Tag Archives: Customer Solutions

Use generative AI on AWS for efficient clinical document analysis

Post Syndicated from Alex Boudreau original https://aws.amazon.com/blogs/architecture/use-generative-ai-on-aws-for-efficient-clinical-document-analysis/

Clinical trials involve the ingestion and processing of vast amounts of highly regulated data, including complex protocol documents that describe how the trial will be conducted. Managing this volume of information can be overwhelming, but generative AI offers a solution by helping automate the process and enabling clinical researchers to quickly focus on the most relevant information. Currently, the drug approval process takes on average 10–12 years, with clinical trial study startup time accounting for 1 year of that timeframe. Much of the challenge with study startup lies in the complex and non-standard nature of protocol documents. These often require weeks or months of effort to review and assess. This review time adds to the already long cycle time to bring a new drug to market.

In this post, we show how Clario uses the AWS platform to accelerate clinical document analysis.

About Clario

Clario is a leading provider of endpoint data solutions to the clinical trials industry providing regulatory-grade clinical evidence for pharmaceutical, biotech, and medical device partners. Since Clario’s founding more than 50 years ago, their endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces is the time-consuming process of generating documentation for clinical trials, which can take weeks or months.

The business challenge

Clinical trials are essential for the approval of new health innovations, including treatments, procedures, and medical devices. They require the collection of vast quantities of complex data from dispersed clinical trial sites to support assessments of medical benefits and risks, all while maintaining privacy and regulatory compliance. To make matters even more challenging, capturing data in clinical trial occurs not only in healthcare centers but also through remote capture through various aspects of trial participants’ daily activities.

Partners like Clario understand the challenges faced by life sciences companies when it comes to analyzing large volumes of complex clinical documents, such as study protocols. These documents often contain a mix of structured and unstructured data, including tables, images, and diagrams, making it difficult to accurately interpret and extract key information at scale. In this post, we explore how Clario has used the power of generative AI on AWS to efficiently analyze clinical documents and drive better outcomes for its clients.

Harnessing the power of large language models

The rapid progress in large language models (LLMs) has expanded the potential applications of natural language processing beyond simple conversational AI assistants. Clario has experimented with various techniques, such as zero-shot learning, few-shot learning, classification, entity extraction, and summarization, for the effective use of LLMs in specialized use cases. By employing prompt engineering, AI orchestration, and content retrieval, Clario can guide the models to accurately generate insights and extract relevant information from key clinical research documents, including complex clinical trial protocols.

Four pillars of effective document analysis on AWS

Through its research and development efforts, Clario has identified four core pillars that enable effective document analysis using generative AI on AWS:

  • Parsing – Clario uses AWS services such as Amazon Textract and Amazon Comprehend to extract text, images, and tables from clinical documents, maintaining both data privacy and security.
  • Retrieval – By using embedding models and vector databases like Amazon OpenSearch Service, Clario efficiently stores and retrieves relevant information from large document collections based on similarity search. The team has experimented with various chunking and retrieval strategies to optimize accuracy and performance.
  • Prompting – Using techniques like zero-shot and few-shot learning, Clario has enhanced the accuracy of LLMs for classifying and extracting information . AWS services such as and Amazon Bedrock simplify experimentation with different prompting strategies and the evaluation of model performance.
  • Generation – Clario carefully considers factors such as context size, reasoning capabilities, and latency when selecting the appropriate LLMs for generating structured outputs. AWS offers a range of pre-trained models and frameworks that seamlessly integrate into Clario’s pipeline.

Solution overview

To tackle the unique challenges associated with analyzing clinical documents, Clario has built a custom generative AI platform on AWS. This platform incorporates an orchestration engine that combines multiple LLMs and deep learning models, enabling it to extract key information accurately and at scale. By using AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Simple Storage Service (Amazon S3), SageMaker, and AWS Lambda, Clario can efficiently process thousands of documents in a matter of seconds.

The following diagram illustrates the solution architecture.

Solution Overview

The workflow consists of the following steps:

  • Documents are collected on premises (1) and uploaded using AWS Direct Connect (2) with encryption in transit to Amazon S3 (3). All uploaded documents are then automatically and securely stored with server-side object-level encryption.
  • After the documents are uploaded and the user has reviewed them, the Clario AI Orchestration Engine (4) determines the best document parsing strategy based on file type, and extracts text using Amazon Textract (5). Once extracted, the text is vectorized and stored in the Amazon OpenSearch Service vector engine (6) for later semantic retrieval.
  • After vectorization, the Clario AI Orchestration Engine (4), which runs as a distributed service in Amazon EKS, launches a document classification async task using Amazon MQ. Amazon EC2 and Lambda are used for additional processing if needed. This triggers the Document Classification Agent, which uses Amazon Bedrock LLMs (8), for automatically determining the document type.
  • After the documents are classified, the Clario AI Orchestration Engine (4) launches the appropriate document analysis agent for further background processing. In the case of study protocols, the engine launches the Protocol Analysis agent, which uses a predefined analysis graph configuration stored in Amazon Relational Database Service (Amazon RDS) (7), as well as a combination of retrieval strategies and AI models, including custom deep learning models on SageMaker (9), and pre-trained LLMs on Amazon Bedrock (8). This orchestration powers advanced document analysis, transforming massive amounts of unstructured multi-modal data into structured data and insights.
  • Following the analysis, all structured data is then persisted to Amazon RDS (7) for later visualization, review, and querying.

Recommendations and best practices

Based on their experience developing and deploying generative AI solutions on AWS, Clario learned the following best practices:

  • Adopt an incremental and iterative development approach to gradually build and refine your models
  • Follow a standard machine learning approach for evaluating and validating model performance using representative test sets
  • Optimize the four pillars of document analysis before investing in fine-tuning and continuous pre-training of LLMs
  • Tailor your approaches to specific use cases, because not all problems require the same models or techniques

Conclusion

By using the power of generative AI on AWS, Clario has been able to efficiently analyze complex clinical trial documents and extract valuable insights for its clients in the life sciences industry. Through a combination of careful model selection, iterative development, and adherence to best practices, Clario has built a scalable and accurate document analysis pipeline using AWS. Unlock the full potential of your clinical trial data by applying these best practices with an AWS generative AI solution today.


About the Authors

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

Post Syndicated from Michael Davies original https://aws.amazon.com/blogs/big-data/how-open-universities-australia-modernized-their-data-platform-and-significantly-reduced-their-etl-costs-with-aws-cloud-development-kit-and-aws-step-functions/

This is a guest post co-authored by Michael Davies from Open Universities Australia.

At Open Universities Australia (OUA), we empower students to explore a vast array of degrees from renowned Australian universities, all delivered through online learning. We offer students alternative pathways to achieve their educational aspirations, providing them with the flexibility and accessibility to reach their academic goals. Since our founding in 1993, we have supported over 500,000 students to achieve their goals by providing pathways to over 2,600 subjects at 25 universities across Australia.

As a not-for-profit organization, cost is a crucial consideration for OUA. While reviewing our contract for the third-party tool we had been using for our extract, transform, and load (ETL) pipelines, we realized that we could replicate much of the same functionality using Amazon Web Services (AWS) services such as AWS Glue, Amazon AppFlow, and AWS Step Functions. We also recognized that we could consolidate our source code (much of which was stored in the ETL tool itself) into a code repository that could be deployed using the AWS Cloud Development Kit (AWS CDK). By doing so, we had an opportunity to not only reduce costs but also to enhance the visibility and maintainability of our data pipelines.

In this post, we show you how we used AWS services to replace our existing third-party ETL tool, improving the team’s productivity and producing a significant reduction in our ETL operational costs.

Our approach

The migration initiative consisted of two main parts: building the new architecture and migrating data pipelines from the existing tool to the new architecture. Often, we would work on both in parallel, testing one component of the architecture while developing another at the same time.

From early in our migration journey, we began to define a few guiding principles that we would apply throughout the development process. These were:

  • Simple and modular – Use simple, reusable design patterns with as few moving parts as possible. Structure the code base to prioritize ease of use for developers.
  • Cost-effective – Use resources in an efficient, cost-effective way. Aim to minimize situations where resources are running idly while waiting for other processes to be completed.
  • Business continuity – As much as possible, make use of existing code rather than reinventing the wheel. Roll out updates in stages to minimize potential disruption to existing business processes.

Architecture overview

The following Diagram 1 is the high-level architecture for the solution.

Diagram 1: Overall architecture of the solution, using AWS Step Functions, Amazon Redshift and Amazon S3

The following AWS services were used to shape our new ETL architecture:

  • Amazon Redshift – A fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift served as our central data repository, where we would store data, apply transformations, and make data available for use in analytics and business intelligence (BI). Note: The provisioned cluster itself was deployed separately from the ETL architecture and remained unchanged throughout the migration process.
  • AWS Cloud Development Kit (AWS CDK) – The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. Our infrastructure was defined as code using the AWS CDK. As a result, we simplified the way we defined the resources we wanted to deploy while using our preferred coding language for development.
  • AWS Step Functions – With AWS Step Functions, you can create workflows, also called State machines, to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning pipelines. AWS Step Functions can call over 200 AWS services including AWS Glue, AWS Lambda, and Amazon Redshift. We used the AWS Step Function state machines to define, orchestrate, and execute our data pipelines.
  • Amazon EventBridge – We used Amazon EventBridge, the serverless event bus service, to define the event-based rules and schedules that would trigger our AWS Step Functions state machines.
  • AWS Glue – A data integration service, AWS Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It’s also serverless, which means there’s no infrastructure to manage. includes the ability to run Python scripts. We used it for executing long-running scripts, such as for ingesting data from an external API.
  • AWS Lambda – AWS Lambda is a highly scalable, serverless compute service. We used it for executing simple scripts, such as for parsing a single text file.
  • Amazon AppFlow – Amazon AppFlow enables simple integration with software as a service (SaaS) applications. We used it to define flows that would periodically load data from selected operational systems into our data warehouse.
  • Amazon Simple Storage Service (Amazon S3) – An object storage service offering industry-leading scalability, data availability, security, and performance. Amazon S3 served as our staging area, where we would store raw data prior to loading it into other services such as Amazon Redshift. We also used it as a repository for storing code that could be retrieved and used by other services.

Where practical, we made use of the file structure of our code base for defining resources. We set up our AWS CDK to refer to the contents of a specific directory and define a resource (for example, an AWS Step Functions state machine or an AWS Glue job) for each file it found in that directory. We also made use of configuration files so we could customize the attributes of specific resources as required.

Details on specific patterns

In the above architecture Diagram 1, we showed multiple flows by which data could be ingested or unloaded from our Amazon Redshift data warehouse. In this section, we highlight four specific patterns in more detail which were utilized in the final solution.

Pattern 1: Data transformation, load, and unload

Several of our data pipelines included significant data transformation steps, which were primarily performed through SQL statements executed by Amazon Redshift. Others required ingestion or unloading of data from the data warehouse, which could be performed efficiently using COPY or UNLOAD statements executed by Amazon Redshift.

In keeping with our aim of using resources efficiently, we sought to avoid running these statements from within the context of an AWS Glue job or AWS Lambda function because these processes would remain idle while waiting for the SQL statement to be completed. Instead, we opted for an approach where SQL execution tasks would be orchestrated by an AWS Step Functions state machine, which would send the statements to Amazon Redshift and periodically check their progress before marking them as either successful or failed. The following Diagram 2 shows this workflow.

Data transformation, load, and unload

Diagram 2: Data transformation, load, and unload pattern using Amazon Lambda and Amazon Redshift within an AWS Step Function

Pattern 2: Data replication using AWS Glue

In cases where we needed to replicate data from a third-party source, we used AWS Glue to run a script that would query the relevant API, parse the response, and store the relevant data in Amazon S3. From here, we used Amazon Redshift to ingest the data using a COPY statement. The following Diagram 3 shows this workflow.

Image 3: Copying from external API to Redshift with AWS Glue

Diagram 3: Copying from external API to Redshift with AWS Glue

Note: Another option for this step would be to use Amazon Redshift auto-copy, but this wasn’t available at time of development.

Pattern 3: Data replication using Amazon AppFlow

For certain applications, we were able to use Amazon AppFlow flows in place of AWS Glue jobs. As a result, we could abstract some of the complexity of querying external APIs directly. We configured our Amazon AppFlow flows to store the output data in Amazon S3, then used an EventBridge rule based on an End Flow Run Report event (which is an event which is published when a flow run is complete) to trigger a load into Amazon Redshift using a COPY statement. The following Diagram 4 shows this workflow.

By using Amazon S3 as an intermediate data store, we gave ourselves greater control over how the data was processed when it was loaded into Amazon Redshift, when compared with loading the data directly to the data warehouse using Amazon AppFlow.

Image 4: Using Amazon AppFlow to integrate external data

Diagram 4: Using Amazon AppFlow to integrate external data to Amazon S3 and copy to Amazon Redshift

Pattern 4: Reverse ETL

Although most of our workflows involve data being brought into the data warehouse from external sources, in some cases we needed the data to be exported to external systems instead. This way, we could run SQL queries with complex logic drawing on multiple data sources and use this logic to support operational requirements, such as identifying which groups of students should receive specific communications.

In this flow, shown in the following Diagram 5, we start by running an UNLOAD statement in Amazon Redshift to unload the relevant data to files in Amazon S3. From here, each file is processed by an AWS Lambda function, which performs any necessary transformations and sends the data to the external application through one or more API calls.

Image 5: Reverse ETL workflow, sending data back out to external data sources

Diagram 5: Reverse ETL workflow, sending data back out to external data sources

Outcomes

The re-architecture and migration process took 5 months to complete, from the initial concept to the successful decommissioning of the previous third-party tool. Most of the architectural effort was completed by a single full-time employee, with others on the team primarily assisting with the migration of pipelines to the new architecture.

We achieved significant cost reductions, with final expenses on AWS native services representing only a small percentage of projected costs compared to continuing with the third-party ETL tool. Moving to a code-based approach also gave us greater visibility of our pipelines and made the process of maintaining them quicker and easier. Overall, the transition was seamless for our end users, who were able to view the same data and dashboards both during and after the migration, with minimal disruption along the way.

Conclusion

By using the scalability and cost-effectiveness of AWS services, we were able to optimize our data pipelines, reduce our operational costs, and improve our agility.

Pete Allen, an analytics engineer from Open Universities Australia, says, “Modernizing our data architecture with AWS has been transformative. Transitioning from an external platform to an in-house, code-based analytics stack has vastly improved our scalability, flexibility, and performance. With AWS, we can now process and analyze data with much faster turnaround, lower costs, and higher availability, enabling rapid development and deployment of data solutions, leading to deeper insights and better business decisions.”

Additional resources


About the Authors

Michael Davies is a Data Engineer at OUA. He has extensive experience within the education industry, with a particular focus on building robust and efficient data architecture and pipelines.

Emma Arrigo is a Solutions Architect at AWS, focusing on education customers across Australia. She specializes in leveraging cloud technology and machine learning to address complex business challenges in the education sector. Emma’s passion for data extends beyond her professional life, as evidenced by her dog named Data.

How MuleSoft achieved cloud excellence through an event-driven Amazon Redshift lakehouse architecture

Post Syndicated from Sean Zou original https://aws.amazon.com/blogs/big-data/how-mulesoft-achieved-cloud-excellence-through-an-event-driven-amazon-redshift-lakehouse-architecture/

This post is cowritten with Sean Zou, Terry Quan and Audrey Yuan from MuleSoft.

In our previous thought leadership blog post Why a Cloud Operating Model we defined a COE Framework and showed why MuleSoft implemented it and the benefits they received from it. In this post, we’ll dive into the technical implementation describing how MuleSoft used Amazon EventBridge, Amazon Redshift, Amazon Redshift Spectrum, Amazon S3, & AWS Glue to implement it.

Solution overview

MuleSoft’s solution was to build a lakehouse built on top of AWS services, illustrated in the following diagram, supporting a portal. To provide near real-time analytics we used an event-driven strategy that which would trigger AWS Glue jobs an refresh materialized views.  We also implemented a layered approach that included collection, preparation, and enrichment making it straightforward to identify areas that affect data accuracy.

For MuleSoft’s lakehouse end-to-end solution, the following phases are key:

  • Preparation phase
  • Enrichment phase
  • Action phase

In the following sections, we discuss these phases in more detail.

Preparation phase

Using the COE Framework, we engaged with the stakeholders in the preparation phase to determine the business goals and identify the data sources to ingest. Examples of data sources were cloud assets inventory, AWS Cost and Usage Reports, and AWS Trusted Advisor data. The ingested data is processed in the lakehouse to implement the Well-Architected pillars, utilization, security, and compliance status checks and measures.

How you configure the CUR data and the Trusted Advisor data to land into S3?

The configuration process involves multiple components for both CUR and Trusted Advisor data storage. For CUR setup, customers need to configure an S3 bucket where the CUR report will be delivered, either by selecting an existing bucket or creating a new one. The S3 bucket requires a policy to be applied and customers must specify an S3 path prefix which creates a subfolder for CUR file delivery .

Trusted Advisor data is configured to use Kinesis Firehose to deliver customer summary data to the Support Data lake S3 bucket .The data ingestion process uses firehose buffer parameters (1MB buffer size and 60-second buffer time) to manage data flow to the S3 bucket .

The Trusted Advisor data is stored in JSON and GZIP format, following a specific folder structure with hourly partitions using the “YYYY-MM-DD-HH” format .

The S3 partition structure for Trusted Advisor customer summary data includes separate paths for success and error data, and the data is encrypted using a KMS key specific to Trusted Advisor data .

MuleSoft used AWS managed services and data ingestion tools to pull from multiple data sources and that can support customizations. Cloudquery is used tool to gather cloud infrastructure information, which can connect many infrastructure data sources out of the box and land it into an Amazon S3 bucket. The MuleSoft Anypoint Platform provides an integration layer to integrate infrastructure tools, accommodating many data sources like on-premise, SaaS, and commercial off-the-shelf (COTS) software. Cloud Custodian  was used for its capability of managing cloud resources and auto-remediation with customizations.

Enrichment phase

The enrichment phase includes ingesting raw data aligning with our business goals into the lakehouse through our pipelines, and consolidating the data to create a single pane of glass.

The pipelines adopt the event-driven architecture consisting of EventBridge, Amazon Simple Queue Service (Amazon SQS), and Amazon S3 Event Notifications to provide near real-time data for analysis. When new data arrives in the source bucket, new object creation is captured by the EventBridge rule, which invokes the AWS Glue workflow, consisting of an AWS Glue crawler and AWS Glue extract, transform, and load (ETL) jobs. We also configured S3 Event Notifications to send messages to the SQS queue to make sure the pipeline will only process the new data.

The AWS Glue ETL job cleanses and standardizes the data, so that it’s ready to be analyzed using Amazon Redshift. To tackle data with complex structures, additional processing is performed to flatten the nested data formats into a relational model. The flattening step also extracts the tags of AWS assets out of the nested JSON objects and pivots them into individual columns, enabling tagging enforcement controls and ownership attribution.  The ownership attribution of the infrastructure data provides accountability and holds teams responsible for the costs, utilization, security, compliance, and remediation of their cloud assets.  One important tag is asset ownership which is from the tags extracted from the flattening step, this data can be attributed to the corresponding owners by SQL scripts.

When the workflow is complete, the raw data from different sources and with various structures is now  centralized in the data warehouse.  From there, disjointed data with different purposes is ready to be consolidated and translated into actionable intelligence in the Well-Architected Pillars by coding out the business logic.

 Solutions for the enrichment phase

In the enrichment phase, we faced a number of storage, efficiency, and scalability challenges given the sheer volume of data. We used three techniques (file partitioning, Redshift Spectrum, and materialized views) to address these issues and scale without compromising performance.

File partitioning

MuleSoft’s infrastructure data is stored in folder structure: year, month, day, hour, account, and Region in an S3 bucket, so AWS Glue crawlers are able to automatically identify and add partitions to the tables in the AWS Glue Data Catalog. Partitioning helps improve query performance significantly because it optimizes parallel processing for queries. The amount of data scanned by each query is restricted based on the partition keys, helping reduce overall data transfers, processing time, and computation costs. Although partitioning is an optimization technique that helps improve query efficiency, it’s important to keep in mind two key points while using this technique:

  • The Data Catalog has a maximum cap of 10 million partitions per table
  • Query performance gets compromised as partitions grow rapidly

Therefore, balancing the number of partitions in the Data Catalog tables and query efficiency is essential. We decided on a data retention policy of 3 months and configured a lifecycle rule to expire any data older than that.

Our event-driven architecture–AWS Eventbridge event is invoked when objects are put into or removed from an S3 bucket, event messages are published to the SQS queue using S3 Event Notifications, which invokes an AWS Glue crawler to either add new partitions or removes old partitions from the Data Catalog based on the messages handling the partition cleanup.

Amazon Redshift and concurrency scaling

MuleSoft uses Amazon Redshift to query the data in S3 because it provides large scale compute and minimized data redundancy. MuleSoft also used Amazon Redshift concurrency scaling to run concurrent queries with consistently fast query performance. Amazon Redshift automatically added query processing power in seconds to process a high number of concurrent queries without any delays.

Materialized views

Another technique we used is Amazon Redshift materialized views. Materialized views store preset query results that future similar queries can use, so many computation steps can be skipped. Therefore, relevant data can be accessed efficiently, which leads to query optimization. Additionally, materialized views can be automatically and incrementally refreshed. Therefore, we can achieve a single pane of glass in our cloud infrastructure with the most up-to-date projections, trends, and actionable insights to our organization with improved query performance.

Amazon Redshift Materialized Views (MVs) are used extensively for reporting in MuleSoft’s Cloud Central portal, but if users needed to drill down into a granular view they could reference external tables.

Mulesoft is currently manually refreshing the materialized views through the event-driven architecture, but is evaluating a switch to automatic refresh.

Action phase

Using materialized views in Amazon Redshift, we developed a self-serve Cloud Central portal in Tableau to provide a display portal for each team, engineer, and manager offering guidance and recommendations to help them operate in a way that aligns with the organization’s requirements, standards, and budget. Managers are empowered with monitoring and decision-making information for their teams. Engineers are able to identify and tag assets with missing mandatory tagging information, as well as remediate non-compliant resources. A key feature of the portal is personalization, meaning that the portal is enabled to populate visualizations and analysis based on the relevant data associated with a manager’s or engineer’s login information.

Cloud Central also helps engineering teams improve their cloud maturity in the six Well-Architecture pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. The team proved out the “art of possible” by poc’ing Amazon Q to assist with 100 and 200 Well-Architected pillar inquiries and how to’s. The following screenshot illustrates the MuleSoft implementation of the portal, Cloud Central. Other companies will design portals that are more bespoke to their own use cases and requirements.

Conclusion

The technical and business impact of MuleSoft’s COE Framework enables an optimization strategy and a cloud usage show back approach which helps MuleSoft continue to grow with a scalable and sustainable cloud infrastructure. The framework also drives continual maturity and benefits in cloud infrastructure centered around the six Well-Architecture pillars shown in the following figure.

The framework helps organizations with expanded public cloud infrastructure achieve their business goals guided by the Well-Architected benefits powered by an event-driven architecture.

The event-driven Amazon Redshift lakehouse architecture solution offers near real-time actionable insights on decision-making, control, and accountability. The event-driven architecutre can be distilled into modules which can be added or deleted depending on your technical/business goals.

The team is exploring new ways to lower the total cost of ownership. They are evaluating Amazon Redshift Serverless for transient database workloads as well as exploring Amazon DataZone to aggregate and correlate data sources into a data catalog to share among teams, applications, and lines of businesses in a democratized way. We can increase visibility, productivity, and scalability with a well-thought-out lakehouse solution.

We invite organizations and enterprises to take a holistic approach to understand their cloud resources, infrastructure, and applications. You can enable and educate your teams through a single pane of glass, while running on a data modernization lakehouse applying Well-Architected concepts, best practices, and cloud-centric principles. This solution can ultimately enable near real-time streaming, leveling up a COE Framework well into the future.


About the Authors

Sean Zou is a Cloud Operations leader with MuleSoft at Salesforce. Sean has been involved in many aspects of MuleSoft’s Cloud Operations, and helped drive MuleSoft’s cloud infrastructure to scale more than tenfold in 7 years. He built the Oversight Engineering function at MuleSoft from scratch.

Terry Quan focuses on FinOps issues. He works on MuleSoft Engineering on cloud computing budgets and forecasting, cost reduction efforts, costs-to-serve, and coordinates with Salesforce Finance. Terry is a FinOps Practitioner and Professional Certified.

Audrey Yuan is a Software Engineer with MuleSoft at Salesforce. Audrey works on data lakehouse solutions to help drive cloud maturity across the six pillars of the Well-Architected Framework.

Rueben Jimenez is a Senior Solutions Architect at AWS, designing and implementing complex data analytics, AI/ML, and cloud infrastructure solutions.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

Post Syndicated from Shani Adadi Kazaz original https://aws.amazon.com/blogs/architecture/how-nielsen-uses-serverless-concepts-on-amazon-eks-for-big-data-processing-with-spark-workloads/

Nielsen Marketing Cloud, a leading ad tech company, processes in one of their pipelines 25 TB of data and 30 billion events daily. As their data volumes grew, so did the challenges of scaling their Apache Spark workloads efficiently.

Nielsen’s team faced a scenario in which, as they scaled up their cluster by adding more instances, the performance per instance degraded. The degradation resulted in a decrease in the amount of work done per hour by each instance, and drove costs per GB of data processed up.

Furthermore, they encountered occasional data skew issues. Data skew, where data is unevenly distributed across partitions, created processing bottlenecks and further reduced cluster efficiency. In extreme cases, these combined factors led to cluster failures.

In this post, we follow Nielsen’s journey to build a robust and scalable architecture while enjoying linear scaling. We start by examining the initial challenges Nielsen faced and the root causes behind these issues. Then, we explore Nielsen’s solution: running Spark on Amazon Elastic Kubernetes Service (Amazon EKS) while adopting serverless concepts.

Evolving from a Spark cluster to Spark pods on Amazon EKS

Nielsen’s Marketing Cloud architecture began as a typical Spark cluster on Amazon EMR, receiving a constant stream of files of varying sizes to process. As both data volume and cluster size grew, the team noticed a degradation in performance per instance, as illustrated in the following graphs. Beyond the slower processing and the higher costs, Nielsen occasionally suffered production issues caused by data skew.

GB/Instance/Hour Compared to Cluster SizeCost to Process 1 GB of Data

The team realized the problem was the growing number of remote shuffles between instances as the cluster grew. Remote shuffle, a process in Spark where data is redistributed across partitions, involves significant data transfer over the network and can become a major bottleneck. Due to the streaming nature of the data in their scenario, Nielsen realized they could instead process data in smaller batches. This meant they didn’t have to lean on the distributed processing capabilities of Spark by using large Spark clusters, and opt for small ones instead.

To address the performance degradation, the team decided to change its growth strategy: instead of scaling up their single Spark cluster, they scaled out using multiple local mode Spark clusters (a single node cluster) running on Amazon EKS. When compared to Spark cluster mode, local mode provides better performance for small analytics workloads. Each local mode is running a limited, smaller amount of data, requiring no remote shuffle and no interaction with other Spark instances.

Moreover, the pods running on Amazon EKS can scale up and down based on the amount of pending work, meaning Nielsen could stop resources when they are not needed.

The new solution scales linearly, is 55% cheaper, and handles data faster, even under large burst conditions.

Why shuffle matters

Remote shuffle is triggered when data needs to be exchanged between Spark instances. Some transformations, like join or repartition, necessitate a shuffle of data. Remote shuffle is an order of magnitude slower than in-memory computations because it requires moving data over the network. It could slow down processing significantly, sometimes adding 100–200% to the total processing time.

The problem Nielsen ran into was that as cluster size grew, the amount of data shuffled grew proportionally to the cluster size. The following graph shows why this happens. It calculates the amount of data exchanged for a randomly distributed dataset as cluster size grows.

The following graph illustrates that the correlation is to the size of the cluster and not to the size of the data.

% of Data Shuffled vs Cluster Size

Addressing shuffle

The team hypothesized that minimizing shuffle could lead to substantial performance improvements. Nielsen’s engineers decided to implement ideas from serverless patterns by drastically reducing the size of each cluster to a minimum while at the same time adding more of these smaller clusters to compensate for the lower capacity of each one. This approach promised to eliminate remote shuffle entirely for each data work item, as illustrated in the preceding graph.

Although this strategy promised performance gains, it also introduced a constraint: a limit on the amount of data per work item.

Designing the new system based on serverless patterns

Nielsen’s team developed a new architecture that uses two core concepts:

  • A queue of work items to pull from
  • A group of local mode Spark modules pulling work items from the queue

They had the following design goals:

  • Keep the Spark modules busy at all times
  • Stop modules when not needed
  • Make sure all work items are processed successfully

The following diagram illustrates the workflow.

Work items Queue

Final design

The final design includes the following components:

  • File metadata storage – An Amazon Relational Database Service (Amazon RDS) cluster runs the PostgreSQL engine to store and manage statistics about each file entering the system.
  • Work manager – An AWS Lambda function is used to periodically pull waiting files from the database, prepare work items comprised of one or multiple files, and publish the work items to an Amazon Simple Queue Service (Amazon SQS) message queue.
  • Work queue – An SQS message queue is used for work items waiting to be pulled for processing.
  • Processing units – Local mode Spark instances run as pods on an EKS cluster. They pull work items from the SQS queue. As long as there are waiting work items in the queue, the pods are constantly busy.
  • Metrics adaptor – An adaptor (Kubernetes-cloudwatch-adapter) provides Amazon CloudWatch metrics to the Kubernetes Horizontal Pod Autoscaler.
  • Kubernetes Horizontal Pod Autoscaler – Horizontal Pod Autoscaler (HPA) uses a scaling rule to scale pods up or down based on the metrics from CloudWatch. It scales according to the number of messages (work items) visible in the queue, which are proportional to the work waiting to be processed. In Nielsen’s system, HPA scales the pods by targetValue = {SQS length/2}. .
  • Work completion queue – A second SQS message queue is used for reporting completion of work items. The completions get pulled by another Lambda function and get updated in the PostgreSQL database.

The following diagram illustrates the architecture of the final system.

Full architecture

⁠Analyzing the results

The following graphs demonstrate the EKS pods scaling based on the amount of work items. The active pods pick up new work items as soon as they finish their previous ones.

Analyzing - Messages and Spark Pods

The following graph shows a large burst of data coming in. The system reacts quickly and scales up to process the added work. It quickly scales down when work is complete.

Analyzing - Messages, Spark Pods and EC2 Instances

Analyzing the performance achieved per instance, the new system demonstrated a significant improvement. Performance per instance increased by approximately 130% while growing linearly and maintaining close to constant costs per GB processed.

The comparison of performance between the new system and the old system can be seen in the following graph.

Throughput - MB/Hour

The new system’s costs are 55% lower for the same amount of data processed.

The following graphs compare the costs before and after the implementation.

Cost Comparison

Conclusion

Nielsen’s journey from a traditional architecture to a serverless-inspired architecture on Amazon EKS exemplifies the power of rethinking established patterns in big data processing.

By addressing the core challenges of data shuffle and scaling, Nielsen not only achieved performance gains and cost reductions, but also demonstrated the potential for linear scaling in large-scale data operations.

If you have big data processing jobs that that can be broken down into many independent small parts, consider using similar ideas over Amazon EKS to achieve linear scaling and large cost savings.

This post was copyedited for grammar, spelling, capitalization, punctuation, terminology, and legal issues. Other important issues are noted in comments, and you should consider revising the content accordingly before publication.


About the Authors

Top Architecture Blog Posts of 2024

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2024/

Well, it’s been another historic year! We’ve watched in awe as the use of real-world generative AI has changed the tech landscape, and while we at the Architecture Blog happily participated, we also made every effort to stay true to our channel’s original scope, and your readership this last year has proven that decision was the right one.

AI/ML carries itself in the top posts this year, but we’re also happy to see that foundational topics like resiliency and cost optimization are still of great interest to our audience.

(By the way, if you were hoping for more AI/ML content, head on over to our sister channel, the AWS Machine Learning Blog!).

Without further ado, here are our top posts from 2024!

#10 Deploy Stable Diffusion ComfyUI on AWS elastically and efficiently

This post helps you get started using ComfyUI, and was so successful that we followed it up later in the year with How to build custom nodes workflow with ComfyUI on EKS!

Architecture for deploying stable diffusion on ComfyUI

Figure 1. Architecture for deploying stable diffusion on ComfyUI

#9 Let’s Architect! Designing Well-Architected systems

In keeping with Let’s Architect! series, we have our first of three favorites for the year. This set of resources helps you apply Well-Architected standards in practice.

Let's Architect

Figure 2. Let’s Architect

#8 Let’s Architect! Learn About Machine Learning on AWS

As I said, Let’s Architect! has a winning series, and they’ve got a finger on the pulse of the tech world. This post about machine learning showcases some of the most exciting things happening at AWS.

Let's Architect

Figure 3. Let’s Architect

If you’re more interested in generative AI, you can also take a look at another post from 2024: Let’s Architect! GenAI

#7 Creating an organizational multi-Region failover strategy

Preparedness is another common theme in this year’s favorites. Michael, John, and Saurabh are well-versed in multi-Region architecture, and they’re here to share some strategies to contain failure impact.

When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

Figure 4. When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

#6 Building a three-tier architecture on a budget

Let’s talk cost optimization. This post about a three-tier architecture that relies on the AWS Free Tier is a must-read for anyone looking for tips to help them avoid unnecessary costs (and that’s everyone).

Example of a three-tier architecture on AWS

Figure 5. Example of a three-tier architecture on AWS

#5 Announcing updates to the AWS Well-Architected Framework guidance

As usual, Haleh & team are pros at making sure the Well-Architected Framework is current and relevant. Take a look at the enhanced and expanded guidance in all six pillars.

Well-Architected logo

Figure 6. Well-Architected logo

#4 Let’s Architect! Serverless developer experience in AWS

One more winning post from Luca, Federica, Vittorio, and Zamira! This collection of developer resources includes new ideas in AWS Lambda, Amazon Q Developer, and Amazon DynamoDB.

Let's Architect

Figure 7. Let’s Architect

#3 London Stock Exchange Group uses chaos engineering on AWS to improve resilience

This post from April 1 was not an April Fool’s joke! See how LSEG designed failure scenarios to test their resilience and observability.

Chaos engineering pattern for hybrid architecture (3-tier application)

Figure 8. Chaos engineering pattern for hybrid architecture (3-tier application)

#2 Achieving Frugal Architecture using the AWS Well-Architected Framework Guidance

Frugality AND Well-Architected? What a winning combo! This post, inspired by the 2023 re:Invent keynote, outlines the seven laws of Frugal Architecture.

Well-Architected logo

Figure 9. Well-Architected logo

#1 How an insurance company implements disaster recovery of 3-tier applications

And finally, our number one post of the year! Amit and Luiz showcase a customer solution with real-world applications that builds on the guidelines of other posts in this list! Well done!

The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Figure 10. The Pilot Light scenario for a 3-tier application that has application servers and a database deployed in two Regions

Thank you!

As always, thanks to our contributors for their dedication and desire to share, and to you, our readers! We would be nothing with you. Literally.

For other top post lists, see our Top 10 and Top 5 posts from previous years.

How EUROGATE established a data mesh architecture using Amazon DataZone

Post Syndicated from Dr. Leonard Heilig original https://aws.amazon.com/blogs/big-data/how-eurogate-established-a-data-mesh-architecture-using-amazon-datazone/

This post is co-written by Dr. Leonard Heilig and Meliena Zlotos from EUROGATE.

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. Internally, making data accessible and fostering cross-departmental processing through advanced analytics and data science enhances information use and decision-making, leading to better resource allocation, reduced bottlenecks, and improved operational performance. Externally, sharing real-time data with partners such as shipping lines, trucking companies, and customs agencies fosters better coordination, visibility, and faster decision-making across the logistics chain. Together, these capabilities enable terminal operators to enhance efficiency and competitiveness in an industry that is increasingly data driven.

EUROGATE is a leading independent container terminal operator in Europe, known for its reliable and professional container handling services. Every day, EUROGATE handles thousands of freight containers moving in and out of ports as part of global supply chains. Their terminal operations rely heavily on seamless data flows and the management of vast volumes of data. Recently, EUROGATE has developed a digital twin for its container terminal Hamburg (CTH), generating millions of data points every second from Internet of Things (IoT)devices attached to its container handling equipment (CHE).

In this post, we show you how EUROGATE uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. Two use cases illustrate how this can be applied for business intelligence (BI) and data science applications, using AWS services such as Amazon Redshift and Amazon SageMaker. We encourage you to read Amazon DataZone concepts and terminology to become familiar with the terms used in this post.

Data landscape in EUROGATE and current challenges faced in data governance

The EUROGATE Group is a conglomerate of container terminals and service providers, providing container handling, intermodal transports, maintenance and repair, and seaworthy packaging services. In recent years, EUROGATE has made significant investments in modern cloud applications to enhance its operations and services along the logistics chains. With the addition of these technologies alongside existing systems like terminal operating systems (TOS) and SAP, the number of data producers has grown substantially. However, much of this data remains siloed and making it accessible for different purposes and other departments remains complex. Thus, managing data at scale and establishing data-driven decision support across different companies and departments within the EUROGATE Group remains a challenge.

Need for a data mesh architecture

Because entities in the EUROGATE group generate vast amounts of data from various sources—across departments, locations, and technologies—the traditional centralized data architecture struggles to keep up with the demands for real-time insights, agility, and scalability. The following requirements were essential to decide for adopting a modern data mesh architecture:

  • Domain-oriented ownership and data-as-a-product: EUROGATE aims to:
    • Enable scalable and straightforward data sharing across organizational boundaries.
    • Enhance agility by localizing changes within business domains and clear data contracts.
    • Improve accuracy and resiliency of analytics and machine learning by fostering data standards and high-quality data products.
    • Eliminate centralized bottlenecks and complex data pipelines.
  • Self-service and data governance: EUROGATE wants to ensure that the discovery, access, and use of data by consumers is as direct as possible through a data portal where information about shared data sets can be published, while data governance is streamlined through automated policy enforcement, ensuring compliance during key stages such as data discovery, access, and deployment.
  • Plug-and-play integration: A seamless, plug-and-play integration between data producers and consumers should facilitate rapid use of new data sets and enable quick proof of concepts, such as in the data science teams.

How Amazon DataZone helped EUROGATE address those challenges

In the first phase of establishing a data mesh, EUROGATE focused on standardized processes to allow data producers to share data in Amazon DataZone and to allow data consumers to discover and access data. The vision, as shown in the following figure, is that data from digital services, such as from the terminal operating system (TOS) and TwinSim (a project to create a digital twin of real-world operations), can be shared with Amazon DataZone and used by BI dashboards and data science teams, among others, while those digital services and other domain users can also consume subscribed data from Amazon DataZone.

EUROGATE_pic1

In the following section, two use cases demonstrate how the data mesh is established with Amazon DataZone to better facilitate machine learning for an IoT-based digital twin and BI dashboards and reporting using Tableau.

Use case 1: Machine learning for IoT-based digital twin

Through the TwinSim project, EUROGATE has developed a digital twin using AWS services that gathers real-time data (for example, positions, machinery, and pick/deck events) from CHE (including straddle carriers and quay cranes), integrates it with planning data from the TOS, and enhances it with additional sources such as weather information. In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. EUROGATE’s data science team aims to create machine learning models that integrate key data sources from various AWS accounts, allowing for training and deployment across different container terminals. To achieve this, EUROGATE designed an architecture that uses Amazon DataZone to publish specific digital twin data sets, enabling access to them with SageMaker in a separate AWS account.

As part of the required data, CHE data is shared using Amazon DataZone. The data originates in Amazon Kinesis Data Streams, from which it is copied to a dedicated Amazon Simple Storage Service (Amazon S3) bucket by using Amazon Data Firehose in combination with an AWS Lambda function for data filtering. An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

EUROGATE_2

To work with the shared data, the data science and AI teams subscribe to the data and query it using Amazon Athena by using Amazon SageMaker Data Wrangler. The following is an example query.

import awswrangler as wr
wr.athena.read_sql_query('SELECT * FROM "sagemakedatalakeenvironment_sub_db"."cycle_end"', "sagemakedatalakeenvironment_sub_db", ctas_approach=False)

A similar approach is used to connect to shared data from Amazon Redshift, which is also shared using Amazon DataZone.

import awswrangler as wr
con = wr.redshift.connect(secret_id="ai-dev-redshift-credentials",is_serverless=True,serverless_work_group="ai-dev-workgroup")
with con.cursor() as cursor:
cursor.execute('SELECT * FROM 
"datazone_datashare_db_269e5790f589258657fcc48d8cfd65ea3f3cd7f7"."datazone_env_twinsimsilverdata"."cycle_end";')
con.close()

With this, as the data lands in the curated data lake (Amazon S3 in parquet format) in the producer account, the data science and AI teams gain instant access to the source data eliminating traditional delays in the data availability. The data science and AI teams are able to explore and use new data sources as they become available through Amazon DataZone. Because Amazon DataZone integrates the data quality results, by subscribing to the data from Amazon DataZone, the teams can make sure that the data product meets consistent quality standards.

After experimentation, the data science teams can share their assets and publish their models to an Amazon DataZone business catalog using the integration between Amazon SageMaker and Amazon DataZone. This will be the future use case of EUROGATE where the ability to publish trained machine learning (ML) models back to an Amazon DataZone catalog promotes reusability, allowing models to be discovered by other teams and projects. This approach fosters knowledge sharing across the ML lifecycle.

Use case 2: BI for cloud applications

In recent years, EUROGATE has developed several cloud applications for supporting key container logistics processes and services, such as special container terminal and container depot applications or digital platforms for organizing container transports using rail and truck. The applications are hosted in dedicated AWS accounts and require a BI dashboard and reporting services based on Tableau. In the past, one-to-one connections were established between Tableau and respective applications. This led to a complex and slow computations. In this use case, EUROGATE implemented a hybrid data mesh architecture using Amazon Redshift as a centralized data platform. This approach transformed their fragmented Tableau connections into a scalable, efficient analytics ecosystem.

By centralizing container and logistics application data through Amazon Redshift and establishing a governance framework with Amazon DataZone, EUROGATE achieved both performance optimization and cost efficiency. The hybrid data mesh enables batch processing at scale while maintaining the data access controls, security, and governance; effectively balancing the distributed ownership with centralized analytics capabilities.

The data is shared from on-premises to an Amazon Relational Database Service (Amazon RDS) database in the AWS Cloud. AWS Database Migration Service (AWS DMS) is used to securely transfer the relevant data to a central Amazon Redshift cluster. AWS DMS tasks are orchestrated using AWS Step Functions. A Step Functions state machine is run on a daily using Amazon EventBridge scheduler. The data in the central data warehouse in Amazon Redshift is then processed for analytical needs and the metadata is shared to the consumers through Amazon DataZone. The consumer subscribes to the data product from Amazon DataZone and consumes the data with their own Amazon Redshift instance. This is further integrated into Tableau dashboards. The architecture is depicted in the following figure.

EUROGATE_3

Implementation benefits

As we continue to scale, efficient and seamless data sharing across services and applications becomes increasingly important. By using Amazon DataZone and other AWS services including Amazon Redshift and Amazon SageMaker, we can achieve a secure, streamlined, and scalable solution for data and ML model management, fostering effective collaboration and generating valuable insights. This approach supports both the immediate needs of visualization tools such as Tableau and the long-term demands of digital twin and IoT data analytics.

  • Centralized, scalable data sharing and native integration

Amazon DataZone facilitates integration with applications such as Tableau, enabling data to flow seamlessly within the AWS ecosystem. Those integrations reduce the need for complex, manual configurations, allowing EUROGATE to share data across the organization efficiently. The architecture centralizes key data, such as CHE data, for analytics and ML, ensuring that teams across the organization have access to consistent, up-to-date information, enhancing collaboration and decision-making at all levels. Insights from ML models can be channeled through Amazon DataZone to inform internal key decision makers internally and external partners.

  • Reduced complexity, greater scalability, and cost efficiency

The Amazon DataZone architecture reduces unnecessary complexity and scales with EUROGATE’s growing needs, whether through new data sources or increased user demand. In parallel, using Amazon Data Firehose to stream data into an S3 bucket and AWS Glue for daily ETL transformations provides an automated pipeline that prepares the data for long-term analytics. This batch-oriented approach reduces computational overhead and associated costs, allowing resources to be allocated efficiently. While real-time data is processed by other applications, this setup maintains high-performance analytics without the expense of continuous processing.

  • Faster and easier data integration for Tableau and enhanced data preparation for ML

Amazon DataZone streamlines data integration for tools such as Tableau, enabling BI teams to quickly add and visualize data without building complex pipelines. This agility accelerates EUROGATE’s insight generation, keeping decision-making aligned with current data. Additionally, daily ETL transformations through AWS Glue ensure high-quality, structured data for ML, enabling efficient model training and predictive analytics. This combination of ease and depth in data management equips EUROGATE to support both rapid BI needs and robust analytical processing for IoT and digital twin projects.

  • Faster onboarding and data sharing of data assets between organizational units

Amazon DataZone helps the teams to autonomously discover data assets that are created in the organization and to onboard data assets across AWS accounts within minutes with metadata synchronization. EUROGATE has already onboarded 500 data assets from different organizational units using Amazon DataZone. The new process of onboarding data assets is 15 times faster, leading to immediate visibility of data assets while simplifying data sharing and discovery through an intuitive point-and-click interface that removes traditional barriers to data access.

Conclusion

The implementation of Amazon DataZone marks a transformative step for EUROGATE’s data management by providing a scalable, and efficient solution for data sharing, machine learning and analytics. By integrating various data producers and connecting them to data consumers such as Amazon SageMaker and Tableau, Amazon DataZone functions as a digital library to streamline data sharing and integration across EUROGATE’s operations. In the first phase of production, Amazon DataZone has already demonstrated measurable benefits, including access to data and ML and the ability to incorporate a wider range of datasets to its unified catalog repository. By centralizing metadata with Amazon DataZone, EUROGATE is setting a solid foundation for efficient operations and improved data and ML governance, because teams can now discover, govern, and analyze data with greater confidence and speed. This capability supports rapid responses to business needs, helping EUROGATE to maintain agility and stay ahead of the curve. With this, EUROGATE is better positioned to onboard new data sources, integrate additional terminals, and expand machine learning applications across our container terminals.

Amazon DataZone empowers EUROGATE by setting the stage for long-term operational excellence and scalability. With a unified catalog, enhanced analytics capabilities, and efficient data transformation processes, we’re laying the groundwork for future growth. This infrastructure enables EUROGATE to extract predictive insights, drive smarter business decisions, and scale operations efficiently, ultimately supporting our goal of sustained innovation and competitive advantage.

Future vision and next steps

As EUROGATE continues to advance its digital transformation, the integration of Amazon DataZone and EUROGATE’s architecture lays the groundwork for a more data-driven and intelligent future. In the upcoming phases, the vision is to further expand the role of Amazon DataZone as the central platform for all data management, enabling seamless integration across an even broader set of data sources and consumers. This will include additional data from more container terminals and logistics service providers, enhanced operational metrics, IoT sensor data, and advanced third-party sources such as global supply chain data and maritime analytics.

The continued focus on secure data sharing and governance will also foster better collaboration with partners, suppliers, and customers, leading to improved service levels and a more resilient supply chain. This future vision will help EUROGATE maintain its position as a leader in container terminal operations while continuously adapting to technological advancements and market dynamics.

Ultimately, EUROGATE’s investment in this architecture ensures that the organization is well-positioned to scale and innovate in a dynamic industry through a future of smarter, more connected, and highly efficient container terminal operations.

To learn more about Amazon DataZone and how to get started, see the Getting started guide. See the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.


About the Authors

Dr. Leonard Heilig is CTO at driveMybox and drives digitalization and AI initiatives at EUROGATE, bringing over 10 years of research and industry experience in cloud-based platform development, data management, and AI. Combining a deep understanding of advanced technologies with a passion for innovation, Leonard is dedicated to transforming logistics processes through digitalization and AI-driven solutions.

Meliena ZlotosMeliena Zlotos is a DevOps Engineer at EUROGATE with a background in Industrial Engineering. She has been heavily involved in the Data Sharing Project, focusing on the implementation of Amazon DataZone into EUROGATE’s IT environment. Through this project, Meliena has gained valuable experience and insights into DataZone and Data Engineering, contributing to the successful integration and optimization of data management solutions within the organization.

Lakshmi Nair is a Senior Specialist Solutions Architect for Data Analytics at AWS. She focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, data governance, big data, data warehousing, and data lake workloads. She can reached via LinkedIn.

Siamak NarimanSiamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.

Juicebox recruits Amazon OpenSearch Service for improved talent search

Post Syndicated from Ishan Gupta original https://aws.amazon.com/blogs/big-data/juicebox-recruits-amazon-opensearch-service-for-improved-talent-search/

This post is cowritten by Ishan Gupta, Co-Founder and Chief Technology Officer, Juicebox.

Juicebox is an AI-powered talent sourcing search engine, using advanced natural language models to help recruiters identify the best candidates from a vast dataset of over 800 million profiles. At the core of this functionality is Amazon OpenSearch Service, which provides the backbone for Juicebox’s powerful search infrastructure, enabling a seamless combination of traditional full-text search methods with modern, cutting-edge semantic search capabilities.

In this post, we share how Juicebox uses OpenSearch Service for improved search.

Challenges in recruiting search

Recruiting search engines traditionally rely on simple Boolean or keyword-based searches. These methods aren’t effective in capturing the nuance and intent behind complex queries, often leading to large volumes of irrelevant results. Recruiters spend unnecessary time filtering through these results, a process that is both time-consuming and inefficient.

In addition, recruiting search engines often struggle to scale with large datasets, creating latency issues and performance bottlenecks as more data is indexed. At Juicebox, with a database growing to more than 1 billion documents and millions of profiles being searched per minute, we needed a solution that could not only handle massive-scale data ingestion and querying, but also support contextual understanding of complex queries.

Solution overview

The following diagram illustrates the solution architecture.

OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search. You send search documents to OpenSearch Service and retrieve them with search queries that match text and vector embeddings for fast, relevant results.

At Juicebox, we solved five challenges with Amazon OpenSearch Service, which we discuss in the following sections.

Problem 1: High latency in candidate search

Initially, we faced significant delays in returning search results due to the scale of our dataset, especially for complex semantic queries that require deep contextual understanding. Other full-text search engines couldn’t meet our requirements for speed or relevance when it came to understanding recruiter intent behind each search.

Solution: BM25 for fast, accurate full-text search

The OpenSearch Service BM25 algorithm quickly proved invaluable by allowing Juicebox to optimize full-text search performance while maintaining accuracy. Through keyword relevance scoring, BM25 helps rank profiles based on the likelihood that they match the recruiter’s query. This optimization reduced our average query latency from around 700 milliseconds to 250 milliseconds, allowing recruiters to retrieve relevant profiles much faster than our previous search implementation.

With BM25, we observed a nearly threefold reduction in latency for keyword-based searches, improving the overall search experience for our users.

Problem 2: Matching intent, not just keywords

In recruiting, exact keyword matching can often lead to missing out on qualified candidates. A recruiter looking for “data scientists with NLP experience” might miss candidates with “machine learning” in their profiles, even though they have the right expertise.

Solution: k-NN-powered vector search for semantic understanding

To address this, Juicebox uses k-nearest neighbor (k-NN) vector search. Vector embeddings allow the system to understand the context behind recruiter queries and match candidates based on semantic meaning, not just keyword matches. We maintain a billion-scale vector search index that is capable of performing low-latency k-NN search, thanks to OpenSearch Service optimizations like product quantization capabilities. The neural search capability allowed us to build a Retrieval Augmented Generation (RAG) pipeline for embedding natural language queries before searching. OpenSearch Service allows us to optimize algorithm hyperparameters for Hidden Navigable Small Worlds (HNSW) like m, ef_search, and ef_construction. This enabled us to achieve our target latency, recall, and cost goals.

Semantic search, powered by k-NN, allowed us to surface 35% more relevant candidates compared to keyword-only searches for complex queries. The speed of these searches was still fast and accurate, with vectorized queries achieving a 0.9+ recall.

Problem 3: Difficulty in benchmarking machine learning models

There are several key performance indicators (KPIs) that measure the success of your search. When you use vector embeddings, you have a number of choices to make when selecting the model, fine-tuning the model, and choosing the hyperparameters to use. You need to benchmark your solution to make sure that you’re getting the right latency, cost, and especially accuracy. Benchmarking machine learning (ML) models for recall and performance is challenging due to the vast number of fast-evolving models available (such as MTEB leaderboard on Hugging Face). We faced difficulties in selecting and measuring models accurately while making sure we performed well across large-scale datasets.

Solution: Exact k-NN with scoring script in OpenSearch Service

Juicebox used exact k-NN with scoring script features to address these challenges. This feature allows for precise benchmarking by executing brute-force nearest neighbor searches and applying filters to a subset of vectors, making sure that recall metrics are accurate. Model testing was streamlined using the wide range of pre-trained models and ML connectors (integrated with Amazon Bedrock and Amazon SageMaker) provided by OpenSearch Service. The flexibility of applying filtering and custom scoring scripts helped us evaluate multiple models across high-dimensional datasets with confidence.

Juicebox was able to measure model performance with fine-grained control, achieving 0.9+ recall. The use of exact k-NN allowed Juicebox to benchmark faster and reliably, even on billion-scale data, providing the confidence needed for model selection.

Problem 4: Lack of data-driven insights

Recruiters need to not only find candidates, but also gain insights into broader talent industry trends. Analyzing hundreds of millions of profiles to identify trends in skills, geographies, and industries was computationally intensive. Most other search engines that support full-text search or k-NN search didn’t support aggregations.

Solution: Advanced aggregations with OpenSearch Service

The powerful aggregation features of OpenSearch Service allowed us to build Talent Insights, a feature that provides recruiters with actionable insights from aggregated data. By performing large-scale aggregations across millions of profiles, we identified key skills and hiring trends, and helped clients adjust their sourcing strategies.

Aggregation queries now run on over 100 million profiles and return results in under 800 milliseconds, allowing recruiters to generate insights instantly.

Problem 5: Streamlining data ingestion and indexing

Juicebox ingests data continuously from multiple sources across the web, reaching terabytes of new data per month. We needed a robust data pipeline to ingest, index, and query this data at scale without performance degradation.

Solution: Scalable data ingestion with Amazon OpenSearch Ingestion pipelines

Using Amazon OpenSearch Ingestion, we implemented scalable pipelines. This allowed us to efficiently process and index hundreds of millions of profiles every month without worrying about pipeline failures or system bottlenecks. We used AWS Glue to preprocess data from multiple sources, chunk it for optimal processing, and feed it into our indexing pipeline.

Conclusion

In this post, we shared how Juicebox uses OpenSearch Service for improved search. We can now index hundreds of millions of profiles per month, keeping our data fresh and up to date, while maintaining real-time availability for searches.


About the authors

Ishan Gupta is the Co-Founder and CTO of Juicebox, an AI-powered recruiting software startup backed by top Silicon Valley investors including Y Combinator, Nat Friedman, and Daniel Gross. He has built search products used by thousands of customers to recruit talent for their teams.

Jon Handler is the Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

TVS Supply Chain Solutions built a file transfer platform using AWS Transfer Family for AS2 for B2B collaboration

Post Syndicated from Suresh Kanniappan original https://aws.amazon.com/blogs/architecture/tvs-supply-chain-solutions-built-a-file-transfer-platform-using-aws-transfer-family-for-as2-for-b2b-collaboration/

TVS Supply Chain Solutions (TVS SCS), promoted by the erstwhile TVS Group and now part of the $3 billion TVS Mobility Group, is an India-based multinational company who pioneered the development of the supply chain solutions market in India.

For the last 2 decades, it has provided supply chain management services to customers in the automotive, consumer goods, defense, and utility sectors in India, the United Kingdom, Europe, and the US. It has a presence in 26 countries with over 17,000 employees and provides services to 78 global Fortune 500 companies. The company went public in 2023.

To meet its customers’ compliance requirements, TVS SCS sought a reliable file transfer solution supporting Applicability Statement 2 (AS2), a business-to-business (B2B) messaging protocol. This post describes how TVS SCS built a secure file transfer platform using AWS Transfer Family for AS2 to exchange Electronic Data Interchange (EDI) documents with their B2B customers in the logistics industry.

Business use case

Several end customers in the manufacturing sector mandated the exchange of EDI documents through the AS2 protocol over the internet. To address this requirement while maintaining manageability, security, and scalability, TVS SCS implemented a file transfer platform on AWS.

TVS SCS serves end customers in the manufacturing sector who require supply chain solutions between various locations:

  • Source – Plants, warehouses, technology
  • Destination – OEM vendors, plants, dealers

The process involves the following steps:

  1. The end customer sends a booking request document (booking fact) to TVS SCS.
  2. TVS SCS and the end customer exchange a series of EDI documents.
  3. TVS SCS must acknowledge, process, and update the end customer upon receipt of each EDI document.

TVS SCS built a file transfer platform using Transfer Family with AS2 configuration to achieve the following:

  • Securely exchange EDI documents with end customers
  • Provide continuous notification using Message Disposition Notifications (MDNs)

The following diagram illustrates the end-to-end business process (requisition, sourcing, purchase orders, receiving, and invoicing) between TVS SCS and an end customer using the AS2 protocol.

End to end business process

Why the cloud?

TVS SCS chose AWS to build their AS2-compliant file transfer platform for three key reasons:

  • Data location – All relevant data (such as order creation and customer details) already resides in AWS
  • Infrastructure management – AWS addresses challenges in the following areas:
    • Maintaining highly available and scalable infrastructure
    • Maintaining correct AS2 system interoperability with trading partners
    • Meeting compliance requirements
  • Versatility for non-AS2 customers – TVS SCS uses multiple scalable and fully managed AWS services to build customized APIs and webhooks for customers not using AS2

This cloud-based approach allows TVS SCS to focus on their core business while AWS handles the complexities of secure, compliant, and scalable file transfer infrastructure.

Why Transfer Family and AS2?

AS2 is a B2B messaging protocol commonly used for exchanging EDI documents securely with integrity control according to the EDIFACT standard, reliably, and cost-effectively over the internet using the HTTP and HTTPS protocols. B2B integration over the AS2 protocol can be challenging, such as with trading partner onboarding, AS2 EDI integration, firewall configuration, certificate maintenance, and high licensing costs for commercial AS2 solutions.

By choosing Transfer Family with AS2 configuration, TVS SCS addresses these challenges and gains several advantages:

  • Simplified partner onboarding
  • Managed infrastructure, reducing maintenance overhead
  • Built-in security features
  • Flexible scaling to meet changing business needs
  • Pay-as-you-go pricing model

Solution overview

The following diagram shows the relationship between the AS2 objects involved in the inbound and outbound processes.

Relationship between the AS2 objects

The following diagram illustrates the solution architecture with AWS services.

Solution Architecture

For step-by-step instructions about creating an AS2 server using Transfer Family, refer to Create an AS2 server using the Transfer Family console.

The allowlisted IP address of the end-customer AS2 server is allowed to communicate with Transfer Family for AS2 on AWS. The customer sends the EDI document through Transfer Family, and the EDIs are stored in Amazon Simple Storage Service (Amazon S3). The business logic is implemented in AWS Lambda functions to read the EDI documents, process them, and update customers. AWS B2B Data Interchange, a fully managed service for automating EDI document transformation, can be considered as a complementary or alternative solution for EDI processing. There are two Lambda functions created: one handles truck booking using NodeJS, and the other handles outbound file transfer (from Amazon S3 to the AS2 server) using Python 3.2.

This architecture enables TVS SCS to securely and efficiently manage the EDI document flow, from receipt through processing and outbound transfer, using scalable and serverless AWS services. The solution provides a compliant and cost-effective approach to B2B data exchange with customers and partners.

Prerequisites

For the prerequisites to configure Transfer Family with AS2, see Configuring AS2. To learn more about the security features in Transfer Family, see Security in AWS Transfer Family.

End customer to TVS SCS communication workflow

The following diagram illustrates the step-by-step process of a truck booking request from an end customer to TVS SCS using AWS services.

End customer to TVS SCS communication workflow

This streamlined workflow demonstrates how TVS SCS uses AWS services to efficiently handle truck booking requests from customers:

  1. The customer initiates a truck booking by sending a booking fact EDI to TVS SCS. The EDI contains details like customer name, date, source location, destination location, and more.
  2. The signed and encrypted booking fact EDI is sent as an inbound HTTP AS2 payload to Transfer Family through the internet.
  3. Transfer Family writes the booking fact EDI to the S3 bucket.
  4. TVS SCS confirms receipt of the booking fact EDI either through the inline HTTP response or an asynchronous HTTP POST request to the originating server.
  5. The EDI exchange audit trail is logged in Amazon CloudWatch Logs.
  6. The EDI document is available for TVS SCS consumption, and a Lambda function processes the document using business logic.

TVS SCS to end customer communication workflow

The following diagram depicts the workflow from TVS SCS to the end customer.

TVS SCS to end customer communication workflow

This workflow demonstrates how TVS SCS uses AWS services to provide timely and accurate updates to customers throughout the delivery process:

  1. The customer confirms the price quote. TVS SCS uploads EDI documents to S3 bucket.
  2. TVS SCS sends a series of updates using the AS2 outbound connector, such as truck allocation, truck departure, truck in-transit status, truck delay notifications, delivery confirmation, and billing invoice. A Lambda function reads the EDI documents from Amazon S3 and runs business logic to generate responses for the end customer.
  3. The EDI documents are sent as an outbound HTTP payload.
  4. The customer AS2 server sends an acknowledgment using MDN.
  5. The EDI exchange audit trail is logged in CloudWatch Logs.
  6. The EDI document is available for the customer’s consumption and further processing.

Results

The following customer challenges were addressed with this solution:

  • It meets end customer requirements for EDI file exchange through AS2 protocol
  • It eliminates the need for in-house AS2 infrastructure management
  • It provides flexibility to add new customers to the file transfer platform

By addressing these challenges and using AWS services, TVS SCS has created a future-proof file transfer platform.

Summary

This post demonstrated how cloud-based services can transform traditional B2B communication processes, offering supply chain companies a path to improved efficiency, compliance, and customer satisfaction. For supply chain providers facing similar challenges, this solution offers a blueprint for modernizing file transfer systems while maintaining compliance with industry standards.

To learn more about this AWS solution for supply chain companies, contact AWS for further assistance. AWS can provide detailed information about implementation, pricing, and how to tailor the solution to your specific business needs. They have teams of experts who can guide companies through the process of modernizing their B2B communication systems using cloud-based services.


About the Authors

Fitch Group achieves multi-Region resiliency for mission-critical Kafka infrastructure with Amazon MSK Replicator

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/fitch-group-achieves-multi-region-resiliency-for-mission-critical-kafka-infrastructure-with-amazon-msk-replicator/

Real-time data streaming and event processing are critical components of modern distributed systems architectures. Apache Kafka has emerged as a leading platform for building real-time data pipelines and enabling asynchronous communication between microservices and applications. However, running and managing Kafka clusters at scale can be challenging, requiring specialized expertise and significant operational overhead.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that allows you to build and run production Kafka applications. With Amazon MSK, you can rely on AWS to handle the heavy lifting of provisioning and managing Kafka clusters, while you focus on building innovative applications and real-time data processing pipelines.

In this post, we explore how Fitch Group, one of the top credit rating companies, used Amazon MSK and Amazon MSK Replicator to achieve multi-Region resiliency for their mission-critical Kafka infrastructure.

About Fitch Group and their need for multi-region resiliency

As a leading global financial information services provider, Fitch Group delivers vital credit and risk insights, robust data, and dynamic tools to champion more efficient, transparent financial markets. With employees in over 30 countries, Fitch Group’s culture of credibility, independence, and transparency is embedded throughout its structure, which includes Fitch Ratings, one of the world’s top three credit ratings agencies, and Fitch Solutions, a leading provider of insights, data, and analytics.

To stay competitive and efficient in the fast-paced financial industry, Fitch Group strategically adopted an event-driven microservices architecture. At the heart of this ecosystem lies Kafka, specifically Amazon MSK, which serves as the backbone for their data integration systems.

Fitch Group uses Kafka to enable applications to send ratings-related business events, facilitating automation within their ratings workflow systems and providing real-time or near real-time processing. This architectural choice has significantly reduced the time to market for end-user-facing systems like Fitch Ratings Pro and Fitch Group Ratings websites. Moreover, Kafka’s robust capabilities allow for seamless aggregation and distribution of data from many disparate systems through their data platform, enhancing data consistency, reliability, and accessibility across the organization.

Given the critical role that Kafka plays in Fitch Group architecture, providing robust disaster recovery (DR) mechanisms became paramount. Any disruption to their Kafka infrastructure could have significant repercussions on their ratings workflow automation, real-time processing, and end-user-facing systems, potentially exposing Fitch Group to regulatory, financial, and reputational risks.

To achieve the desired levels of resiliency, Fitch Group had the following key requirements:

  • Multi-Region deployment – Deploy MSK clusters across multiple AWS Regions to provide business continuity and maintain service availability during Regional or service events
  • Automated replication – Replicate Kafka data across Regions in near real time with minimal latency and data loss
  • Consistent topic namespaces – Maintain the same Kafka topic names and structures across source and destination clusters to minimize application changes
  • Rapid recovery – In the event of a failover, enable applications to seamlessly start consuming from the replicated cluster with minimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

Solution overview

Fitch Group chose to implement their multi-Region Kafka deployment using Amazon MSK and MSK Replicator. MSK Replicator is a fully managed replication service that enables continuous, automated data replication between MSK clusters within the same Region or across different Regions. It supports replicating data between clusters with different configurations, including varying broker counts, storage volumes, and Kafka versions. Here’s how Fitch Group used MSK Replicator to achieve their multi-Region resiliency goals:

  • Deployed MSK clusters in two separate Regions, with the primary cluster in the main Region and the secondary cluster in a different Region for disaster recovery
  • Configured MSK Replicator to continuously replicate data from the primary cluster to the secondary cluster, maintaining the same topic names and structures across both clusters
  • Implemented application failover logic to automatically switch to consuming from the secondary cluster in case of a primary cluster unavailability, with minimal recovery time and data loss

The following diagram illustrates this architecture

Benefits achieved

By implementing Amazon MSK and MSK Replicator, Fitch Group realized several key benefits:

  • Enhanced disaster recovery – The multi-Region deployment provides business continuity even in the face of Regional or service events.
  • Simplified operations – The managed capability of MSK Replicator offloads the operational complexity of self-managing custom replication solutions, reducing the burden on Fitch Group’s IT team
  • Scalability – The solution can scale to handle varying data loads, making sure that DR capabilities grow alongside business needs
  • Minimal application changes – MSK Replicator supports replicating topics with the same name, which eliminates the need for consumer application modifications, reducing development effort and potential errors
  • Seamless failover and failback – Bidirectional replication capabilities enable quick switching of operations to the standby Region with minimal disruption, and straightforward reversion after the primary Region is restored
  • Improved testing capabilities – The setup facilitates regular DR exercises without impacting production systems, allowing Fitch Group to validate their DR plans consistently

Conclusion

By using Amazon MSK and MSK Replicator, Fitch Group has successfully implemented a highly resilient and scalable Kafka infrastructure that meets their stringent business continuity and disaster recovery requirements. This multi-Region deployment enables them to process mission-critical financial data at scale while providing minimal downtime and data loss in the event of service events or disasters. As Fitch Group continues to innovate and grow, their robust Kafka infrastructure provides a solid foundation for future expansion and the development of new data-driven services, ultimately enhancing their ability to deliver timely and accurate financial insights to their clients.


About the authors

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Venu Nemallikanti is the Enterprise Architect and Lead for Event Streaming at Fitch Group, a globally recognized financial information services provider operating in over 30 countries. His primary responsibilities include overseeing the architecture and implementation of event streaming solutions, ensuring the seamless integration and performance of systems that deliver credit ratings, research, data, and analytics to a worldwide clientele.

Chaitanya Shah is a Principal Technical Account Manager with AWS, based out of New York. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

Oleg Chugaev is a Principal Solutions Architect and Serverless evangelist with 20+ years in IT, holding multiple AWS certifications. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences.

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Post Syndicated from Ramon Diez Lejarazu original https://aws.amazon.com/blogs/big-data/jumia-builds-a-next-generation-data-platform-with-metadata-driven-specification-frameworks/

Jumia is a technology company born in 2012, present in 14 African countries, with its main headquarters in Lagos, Nigeria. Jumia is built around a marketplace, a logistics service, and a payment service. The logistics service enables the delivery of packages through a network of local partners, and the payment service facilitates the payments of online transactions within Jumia’s ecosystem. Jumia is present in NYSE and has a market cap of $554 million.

In this post, we share part of the journey that Jumia took with AWS Professional Services to modernize its data platform that ran under a Hadoop distribution to AWS serverless based solutions. Some of the challenges that motivated the modernization were the high cost of maintenance, lack of agility to scale computing at specific times, job queuing, lack of innovation when it came to acquiring more modern technologies, complex automation of the infrastructure and applications, and the inability to develop locally.

Solution overview

The basic concept of the modernization project is to create metadata-driven frameworks, which are reusable, scalable, and able to respond to the different phases of the modernization process. These phases are: data orchestration, data migration, data ingestion, data processing, and data maintenance.

This standardization for each phase was considered as a way to streamline the development workflows and minimize the risk of errors that can arise from using disparate methods. This also enabled migration of different kinds of data following a similar approach regardless of the use case. By adopting this approach, the data handling is consistent, more efficient, and more straightforward to manage across different projects and teams. In addition, although the use cases have autonomy in their domain from a governance perspective, on top of them is a centralized governance model that defines the access control in the shared architectural components. Importantly, this implementation emphasizes data protection by enforcing encryption across all services, including Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Furthermore, it adheres to the principle of least privilege, thereby enhancing overall system security and reducing potential vulnerabilities.

The following diagram describes the frameworks that were created. In this design, the workloads in the new data platform are divided by use case. Each use case requires the creation of a set of YAML files for each phase, from data migration to data flow orchestration, and they are basically the input of the system. The output is a set of DAGs that run the specific tasks.

Overview

In the following sections, we discuss the objectives, implementation, and learnings of each phase in more detail.

Data orchestration

The objective of this phase is to build a metadata-driven framework to orchestrate the data flows along the whole modernization process. The orchestration framework provides a robust and scalable solution that has the following capacities: dynamically create DAGs, integrate natively with non-AWS services, allow the creation of dependencies based on past executions, and add an accessible metadata generation per each execution. Therefore, it was decided to use Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which, through the Apache Airflow engine, provides these functionalities while abstracting users from the management operation.

The following is the description of the metadata files that are provided as part of the data orchestration phase for a given use case that performs the data processing using Spark on Amazon EMR Serverless:

owner: # Use case owner
dags: # List of DAGs to be created for this use case
  - name: # Use case name
    type: # Type of DAG (could be migration, ingestion, transformation or maintenance)
    tags: # List of TAGs
    notification: # Defines notificacions for this DAGs
      on_success_callback: true
      on_failure_callback: true
    spark: # Spark job information 
      entrypoint: # Spark script 
      arguments: # Arguments required by the Spark script
      spark_submit_parameters: # Spark submit parameters. 

The idea behind all the frameworks is to build reusable artifacts that enable the development teams to accelerate their work while providing reliability. In this case, the framework provides the capabilities to create DAG objects within Amazon MWAA based on configuration files (YAML files).

This particular framework is built on layers that add different functionalities to the final DAG:

  • DAGs – The DAGs are built based on the metadata information provided to the framework. The data engineers don’t have to write Python code in order to create the DAGs, they are automatically created and this module is in charge of performing this dynamic creation of DAGs.
  • Validations – This layer handles YAML file validation in order to prevent corrupted files from affecting the creation of other DAGs.
  • Dependencies – This layer handles dependencies among different DAGs in order to handle complex interconnections.
  • Notifications – This layer handles the type of notifications and alerts that are part of the workflows.

Orchestration

One aspect to consider when using Amazon MWAA is that, being a managed service, it requires some maintenance from the users, and it’s important to have a good understanding of the number of DAGs and processes that you’re expected to have in order to fine-tune the instance and obtain the desired performance. Some of the parameters that were fine-tuned during the engagement were core.dagbag_import_timeout, core.dag_file_processor_timeout, core.min_serialized_dag_update_interval, core.min_serialized_dag_fetch_interval, scheduler.min_file_process_interval, scheduler.max_dagruns_to_create_per_loop, scheduler.processor_poll_interval, scheduler.dag_dir_list_interval, and celery.worker_autoscale.

One of the layers described in the preceding diagram corresponds to validation. This was an important component for the creation of dynamic DAGs. Because the input to the framework consists of YML files, it was decided to filter out corrupted files before attempting to create the DAG objects. Following this approach, Jumia could avoid undesired interruptions of the whole process. The module that actually builds DAGs only receives configuration files that follow the required specifications to successfully create them. In case of corrupted files, information regarding the actual issues is logged into Amazon CloudWatch so that developers can fix them.

Data migration

The objective of this phase is to build a metadata-driven framework for migrating data from HDFS to Amazon S3 with Apache Iceberg storage format, which involves the least operational overhead, provides scalability capacity during peak hours, and guarantees data integrity and confidentiality.

The following diagram illustrates the architecture.

Migration

During this phase, a metadata-driven framework built in PySpark receives a configuration file as input so that some migration tasks can run in an Amazon EMR Serverless job. This job uses the PySpark framework as the script location. Then the orchestration framework described previously is used to create a migration DAG that runs the following tasks:

  1. The first task creates the DDLs in Iceberg format in the AWS Glue Data Catalog using the migration framework within an Amazon EMR Serverless job.
  2. After the tables are created, the second task transfers HDFS data to a landing bucket in Amazon S3 using AWS DataSync to sync customer data. This process brings data from all the different layers of the data lake.
  3. When this process is complete, a third task converts data to Iceberg format from the landing bucket to the destination bucket (raw, process, or analytics) using again another option of the migration framework embedded in an Amazon EMR Serverless job.

Data transfer performance is better when the size of the files to be transferred is around 128–256 MB, so it’s recommended to compress the files at the source. By reducing the number of files, metadata analysis and integrity phases are reduced, speeding up the migration phase.

Data ingestion

The objective of this phase is to implement another framework based on metadata that responds to the two data ingestion models. A batch mode is responsible for extracting data from different data sources (such as Oracle or PostgreSQL) and a micro-batch-based mode extracts data from a Kafka cluster that, based on configuration parameters, has the capacity to run native streams in streaming.

The following diagram illustrates the architecture for the batch and micro-batch and streaming approach.

Ingestion

During this phase, a metadata-driven framework builds the logic to bring data from Kafka, databases, or external services, that will be run using an ingestion DAG deployed in Amazon MWAA.

Spark Structured Streaming was used to ingest data from Kafka topics. The framework receives configuration files in YAML format that indicate which topics to read, what extraction processes should be performed, whether it should be read in streaming or micro-batch, and in which destination table the information should be saved, among other configurations.

For batch ingestion, a metadata-driven framework written in Pyspark was implemented. In the same way as the previous one, the framework received a configuration in YAML format with the tables to be migrated and their destination.

One of the aspects to consider in this type of migration is the synchronization of data from the ingestion phase and the migration phase, so that there is no loss of data and that data is not reprocessed unnecessarily. To this end, a solution has been implemented that saves the timestamps of the last historical data (per table) migrated in a DynamoDB table. Both types of frameworks are programmed to use this data the first time they are run. For micro-batching use cases, which use Spark Structured Streaming, Kafka data is read by assigning the value stored in DynamoDB to the startingTimeStamp parameter. For all other executions, priority will be given to the metadata in the checkpoint folder. This way, you can make sure ingestion is synchronized with the data migration.

Data processing

The objective in this phase was to be able to handle updates and deletions of data in an object-oriented file system, so Iceberg is a key solution that was adopted throughout the project as delta lake files because of its ACID capabilities. Although all phases use Iceberg as delta files, the processing phase makes extensive use of Iceberg’s capabilities to do incremental processing of data, creating the processing layer using UPSERT using Iceberg’s ability to run MERGE INTO commands.

The following diagram illustrates the architecture.

Processing

The architecture is similar to the ingestion phase, with just changes to the data source to be Amazon S3. This approach speeds up the delivery phase and maintains quality with a production-ready solution.

By default, Amazon EMR Serverless has the spark.dynamicAllocation.enabled parameter set to True. This option scales up or down the number of executors registered within the application, based on the workload. This brings a lot of advantages when dealing with different types of workloads, but it also brings considerations when using Iceberg tables. For instance, while writing data into an Iceberg table, the Amazon EMR Serverless application can use a large number of executors in order to speed up the task. This can result in reaching Amazon S3 limits, specifically the number of requests per second per prefix. For this reason, it’s important to apply good data partitioning practices.

Another important aspect to consider in these cases is the object storage file layout. By default, Iceberg uses the Hive storage layout, but it can be set to use ObjectStoreLocationProvider. By setting this property, a deterministic hash is generated for each file, with a hash appended directly after write.data.path. This can considerably minimize throttle requests based on object prefix, as well as maximize throughput for Amazon S3 related I/O operations, because the files written are equally distributed across multiple prefixes.

Data maintenance

When working with data lake table formats such as Iceberg, it’s essential to engage in routine maintenance tasks to optimize table metadata file management, preventing a large number of unnecessary files from accumulating and promptly removing any unused files. The objective of this phase was to build another framework that can perform these types of tasks on the tables within the data lake.

The following diagram illustrates the architecture.

Maintenance

The framework, as well as the other ones, receives a configuration file (YAML files) indicating the tables and the list of maintenance tasks with their respective parameters. It was built on PySpark so that it could run as an Amazon EMR Serverless job and could be orchestrated using the orchestration framework just like the other frameworks built as part of this solution.

The following maintenance tasks are supported by the framework:

  • Expire snapshots – Snapshots can be used for rollback operations as well as time traveling queries. However, they can accumulate over time and can lead to performance degradation. It’s highly recommended to regularly expire snapshots that are no longer needed.
  • Remove old metadata files – Metadata files can accumulate over time just like snapshots. Removing them regularly is also recommended, especially when dealing with streaming or micro-batching operations, which was one of the cases of the overall solution.
  • Compact files – As the number of data files increases, the number of metadata stored in the manifest files also increases, and small data files can lead to less efficient queries. Because this solution uses a streaming and micro-batching application writing into Iceberg tables, the size of the files tends to be small. For this reason, a method to compact files was imperative to enhance the overall performance.
  • Hard delete data – One of the requirements was to be able to perform hard deletes in the data older than a certain period of time. This implies removing expiring snapshots and removing metadata files.

The maintenance tasks were scheduled with different frequencies depending on the use case and the specific task. For this reason, the schedule information for this tasks is defined in each of the YAML files of the specific use case.

At the time this framework was implemented, there was no any automatic maintenance solution on top of Iceberg tables. At AWS re:Invent 2024, Amazon S3 Tables functionality has been released to automatize the maintenance of Iceberg Tables . This functionality automates file compaction, snapshot management, and unreferenced file removal.

Conclusion

Building a data platform on top of standarized frameworks that use metadata for different aspects of the data handling process, from data migration and ingestion to orchestration, enhances the visibility and control over each of the phases and significantly speeds up implementation and development processes. Furthermore, by using services such as Amazon EMR Serverless and DynamoDB, you can bring all the benefits of serverless architectures, including scalability, simplicity, flexible integration, improved reliability, and cost-efficiency.

With this architecture, Jumia was able to reduce their data lake cost by 50%. Furthermore, with this approach, data and DevOps teams were able to deploy complete infrastructures and data processing capabilities by creating metadata files along with Spark SQL files. This approach has reduced turnaround time to production and reduced failure rates. Additionally, AWS Lake Formation provided the capabilities to collaborate and govern datasets on various storage layers on the AWS platform and externally.

Leveraging AWS for our data platform has not only optimized and reduced our infrastructure costs but also standardized our workflows and ways of working across data teams and established a more trustworthy single source of truth for our data assets. This transformation has boosted our efficiency and agility, enabling faster insights and enhancing the overall value of our data platform.

– Hélder Russa, Head of Data Engineering at Jumia Group.

Take the first step towards streamlining the data migration process now, with AWS.


About the Authors

Ramón Díez is a Senior Customer Delivery Architect at Amazon Web Services. He led the project with the firm conviction of using technology in service of the business.

Paula Marenco is a Data Architect at Amazon Web Services, she enjoys designing analytical solutions that bring light into complexity, turning intricate data processes into clear and actionable insights. Her work focuses on making data more accessible and impactful for decision-making.

 Hélder Russa is the Head of Data Engineering at Jumia Group, contributing to the strategy definition, design, and implementation of multiple Jumia data platforms that support the overall decision-making process, as well as operational features, data science projects, and real-time analytics.

Pedro Gonçalves is a Principal Data Engineer at Jumia Group, responsible for designing and overseeing the data architecture, emphasizing on AWS Platform and datalakehouse technologies to ensure robust and agile data solutions and analytics capabilities.

HEMA accelerates their data governance journey with Amazon DataZone

Post Syndicated from Luis Campos original https://aws.amazon.com/blogs/big-data/hema-accelerates-their-data-governance-journey-with-amazon-datazone/

This post is cowritten by Tommaso Paracciani and Oghosa Omorisiagbon from HEMA.

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. However, many companies today still struggle to effectively harness and use their data due to challenges such as data silos, lack of discoverability, poor data quality, and a lack of data literacy and analytical capabilities to quickly access and use data across the organization. To address these growing data management challenges, AWS customers are using Amazon DataZone, a data management service that makes it fast and effortless to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources.

HEMA is a household Dutch retail brand name since 1926, providing daily convenience products using unique design. HEMA’s more than 17,000 employees bring exclusive, sustainably designed products in more than 750 stores in the Netherlands but also in Belgium, Luxembourg, France, Germany, and Austria, with webstores available in all these countries. HEMA built its first ecommerce system on AWS in 2018 and 5 years later, its developers have the freedom to innovate and build software fast with their choice of tools in the AWS Cloud. Today, this is powering every part of the organization, from the customer-favorite online cake customization feature to democratizing data to drive business insight.

This post describes how HEMA used Amazon DataZone to build their data mesh and enable streamlined data access across multiple business areas. It explains HEMA’s unique journey of deploying Amazon DataZone, the key challenges they overcame, and the transformative benefits they have realized since deployment in May 2024. From establishing an enterprise-wide data inventory and improving data discoverability, to enabling decentralized data sharing and governance, Amazon DataZone has been a game changer for HEMA.

Data landscape at HEMA

After moving its entire data platform from on premises to the AWS Cloud, the wave of change presented a unique opportunity for the HEMA Data & Cloud function to invest and commit in building a data mesh.

HEMA has a bespoke enterprise architecture, built around the concept of services. These services are individual software functionalities that fulfill a specific purpose within the company. Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure.

HEMA runs over 400 services, and 20 of them run extract, transform, and load (ETL) pipelines with dedicated data resources, which produce and consume data assets shared across the data mesh.

Data management in a data mesh

Weeks after launch, HEMA’s data platform wasn’t where the company wanted it to be. Building an agile organization that runs on reliable and streamlined processes was the primary goal. Initially, the data inventories of different services were siloed within isolated environments, making data discovery and sharing across services manual and time-consuming for all teams involved.

Implementing robust data governance is challenging. In a data mesh architecture, this complexity is amplified by the organization’s decentralized nature. In this context, HEMA concluded that data governance was no longer a nice-to-have, but had become a foundational piece required to build a healthy data organization.

Why HEMA selected Amazon DataZone

By exploring the preview, HEMA saw how Amazon DataZone covered all the critical pillars of data management in a single solution. It was clear how Amazon DataZone would bring benefit to both the technical teams as well as the business end-users. The technical organization could take advantage of a robust programmatic solution to manage the availability, accessibility, and quality of the data assets that make the enterprise data catalog. The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs.

Features such as AI-generated metadata were key to providing end-users with reliable and use case-driven explanations of what a certain data product could provide and solve, while the subscription feature allowed them to start using a certain data asset within their own environment in a matter of seconds, as opposed to the existing lengthy and human-driven process.

These reasons, as well as the self-service capabilities, resulted in HEMA’s decision to adopt and roll out Amazon DataZone at the enterprise level.

Solution overview

The HEMA data landscape is multifaceted, with various teams across the organization using a range of technologies and systems, including Databricks. To effectively govern this complex data environment, HEMA has adopted a data mesh architecture on AWS. This architecture maintains a central intelligence platform (CIP) that enables the activities of both data producers and data consumers by providing the necessary platform and infrastructure. The overall structure can be represented in the following figure.

Each service uses two AWS accounts, one for pre-production and one for production. This separation means changes can be tested thoroughly before being deployed to live operations.

Amazon DataZone is the central piece in this architecture. It helps HEMA centralize all data assets across disparate data stacks into a single catalog. It plays a pivotal role in bridging the gap and integrating different systems, such as Databricks and native AWS services. The integration of Databricks Delta tables into Amazon DataZone is done using the AWS Glue Data Catalog. Delta tables’ technical metadata is stored in the Data Catalog, which is a native source for creating assets in the Amazon DataZone business catalog. Access control is enforced using AWS Lake Formation, which manages fine-grained access control and data sharing on data lake data. The following figure illustrates the data mesh architecture.

The Amazon DataZone implementation follows the same approach as individual services: HEMA maintains two distinct domain data catalogs: preprod-hema-data-catalog and prod-hema-data-catalog. These catalogs serve as the backbone for data sharing across pre-production and production accounts, allowing flexible access to data assets based on the environment’s needs.

The prod-hema-data-catalog is the production-grade catalog that supports data sharing across production services and, in some cases, pre-production services. This catalog only facilitates the production of data assets from production services (disallows publishing of assets belonging to pre-production services) and allows pre-production services to access production-grade data. The following diagram illustrates the architecture of both accounts.

To establish isolation between services in the data mesh, a project is dedicated to a unique service account. The environment profiles and environments are configured to be explicitly used only by the service. This Amazon DataZone configuration is managed centrally by the core team using AWS CloudFormation. After projects are created and configured by the central team, project teams have access to self-service capabilities to create their own environments according to their needs.

The following diagram illustrates the full workflow for onboarding HEMA service teams in Amazon DataZone.

The workflow includes the following steps:

  1. A service team (either a data producer or a data consumer) initiates a request to the core data platform team to enable data sharing for their service accounts. This request is typically made when a service team has a use case where they need to either publish data to the catalog (for other teams to consume) or access data that another team has published.
  2. After the request is received, the core data platform team assesses the requirements and initiates the creation of projects and environments in Amazon DataZone. This is done using AWS CloudFormation and a continuous integration and delivery (CI/CD) pipeline. The core data platform team makes sure that the appropriate AWS account (pre-production or production) is linked to the environment within the project in the respective catalogs.
  3. After the projects and environments are set up, service teams can use Amazon DataZone features to produce and consume data assets:
    1. Producers (for example, Service A) can publish their data assets to the Data Catalog and approve or reject subscription requests.
    2. Consumers (for example, Service B) can search and access these published data assets using the Amazon DataZone catalog and request data access through subscription requests.

In a decentralized data mesh environment, there is a risk of service teams creating resources in service accounts they are not authorized to manage, which may lead to governance issues and data mismanagement. To address this challenge, HEMA followed two principles:

  • Amazon DataZone project structure – Each project contains resources that are solely managed by the service team (project members) responsible for it. Each service team’s project provides a clear boundary for the resources they manage.
  • Environment isolation – The core teams enforce governance policies in the Amazon DataZone configuration, allowing teams to only deploy resources within their own environments.

Adoption plan: Strategy

In HEMA’s data mesh, the catalog must be built in collaboration with all the services that produce data, so the key for the central data governance team was ideating an adoption plan that would add value to these teams, rather than disrupting the delivery of their projects. With that in mind, HEMA’s adoption strategy was designed on three core principles:

  • Launch it – Do not wait until you can ship to production a full-scale service that covers every single feature available. Instead, define an MVP that solves the most critical need for the business and make it available for the business as soon as you can.
  • Prove value – HEMA’s data team ran several internal seminars, and dedicated presentations with each of the involved teams to showcase how Amazon DataZone would simplify their data sharing needs. Do not tell them they must invest time to learn and start using a new service, but rather let them get drawn in by the new advantages of the new functionality and stimulate self-adoption.
  • Be there – This connects with what HEMA as a company stands for. Be close to the teams when they need support during the adoption stage, like HEMA is close to their customers whenever they need a new product for their lives. Create space for Q&A and develop a collaborative experience for everyone in their adoption curve.

Adoption plan: Action points

While deploying the adoption plan for a decentralized data marketplace using Amazon DataZone, HEMA followed a “start small, fine-tune, and iterate” approach. In practice, this meant that the Data & Cloud team started working with one business unit, expanding then to several business units, while focusing on one single feature: data asset subscription. To increase interest and adoption, this process was introduced for the core data assets that were more used in the company.

After this part of the process was well understood and embraced by everyone, the next step was to start supporting the data pipeline adaptation work needed for each business unit.

Finally, when all teams were onboarded and familiar with the subscription feature, HEMA moved to introduce the business units to the second critical feature: data publishing. In summary, HEMA released new features and allowed the domains to pick up the implementation at their preferred pace before moving onto the next one.

When adoption was at a point where all core data assets were being consumed through the Amazon DataZone catalog, the Lake Formation resource links used previously to share data across accounts were decommissioned, and at the same time the Data & Cloud team interrupted their duty to share data between business units, stimulating the peer-to-peer data sharing practice, where teams can directly talk to each other without having to involve a third party.

Results

The popularity of Amazon DataZone across the enterprise ramped up quickly, and all the involved business units started using the service daily to self-serve their needs. The existence of a central data catalog enabled teams to seamlessly search, discover, share, and subscribe to data assets produced within the business. Only a few months after launching the service, HEMA observed stunning statistics:

  • Over 200 data assets published to the catalog
  • Over 180 active subscriptions
  • Over 100 active users monthly
  • Over 20 business units (services) onboarded
  • Data sharing average turnaround time cut from 4 working days to few seconds, without the support of any other team

Additionally, they saw massive benefits that can’t be represented by statistics. Above all, the ability to autonomously discover data produced by other teams is enabling a series of new use cases for the business, which weren’t even visible to them earlier due to the lack of awareness and visibility on what others were producing. For example, the data science team quickly developed a new predictive model for sales by reusing data already available in Amazon DataZone, instead of rebuilding it from scratch. This is resulting in an energized data organization, which can collaborate and contribute to shaping the future of HEMA’s data operations.

Conclusion

At HEMA, Amazon DataZone made data governance a reality, and so the company wants to implement new features in close collaboration with AWS, while continuing to work on the rollout of items that are already in HEMA’s roadmap. The team is continuously developing the service, launching a series of new features that will continue to improve the data operations:

  • Data quality scores – This feature helps data producers monitor and optimize their data assets, while consumers can see upfront the nuances of a certain asset that they might be using or are looking to use within their ETL pipelines
  • Data lineage – This feature allows consumers and the central governance team to trace data sources, transformation stages, and observe cross-organizational usage of data assets
  • Fine-grained access control – This feature enables producers to be in full control of what they share with other units, making sure that only the relevant pieces of a data asset are shared with the consuming teams

The long-term vision of HEMA is clear: Amazon DataZone is set to become the central solution for data sharing and data cataloging across the enterprise. Although as of today, Amazon DataZone is focused on supporting the teams running ETL pipelines, the goal is to extend the service to all the business teams that work with data, with the ultimate goal of streamlining their daily operations. Data is one of the most valuable resources a company has, and HEMA is determined to democratize its role by building an efficient data organization, who relies on the most advanced data governance solution on the market.


About the authors

Luis Campos is the Data & AI Governance GTM Lead for the EMEA market at AWS where he helps customers with their data strategies starting with strong data governance and uses his expertise in end-to-end data & analytics management. Luis is also a public speaking coach, based in the Netherlands, and has two boys with 18 years apart, which has taught him to see problems from both ends of a spectrum.

Vincent Gromakowski is a Principal Analytics Solutions Architect at AWS where he enjoys solving customers’ data challenges. He uses his strong expertise on analytics, distributed systems and resource orchestration platform to be a trusted technical advisor for AWS customers.

Tommaso is the Head of Data & Cloud Platforms at HEMA. He joined the business with the goal of modernising the Data Organization by building cloud-based Data Platform – hosted in AWS – which would power a Data Mesh architecture. With a strong passion for both technical and organizational challenges, Tommaso leads the Solution Architecture efforts as well as all core Data Management and Data Governance initiatives, for which he is also a passionate public speaker. Outside the office, Tommaso is a full-time dad with a passion for traveling and sports.

Oghosa Omorisiagbon is a Senior Data Engineer at HEMA. He focuses on leveraging AWS-native tools to optimise data pipelines, modernise HEMA’s data infrastructure and introduce reliable and scalable end-to-end data architecture solutions. Outside of work, he enjoys traveling, playing video games and outdoor activities.

How DeNA Co., Ltd. accelerated anonymized data quality tests up to 100 times faster using Amazon Redshift Serverless and dbt

Post Syndicated from Momota Sasaki original https://aws.amazon.com/blogs/big-data/how-dena-co-ltd-accelerated-anonymized-data-quality-tests-up-to-100-times-faster-using-amazon-redshift-serverless-and-dbt/

This blog was co-authored by DeNA Co., Ltd. and Amazon Web Services Japan.

DeNA Co., Ltd. (DeNA) engages in a variety of businesses, from games and live communities to sports & the community and healthcare & medical, under our mission to delight people beyond their wildest dreams. Among these, the healthcare & medical business handles particularly sensitive data. To comply with their data policies for sensitive data, this healthcare & medical business set the following requirements for their data processing:

  • Process data in compliance with data policies – Mask or delete sensitive data as necessary to transform into anonymized data. Prevent the inclusion of invalid values in categorical data and process data without any data loss.
  • Conduct data quality tests on anonymized data in compliance with data policies – Conduct data quality tests to quickly identify and address data quality issues, maintaining high-quality data at all times.

This post introduces a case study where DeNA combined Amazon Redshift Serverless and dbt (dbt Core) to accelerate data quality tests in their business.

The challenge

Data quality tests require performing 1,300 tests on 10 TB of data monthly. Previously, DeNA ran Python-based batch jobs on Amazon Elastic Compute Cloud (Amazon EC2) to perform these data quality tests. As business and data volume grew over time, DeNA started to face the following challenges:

  • Performance – Data quality tests took days to weeks to complete because engineers hadn’t designed the batch jobs to handle big data.
  • Cost – Costs increased due to the batch job design, particularly for large datasets. The implementation required loading data into memory for processing. When handling large table data, DeNA needed to use large memory-optimized EC2 instances.
  • Maintainability – The batch job implementations varied significantly between engineers, leading to high maintenance overhead, because the required knowledge was siloed among individual engineers.

The switch to Redshift Serverless and dbt

To address these challenges, DeNA decided to adopt Redshift Serverless and dbt (an open source data transformation tool) for the following key reasons:

  • Scalable and cost-effective processing with Redshift Serverless
  • Standardized and maintainable data quality tests with dbt

This decision was made after careful comparison of alternative solutions. DeNA initially considered parallelizing the existing Python-based batch jobs but rejected this approach due to the high maintenance overhead and siloed knowledge associated with the batch jobs. Instead, DeNA decided to use dbt, which DeNA has been using in their healthcare & medical business, and connect it to an AWS service capable of large-scale distributed processing. dbt provides a SQL-first templating engine for repeatable and extensible data transformations, including a data tests feature, which allows verifying data models and tables against expected rules and conditions using SQL. By using dbt, DeNA could standardize the technical stack, implement data quality tests in maintainable SQL, and connect dbt to a managed service for scalable and cost-effective processing.

AWS offers several services that are compatible with dbt, including Amazon Redshift and AWS Glue. DeNA selected Redshift Serverless, primarily due to its serverless nature, optimal cost-performance, and the superior processing performance for structured data typical of a data warehouse service.

Solution overview

DeNA designed the following architecture using AWS serverless services.

The workflow consists of the following high-level steps and key design points:

  1. The source system stores the target data for the data quality tests in Amazon Simple Storage Service (Amazon S3). When new data files are added, Amazon EventBridge invokes an AWS Step Functions state machine (workflow). To make sure all files for target data are delivered, the source system stores a completion file in Amazon S3.
  2. dbt runs on Amazon Elastic Container Service (Amazon ECS) using AWS Fargate, an AWS serverless container service. DeNA selected Amazon ECS because it allows running dbt in a serverless, pay-per-use manner, and DeNA had prior experience developing and operating applications using Amazon ECS. To allow the containers to securely access Redshift Serverless, DeNA used the pass sensitive data to an ECS container feature to pass sensitive credentials that are stored in AWS Secrets Manager to the containers using an ECS task execution IAM role.
  3. DeNA segmented Redshift Serverless into separate workgroups for access control. Operation personnel may need to access the Redshift Serverless database using the Query Editor V2 to investigate issues with data quality tests, while maintaining strict access control. Redshift Serverless allows fine-grained access control to data by using database security features, similar to how the GRANT command is used in database products. However, in this workload, DeNA chose to use AWS Identity and Access Management (IAM) to control access to the workgroups at IAM level. This allowed DeNA to restrict access to specific Redshift Serverless workgroups based on users’ IAM roles, enabling unified management of authorization through IAM. Additionally, by separating the workgroups, DeNA could individually adjust Redshift Processing Units (RPUs) per workgroup, contributing to cost optimization.
  4. Amazon ECS sends execution logs of dbt running to Amazon CloudWatch Logs for observability. DeNA used metric filters to convert the logs into CloudWatch metrics, then created alarms based on these metrics. When triggered, these alarms invoke AWS Lambda functions using Amazon Simple Notification Service (Amazon SNS). The Lambda functions create result reports of dbt running and data quality tests and send them to an internal chat application. DeNA visualizes the results of data quality tests using the elementary CLI, a dbt-based data observability solution. This workflow enables even non-engineers to track data quality status effectively.

Outcomes

DeNA successfully addressed all the challenges they faced by designing the solution and migrating to a new platform:

  • Performance – Improved performance up to 100 times faster by reducing processing time from days or weeks to 1–2 hours. A certain data quality test that previously took 877 minutes now completes in 1 minute, thanks to the large-scale distributed processing capabilities of Redshift Serverless.
  • Cost – Reduced costs by 90% with AWS serverless services. Optimized expenses by incurring costs only for data quality tests.
  • Maintainability – Standardized the technical stack with dbt, eliminating siloed knowledge from custom programs. dbt’s data tests feature simplified the implementation of data quality tests. The elementary CLI improved the observability of data quality tests for non-engineers. AWS serverless services virtually eliminated the operational overhead for managing the workload infrastructure.

Conclusion

This post demonstrated how DeNA was able to securely and efficiently accelerate their data quality tests by combining Redshift Serverless and dbt. This combination is not only effective for DeNA’s use case but also applicable to various business use cases across different industries.

For more information on the combination of Redshift Serverless and dbt, refer to the following resources:


About the Author

Momota Sasaki is an Engineering Manager at DeSC Healthcare, a subsidiary of DeNA. He joined DeNA in 2021 and was seconded to DeSC Healthcare. Since then, he has been consistently involved in the healthcare business, leading and promoting the development and operation of the data platform.

Kaito Tawara is a Data Engineer at DeSC Healthcare, a subsidiary of DeNA, focusing on improving healthcare data platforms. After gaining experience in backend development for web systems and data science, he transitioned to data engineering. He joined DeNA in 2023 and was seconded to DeSC Healthcare. Currently, he works remotely from Nagoya-city, contributing to the enhancement of healthcare data platforms.

Shota Sato is an Analytics Specialist Solution Architect at AWS Japan, focusing on data analytics solutions powered by AWS for digital native business customers.

Enforce resource configuration to control access to new features with AWS

Post Syndicated from Yossi Cohen original https://aws.amazon.com/blogs/security/enforce-resource-configuration-to-control-access-to-new-features-with-aws/

Establishing and maintaining an effective security and governance posture has never been more important for enterprises. This post explains how you, as a security administrator, can use Amazon Web Services (AWS) to enforce resource configurations in a manner that is designed to be secure, scalable, and primarily focused on feature gating.

In this context, feature gating means that newly supported AWS features and configurations can’t be used unless you explicitly approve them. With feature gating, you maintain control over your AWS environment when new services and capabilities are introduced.

This blog post demonstrates a unique approach to giving users, such as DevOps teams, controlled flexibility within safe boundaries by allowing resource provisioning that uses only approved configurations. This approach also accommodates configurations that will be supported in future versions of the resource, keeping them restricted until explicitly approved, as shown in Figure 1.

Figure 1: Restrict resource provisioning to approved configurations only

Figure 1: Restrict resource provisioning to approved configurations only

Apply your resource configuration enforcement

As shown in Figure 2, our solution for resource configuration enforcement (RCFGE) uses AWS CloudFormation Hooks. By using Hooks, you can run custom logic during the provisioning of resources. These are proactive controls because you inspect and enforce resource configurations before the resource is created, updated, or deleted.

Your Hook will only be effective if CloudFormation supports the AWS resources that you are using and if you implement a service control policy (SCP) that helps prevent users from provisioning resources outside of CloudFormation.

Figure 2: How CloudFormation Hooks work

Figure 2: How CloudFormation Hooks work

The flow shown in Figure 2 consists of the following five steps:

  1. DevSecOps registers and configures a CloudFormation Hook in the account.
  2. DevOps specifies a CloudFormation template that defines the required resources and configurations.
  3. CloudFormation creates a new stack resource, starting the provisioning process based on the template.
  4. The Hook is triggered before provisioning for each resource that’s defined in the template, and runs custom validation logic.
  5. If the validation checks pass, CloudFormation proceeds with provisioning; if not, the process is terminated.

Make your solution scalable

To achieve scalable operations, you should implement a reusable and generic Hook that targets all supported CloudFormation resource types. This Hook enforces resource configuration by loading resource specification files from an external object storage, such as an Amazon Simple Storage Service (Amazon S3) bucket.

These specification files define validation rules in a declarative language. Using this approach, you can add and remove resource configuration validation rules by editing the declarative files. When you externalize custom logic as decoupled validation rules from the Hook, DevSecOps personnel can manage these rules at scale without affecting your infrastructure.

Figure 3: Externalize custom logic as validation rule files in an S3 bucket

Figure 3: Externalize custom logic as validation rule files in an S3 bucket

Figure 3 shows how the solution has been revised to support this approach. Steps 1–3 are the same as in the flow shown in Figure 2:

  1. DevSecOps registers and configures a CloudFormation Hook in the account.
  2. DevOps specifies a CloudFormation template that defines the required resources and configurations.
  3. CloudFormation creates a new stack resource, starting the provisioning process based on the template.
  4. The Hook is triggered before provisioning for each resource that’s defined in the template.
  5. The Hook loads the relevant resource specification file from the S3 bucket and executes the validation rules against the current resource in the CloudFormation template.
  6. If the validation checks pass, CloudFormation proceeds with provisioning; if not, the process is terminated.

You need to configure the Hook schema and the Hook configuration schema to evaluate the configurations of all supported resources across your AWS accounts before changes are provisioned. This setup should cover create, update, and delete operations so that the Hook can help prevent non-approved configurations across stacks.

By using AWS CloudFormation Guard, you can externalize validation rules from the Hook, as described in Extend your pre-commit hooks with AWS CloudFormation Guard. Guard is an open source, general purpose, policy-as-code (PaC) evaluation tool that validates CloudFormation templates against custom rules to help you stay aligned with your organizational policies. For example, the CT.S3.PR.1 rule specification demonstrates a Guard rule that requires an S3 bucket to have its settings configured to block public access. These validation rules apply to currently supported AWS resource configurations and features, but they don’t restrict potential future properties.

Boost your solution with feature gating

Your risk model might lead you to look for mechanisms that further restrict the AWS resource configurations that you allow in your environments. As you will see, the proposed solution restricts authorized workforce users so that they can use new configurations only if you enable them. The proposed approach uses feature gating because it continues to enforce your configurations even when AWS adds new options for your resources.

Guard aims to validate required constraints; but to meet the feature gating objective, you should implement validation rules that check whether resource configurations fulfill structural constraints described by the restricted version of CloudFormation resource schemas. These schemas help you confine the possible resource configurations that can be provisioned in your environment no matter what new configurations AWS introduces.

Figure 4: Enforce resource configuration with restricted resource schema templates

Figure 4: Enforce resource configuration with restricted resource schema templates

Figure 4 shows an updated version of the same flow where validation rules are implemented by using restricted resource schema templates, which are stored in an S3 bucket. These templates are based on the original CloudFormation resource schemas, representing a snapshot of these schemas at a specific point in time. Steps 1–4 are the same as in the flow shown in Figure 3:

  1. DevSecOps registers and configures a CloudFormation Hook in the account.
  2. DevOps specifies a CloudFormation template that defines the required resources and configurations.
  3. CloudFormation creates a new stack resource, starting the provisioning process based on the template.
  4. The Hook is triggered before provisioning for each resource that’s defined in the template.
  5. The Hook loads the relevant restricted resource schema template file from the S3 bucket and uses it to execute schema validation against the current resource in the CloudFormation template.
  6. If the validation checks pass, CloudFormation proceeds with provisioning; if not, the process is terminated.

A restricted resource schema template is a subset of its corresponding original CloudFormation resource schema. It includes additional constraints that limit certain properties to specific values and patterns or exclude certain properties entirely. Furthermore, these templates contain placeholders that you fill in with runtime values, such as the account ID, which your Hook provides as part of the Hook context.

Figure 5: Resource configuration enforcement (RCFGE) CloudFormation Hook flow

Figure 5: Resource configuration enforcement (RCFGE) CloudFormation Hook flow

As shown in Figure 5, the flow within the RCFGE CloudFormation Hook involves the following steps:

  1. The CloudFormation Hook is invoked with the Hook context and the resource’s configuration JSON object.
  2. The Hook loads the restricted resource schema template from the S3 bucket and substitutes placeholders with the Hook context runtime values, producing a valid JSON schema.
  3. The Hook validates the stack’s resource configuration JSON object against the schema. If it returns OperationStatus.SUCCESS, then CloudFormation proceeds with the provisioning process. If it returns OperationStatus.FAILED, then CloudFormation terminates the provisioning process.

If a restricted resource schema template for a CloudFormation resource type isn’t found in the S3 bucket, the schema validation step fails by default.

Sample excerpt of a restricted schema template for an S3 bucket resource

The following is an excerpt from a restricted schema template for an S3 bucket. At runtime, your Hook processes this template, substituting the placeholders with relevant values from the Hook context. In this example, the Hook replaces the <accountID> placeholder in the topic’s pattern with the actual account ID. The resulting JSON schema disallows additional properties beyond those defined by the schema and restricts the Amazon Simple Notification Service (Amazon SNS) topics that can be used for event notifications.

Note: In the code samples that follow, we’ve omitted some code for brevity—we’ve indicated these omissions with three periods: ...

{
  "type": "object",
  "required": [],
  "additionalProperties": false,
  "properties": {
        ...
      "NotificationConfiguration": {
          "$ref": "#/definitions/NotificationConfiguration"
      },
        ...
  },
  "definitions": {
        ...
      "NotificationConfiguration": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            ...
              "TopicConfigurations": {
                  "type": "array",
                  "uniqueItems": true,
                  "items": {
                      "$ref": "#/definitions/TopicConfiguration"
                  }
              }
          }
      },
        ...
      "TopicConfiguration": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
        ...
              "Topic": {
                  "type": "string",
                  "pattern": "^arn:aws:sns::$<accountID>:.*$"
              },
        ...
            }
      },
  }
}

CloudFormation template for an S3 bucket that adheres to the restricted schema

Let’s assume that your account ID is 111122223333. The account ID is propagated to the Hook through the Hook context.

The following is an excerpt from a CloudFormation template that aligns with the restricted schema for an S3 bucket instantiated from the template shown previously. As a result, your Hook allows the corresponding CloudFormation stack to proceed.

{
   "AWSTemplateFormatVersion": "2010-09-09",
   "Resources": {
     "S3Bucket": {
       "Type": "AWS::S3::Bucket",
       "Properties": {
         "BucketName":
            "valid-bucket-sns-notification-configuration-template",
         "NotificationConfiguration": {
           "TopicConfigurations": [
             {
              "Topic":
                "arn:aws:sns:eu-west-1:111122223333:this-is-my-topic-and-I-trust-it",
              "Event": "s3:ObjectCreated:*"
             }
           ]
         }
       }
    }
  }
}

CloudFormation template for an S3 bucket that diverges from the restricted schema (example 1)

The following is an excerpt from a CloudFormation template that doesn’t align with the restricted schema for an S3 bucket instantiated from the template shown previously because it attempts to configure the Amazon SNS topic for the notification configuration, which uses an Amazon Resource Name (ARN) of another account. As a result, your Hook causes the corresponding CloudFormation stack to fail.

{
   "AWSTemplateFormatVersion": "2010-09-09",
   "Resources": {
     "S3Bucket": {
       "Type": "AWS::S3::Bucket",
       "Properties": {
         "BucketName":
           "invalid-bucket-sns-notification-configuration-template",
         "NotificationConfiguration": {
            "TopicConfigurations": [
              {
               "Topic":
                 "arn:aws:sns:eu-west-1:444455556666:this-is-not-my-topic",
               "Event": "s3:ObjectCreated:*"
              }
            ]
         }
       }
     }
   }
}

CloudFormation template for an S3 bucket that diverges from the restricted schema (example 2)

The following is an excerpt from a CloudFormation template that doesn’t align with the restricted schema for an S3 bucket instantiated from the template shown previously. This time, it violates your feature gating objective by attempting to use a new, imaginary feature of an S3 bucket that isn’t approved for use by your restricted schema for an S3 bucket. As a result, your Hook causes the corresponding CloudFormation stack to fail.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "S3Bucket": {
      "Type": "AWS::S3::Bucket",
      "Properties": {
        "BucketName":
           "valid-bucket-sns-notification-configuration-template",
        "NewFeature": {
           "property-1": true,
           "property-2": "public"
        },                
        "NotificationConfiguration": {
          "TopicConfigurations": [
            {
              "Topic":
                 "arn:aws:sns:eu-west-1:111122223333:this-is-my-topic-and-I-trust-it",
              "Event": "s3:ObjectCreated:*"
            }
          ]
        }
      }
    }
  }
}

Protect your controls

If a security control itself isn’t protected adequately, it becomes a weak link in the security chain. For example, a surveillance camera (a physical security control) that isn’t securely mounted can be removed, rendering it useless. This principle also applies to your RCFGE solution.

Next, we will show you how to isolate management activities to a dedicated account and use SCPs as preventative controls.

Isolate RCFGE management in a dedicated account

Organizing your AWS environment by using multiple accounts is a best practice because it enhances security, simplifies management, and allows for better resource isolation and cost tracking. Isolating the operation and management of your RCFGE solution in its own dedicated account is essential for securing the solution’s resources.

With AWS CloudFormation StackSets, you can deploy and manage RCFGE stacks across multiple accounts and AWS Regions from a single central administrator account. This provides consistent and scalable infrastructure while maintaining centralized governance. With this functionality, you can deploy the RCFGE resources to existing accounts and automatically include new accounts as you add them to your organization, simplifying RCFGE management and providing uniformity across your environments. For more information, see Deploy CloudFormation Hooks to an Organization with service-managed StackSets.

Figure 6 shows how to extend that idea so that you can operate the RCFGE solution at scale while maintaining isolation and the separation of duties. The solution operates across three key account types:

  • Management account –use this account to create your organization and designate the CloudFormation StackSets delegated administrator account.
  • Delegated administrator account – this account serves as the centralized management point for the RCFGE solution. It contains a continuous integration and continuous delivery (CI/CD) pipeline that provisions RCFGE resources across the organization by using CloudFormation StackSets with service managed permissions. The account hosts a centralized S3 bucket that stores the RCFGE restricted resource schema templates. The security engineering team uses this account to submit Hook code and restricted resource schema template changes, which trigger the CI/CD pipeline.
  • Member accounts – each member account contains an RCFGE StackSet instance and an AWS Identity and Access Management (IAM) role for provisioning RCFGE resources. It also includes a CloudFormation Hook and an IAM role that allows the Hook to access the centralized S3 bucket with RCFGE restricted resource schema templates.

Figure 6: Securely operate the RCFGE solution

Figure 6: Securely operate the RCFGE solution

Let’s explore how the RCFGE solution architecture enforces resource configuration step by step, as shown in Figure 7.

Figure 7: CloudFormation stack deployment flow with RCFGE validation and enforcement

Figure 7: CloudFormation stack deployment flow with RCFGE validation and enforcement

  1. DevOps initiates the deployment by specifying a CloudFormation template that defines the resources and configurations needed.
  2. CloudFormation creates a new stack resource, initiating the resource provisioning process based on the provided template.
  3. The RCFGE CloudFormation Hook is triggered for each resource defined in the CloudFormation template.
  4. The Hook loads the corresponding restricted resource schema template from the S3 bucket.
  5. The Hook validates a resource configuration:
    • The Hook processes the restricted resource schema template to create a JSON schema.
    • It uses this JSON schema to validate the current resource in the CloudFormation template.
    • If the resource is invalid according to the schema, the provisioning process is terminated.
  6. If the current resource passes validation, CloudFormation proceeds with the resource provisioning process by creating and configuring the resources as specified in the template.

Use SCPs as preventive controls for your organization to help protect RCFGE

The following SCP excerpt accomplishes three objectives:

  • Implements a statement (see AllowedListActions) to explicitly specify the access that is allowed while other access is implicitly blocked.
  • Implements control objectives to help prevent changes to resources set up by the RCFGE solution (see ProtectRCFGEResources and ProtectStackSetExecutionRole).
  • Makes sure that AWS resource provisioning does not occur outside of CloudFormation (see ProvisionResourcesViaCloudFormationOnly).

In this SCP excerpt, the ProvisionResourcesViaCloudFormationOnly statement restricts CloudFormation stacks to being managed only through forward access sessions (FAS) in AWS IAM.

The ProvisionResourcesViaCloudFormationOnly statement explicitly prohibits direct create, update, and delete actions for all supported resources used in your environment. If needed, split this statement into multiple parts so you don’t exceed SCP size limits, while providing comprehensive coverage of your resources to make sure that they are provisioned and managed only through CloudFormation.

The ProtectStackSetExecutionRole statement in this example assumes that CloudFormation trusted access is activated with AWS Organizations, which is required by StackSets to deploy across accounts and Regions by using service managed permissions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowedListActions",
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:DeleteBucket",
        "s3:DeleteBucketPolicy",
        "s3:PutAnalyticsConfiguration",
        "s3:PutBucketLogging",
        "s3:PutBucketNotification",
        "s3:PutBucketObjectLockConfiguration",
        "s3:PutBucketPolicy",
        "s3:PutBucketTagging",
        "s3:PutBucketVersioning",
        "s3:PutLifecycleConfiguration",
        "s3:PutMetricsConfiguration",
        "s3:PutReplicationConfiguration",
        "s3:GetObject",
        ...
      ],
      "Resource": "*"
    },
    {
      "Sid": "ProtectRCFGEResources",
      "Effect": "Deny",
      "Action": "*",
      "Resource": [
        "arn:aws:cloudformation:*:*:stack/RCFGEStackSet",
        "arn:aws:cloudformation:*:*:*/hook/RCFGEHook/*",
        "arn:aws:iam::*:role/RCFGEHookExecutionRole"
      ],
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/stacksets-exec-*"
          ]
        }
      }
    },
    {
      "Sid": "ProtectStackSetExecutionRole",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "arn:aws:iam::*:role/stacksets-exec-*"
    },
    {
      "Sid": "ProvisionResourcesViaCloudFormationOnly",
      "Effect": "Deny",
      "Action": [
        "s3:CreateBucket",
        "s3:DeleteBucket",
        "s3:DeleteBucketPolicy",
        "s3:PutAnalyticsConfiguration",
        "s3:PutBucketLogging",
        "s3:PutBucketNotification",
        "s3:PutBucketObjectLockConfiguration",
        "s3:PutBucketPolicy",
        "s3:PutBucketTagging",
        "s3:PutBucketVersioning",
        "s3:PutLifecycleConfiguration",
        "s3:PutMetricsConfiguration",
        "s3:PutReplicationConfiguration",
        ...
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:CalledViaFirst": "cloudformation.amazonaws.com"
        }
      }
    }
  ]
}

To allow the Hook to retrieve the necessary restricted resource schema templates, member accounts must be able to access the S3 bucket that contains the RCFGE templates. The following code sample shows the bucket policy for the S3 bucket that contains the RCFGE templates.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowRCFGEHookExecutionRoleGetRCFGETemplates",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Principal": "*",
      "Resource": "arn:aws:s3:::RCFGETemplates/*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "o-abcdef0123"
        },
        "ArnLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/RCFGEHookExecutionRole"
        }
      }
    }
  ]
}

As shown in the following code sample, the RCFGEHookExecutionRole IAM role in member accounts has a policy that grants read-only access to the RCFGE templates that are stored in an S3 bucket in the RCFGE delegated administrator account, where 555555555555 represents the account ID.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowRCFGEHookExecutionRoleGetRCFGETemplates",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::RCFGETemplates/*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceAccount": "555555555555"
        }
      }
    }
  ]
}

In the following code sample, the RCFGEHookExecutionRole IAM role in member accounts has a trust policy that allows it to be assumed only by the relevant CloudFormation service principals, where 444455556666 represents the account ID of the member account.

{
  "Version": "2012-10-17",
  "Statement": {
    "Sid": "AllowRCFGEHookExecutionRoleGetRCFGETemplatesTrust",
    "Effect": "Allow",
    "Principal": {
      "Service": [
        "hooks.cloudformation.amazonaws.com",
        "resources.cloudformation.amazonaws.com"
      ]
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "ArnLike": {
        "aws:SourceArn": "arn:aws:cloudformation:eu-west-1:444455556666:type/hook/RCFGEHook/*"
      }
    }
  }
}

Define baseline configuration for RCFGE and continuous monitoring with AWS Config

Defense in depth is an effective strategy because if one line of defense fails, additional layers are in place to help stop threats at subsequent points. With AWS Config, you can capture the configuration of RCFGE resources over time. You can set up AWS Config custom rules to automatically assess the compliance of your RCFGE resources against predefined policies. For example, you can use an AWS Config custom rule to make sure that the RCFGE Hook hasn’t been altered or removed.

Conclusion

In this post, you learned how to use CloudFormation Hooks to create a resource configuration enforcement (RCFGE) solution on AWS that is designed to be secure and scalable and that supports feature gating. Using this approach, you, as a security administrator, can maintain strict control over resource configurations and feature adoption across your AWS environments. The solution provides a balanced approach to governance, so that DevOps teams have the flexibility to work within approved boundaries while making sure that new AWS features are only accessible after explicit approval.

If you have feedback about this post, submit comments in the Comments section. For questions, start a new thread on the CloudFormation re:Post or contact AWS Support.
 

Yossi Cohen
Yossi Cohen

Yossi is a Senior Security Specialist Solutions Architect at AWS for the public sector in the EMEA region. Yossi has over two decades of experience in cloud-native architecture development, design, operations, technical due diligence, and governance in highly regulated environments. At AWS, Yossi collaborates closely with defense, intelligence, government, and public sector clients, helping them navigate their unique threat landscapes.
Yaniv Rozenboim
Yaniv Rozenboim

Yaniv is a Senior Solutions Architect at AWS with extensive experience in cloud architecture and security. He specializes in designing and implementing secure, scalable, and efficient cloud infrastructures. Yaniv works closely with clients to help them achieve their business goals through AWS technologies.

AWS KMS: How many keys do I need?

Post Syndicated from Ishva Kanani original https://aws.amazon.com/blogs/security/aws-kms-how-many-keys-do-i-need/

As organizations continue their cloud journeys, effective data security in the cloud is a top priority. Whether it’s protecting customer information, intellectual property, or compliance-mandated data, encryption serves as a fundamental security control. This is where AWS Key Management Service (AWS KMS) steps in, offering a robust foundation for encryption key management on AWS.

One of the first questions that often arises for customers is, “How many keys do I actually need?” This seemingly simple question requires careful consideration of various factors. Although AWS KMS makes encryption straightforward, organizations need to consider several aspects of their key management strategy. These include choosing between AWS managed keys, customer managed keys, and importing your own keys (BYOK), as well as deciding between centralized and decentralized key management approaches. Each option has its own benefits and trade-offs in terms of security, control, and operational overhead. By understanding these choices and how they align with your organization’s needs, you can develop an effective and efficient key management strategy.

In this blog post, we explore the main considerations that drive your AWS KMS key strategy, from organizational structure to compliance requirements. Should you maintain a centralized key management approach with a single team controlling all keys, or adopt a decentralized model where individual teams manage their own keys? These decisions are important because they relate to the AWS shared responsibility model, where AWS maintains the security of the cloud, while customers remain responsible for security in the cloud, including the proper management and use of encryption keys.

Overview – What is AWS Key Management Service?

AWS Key Management Service (AWS KMS) is an AWS managed service that makes it convenient for you to create and control the encryption keys that are used to encrypt your data. The keys that you create in AWS KMS are protected by FIPS 140 Level 3 validated hardware security modules (HSM). The keys never leave AWS KMS unencrypted. To use or manage your KMS keys, you interact with AWS KMS.

Customers are responsible for deciding what data to encrypt, choosing the appropriate encryption keys, and implementing encryption across AWS services with the help of the key policy. Customers are responsible for monitoring and auditing the use of encryption keys through services such as AWS CloudTrail.

A critical aspect of customer responsibility is determining how to manage the keys and how many KMS keys are needed. This decision depends on various factors such as data classification, application architecture, regulatory requirements, and operational needs. We look at these areas in more detail in the next sections.

Guiding principles for key strategy

Following are four guiding engineering principles that, based on our experience, help create a secure and easier-to-maintain system. They will assist you in determining the approximate number of KMS keys for your organization based on your management requirements.

Principle 1 – Data Classification: If a system processes data of different classification levels, employ separate data resources and separate KMS keys to separately govern and audit access to the data. With similarly classified data or a single type of data, the usage of just one KMS key may be justified.

Why it matters: This principle helps to ensure that data that is classified into different sensitivity levels is protected appropriately based on access to encryption keys for that same classification, reducing the risk of unauthorized access and simplifying governance.

Principle 2 – Applications: Multiple applications can run in one AWS account. We recommend that you use distinct KMS keys for each application, because managing access to an individual key can become a complex task when it is delegated to two or more application administrators. Use separate KMS keys for applications running in distinct AWS accounts to further make use of the account boundary, limiting the potential impact in case of a security incident. Use separate keys for distinct application stages (such as development, staging, or production).

Why it matters: This approach isolates access to applications and application access to data. This reduces the potential impact of unintended access to a key.

Principle 3 – AWS Services: When you consider key management across multiple AWS services, focus on both the services and the nature of the data. If you are dealing with one type of data (for example, customer information) that flows through multiple AWS services as part of one application or workflow, consider using a single KMS key. This simplifies key management while maintaining consistent access control. For instance, a customer record that is stored in Amazon Simple Storage Service (Amazon S3), processed by AWS Lambda, and then stored in Amazon DynamoDB could use the same KMS key across these services as mentioned in Principle 1.

However, if you are handling different types of data (such as financial records and user preferences) across various AWS services, even within the same application, consider using separate KMS keys on a per-service basis. This allows for more granular access control and adheres to the principle of least privilege. For example, in an e-commerce application, you might use one KMS key for encrypting payment information in Amazon Relational Database Service (Amazon RDS) and a different key for encrypting user browsing history in Amazon Redshift.

The decision to use one key or multiple keys should be based on your data classification policies and access control requirements. With this approach, you can keep your key management strategy aligned with your data governance policies, regardless of which AWS services you are using.

Why it matters: This principle balances the need for simplicity with the requirement for granular control over data access across different AWS services.

Principle 4 – Separation of Duties: Key policies define who can administer and who can use the key. In the case of distinct encryption use cases and distinct administrators, we recommend that you create separate KMS keys. Another aspect of separation of duties is that, with KMS key policies, two different principals can be made responsible for governing data and data decryption access. However, this does not influence the count of keys.

Why it matters: This principle supports the implementation of least privilege access and helps maintain clear accountability in key management.

By applying these principles, you can develop a key management strategy that describes how many KMS keys you may need, and that balances security, compliance, and operational efficiency. In the following sections, we explore how to apply these principles in various scenarios.

Examples of key management strategy and comparison of centralized and decentralized approaches

In addition to the guiding principles discussed earlier, the structure of your organization and its specific needs play a crucial role in determining the most suitable approach to key management. When implementing key management strategies, organizations generally choose from three main approaches: centralized, decentralized, or a hybrid model. The choice depends on the organization’s structure, needs, and operational context. Each approach offers distinct advantages for specific organizational scenarios.

A decentralized approach is our recommended approach, as most customers fit into the following scenarios:

  • Organizations with autonomous business units or where governance controls provide oversight of key usage
  • Companies where development teams are agile and ownership of keys can be centrally audited
  • Companies that operate in multiple regulatory frameworks
  • Companies that require to operate in a particular AWS Region

A centralized KMS approach is best suited for the following scenarios:

  • Organizations that require strict compliance oversight and centralized management
  • Companies with centralized security or data protection functions

In a hybrid model, there is a blend between centralized and decentralized:

  • Core key policies are managed centrally
  • Day-to-day key operations are handled by teams

    For example, organizations or companies could have independent product teams, but a centralized security team.

Example 1 (Hybrid): A retail website with public product catalog data and confidential customer data should use two KMS keys—one for the public catalog that is encrypted in Amazon S3, and one for customer data that is encrypted in Amazon RDS and other AWS services.

Rationale: This recommendation is based primarily on Principle 1 (Data Classification). The public catalog data and confidential customer data represent different classification levels, justifying the use of separate keys. This approach is further supported by Principle 3 (AWS Services), because the data resides in different AWS services and is of a varied nature.

The benefits of this approach:

  • Implement appropriate access controls for each data type
  • Manage encryption independently for each data classification
  • Enhance overall data security and compliance

Example 2 (Decentralized): A healthcare company with several application teams could use a separate KMS key for each application team, with distinct key policies for each key based on the data and roles of each team.

Rationale: This recommendation is primarily based on Principle 2 (Applications). With multiple application teams operating within the healthcare company, each potentially dealing with distinct types of data and having different access requirements, separate KMS keys provide for independent management of encryption and access for each team. This approach is further supported by Principle 4 (Separation of Duties), allowing for team-specific key policies.

The benefits of this approach:

  • Maintain granular control over data access.
  • Implement team-specific encryption policies.
  • Uphold the principle of least privilege across the organization.
  • Enhance data security: By using separate keys, the company limits the impact of improper access to any given key, enables more precise access control, facilitates independent key rotation schedules, and improves the ability to monitor and audit key usage for each application.
  • Simplify alignment with healthcare regulations: Separate keys support data segregation requirements, enable fine-grained role-based access control, provide clear audit trails for each application’s data access, and allow for tailored data lifecycle management. This functionality is crucial for aligning with various healthcare compliance standards such as HIPAA.
  • Allow for efficient and distributed key management that is tailored to each application team’s needs.

These examples demonstrate how applying the guiding principles can lead to a well-structured key management strategy, tailored to the specific needs of different organizations and use cases.

Considerations for key management

When you implement your key management strategy, several factors need to be considered beyond just the number of keys. This section explores these considerations to help you make informed decisions about your key management approach.

Key types

AWS offers different types of KMS keys, each with its own benefits and use cases.

AWS owned keys are managed by AWS in service accounts, used across multiple customer accounts, and provide no customer visibility or audit capability. Choose AWS owned keys when there are no management or audit requirements for the keys, but encryption of the data at rest is needed.

AWS managed keys are managed entirely by AWS and are used only for your AWS account. Although customers can view these keys in the AWS Management Console and track their usage in AWS CloudTrail logs, they have limited ability to directly control or modify these keys. Choose AWS managed keys when managing keys is not a requirement, but having an audit trail is. It’s worth noting that AWS managed keys are automatically rotated every year, which can be convenient for many use cases.

Customer-managed keys offer the highest level of control and customization, allowing creation of key policies and control over key rotation. However, customer managed keys provide more flexibility, allowing you to set your own rotation schedule or even enable rotation if you are required to do so for regulatory reasons. Choose customer managed keys when you need strict control over key usage and the ability to share keys or control access through key policies, detailed auditing capabilities, alignment with specific compliance requirements, or the ability to integrate key management with your existing processes and tools.

The decision between AWS managed and customer managed keys often comes down to balancing the convenience of automatic management with the need for granular control and customization. As the number of keys increases, so does the complexity of management. More keys mean more policies to create, manage, and audit. Making sure that the right people have access to the right keys becomes more challenging. However, to help audit KMS key access, you can use the IAM Access Analyzer to determine external access to your keys. Managing rotation schedules for multiple keys requires more effort, and more keys mean more policies to analyze and monitor, as well as growing costs.

Cost

Security should be the primary concern, but cost is also a factor. Each customer managed key incurs a monthly storage cost. Both AWS managed and customer managed KMS keys have API usage costs associated with them. Key rotation can increase costs over time, as old key versions are retained.

Manageability

Finding the right balance between security and manageability is crucial. Too few keys might not provide adequate separation of duties or granular access control, while too many keys can lead to increased complexity, higher costs, and potential mismanagement.

Specific requirements

Different industries and regions may have specific requirements for key management. Some regulations might require separation of duties, necessitating multiple keys. Certain compliance standards might dictate specific key rotation or audit trail requirements.

By carefully considering these factors alongside the guiding principles discussed earlier, you can develop a key management strategy that balances security, compliance, cost-effectiveness, and operational efficiency for your specific needs. It is important to approach your KMS strategy holistically, considering not just your immediate security needs, but also the long-term management implications. Regular review and adjustment of your key management strategy will provide assurance that it continues to meet your evolving needs while maintaining robust security and compliance.

Conclusion

As we explored throughout this post, determining the optimal number of AWS KMS keys for your organization is a nuanced decision that balances security, compliance, cost, and operational efficiency. The guiding principles we discussed—data classification, application segregation, AWS service integration, and separation of duties—provide a solid framework for making these decisions. Remember that there’s no one-size-fits-all solution; the right approach depends on your specific needs and circumstances.

As you move forward in implementing or refining your KMS key strategy, consider these next steps: First, conduct a thorough audit of your current data assets, their classifications, and the applications and services that interact with them. Next, map out your ideal key management structure based on the principles we’ve discussed. Then, evaluate the costs and operational overhead of your proposed strategy, adjusting as necessary to find the right balance for your organization. Finally, implement your strategy incrementally, starting with your most sensitive or critical data assets.

Remember that key management is an ongoing process. Regularly review and update your strategy as your data landscape evolves, new compliance requirements emerge, or AWS introduces new features. By thoughtfully applying the principles and considerations we’ve discussed, you can create a robust, scalable, and efficient key management strategy that helps your overall security posture and meets your organization’s unique needs.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Ishva Kanani
Ishva Kanani

Ishva is a Security Consultant at AWS. She assists customers with secure cloud migrations and accelerates their cloud journeys by delivering innovative solutions. Passionate about cybersecurity, Ishva provides strategic guidance and best practices for cloud environments. When not safeguarding digital assets, she enjoys exploring local hiking trails and trying new recipes in her kitchen.
Hardik Thakkar
Hardik Thakkar

Hardik is a Prototyping Solutions Architect at AWS Global Financial Services (GFS). He specializes in secure architecture design and foundations on AWS, leveraging his security expertise to serve financial services customers. His focus areas include security-first design patterns, financial services compliance frameworks, and helping institutions build robust cloud infrastructures on AWS.

Accelerate Amazon Redshift secure data use with Satori – Part 2

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-secure-data-use-with-satori-part-2/

This post is co-written by Adam Gaulding, Solution Architect at Satori.

In this post, we continue from Accelerate Amazon Redshift secure data use with Satori – Part 1, and explain how Satori, an Amazon Redshift Ready partner, simplifies both the user experience of gaining access to data and the admin practice of granting and revoking access to data in Amazon Redshift. Satori enables both just-in-time and self-service access to data.

Solution overview

Satori creates a transparent layer providing visibility and control capabilities that is deployed in front of your existing Redshift data warehouse. When adding a new data store to Satori, a new, Satori-provided URL is generated for the data store, which data consumers use instead of connecting directly.

The following diagram illustrates the solution architecture.

Data consumers don’t have to change how they work with data, such as installing different database drivers, changing their queries, or compromising on features or functionality. Satori is not a data virtualization or database federation solution that abstracts your existing data stores.

Self-service access to data is fully automated. The admin is responsible for setting up the access rules. User access privileges can be preconfigured for automated dataset access. The user can see the datasets that are available to them in their personalized data portal. The user then selects the dataset they want to use and Satori automatically applies the appropriate security, privacy, and compliance requirements.

Just-in-time access to data is also flexible but requires approval from an admin. From the user’s personalized data portal, they can see the available datasets—the only datasets they have self-service access to are already included in their My Data folder. If they see a dataset that they need but don’t have access to, they can request access to this data on-demand. The request is sent to the admin and, based on the user’s credentials, the admin can choose to approve or deny access.

The ability to facilitate and automate access to data provides the following benefits:

  • Satori improves the user experience by providing quick access to data. This increases the time-to-value of data and drives innovative decision-making.
  • Admins benefit from automating the process, significantly reducing the amount of time spent on granting and revoking access to data.

Prerequisites

Follow the steps outlined in Accelerate Amazon Redshift secure data use with Satori – Part 1 to complete the following prerequisite steps:

  1. Prepare the data.
  2. Connect to Amazon Redshift.
  3. Create a dataset and give Satori control over access to the dataset.
  4. Optionally, create security policies and revisit the concepts related to secure data access and masking policies.

After you complete the prerequisites, you’re ready to explore self-service and just-in-time access to data.

Self-service access

The following steps explain how to create self-service rules from admin and user perspectives.

Create access request and self-service rules (admin perspective)

After the admin gives Satori control over access to the dataset, they need to first preconfigure the user access rules. Complete the following steps:

  1. Navigate to the Datasets page and choose User Access Requests.
  2. In the Self-Service Access section, choose Self-Service Rule.

  1. Specify the required level of access.

The admin has several options when configuring the access rules. You can set the level of access by user or group, define when it expires, and set revocation rules.

The following screenshot shows the configuration rule for data access requests we created. In this example, the self-service user group has read-only access during the next 30 days that is set to revoke within 7 days if it’s not used.

The following figure shows an example configuration rule to add a user.

The newly created access rule and details are displayed in the list of self-service rules.

The next steps outline the data user view and steps to gain self-service access to data.

Create access request and self-service rules (user perspective)

As a user, complete the following steps:

  1. Enter the Satori personalized data portal using the Data Portal option on the options menu (three vertical dots).

The data portal will display all available datasets. Any datasets that the user already has self-service access to will appear under My Data, as shown in the following screenshot. All other datasets appear under Available Datasets.

  1. Choose the desired dataset (in this case, CustomerDataset) and request immediate access to this dataset by choosing Ask for Access to Dataset.

  1. For Access Request, choose Self Service.
  2. For Request Message, enter a reason for the request.
  3. Choose Request.

Based on the user’s identity, preconfigured access rules match the user to their respective qualifications and authorizations. In this case, the user is automatically granted access to CustomerDataset using the preconfigured self-service rules. The requested dataset appears with Status – Access Granted under My Data.

The preconfigured access rules are applied so that when this user runs their queries, certain sensitive data is redacted.

Now that access is granted, query the data using a SQL editor of your choice. In this post, we use DBeaver to connect to a Redshift cluster using the Satori hostname on the data stores tab.

When you query the data, you will see the security policies applied to the result set at runtime. In the following example, the customer table is displayed with redacted field values based on security policies.

In the following example, the credit_cards table is displayed with masking policies applied to the result values.

Just-in-time access

Just-in-time access is similar to self-service access; the only difference is that it includes an additional step of requesting access from the admin.

Create access request and self-service rules (user perspective)

The user enters the Satori personalized data portal with the same view as shown in the self-service access to data.

If the data that you need isn’t included under My Data but shows under Available Datasets, you can request access to this dataset. For this example, we consider a new user John Doe trying to access CustomerDataset from the available datasets. The process consists of the following steps:

  1. User John Doe logs in to the Satori portal and finds the Available Datasets section in their data portal.
  2. The user submits a request for CustomerDataset.

The request from user John Doe for CustomerDataset stays in Pending Approval status until approved from the admin.

  1. The admin receives the request from user John Doe through email and portal notifications for dataset requests.

The admin can approve or deny the request and might also designate the level of access and when that access expires.

The following screenshot shows an example email notification.

  1. The admin can choose View Request in the email and then approve or deny the request on the Satori portal.

  1. The admin can choose the pencil icon to edit the request before approval and modify the approval conditions.

In this example, the admin modifies a couple of criteria as shown and then approves the request.

Create access request rules (admin perspective)

Users can request access to datasets and the admin can approve or reject those requests, but the admin can also preconfigure the user access rules. Complete the following steps as the admin:

  1. On the Datasets page, choose User Access Requests.
  2. Fill out the access request rule.
  3. Choose Add.

The access request rule creation will be treated as an approval workflow when dataset requests are placed from the data portal.

Dataset requests from users will follow the course of action configured by the admin during access request rules creation. The preconfigured access rules specific to that user are applied so that when this user runs their queries, security policies and masking conditions are applied, and sensitive data is redacted or masked as applicable. The access control is maintained according to the admin settings for both just-in-time access and self-service access.

Clean up

To avoid unintended costs, clean up the resources provisioned as part of Accelerate Amazon Redshift secure data use with Satori – Part 1 or provisioned for this post. Make sure to delete the following resources:

  • Redshift cluster or serverless endpoint
  • Security group to allow inbound traffic from Satori
  • Configurations within your Satori account

Summary

In this post, we described how Satori can help automate secure data access for both data users and admins. The ability to automate this process increases the time-to-value of data for users and reduces the time and resources admins need to allocate for granting and revoking data access.

Satori is available on the AWS Marketplace. To learn more, start a free trial or request a demo meeting.

Amazon Redshift provides comprehensive security and governance features to protect your data, and continues to expand its out-of-the-box capabilities. For the latest features and updates, explore Amazon Redshift What’s New.


About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has over 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

 Adam Gaulding is a Solution Architect at Satori. At Satori, Adam is helping customers implement data security controls on databases, data lakes and data warehouses. Adam has been in and around the data space throughout his 20+ year career. He’s worked with companies large and small and prides himself in building creative solutions for technical problems.

Efficient satellite imagery supply with AWS Serverless at BASF Digital Farming GmbH

Post Syndicated from Kevin S. Ridolfi original https://aws.amazon.com/blogs/architecture/efficient-satellite-imagery-supply-with-aws-serverless-at-basf-digital-farming-gmbh/

This post was co-written with Dr. Jan Melchior at BASF Digital Farming GmbH and xarvio Digital Farming Solutions.

BASF Digital Farming’s mission is to support farmers worldwide with cutting-edge digital agronomic decision advice by using its main crop optimization platform, xarvio FIELD MANAGER. This necessitates providing the most recent satellite imagery available as quickly as possible. This blog post describes the serverless architecture developed by BASF Digital Farming for efficiently downloading and supplying satellite imagery from various providers to support its xarvio platform.

Screenshot showing the xarvio Field Manager platform

Figure 1. Screenshot showing the xarvio Field Manager platform

Architecture

Figure 2 shows the serverless architecture implemented with AWS services for downloading and processing satellite imagery. The subscription management components handle subscription creation, updates, and deletions, while the actual data downloading and processing occurs in AWS Step Functions.

Serverless implementation of the new imagery service

Figure 2. Serverless implementation of the new imagery service

  1. Subscriptions are created using Amazon API Gateway for external API access, which provides request throttling and can be used to manage API request authorizations.
  2. An AWS Lambda API function manages subscriptions. It implements common create, read, update, and delete operations with request validations and provides an endpoint for replaying failed requests. Subscriptions contain geometry, data provider, as well as start and end date and other parameters, which are stored in the subscription database (Step 7) before a message is sent out for processing.
    Notice that the entire architecture is serverless and thus allows for theoretically unbounded scaling. In case of a bug, this can lead to severe cost impacts, so we implemented a safety buffer, which enables us to prioritize and limit the number of Step Functions executions of the processing pipeline.
  3. All requests (such as the initial request for imagery when a subscription is created) are sent to the Amazon Simple Queue Service (Amazon SQS) processing queue first, which functions as a processing buffer and allows for request prioritization.
  4. Subsequently, Amazon EventBridge Pipes connects the processing buffer with AWS Step Functions. It handles pipe-internal errors automatically; for example, when the Step Functions concurrency limit is reached, the invocation will be retired automatically. This does not handle exceptions raised within Step Functions, such as runtime errors.
  5. AWS Step Functions then performs the actual downloading, processing, and ingestion to the STAC catalog of satellite data from different providers. In case of failure, the request message with error description is sent to the failure queue.
  6. Step Functions uploads the data to Amazon Simple Storage Service (Amazon S3), which stores satellite imagery data.
  7. Following this, Step Functions updates the subscriptions in the Amazon DynamoDB-based subscription database, which stores relevant metadata, such as start and end date, boundary, provider, collection, and last update.
  8. A notification is sent out to inform the user that new data is available through Amazon Simple Notification Service (Amazon SNS), which informs users and services about any updates on a subscription, such as new data being available or subscriptions having been created, deleted, updated, or having failed.
  9. Next, the data is published to our internal STAC catalog, which registers the satellite imagery and makes it directly accessible for subsequent processing.
  10. In case of failed Step Functions execution in Step 5, the Amazon SQS-based failure queue buffers failed executions. Failure messages contain the error message and request body. Depending on error reasons, they can be replayed using the corresponding API endpoint, enabling reprocessing through the replay endpoint on the API Lambda function. The endpoint also allows users to filter messages based on their failure type and to delete messages that cannot be replayed.
  11. An update checker, built on AWS Lambda, regularly checks whether a subscription can be updated. It is triggered in conjunction with an event scheduler every 5 minutes, checks the database for subscriptions that can be updated, and sends update request messages to the processing buffer. Besides actively checking resources, such as API endpoints and STAC catalogs, it also sends out an update message if a notification was received, for example, through an external notification service.
  12. Finally, a delete checker, also built on AWS Lambda, identifies subscriptions that can be deleted. It is triggered in conjunction with an event scheduler every 12 hours. It regularly checks the database for subscriptions that can be deleted and removes them from the database, the S3 bucket, and the STAC catalog. As a safety mechanism, a subscription will first be marked for deletion for 6 months before it gets deleted.

Imagery step function

The actual downloading and processing of data from different providers is handled by the imagery function, illustrated for two different providers (Public and Planet) in Figure 3.

Diagram showing detail state machine for the Imagery Step Function

Figure 3. Diagram showing detail state machine for the Imagery Step Function

  1. When a request arrives, the provider choice state determines the provider from the request body, depending on which the Step Functions flow routes to different Lambda states.
  2. In case a public provider is selected (for example, Earth Search), the Public_Provider Lambda function downloads the data from STAC-based open data providers and directly uploads it to the S3 data bucket, as shown in Figure 2.
  3. In case Planet data is selected, the data retrieval involves an asynchronous call to an external API: First, the Planet_Requester sends an order to the Planet API, together with a task token for pausing Step Functions and the URL of the Planet_Webhook Lambda function.
  4. The Planet_Webhook function is invoked by Planet when the requested order is available for downloading. Given the transmitted task token, Step Functions is resumed with the next state.
  5. Subsequently, the Planet_Provider Lambda function downloads and processes the Planet data.
  6. For both public providers and Planet, the subsequent Public_Provider Lambda function updates the subscription database entries, as shown in Figure 2 (for example, with the latest available timestamp), and adds the download and processed data to the internal STAC catalog, before it ends in the Success state.
  7. If an error occurs in any of the Lambda functions (2, 3, 5, 6), an error message is prepared in the Error_Parsing If an unknown provider is handed in, an error message, including the request body, is prepared in the Error_Provider_Unknown state. In both cases, the error message is pushed to the Failure_Queue (refer to #10 of Figure 2), before it ends in the Failure state.

Conclusion

BASF Digital Farming GmbH developed a serverless architecture on AWS for efficiently downloading and supplying satellite imagery for use by its xarvio platform. This architecture led to a 5x faster delivery rate, an 80% cost reduction through on-demand data downloading, and a 3x accelerated development cycle. Future work will include optimizing the architecture, exploring additional AWS services, and onboarding more satellite imagery providers. Similar serverless architectures using AWS services like AWS Step Functions, AWS Lambda, and Amazon API Gateway can enhance flexibility, scalability, and cost efficiency in imagery provisioning. Learn more about AWS serverless offerings at aws.amazon.com/serverless.

How REA Group approaches Amazon MSK cluster capacity planning

Post Syndicated from Eunice Aguilar original https://aws.amazon.com/blogs/big-data/how-rea-group-approaches-amazon-msk-cluster-capacity-planning/

This post was written by Eunice Aguilar and Francisco Rodera from REA Group.

Enterprises that need to share and access large amounts of data across multiple domains and services need to build a cloud infrastructure that scales as need changes. REA Group, a digital business that specializes in real estate property, solved this problem using Amazon Managed Streaming for Apache Kafka (Amazon MSK) and a data streaming platform called Hydro.

REA Group’s team of more than 3,000 people is guided by our purpose: to change the way the world experiences property. We help people with all aspects of their property experience—not just buying, selling, and renting—through the richest content, data and insights, valuation estimates, and home financing solutions. We deliver unparalleled value to our customers, Australia’s real estate agents, by providing access to the largest and most engaged audience of property seekers.

To achieve this, the different technical products within the company regularly need to move data across domains and services efficiently and reliably.

Within the Data Platform team, we have built a data streaming platform called Hydro to provide this capability across the whole organization. Hydro is powered by Amazon MSK and other tools with which teams can move, transform, and publish data at low latency using event-driven architectures. This type of structure is foundational at REA for building microservices and timely data processing for real-time and batch use cases like time-sensitive outbound messaging, personalization, and machine learning (ML).

In this post, we share our approach to MSK cluster capacity planning.

The problem

Hydro manages a large-scale Amazon MSK infrastructure by providing configuration abstractions, allowing users to focus on delivering value to REA without the cognitive overhead of infrastructure management. As the use of Hydro grows within REA, it’s crucial to perform capacity planning to meet user demands while maintaining optimal performance and cost-efficiency.

Hydro uses provisioned MSK clusters in development and production environments. In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. Proper capacity planning makes sure the clusters can handle high traffic and provide all users with the desired level of service.

Real-time streaming is a relatively new technology at REA. Many users aren’t yet familiar with Apache Kafka, and accurately assessing their workload requirements can be challenging. As the custodians of the Hydro platform, it’s our responsibility to find a way to perform capacity planning to proactively assess the impact of the user workloads on our clusters.

Goals

Capacity planning involves determining the appropriate size and configuration of the cluster based on current and projected workloads, as well as considering factors such as data replication, network bandwidth, and storage capacity.

Without proper capacity planning, Hydro clusters can become overwhelmed by high traffic and fail to provide users with the desired level of service. Therefore, it’s very important to us to invest time and resources into capacity planning to make sure Hydro clusters can deliver the performance and availability that modern applications require.

The capacity planning approach we follow for Hydro covers three main areas:

  • The models used for the calculation of current and estimated future capacity needs, including the attributes used as variables in them
  • The models used to assess the approximate expected capacity required for a new Hydro workload joining the platform
  • The tooling available to operators and custodians to assess the historical and current capacity consumption of the platform and, based on them, the available headroom

The following diagram shows the interaction of capacity usage and the precalculated maximum usage.

Although we don’t have this capability yet, the goal is to take this approach one step further in the future and predict the approximate resource depletion time, as shown in the following diagram.

To make sure our digital operations are resilient and efficient, we must maintain a comprehensive observability of our current capacity usage. This detailed oversight allows us not only to understand the performance limits of our existing infrastructure, but also to identify potential bottlenecks before they impact our services and users.

By proactively setting and monitoring well-understood thresholds, we can receive timely alerts and take necessary scaling actions. This approach makes sure our infrastructure can meet demand spikes without compromising on performance, ultimately supporting a seamless user experience and maintaining the integrity of our system.

Solution overview

The MSK clusters in Hydro are configured with a PER_TOPIC_PER_BROKER level of monitoring, which provides metrics at the broker and topic levels. These metrics help us determine the attributes of the cluster usage effectively.

However, it wouldn’t be wise to display an excessive number of metrics on our monitoring dashboards because that could lead to less clarity and slower insights on the cluster. It’s more valuable to choose the most relevant metrics for capacity planning rather than displaying numerous metrics.

Cluster usage attributes

Based on the Amazon MSK best practices guidelines, we have identified several key attributes to assess the health of the MSK cluster. These attributes include the following:

  • In/out throughput
  • CPU usage
  • Disk space usage
  • Memory usage
  • Producer and consumer latency
  • Producer and consumer throttling

For more information on right-sizing your clusters, see Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost, Best practices for Standard brokers, Monitor CPU usage, Monitor disk space, and Monitor Apache Kafka memory.

The following table contains the detailed list of all the attributes we use for MSK cluster capacity planning in Hydro.

Attribute Name Attribute Type Units Comments
Bytes in Throughput Bytes per second Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Bytes out Throughput Bytes per second Relies on the aggregate Amazon EC2 network, Amazon EBS network, and Amazon EBS storage throughput
Consumer latency Latency Milliseconds High or unacceptable latency values usually indicate user experience degradation before reaching actual resource (for example, CPU and memory) depletion
CPU usage Capacity limits % CPU user + CPU system Should stay under 60%
Disk space usage Persistent storage Bytes Should stay under 85%
Memory usage Capacity limits % Memory in use Should stay under 60%
Producer latency Latency Milliseconds High or unacceptable sustained latency values usually indicate user experience degradation before reaching actual capacity limits or actual resource (for example, CPU or memory) depletion
Throttling Capacity limits Milliseconds, bytes, or messages High or unacceptable sustained throttling values indicate capacity limits are being reached before actual resource (for example, CPU or memory) depletion

By monitoring these attributes, we can quickly evaluate the performance of the clusters as we add more workloads to the platform. We then match these attributes to the relevant MSK metrics available.

Cluster capacity limits

During the initial capacity planning, our MSK clusters weren’t receiving enough traffic to provide us with a clear idea of their capacity limits. To address this, we used the AWS performance testing framework for Apache Kafka to evaluate the theoretical performance limits. We conducted performance and capacity tests on the test MSK clusters that had the same cluster configurations as our development and production clusters. We obtained a more comprehensive understanding of the cluster’s performance by conducting these various test scenarios. The following figure shows an example of a test cluster’s performance metrics.

To perform the tests within a specific time frame and budget, we focused on the test scenarios that could efficiently measure the cluster’s capacity. For instance, we conducted tests that involved sending high-throughput traffic to the cluster and creating topics with many partitions.

After every test, we collected the metrics of the test cluster and extracted the maximum values of the key cluster usage attributes. We then consolidated the results and determined the most appropriate limits of each attribute. The following screenshot shows an example of the exported test cluster’s performance metrics.

Capacity monitoring dashboards

As part of our platform management process, we conduct monthly operational reviews to maintain optimal performance. This involves analyzing an automated operational report that covers all the systems on the platform. During the review, we evaluate the service level objectives (SLOs) based on select service level indicators (SLIs) and assess the monitoring alerts triggered from the previous month. By doing so, we can identify any issues and take corrective actions.

To assist us in conducting the operational reviews and to provide us with an overview of the cluster’s usage, we developed a capacity monitoring dashboard, as shown in the following screenshot, for each environment. We built the dashboard as infrastructure as code (IaC) using the AWS Cloud Development Kit (AWS CDK). The dashboard is generated and managed automatically as a component of the platform infrastructure, along with the MSK cluster.

By defining the maximum capacity limits of the MSK cluster in a configuration file, the limits are automatically loaded into the capacity dashboard as annotations in the Amazon CloudWatch graph widgets. The capacity limits annotations are clearly visible and provide us with a view of the cluster’s capacity headroom based on usage.

We determined the capacity limits for throughput, latency, and throttling through the performance testing. Capacity limits of the other metrics, such as CPU, disk space, and memory, are based on the Amazon MSK best practices guidelines.

During the operational reviews, we proactively assess the capacity monitoring dashboards to determine if more capacity needs to be added to the cluster. This approach allows us to identify and address potential performance issues before they have a significant impact on user workloads. It’s a preventative measure rather than a reactive response to a performance degradation.

Preemptive CloudWatch alarms

We have implemented preemptive CloudWatch alarms in addition to the capacity monitoring dashboards. These alarms are configured to alert us before a specific capacity metric reaches its threshold, notifying us when the sustained value reaches 80% of the capacity limit. This method of monitoring enables us to take immediate action instead of waiting for our monthly review cadence.

Value added by our capacity planning approach

As operators of the Hydro platform, our approach to capacity planning has provided a consistent way to assess how far we are from the theoretical capacity limits of all our clusters, regardless of their configuration. Our capacity monitoring dashboards are a key observability instrument that we review on a regular basis; they’re also useful while troubleshooting performance issues. They help us quickly tell if capacity constraints could be a potential root cause of any ongoing issues. This means that we can use our current capacity planning approach and tooling both proactively or reactively, depending on the situation and need.

Another benefit of this approach is that we calculate the theoretical maximum usage values that a given cluster with a specific configuration can withstand from a separate cluster without impacting any actual users of the platform. We spin up short-lived MSK clusters through our AWS CDK based automation and perform capacity tests on them. We do this quite often to assess the impact, if any, that changes made to the cluster’s configurations have on the known capacity limits. According to our current feedback loop, if these newly calculated limits change from the previously known ones, they are used to automatically update our capacity dashboards and alarms in CloudWatch.

Future evolution

Hydro is a platform that is constantly improving with the introduction of new features. One of these features includes the ability to conveniently create Kafka client applications. To meet the increasing demand, it’s essential to stay ahead of capacity planning. Although the approach discussed here has served us well so far, it’s by no means the final stage , and there are capabilities that we need to extend and areas we need to improve on.

Multi-cluster architecture

To support critical workloads, we’re considering using a multi-cluster architecture using Amazon MSK, which would also affect our capacity planning. In the future, we plan to profile workloads based on metadata, cross-check them with capacity metrics, and place them in the appropriate MSK cluster. In addition to the existing provisioned MSK clusters, we will evaluate how the Amazon MSK Serverless cluster type can complement our platform architecture.

Usage trends

We have added CloudWatch anomaly detection graphs to our capacity monitoring dashboards to track any unusual trends. However, because the CloudWatch anomaly detection algorithm only evaluates up to 2 weeks of metric data, we will reassess its usefulness as we onboard more workloads. Aside from identifying usage trends, we will explore options to implement an algorithm with predictive capabilities to detect when MSK cluster resources degrade and deplete.

Conclusion

Initial capacity planning lays a solid foundation for future improvements and provides a safe onboarding process for workloads. To achieve optimal performance of our platform, we must make sure that our capacity planning strategy evolves in line with the platform’s growth. As a result, we maintain a close collaboration with AWS to continually develop additional features that meet our business needs and are in sync with the Amazon MSK roadmap. This makes sure we stay ahead of the curve and can deliver the best possible experience to our users.

We recommend all Amazon MSK users not miss out on maximizing their cluster’s potential and to start planning their capacity. Implementing the strategies listed in this post is a great first step and will lead to smoother operations and significant savings in the long run.


About the Authors

Eunice Aguilar is a Staff Data Engineer at REA. She has worked in software engineering in various industries throughout the years and recently for property data. She’s also an advocate for women interested in transitioning into tech, along with the well-versed who she takes inspiration from.

Francisco Rodera is a Staff Systems Engineer at REA. He has extensive experience building and operating large-scale distributed systems. His interests are automation, observability, and applying SRE practices to business-critical services and platforms.

Khizer Naeem is a Technical Account Manager at AWS. He specializes in Efficient Compute and has a deep passion for Linux and open-source technologies, which he leverages to help enterprise customers modernize and optimize their cloud workloads.

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Post Syndicated from Leo Ramsamy original https://aws.amazon.com/blogs/big-data/how-anz-institutional-division-built-a-federated-data-platform-to-enable-their-domain-teams-to-build-data-products-to-support-business-outcomes/

In today’s rapidly evolving financial landscape, data is the bedrock of innovation, enhancing customer and employee experiences and securing a competitive edge. Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights.

Like many large financial institutions, ANZ Institutional Division operated with siloed data practices and centralized data management teams. As time went on, the limitations of this approach became apparent due to rising data complexity, larger volumes, and the growing demand for swift, business-driven insights. Consequently, the bank encountered several challenges and needed to take the following actions:

  • Create business insights from untapped data potential, estimated to be approximately $150 million in the Institutional Division alone
  • Improve operational efficiency by removing manual data handling, the use of spreadsheets, and duplicate data entries
  • Increase agility by making data expertise more readily available, thereby improving time to market and overall customer experience
  • Address data quality
  • Standardize tooling and remove the Shadow IT culture, driving scalability, reducing risk, and minimizing overall operational inefficiencies

These challenges are not unique to ANZ Institutional Division. Globally, financial institutions have been experiencing similar issues, prompting a widespread reassessment of traditional data management approaches.

One major trend, embraced by many financial institutions, has been the adoption of the data mesh architecture and the shift towards treating data as a product. This paradigm, pioneered by thought leaders like Zhamak Dehghani, introduces a decentralized approach to data management that aligns closely with modern organizational structures and agile methodologies.

Some notable global examples of leading companies embracing and implementing this trend are JPMorgan Chase, Capital One, and Saxo Bank.

Inspired by these global trends and driven by its own unique challenges, ANZ’s Institutional Division decided to pivot from viewing data as a byproduct of projects to treating it as a valuable product in its own right.

This shift promises several business benefits:

  • Empowered domain expertise – By decentralizing data ownership to domain-based teams, ANZ can use the deep business knowledge within each unit to create more relevant and valuable data products
  • Increased agility – Domain teams can now respond more quickly to business needs, creating and iterating on data products without relying on a centralized bottleneck
  • Improved data quality – With domain experts overseeing their own data, there’s a greater likelihood of catching and correcting quality issues at the source
  • Scalability – The federated approach allows for greater scalability, enabling ANZ to handle increasing data volumes and complexity more effectively
  • Innovation catalyst – By democratizing data access and empowering teams to create data products, ANZ is fostering a culture of innovation and data-driven decision-making across the organization

This transition is not just about technology; it represents a fundamental shift in how ANZ views and values its data assets. By treating data as a product, the bank is positioned to not only overcome current challenges, but to unlock new opportunities for growth, customer service, and competitive advantage.

This post explores how the shift to a data product mindset is being implemented, the challenges faced, and the early wins that are shaping the future of data management in the Institutional Division.

ANZ’s federated data strategy

In response to the challenges, ANZ Group formulated a data strategy that focuses on empowering employees to securely use data to improve the sustainability and financial well-being of their customers. At its core are the following pillars:

  • Introducing new ways of working that focus on generating customer value first
  • New technology platforms and tooling that allow the bank to collect, share, archive, and dispose data in a secure and controlled way
  • Achieving consistency in how data is produced and consumed across the entire bank through data products and better-connected systems
  • Supporting the bank’s risk and regulatory obligations by providing a secure and resilient data platform that provides fine-grained, controlled access to quality data products

ANZ has made the strategic decision to adopt an architectural and operational model aligned with the data mesh paradigm, which revolves around four key principles: domain ownership, data as a product, a self-serve data platform, and federated computational governance.

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Treating data as a product instils a product-centric mindset, emphasizing that data must be secure, discoverable, understandable, interoperable, reusable, and managed throughout its lifecycle. This principle makes sure data consumers, both internal and external, derive consistent value from well-designed data products.

A self-serve data platform empowers domains to create, discover, and consume data products independently. It abstracts technical complexities and provides user-friendly tools, enabling a scalable, repeatable, and automated approach to producing high-quality data products.

Under the federated mesh architecture, each divisional mesh functions as a node within the broader enterprise data mesh, maintaining a degree of autonomy in managing its data products. To effectively coordinate these autonomous nodes and facilitate seamless integration, enterprise-wide standards, such as those related to data governance, interoperability, and security, are essential to maintain alignment and consistency across all nodes and domains and teams within.

With this approach, each node in ANZ maintains its divisional alignment and adherence to data risk and governance standards and policies to manage local data products and data assets. This enables global discoverability and collaboration without centralizing ownership or operations.

As a result, governance resides with the data products themselves, making sure standards and policies, such as access control, data quality, and compliance, are enforced where the data lives. In this regard, the enterprise data product catalog acts as a federated portal, facilitating cross-domain access and interoperability while maintaining alignment with governance principles. This model balances node or domain-level autonomy with enterprise-level oversight, creating a scalable and consistent framework across ANZ.

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram.

Central to the success of this strategy is its support for each division’s autonomy and freedom to choose their own domain structure, which is closely aligned to their business needs. Divisions decide how many domains to have within their node; some may have one, others many. These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. Nodes and domains serve business needs and are not technology mandated.

Under the federated computational governance model, the ANZ Group strategy defines guardrails that treat a node as a logical data container suitable for the following:

  • Ingestion and metadata management
  • Creating source-aligned data products complying with ANZ’s Data Product Specification (DPS)
  • Integrating source-aligned data products from other nodes
  • Producing consumer-aligned data products for specific business purposes
  • Publishing conforming data products to ANZ’s Data Product Catalog (DPC)

Following on from this strategy is organizing its domain structure to provide autonomy to various functional teams while preserving the core values of data mesh. The following diagram depicts an example of the possible structure.

For instance, Domain A will have the flexibility to create data products that can be published to the divisional catalog, while also maintaining the autonomy to develop data products that are exclusively accessible to teams within the domain. These products will not be available to others until they are deemed ready for broader enterprise use.

This strategy supports each division’s autonomy to implement their own data catalogs and decide which data products to publish to the group-level catalog. This flexibility extends to divisional domains, which can choose which data products to publish to the divisional catalog or keep visible only to domain consumers.

Institutional Data & AI Platform architecture

The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

The building blocks are as follows:

  1. Foundational Data & AI Platform capabilities – A dedicated data platform team provides domain-agnostic tools, systems, and capabilities to enable autonomous data product development across domains. This self-serve infrastructure allows domain teams to manage the full data lifecycle without relying on a centralized data team. Key capabilities include data storage, data onboarding and transformation, and data utilities that facilitate data sharing with interoperability between domains. These capabilities abstract the technical complexities associated with data management infrastructure, allowing domain experts to focus on creating valuable data products rather than infrastructure management.
  2. Domain-owned data assets – The domain-oriented data ownership approach distributes responsibility for data across the business units within the Institutional Division. Domain teams are responsible for developing, deploying, and managing their own analytical data products alongside operational data services. Data contracts authored by data product owners automate data product creation and provide a standard to access data products. By treating the data as a product, the outcome is a reusable asset that outlives a project and meets the needs of the enterprise consumer. Consumer feedback and demand drives creation and maintenance of the data product.
  3. Division-level metadata management and data governance – A centrally hosted service provides domain teams with the capability to publish their data products along with relevant metadata, like business definitions and lineage. Some of the key features implemented are:
    1. Metadata management that centralizes metadata and presents it within the context of data products, such as data quality scores and data product lineage.
    2. A data portal for consumers to discover data products and access associated metadata.
    3. Subscription workflows that simplify access management to the data products.
    4. Computational governance that enforces divisional and enterprise data policies and standards, such as data classification and business data models for aligning terminology.

The following diagram is a high-level example of the technical architecture approach towards the Institutional Data & AI Platform. The solution uses a building block approach, on a cloud-centered platform comprised of AWS services, with partner solutions and open standards like OpenLineage and Apache Iceberg.

Let’s look at the key services that enable the federated platform to operate at scale:

  • Data storage and processing:
    • Apache Iceberg on Amazon Simple Storage Service (Amazon S3) offers an optimized way to store data assets and products and promotes interoperability across other services
    • Amazon Redshift allows domain teams to create and manage fit-for-purpose data marts
    • AWS Lambda and AWS Glue are used for data onboarding and processing, and data utilities created in Python and PySpark promote reusability and quality across the data processing pipelines
    • dbt simplifies data transformation rules and allows sub-domain data analysts to build modeling logic as SQL statements
    • Amazon Managed Workflows for Apache Airflow (Amazon MWAA) enables efficient management of workflows and data pipeline orchestration using out-of-the-box integrations with AWS services
  • Metadata management and data governance:
    • To maintain data reliability and accuracy, a robust data quality framework using Soda core is used that automates data quality using checks defined in a data contract
    • Amazon DataZone enables data product cataloging, discovery, metadata management, and implementing computational governance
    • OpenLineage simplifies harvesting and collection of data and process-level lineage, which are then published to Amazon DataZone
    • AWS Lake Formation, combined with AWS Glue Data Catalog, provides data governance and access management to data products that reside within sub-domains
  • Analytics:
    • Tableau offers capabilities for sub-domains with data visualization and business intelligence capabilities
  • Observability and security:
    • Observability needs of the platform are built into all the processes using monitoring, with logging functionality provided by Amazon CloudWatch and AWS CloudTrail
    • AWS Secrets Manager makes sure secrets are stored and made available for data pipelines to access services in a secure manner

The technical implementation actualizes the data product strategy at ANZ Institutional Division. Amazon DataZone plays an essential role in facilitating data product management for the domain teams. The service addresses several critical aspects of the Institutional Division’s data product strategy, including:

  • Data cataloging and metadata management – Amazon DataZone provides comprehensive data cataloging and metadata management capabilities
  • Data governance and compliance – Effective data governance is essential for scaling data products
  • Self-service capabilities – Amazon DataZone empowers domain teams with self-service capabilities, enabling them to create, manage, and deploy data products independently
  • Integration and interoperability – One of the challenges in scaling data products is providing seamless integration across various data sources and systems
  • Collaboration and sharing – Amazon DataZone provides a platform for sharing data and metadata across teams and domains

Institutional Division’s delivery model to achieve scale

The Institutional Division has successfully used the federated architecture, and key to this delivery model is the implementation of Foundational Data & AI Platform capabilities that serve all domains within the division. This model promotes self-service and accelerates the delivery of subsequent initiatives by using the capabilities built for previous use cases.

To evaluate the success of the delivery model, ANZ has implemented key metrics, such as cost transparency and domain adoption, to guide the data mesh governance team in refining the delivery approach. For instance, one enhancement involves integrating cross-functional squads to support data literacy.

The key to scaling the Institutional Division operating model are the following considerations:

  • Data as a product approach – Use techniques like event storming and domain-driven design to capture business events and their meanings.
  • Education and enablement – Conduct learning interventions to upskill teams on understanding and using the data as a product approach.
  • Iterative data platform delivery – Work backward from business initiative to iteratively deliver self-service data platform infrastructure capabilities.
  • Managing demand efficiently – Implement a feedback mechanism to manage demand on data products. Track and manage data debt using standard data contract specifications. Most importantly, adopt governance and standards to make sure data products are built and maintained with a long-term perspective, minimizing technical debt.

“The Institutional Data & Analytics Platform (IDAP) has allowed the Institutional team to establish a base foundation to allow various teams to aggregate and consume the wealth of data across the division. This self-service platform enables business leaders to both create and consume reusable data products, unlocking value across this division. It’s also an excellent proof point for our broader data mesh architecture, allowing us to connect this divisional data to broader enterprise data stores—further positioning us to put the customer at the center of everything we do.”

– Tim Hogarth, CTO ANZ

“AWS believes that democratizing data, while not compromising on security and fine-grained access, is a key component of any future-proof, scalable data platform, so we are pleased to be enabling ANZ bank’s IDAP metadata management and data governance capabilities through Amazon DataZone. This allows the diverse business functions at ANZ the autonomy to self-serve on their data needs with built-in governance.”

– Shikha Verma, Head of Product, Amazon DataZone

Conclusion

ANZ’s journey to move towards a data product approach has improved the organization’s approach to manage data and reduce data silos, and has positioned it to become a data-driven, customer-centric organization. By combining federated platform practices and adopting AWS services and open standards, ANZ Institutional Division is achieving its objectives in decentralization with a scalable data platform that enables its domain teams to make informed decisions, drive innovation, and maintain a competitive edge.

Special thanks: This implementation success is a result of close collaboration between ANZ Institutional Division, AWS ProServe, and the AWS account team. We want to thank ANZ Institutional Executives and the Leadership Team for the strong sponsorship and direction.


About the Authors

Leo Ramsamy is a Platform Architect specializing in data and analytics for ANZ’s Institutional division. He focuses on modern data practices, including Data Mesh architecture, data governance, quality management, and observability. His work aligns data strategies with business goals, improving accessibility and enabling better decision-making across ANZ.

Srinivasan Kuppusamy is a Senior Cloud Architect – Data at AWS ProServe, where he helps customers solve their business problems using the power of AWS Cloud technology. His areas of interests are data and analytics, data governance, and AI/ML.

Rada Stanic is a Chief Technologist at Amazon Web Services, where she helps ANZ customers across different segments solve their business problems using AWS Cloud technologies. Her special areas of interest are data analytics, machine learning/AI, and application modernization.

How SmugMug Increased Data Modeling Productivity with Amazon Q Developer

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/how-smugmug-increased-data-modeling-productivity-with-amazon-q-developer/

This post is co-written with Dr. Geoff Ryder, Manager, at SmugMug.

Introduction

SmugMug operates two very large online photo platforms: SmugMug and Flickr. These platforms enable more than 100 million customers to safely store, search, share, and sell tens of billions of photos every day. However, the data science and engineering team at SmugMug and Flickr often faces complex data modeling challenges that require significant time to resolve.

These challenges arise due to several factors. First, the team has to contend with diverse datasets from different sources. Additionally, the database schema and tables are highly complex, and the team needs to quickly understand application (PHP) code and database table structures in order to generate the necessary complex database queries. Specifically, SmugMug uses Amazon Redshift as its cloud data warehouse to analyze patterns in petabyte-scale data stored in Amazon S3, as well as transactional data in Amazon Aurora and Amazon DynamoDB. This allows them to generate dozens of business reports daily.

However, the complexity increases further as many database tables also need to be imported from third-party organizations into Amazon Redshift, where they are joined with SmugMug and Flickr’s internal tables. In extreme cases, properly modeling all these database tables and handling issues like granularity, cardinality, timestamps and missing data could take years – an impractical timeline for the business. We are excited to walk through SmugMug’s data modeling use cases and how SmugMug uses Amazon Q Developer to improve the data science and engineering team’s productivity.

Discovering Amazon Q Developer

SmugMug was one of the first customers to pilot Amazon Q Developer (previously Amazon CodeWhisperer), the most capable AI-powered assistant for software development that re-imagines the experience across the entire software development lifecycle, making it easier and faster to build, secure, manage, optimize, operate, and transform applications on AWS. There are multiple Amazon Q Developer use cases at SmugMug and Flickr, such as using Amazon Q Developer agent (/dev) for software development (i.e. generating implementation plans and the accompanying code), generating inline code suggestions, asking Amazon Q Developer in chat about AWS services and best practices, and analyzing AWS usage and costs for Cloud Financial Management (CFM) needs. For the data science and engineering team specifically, the key feature is chatting with Amazon Q Developer in integrated development environments (IDEs) like Intellij DataGrip. The data analysts and data scientists at SmugMug and Flickr ask questions in Amazon Q Developer chat to analyze database schemas, generate data model diagrams from DDL (Data Definition Language) statements, convert queries between languages, automatically generate complex database queries for data analysis, generate code to validate table contents, and predict trends using ML (Machine Learning).

Implementing Amazon Q Developer

To solve the data modeling challenges SmugMug faced, the team collaborated closely with their AWS Account Team, AWS Professional Services, and the Amazon Q Developer service team to create and test a data modeling assistant solution using Amazon Q Developer.

As a first step, the data modeler needs to bring the right metadata to bear. For simpler cases, the commands “show view myschema.v” or “show table myschema.t“ retrieve DDL schema information about the specified view or table from Amazon Redshift into the IDE console.

Here’s an example using simulated data for a hypothetical company. For this typical company that handles orders for products, the result of typing “show table sample.orderinfo” and “show table sample.skuinfo”might be:

Image of SQL statement generated by the show table statement. "CREATE TABLE sample.skuinfo ( sku_id bigint ENCODE raw, sku_vendor bigint ENCODE az64, sku_category character varying(18) ENCODE lzo, sku_description character varying(255) ENCODE lzo, date_sku_created timestamp without time zone ENCODE az64, date_sku_updated timestamp without time zone ENCODE az64, pipeline_inserted_at timestamp without time zone ENCODE az64 ) DISTSTYLE KEY SORTKEY ( sku_id );"

Image of SQL statement generated by the show table statement. "CREATE TABLE sample.orderinfo ( order_id bigint ENCODE raw, shipper_id bigint ENCODE az64 distkey, product_id bigint ENCODE az64, quantity_ordered integer ENCODE az64, date_order_placed timestamp without time zone ENCODE az64 ) DISTSTYLE KEY SORTKEY ( order_id );"

This DDL text is now in the open tab. By selecting the text to highlight it, that DDL text becomes part of the context that Amazon Q Developer sees. The modeler can start asking questions about them in the Amazon Q Developer chat window in the IDE.

Diagram showing what is considered part of the context included in a request including the RAG query result, related documents when using the at-workspace key word, the highlighted text in the IDE open tab,the chat history, and the prompt.

In complex scenarios, establishing the correct modeling context requires a combination of schema information, legacy SQL, application source code in various programming languages, sample values, and natural language documentation. Amazon Q Developer addresses this by creating a local index of relevant files and content. When a question is asked using @workspace, this index is consulted to identify and include pertinent sections of code and information in the request. (See this article for additional details on workspace). The prompt plays a crucial role in measuring similarity, so providing comprehensive context within it is essential. To optimize this process, the IDE settings feature a tunable workspace index function, allowing for enhanced performance in identifying and incorporating relevant context.

Image showing the Amazon Q Settings window where you enable the Workspace feature by checking the "Workspace index" box. You can also change the number of worker threads used, and the maximum workspace index size in MB.

Workspace Index Settings

By adopting Amazon Q Developer as a team, we are able to jointly develop and share proprietary prompt text to address the four steps in our modeling process, as follows.

Step 1. Define the goal for the data modeling project

From prior knowledge, sketch a high-level goal for a data model. Gather the data for it manually, or by e.g. querying a vector database and adding its documents to the project.

For this example, we choose as the goal to compute aggregated metrics from a new table or view composed of two existing tables, sample.orderinfo and sample.skuinfo. These contain simulated data about product sales that are common to many companies. The order table is in the style of a fact table that logs customer orders, and the stock keeping unit (SKU) table is a dimension table that provides additional data points of interest about each order. The order and SKU information need to be combined by a join operation before we can compute the metrics. We would like Amazon Q Developer to tell us how to write that SQL join statement.

Step 2. Conduct an exploratory analysis and generate candidates

Next, prompt Amazon Q Developer for candidate foreign keys to join the tables, and for SQL code to execute those joins. Generate an entity-relationship diagram (ERD) as a visual aid. Prompts do not have to be complicated. For example:

@workspace What columns of database tables sample.orderinfo and sample.skuinfo 
would be best to join the two tables? Provide SQL code for the join. Draw an 
entity relationship diagram that shows the joins between the two tables, and 
includes only the fields involved in the join. Add a crow's foot cardinality 
marker to indicate a 1:many relationship, and add it next to the high 
cardinality table.

Image with the first part of the response to the prompt with the following text: "Based on the table schemas, sku_id is the appropriate column to join these tables. The relationship is likely one-to-many (1:M) where one SKU can appear in multiple orders. Here's the SQL join: SELECT o.order_id, o.sku_id, s.sku_description FROM sample.orderinfo o JOIN sample.skuinfo s ON o.sku_id = s.sku_id;

Image with the second part of the response to the prompt with the ASCII relationship diagram showing the join relationship.

Each time tables are joined together, new aggregated metrics become available to drive business insights. Now, for instance, we can find the top selling SKUs in October thanks to our results:

Image shows the top 5 results from the prior query showing the top skus in October.

Sometimes we need to look at code written in languages other than SQL to complete the data model. For example, the names of some vendors this company works with happen to appear in application PHP code as human readable strings, but are saved in the application database as numbers. The analytics data staged in Redshift only contain the numbers. So, we pull a copy of the PHP text file into @workspace, and ask Amazon Q Developer to translate the relevant string-integer mappings into a SQL case statement.

Image shows the selected PHP code with a switch statement mapping Vendor Ids to Vendor Names.

PHP Switch statement showing the mapping of Vendor Ids to String Names.

I am a Redshift database administrator and I am working on a data modeling 
problem. I would like to write SQL statements to join tables sample.orderinfo 
and sample.skuinfo. Please write that SQL to join the two tables. Also, I 
would like to write a SQL case statement to recover all string values defined 
in PHP that are represented as integer values in the database table.

The output of that prompt is shown below.

Image showing the updated SQL query that maps the Vendor Id to the Vendor Name.

Amazon Q Developer automatically detected the PHP switch case statement, converted to SQL, and added it to the final query. Many other programming languages are supported, and modelers should try this technique with other kinds of source code. Note that data scientists and analysts may not know where to look in complex application code for these details, so this discovery-plus-code translation step is a net new benefit to our company that is only possible thanks to Amazon Q Developer.

Step 3. Create code to test the analysis

Now we request SQL source code for a battery of small test queries. These can return cardinality, grain, arithmetic, and null count results.

Please write a short SQL test to compute counts of the key fields that are used 
in the joins, which will verify the cardinality assignments indicated in the 
entity relationship diagram above. The SQL test should compare distinct counts 
to total counts and null counts when it verifies the cardinality.

Image of resulting SQL queries to check cardinality.

Step 4. Validate the results of the analysis

Run the test queries to see if the candidate solution from step 2 meets our goals. The “Insert at cursor” button at the bottom of the response is handy for this. The data modeler can easily spot an error in the join logic and ERD from inspecting the output of the test query. (Or, if it’s hard to interpret the results, keep making the test queries simpler.) If errors arise from the AI misinterpreting or miscalculating a result, or from a vaguely worded prompt, simply adjust the prompt in step 2 to fix the known errors, and repeat steps 2 – 4.

Image showing the query results from the cardinality query.

After a few iterations, taking from seconds to at most tens of minutes each, the modeling errors have been worked out and we arrive at a valid production query.

Key Benefits and Results

With this Amazon Q Developer powered solution and iterative approach, SmugMug has achieved highly accurate data modeling results across numerous database tables. Once the correct modeling configuration is established, various useful outputs may become available.

We already described production SQL, unit tests, and ERDs for documentation. By the end of the process, because Amazon Q Developer has a good understanding of the data it just modeled in its chat history, it will also generate useful Python machine learning programs to predict business trends. Here is a prompt for that, and a partial screenshot of the Python output:

Please write Python code to implement a linear regression that predicts the 
quantity_ordered value based on other fields in the data set. Choose predictor 
variables that are less likely to cause multi-collinearity problems.

Image showing the python code generated to predict quantity_ordered value.

This only shows the model training step, but the full response included all library imports, a Redshift query, feature engineering steps, ML performance metrics, and code for plotting the metrics. And the AI can produce other types of predictive models. For example, you can try:

Please write Python code to implement an XGBoost model that predicts the 
quantity_ordered value based on other fields in the data set.

Ultimately, the solution has improved team productivity for both existing and new team members, while maintaining legacy knowledge needed to onboard new team members more efficiently. Key benefits include:

  1. Reducing SmugMug data analyst and scientist’s time spent on data modeling tasks from days to hours, allowing them to reallocate this time to other high-priority projects.
  2. Automating the generation of BI documentation and predictive ML, also saving crucial time.
  3. Providing net new value by translating application code constant definitions into SQL. Due to organizational boundaries, we would not have achieved this without an assist from the AI.

Future Plans and Expansion

SmugMug conducted the initial data modeling use case testing with over a dozen data science team members and analysts. We are moving on to analyze more complex tables and data schemas, and generating Python code in Amazon SageMaker for ML tasks like data preparation, training, inference, and MLOps. From our experience, Amazon Q Developer has become a preferred internal tool for development that has a data modeling component, and its use continues to expand to different groups around the company.

For SmugMug’s data modeling projects, we continue to enhance the four-step process described above. In order to gather the most relevant context to solve a problem, we build vector database collections to pull from schemas, older SQL code, application source code, BI tool content, and curated documentation. The vector search operation surfaces the right content, and spares data modelers from manually searching in different code archives. We use ChromaDB to do the searches, and bring the results from ChromaDB into the workspace as additional files.

Conclusion

Using Amazon Q Developer for data modeling use cases, SmugMug has managed to increase data science and engineering team productivity by up to 100% when compared to prior workflows. To explore how Amazon Q Developer can benefit your organization, get started here. If you have questions or suggestions, please leave a comment below.

About the Authors

Image of Dr. Geoffrey Ryder

Dr. Geoffrey Ryder

Dr. Geoff Ryder serves as the Manager of Data Science and Engineering at SmugMug, where he leads Team Prophecy in managing the company’s cloud-based data warehouse and analytics platforms. With a focus on leveraging the best AI tools, his team empowers photography clients to enhance their sales of both physical and digital photographic products. Geoff brings over two decades of experience in technical and business roles across Silicon Valley companies, and holds a PhD in Computer Engineering from UC-Santa Cruz.

Will Matos

Will Matos is a Principal Specialist Solutions Architect at AWS, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 25 years of tech experience, and over 9 years with AWS, he collaborates with product teams to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs.

Sreenivas Adiki

Sreenivas Adiki is a Sr. Customer Delivery Architect in ProServe, with a focus on data and analytics. He ensures success in designing, building, optimizing, and transforming in the area of Big Data/Analytics. Ensuring solutions are well-designed for successful deployment, Sreenivas participates in deep architectural discussions and design exercises. He has also published several AWS assets, such as whitepapers and proof-of-concept papers.

Kevin Bell

Kevin Bell is a Sr. Solutions Architect at AWS based in Seattle. He has been building things in the cloud for about 10 years. You can find him online as @bellkev on GitHub.

Corey Keane

Corey Keane is a Media and Entertainment (M&E) Sr. Account Manager at AWS. Corey has held a number of positions at Amazon and AWS throughout his 8 years with the company across M&E—including technical business development for strategic partnerships with international game developers, in addition to his current role managing AWS customers in the Media vertical. He leans on his pan-Amazon experience from working on other teams to identify new partnerships between our customers and other Amazon businesses to bring disruptive products to market.

How FINRA established real-time operational observability for Amazon EMR big data workloads on Amazon EC2 with Prometheus and Grafana

Post Syndicated from Sumalatha Bachu original https://aws.amazon.com/blogs/big-data/how-finra-established-real-time-operational-observability-for-amazon-emr-big-data-workloads-on-amazon-ec2-with-prometheus-and-grafana/

This is a guest post by FINRA (Financial Industry Regulatory Authority). FINRA is dedicated to protecting investors and safeguarding market integrity in a manner that facilitates vibrant capital markets.

FINRA performs big data processing with large volumes of data and workloads with varying instance sizes and types on Amazon EMR. Amazon EMR is a cloud-based big data environment designed to process large amounts of data using open source tools such as Hadoop, Spark, HBase, Flink, Hudi, and Presto.

Monitoring EMR clusters is essential for detecting critical issues with applications, infrastructure, or data in real time. A well-tuned monitoring system helps quickly identify root causes, automate bug fixes, minimize manual actions, and increase productivity. Additionally, observing cluster performance and usage over time helps operations and engineering teams find potential performance bottlenecks and optimization opportunities to scale their clusters, thereby reducing manual actions and improving compliance with service level agreements.

In this post, we talk about our challenges and show how we built an observability framework to provide operational metrics insights for big data processing workloads on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters.

Challenge

In today’s data-driven world, organizations strive to extract valuable insights from large amounts of data. The challenge we faced was finding an efficient way to monitor and observe big data workloads on Amazon EMR due to its complexity. Monitoring and observability for Amazon EMR solutions come with various challenges:

  • Complexity and scale – EMR clusters often process massive volumes of data across numerous nodes. Monitoring such a complex, distributed system requires handling high data throughput and achieving minimal performance impact. Managing and interpreting the large volume of monitoring data generated by EMR clusters can be overwhelming, making it difficult to identify and troubleshoot issues in a timely manner.
  • Dynamic environments – EMR clusters are often ephemeral, created and shut down based on workload demands. This dynamism makes it challenging to consistently monitor, collect metrics, and maintain observability over time.
  • Data variety – Monitoring cluster health and having visibility into clusters to detect bottlenecks, unexpected behavior during processing, data skew, job performance, and so on are crucial. Detailed observability into long-running clusters, nodes, tasks, potential data skews, stuck tasks, performance issues, and job-level metrics (like Spark and JVM) is very critical to understand. Achieving comprehensive observability across these varied data types was difficult.
  • Resource utilization – EMR clusters consist of various components and services working together, making it challenging to effectively monitor all aspects of the system. Monitoring resource utilization (CPU, memory, disk I/O) across multiple nodes to prevent bottlenecks and inefficiencies is essential but complex, especially in a distributed environment.
  • Latency and performance metrics –Capturing and analyzing latency and comprehensive performance metrics in real time to identify and resolve issues promptly is critical, but it’s challenging due to the distributed nature of Amazon EMR.
  • Centralized observability dashboards – Having a single pane of glass for all aspects of EMR cluster metrics, including cluster health, resource utilization, job execution, logs, and security, in order to provide a complete picture of the system’s performance and health, was a challenge.
  • Alerting and incident management – Setting up effective centralized alerting and notification systems was challenging. Configuring alerts for critical events or performance thresholds requires careful consideration to avoid alert fatigue while making sure important issues are addressed promptly. Responding to incidents from performance slowdowns or disruptions takes time and effort to detect and remediate the issues if proper alerting mechanism is not in place.
  • Cost management – Lastly, optimizing costs while maintaining effective monitoring is an ongoing challenge. Balancing the need for comprehensive monitoring with cost constraints requires careful planning and optimization strategies to avoid unnecessary expenses while still providing adequate monitoring coverage.

Effective observability for Amazon EMR requires a combination of the right tools, practices, and strategies to address these challenges and provide reliable, efficient, and cost-effective big data processing.

The Ganglia system on Amazon EMR is designed to monitor complete cluster and all nodes’ health, which shows several metrics like Hadoop, Spark, and JVM. When we view the Ganglia web UI in a browser, we see an overview of the EMR cluster’s performance, detailing the load, memory usage, CPU utilization, and network traffic of the cluster through different graphs. However, with Ganglia’s deprecation announced by AWS for higher versions of Amazon EMR, it became important for FINRA to build this solution.

Solution overview

Insights drawn from the post Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana inspired our approach. The post demonstrated how to set up a monitoring system using Amazon Managed Service for Prometheus and Amazon Managed Grafana to effectively monitor an EMR cluster and use Grafana dashboards to view metrics to troubleshoot and optimize performance issues.

Based on these insights, we completed a successful proof of concept. Next, we built our enterprise central monitoring solution with Managed Prometheus and Managed Grafana to mimic Ganglia-like metrics at FINRA. Managed Prometheus allows for real-time high-volume data collection, which scales the ingestion, storage, and querying of operational metrics as workloads increase or decrease. These metrics are fed to the Managed Grafana workspace for visualizations.

Our solution includes a data ingestion layer for every cluster, with configuration for metrics collection through a custom-built script stored in Amazon Simple Storage Service (Amazon S3). We also installed Managed Prometheus at startup for EC2 instances on Amazon EMR through a bootstrap script. Additionally, application-specific tags are defined in the configuration file to optimize inclusion and collect the specific metrics.

After Managed Prometheus (installed on EMR clusters) collects the metrics, they are sent to a remote Managed Prometheus workspace. Managed Prometheus workspaces are logical and isolated environments dedicated to Managed Prometheus servers that manage specific metrics. They also provide access control for authorizing who or what sends and receives metrics from that workspace. You can create one more workspace by account or application depending on the need, which facilitates better management.

After metrics are collected, we built a mechanism to render them on Managed Grafana dashboards that are then used for consumption through an endpoint. We customized the dashboards for task-level, node-level, and cluster-level metrics so they can be promoted from lower environments to higher environments. We also built several templated dashboards that display node-level metrics like OS-level metrics (CPU, memory, network, disk I/O), HDFS metrics, YARN metrics, Spark metrics, and job-level metrics (Spark and JVM), maximizing the potential for each environment through automated metric aggregation in each account.

We chose a SAML-based authentication option, which allowed us to integrate with existing Active Directory (AD) groups, helping minimize the work needed to manage user access and grant user-based Grafana dashboard access. We arranged three main groups—admins, editors, and viewers—for Grafana user authentication based on user roles.

Through elaborate monitoring automation, these desired metrics are pushed to Amazon CloudWatch. We use CloudWatch for necessary alerting when it exceeds the desired thresholds for each metric.

The following diagram illustrates the solution architecture.

Sample dashboards

The following screenshots showcase example dashboards.

Conclusion

In this post, we shared how FINRA enhanced data-driven decision-making with comprehensive EMR workload observability to optimize performance, maintain reliability, and gain critical insights into big data operations, leading to operational excellence.

FINRA’s solution enabled the operations and engineering teams to use a single pane of glass for monitoring big data workloads and quickly detecting any operational issues. The scalable solution significantly reduced time to resolution and enhanced our overall operational stance. The solution empowered the operations and engineering teams with comprehensive insights into various Amazon EMR metrics like OS levels, Spark, JMX, HDFS, and Yarn, all consolidated in one place. We also extended the solution to use cases such as Amazon Elastic Kubernetes Service (Amazon EKS) clusters, including EMR on EKS clusters and other applications, establishing it as a one-stop system for monitoring metrics across our infrastructure and applications.


About the Authors

Sumalatha Bachu is Senior Director, Technology at FINRA. She manages Big Data Operations which includes managing petabyte-scale data and complex workloads processing in cloud. Additionally, she is an expert in developing Enterprise Application Monitoring and Observability Solutions, Operational Data Analytics, & Machine Learning Model Governance work flows. Outside of work, she enjoys doing yoga, practicing singing, and teaching in her free time.

PremKiran Bejjam is Lead Engineer Consultant at FINRA, specializing in developing resilient and scalable systems. With a keen focus on designing monitoring solutions to enhance infrastructure reliability, he is dedicated to optimizing system performance. Beyond work, he enjoys quality family time and continually seeks out new learning opportunities.

Akhil Chalamalasetty is Director, Market Regulation Technology at FINRA. He is a Big Data subject matter expert specializing in building cutting edge solutions at scale along with optimizing workloads, data, and its processing capabilities. Akhil enjoys sim racing and Formula 1 in his free time.