Tag Archives: Customer Solutions

How Klarna Bank AB built real-time decision-making with Amazon Kinesis Data Analytics for Apache Flink

2023-06-13 Nir Tsruya

Post Syndicated from Nir Tsruya original https://aws.amazon.com/blogs/big-data/how-klarna-bank-ab-built-real-time-decision-making-with-amazon-kinesis-data-analytics-for-apache-flink/

This is a joint post co-authored with Nir Tsruya from Klarna Bank AB.

Klarna is a leading global payments and shopping service, providing smarter and more flexible shopping and purchase experiences to 150 million active consumers across more than 500,000 merchants in 45 countries. Klarna offers direct payments, pay after delivery options, and instalment plans in a smooth one-click purchase experience that lets consumers pay when and how they prefer to. The ability to utilize data to make near-real-time decisions is a source of competitive advantage for Klarna.

This post presents a reference architecture for real-time queries and decision-making on AWS using Amazon Kinesis Data Analytics for Apache Flink. In addition, we explain why the Klarna Decision Tooling team selected Kinesis Data Analytics for Apache Flink for their first real-time decision query service. We show how Klarna uses Kinesis Data Analytics for Apache Flink as part of an end-to-end solution including Amazon DynamoDB and Apache Kafka to process real-time decision-making.

AWS offers a rich set of services that you can use to realize real-time insights. These services include Kinesis Data Analytics for Apache Flink, the solution Klarna that uses to underpin automated decision-making in their business today. Kinesis Data Analytics for Apache Flink allows you to easily build stream processing applications for a variety of sources including Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon MQ.

The challenge: Real-time decision-making at scale

Klarna’s customers expect a real-time, frictionless, online experience when shopping and paying online. In the background, Klarna needs to assess risks such as credit risk, fraud attempts, and money laundering for every customer credit request in every operating geography. The outcome of this risk assessment is called a decision. Decisions generate millions of risk assessment transactions a day that must be run in near-real time. The final decision is the record of whether Klarna has approved or rejected the request to extend credit to a consumer. These underwriting decisions are critical artefacts. First, they contain information that must be persisted for legal reasons. Second, they are used to build profiles and models that are fed into underwriting policies to improve the decision process. Under the hood, a decision is the sum of a number of transactions (for example, credit checks), coordinated and persisted via a decision store.

Klarna wanted to build a framework to ensure decisions persist successfully, ensuring timely risk assessment and quick decisions for customers. First, the Klarna team looked to solve the problem of producing and capturing decisions by using a combination of Apache Kafka and AWS Lambda. By publishing decision artefacts directly to a Kafka topic, the Klarna team found that high latency could cause long transaction wait times or transactions to be rejected altogether, leading to delays in getting ratified decisions to customers in a timely fashion and potential lost revenue. This approach also caused operational overhead for the Klarna team, including management of the schema evolution, replaying old events, and native integration of Lambda with their self-managed Apache Kafka clusters.

Design requirements

Klarna was able to set out their requirements for a solution to capture risk assessment artefacts (decisions), acting as a source of truth for all underwriting decisions within Klarna. The key requirements included at-least once reliability and millisecond latency, enabling real-time access to decision-making and the ability to replay past events in case of missing data in downstream systems. Additionally, the team needed a system that could scale to keep pace with Klarna’s rapid [10 times] growth.

Solution overview

The solution consists of two components: a combination of an highly available API with DynamoDB as the data store to store each decision, and Amazon DynamoDB Streams with Kinesis Data Analytics. Kinesis Data Analytics is a fully managed Apache Flink service and used to stream, process, enrich, and standardize the decision in real time and replay past events (if needed).

The following diagram illustrates the overall flow from end-user to the downstream systems.

The flow includes the following steps:

As the end-user makes a purchase, the policy components assess risk and the decision is sent to a decision store via the Decision Store API.
The Decision Store API persists the data in DynamoDB and responds to the requester. Decisions for each transaction are time-ordered and streamed by DynamoDB Streams. Decision Store also enables centralised schema management and handles evolution of event schemas.
The Kinesis Data Analytics for Apache Flink application is the consumer of DynamoDB streams. The application makes sure that the decisions captured are conforming to the expected event schema that is then published to a Kafka topic to be consumed by various downstream systems. Here, Kinesis Data Analytics for Apache Flink plays a vital part in the delivery of those events: aggregating, enriching, and mapping data to adhere to the event schema. This provides a standardized way for consumers to access decisions from their respective producers. The application enables at-least once delivery capability, and Flink’s checkpoint and retry mechanism guarantees that every event is processed and persisted.
The published Kafka events are consumed by the downstream systems and stored in an Amazon Simple Storage Service (Amazon S3) bucket. The events stored in Amazon S3 reflect every decision ever taken by the producing policy components, and can be used by the decision store to backfill and replay any past events. In addition to preserving the history of decision events, events are also stored as a set of variables in the variable store.
Policy components use the variable store to check for similar past decisions to determine if a request can be accepted or denied immediately. The request is then processed as described by the preceding workflow, and the next subsequent request will be answered by the variable store based on the result of the previous decision.

The decision store provides a standardized workflow for processing and producing events for downstream systems and customer support. With all the events captured and safely stored in DynamoDB, the decision store provides an API for support engineers (and other supporting tools like chatbots) to query and access past decisions in near-real time.

Solution impact

The solution provided benefits in three areas.

First, the managed nature of Kinesis Data Analytics allowed the Klarna team to focus on value-adding application development instead of managing infrastructure. The team is able to onboard new use cases in less than a week. They can take full advantage of the auto scaling feature in Kinesis Data Analytics and pre-built sources and destinations.

Second, the team can use Apache Flink to ensure the accuracy, completeness, consistency, and reliability of data. Flink’s native capability of stateful computation and data accuracy through the use of checkpoints and savepoints directly supports Klarna team’s vision to add more logic into the pipelines, allowing the team to expand to different use cases confidently. Additionally, the low latency of the service ensures that enriched decision artefacts are available to consumers and subsequently to the policy agents for future decision-making in near-real time.

Third, the solution enables the Klarna team to take advantage of the Apache Flink open-source community, which provides rich community support and the opportunity to contribute back to the community by bug fixing or adding new features.

This solution has proven to scale with increased adoption of a new use case, translating to a 10-times increase in events over 3 months.

Lessons learned

The Klarna team faced a few challenges with Flink serialization and upgrading Apache Flink versions. Flink serialization is an interesting concept and critical for the application’s performance. Flink uses a different set of serializers in order to serialize data between the operators. It’s up to the team to configure the best and most efficient serializer based on the use case. The Klarna team configured the objects as Flink POJO, which reduced the pipeline runtime by 85%. For more information, refer to Flink Serialization Tuning Vol. 1: Choosing your Serializer — if you can before deploying a Flink application to production.

The other challenge faced by the team was upgrading the Apache Flink version in Kinesis Data Analytics. Presently, the Kinesis Data Analytics for Apache Flink application requires the creation of a new Kinesis Data Analytics for Apache Flink application. Currently, reusing a snapshot (the binary artefact representing the state of the Flink application, used to restore the application to the last checkpoint taken) is not possible between two different applications. For that reason, upgrading the Apache Flink version requires additional steps in order to ensure the application doesn’t lose data.

What’s next for Klarna and Kinesis Data Analytics for Apache Flink?

The team is looking into expanding the usage of Kinesis Data Analytics and Flink in Klarna. Because the team is already highly experienced in the technology, their first ambition will be to own the infrastructure of a Kinesis Data Analytics for Apache Flink deployment, and connect it to different Klarna data sources. The team then will host business logic provided by other departments in Klarna such as Fraud Prevention. This will allow the specialised teams to concentrate on the business logic and fraud detection algorithms, while decision tooling will handle the infrastructure.

Klarna, AWS, and the Flink community

A key part of choosing Kinesis Data Analytics for Apache Flink was the open-source community and support.

Several teams within Klarna created different implementations of a Flink DynamoDB connector, which were used internally by multiple teams. Klarna then identified the opportunity to create a single maintained DynamoDB Flink connector and contribute it to the open-source community. This has initiated a collaboration within Klarna, led by the Klarna Flink experts and accompanied by Flink open-source contributors from AWS.

The main principle for designing the DynamoDB Flink connector was utilizing the different write capacity modes of DynamoDB. DynamoDB supports On-demand and Provisioned capacity modes and each behaves differently when it comes to how it handles incoming throughput. On-demand mode will automatically scale up DynamoDB write capacity and apply itself to the incoming load. However, Provisioned mode is more limiting, and will throttle incoming traffic according to the provisioned write capacity.

To comply with this process, the DynamoDB Flink connector was designed to allow concurrent writes to DynamoDB. The number of concurrent requests can be configured to comply with DynamoDB’s capacity mode. In addition, the DynamoDB Flink connector supports backpressure handling in case the DynamoDB write provisioning is low compared to the incoming load from the Apache Flink application.

At the time of writing, the DynamoDB Flink connector has been open sourced.

Conclusion

Klarna has successfully been running Kinesis Data Analytics for Apache Flink in production since October 2020. It provides several key benefits. The Klarna development team can focus on development, not on cluster and operational management. Their applications can be quickly modified and uploaded. The low latency properties of the service ensure a near-real-time experience for end-users, data consumers, and producers, which underpin risk assessment and the decision-making processes underpinning continuous traffic growth. At the same time, exactly-once processing in combination with Flink checkpoints and savepoints means that critical decision-making and legal data is not lost.

To learn more about Kinesis Data Analytics and getting started, refer to Using a Studio notebook with Kinesis Data Analytics for Apache Flink and More Kinesis Data Analytics Solutions on GitHub.

About the authors

Nir Tsruya is a Lead Engineer in Klarna. He leads 2 engineering teams focusing mainly on real time data processing and analytics at large scale.

Ankit Gupta is a Senior Solutions Architect at Amazon Web Serves based in Stockholm, Sweden, where we helps customers across the Nordics succeed in Cloud. He’s particularly passionate about building strong Networking foundation in Cloud.

Daniel Arenhage is a Solutions Architect at Amazon Web Services based in Gothenburg, Sweden.

Joulica unifies real-time and historical customer experience analytics with Amazon QuickSight

2023-06-12 Tony McCormack

Post Syndicated from Tony McCormack original https://aws.amazon.com/blogs/big-data/joulica-unifies-real-time-and-historical-customer-experience-analytics-with-amazon-quicksight/

This is a guest post by Tony McCormack from Joulica.

Joulica is an Ireland-based startup in the contact center industry. Our founders previously led contact center research and development for a global contact center technology provider, and we founded Joulica because we saw that the shift to the cloud and growing demand for data and analytics would transform the customer service industry. Our platform delivers real-time and historical analytics across the wide variety of data sources and customer interaction channels that are now common in most organizations.

It is our mission to make customer experience analytics accessible to all users in an organization. Our customers need to have instant access to the current state of ongoing customer engagements as well trends and historical data patterns. They find it difficult to harness all of the data sources needed to achieve this, particularly in real time. To help solve this problem, we chose Amazon QuickSight, a cloud-native business intelligence tool that allows you to embed insightful analytics into any application with customized, interactive visuals and dashboards.

In this post, we discuss how QuickSight has allowed us to achieve our goal of unifying real-time and historical analytics for our customers.

Embedding QuickSight to provide a 360-degree view

At Joulica, our specialty is real-time analytics. Real-time analytics means that as soon as new data is available, analytics are updated, including any visualizations that are currently being viewed by end-users. The rich embedding capability of QuickSight allowed us to use QuickSight for historical analytics and our technology for real-time analytics, all in one place. Visual embedding with QuickSight has saved us greater than two resource years of research and development time, enabling us to serve customers sooner. We have also been able integrate Joulica real-time analytics into QuickSight dashboards—so within QuickSight, our users can use both standard QuickSight visualizations and Joulica widgets, as shown in the following example.

Our customer journey analytics go beyond typical contact center reporting, and stitches together how customers interact with organizations across channels. By pushing our analytics to QuickSight, we allow users to dive deep into patterns as well as individual customer journeys.

Integration with Amazon Connect and other AWS services

One of the main reasons we chose QuickSight was because of its alignment with Amazon Connect and the extended set of Contact Center services that AWS provides. As the leading provider of analytics for Amazon Connect, this was essential. We are deeply integrated with the full capabilities of Amazon Connect, including Amazon Lex and Contact Lens. In the preceding example of a QuickSight dashboard, we see that as soon as a KPI is updated, the visualization automatically updates. Behind the scenes, Joulica is ingesting Agent Event Streams, Contact Events, and a host of real-time data feeds from Amazon Connect, analyzing them and pushing a stream of real-time analytics to these widgets. This enables our customers to have up-to-date information, in addition to the historical context, empowering them to improve contact center performance and make better decisions. This is all visualized within QuickSight.

We can also ingest data from Amazon Kinesis and Apache Kafka, allowing live analytics from additional data sources to be visualized.

Empowering our customers with insights in QuickSight

The key goals driving the need for analytics in the contact center industry are customer experience optimization as well as efficiency gains. This is relevant across all customer touchpoints, including digital channels, traditional contact center channels, and emerging areas such as social and smart devices. In addition to KPIs related to service levels and agent performance, organizations need access to customer feedback and sentiment analytics, as well as an explanation of how these are related to business performance such as sales and customer retention. With QuickSight, we can offer out-of-the box analytics for all of these areas. The following example shows how easy it is to use the Quicksight visual embedding feature to extend one of our real-time visualizations to show a geographic breakdown of customer survey results.

Thanks to the QuickSight and AWS pay-as-you-go pricing model, we can provide real-time and historical embedded analytics to our customers automatically. We have saved significantly by choosing QuickSight, both in cost and development time, and the cost savings provided by Quicksight is passed down to our customers. Like AWS, our pricing is usage-based, meaning that our customers can scale as their operations grow.

We currently have customers in North America, United Kingdom, the EU, and APAC. As we continue to grow, we plan to explore other QuickSight features like Amazon QuickSight Q, as we pursue providing more ways for our customers to analyze, visualize, and interpret their contact center data.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded. To learn more about Joulica, please visit Joulica.io.

About the Author

Tony McCormack is the CEO and Co-founder of Joulica. Based in Galway, Ireland, he is focused on providing enterprise-grade reporting and analytics for Amazon Connect, Salesforce Service Cloud, and other platforms in the customer experience market. He has extensive experience in the contact center domain, with a passion for real-time analytics and their integration into end-user applications.

Selecting cost effective capacity reservations for your business-critical workloads on Amazon EC2

2023-06-08 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/selecting-cost-effective-capacity-reservations-for-your-business-critical-workloads-on-amazon-ec2/

This blog post is written by Sarath Krishnan, Senior Solutions Architect and Navdeep Singh, Senior Customer Solutions Manager.

Amazon CTO Werner Vogels famously said, “everything fails all the time.” Designing your systems for failure is important for ensuring availability, scalability, fault tolerance and business continuity. Resilient systems scale with your business demand changes, prevent data loss, and allow for seamless recovery from failures. There are many strategies and architectural patterns to build resilient systems on AWS. Building resiliency often involves running duplicate workloads and maintaining backups and failover mechanisms. However, these additional resources may translate into higher costs. It is important to balance the cost of implementing resiliency measures against the potential cost of downtime and the associated risks to the organization.

In addition to the resilient architectural patterns, if your business-critical workloads are running on Amazon Elastic Compute Cloud (Amazon EC2) instances, it is imperative to understand different EC2 capacity reservation options available in AWS. Capacity reservations ensure that you always have access to Amazon EC2 capacity when you need it. For instance, Multi-AZ deployment is one of the architectural patterns to build highly resilient systems on AWS. In a Multi-AZ deployment, you spread your workload across multiple Availability Zones (AZs) with an Auto Scaling group. In an unlikely event of an AZ failure, the Auto Scaling group will try to bring up your instance in another AZ. In a rare scenario, the other AZ may not have the capacity at that time for your specific instance type, hence capacity reservations are important for your crucial workloads.

While implementing capacity reservations, it is important to understand how to control costs for your capacity reservations. In this post, we describe different EC2 capacity reservation and cost savings options available at AWS.

Amazon EC2 Purchase Options

Before we dive into the capacity reservation options, it is important to understand different EC2 instance purchase options available on AWS. EC2 On-Demand purchase option enables you to pay by the second for the instance you launch. Spot Instances purchase option allows you to request unused EC2 capacity for a steep discount. Savings Plans enable you to reduce cost through one- or three-year usage commitments.

Dedicated Hosts and Dedicated Instances allow you to run EC2 instances on single-tenant hardware. But only the On-Demand Capacity Reservations and zonal reservations can reserve capacity for your EC2 instances..

On-Demand Capacity Reservations Deep Dive

On-Demand Capacity Reservations enable you to reserve compute capacity for your Amazon EC2 instances in a specific AZ for any duration. On-Demand Capacity Reservations ensure On-Demand capacity allocation during capacity constraints without entering into a long-term commitment. With On-Demand Capacity Reservations, you pay on-demand price irrespective of your instance running or not. If your business needs capacity reservations only for a shorter duration, like a holiday season, or for a critical business event, such as large streaming event held once a quarter, On-Demand Capacity Reservations is the right fit for your needs. However, if you need capacity reservations for your business-critical workloads for a longer period consistently, we recommend combining On-Demand Capacity Reservations with Savings Plans to achieve capacity reservations and cost savings.

Savings Plans

Savings Plans is a flexible pricing model that can help you reduce your bill by up to 72% compared to On-Demand prices, in exchange for a one – or three-year hourly spend commitment. AWS offers three types of Savings Plans: Compute Savings Plans, EC2 Instance Savings Plans, and Amazon SageMaker Savings Plans.

With EC2 Instance Savings Plans, you can make an hourly spend commitment for instance family and region (e.g. M5 usage in N. Virginia) for one- or three-year terms. Savings are automatically applied to the instances launched in the selected instance family and region irrespective of size, tenancy and operating system. EC2 Instance Savings Plans also give you the flexibility to change your usage between instances within a family in that region. For example, you can move from c5.xlarge running Windows to c5.2xlarge running Linux and automatically benefit from the Savings Plans prices. EC2 Instance Savings Plans gets you the maximum discount of up to 72%.

Compute Savings Plans offer great flexibility as you can change the instance types, migrate workloads between regions, or move workloads to AWS Fargate or AWS Lambda and automatically continue to pay the discounted Savings Plans price. If you are an EC2 customer today, and planning to modernize your applications by leveraging AWS Fargate or AWS Lambda, evaluating Compute Savings Plans is recommended. This plan offers great flexibility so that your commercial agreements support your long-term changing architectural needs and offer cost savings of up to 66%. For example, with Compute Savings Plans, you can change from C4 to M5 instances, shift a workload from EU (Ireland) to EU (London), or move a workload from EC2 to Fargate or Lambda at any time and automatically continue to pay the Savings Plans price. Combining On-Demand Capacity Reservations with Compute Savings Plans give the capacity reservations, significant discounts and maximum flexibility.

We recommend utilizing Savings Plans for discounts due to its flexibility. However, some of the AWS customers might still have older Reserved Instances. If you have already purchased Reserved Instances and want to ensure capacity reservations, you can combine On-Demand Capacity Reservations with Reserved Instances to get the capacity reservations and the discounts. As your Reserved Instances expire, we recommend to sign up for Savings Plans as they offer the same savings as Reserved Instances, but with additional flexibility.

You may find Savings Plans pricing discount examples explained in the Savings Plans documentation.

Zonal Reservations

Zonal reservations offer reservation of capacity in a specific AZ. Zonal reservation requires one- or three- years commitment and reservation applies to a pre-defined instance family. Zonal reservation provides less flexibility as compared to Savings Plans. With zonal reservations, you do not have flexibility to change the instance family and its size. Zonal reservation also does not support queuing your purchase for a future date. We recommend to consider Savings Plans and On-Demand Capacity Reservations over zonal Reserved Instances so that you can get similar discounts and you get much better flexibility. If you are already on a zonal reservation, as your plan expires, we recommend you sign up for Savings Plans and On-Demand Capacity Reservations .

Working with Capacity Reservations and Savings Plans

You may provision capacity reservations using AWS console, Command Line Interface(CLI), and Application Programming Interface (API).

Work with capacity reservations documentation explains the steps to provision the On-Demand Capacity Reservations using AWS console and CLI in detail. You may find the steps to purchase the Savings Plans explained in the documentation.

Conclusion

In this post, we discussed different options for capacity reservations and cost control for your mission-critical workloads on EC2. For most flexibility and value, we recommend using On-Demand Capacity Reservations with Savings Plans. If you have a steady EC2 workloads which are not suitable candidates for modernization, EC2 Savings Plans is recommended. If you are looking for more flexibility of changing the instance types, migrate workloads between regions or planning to modernize your workloads leveraging AWS Fargate or AWS Lambda, consider Compute Savings Plans. Zonal reservations are not the preferred capacity reservation approach due to its lack of flexibility. If you need the capacity reservation for a short period of time, you may leverage the flexibility of On-Demand Capacity Reservations to book and cancel the reservations anytime.

You may refer to the blog to implement Reserving EC2 Capacity across Availability Zones by utilizing On Demand Capacity Reservations.

How Cargotec uses metadata replication to enable cross-account data sharing

2023-06-07 Sumesh M R

Post Syndicated from Sumesh M R original https://aws.amazon.com/blogs/big-data/how-cargotec-uses-metadata-replication-to-enable-cross-account-data-sharing/

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland.

Cargotec (Nasdaq Helsinki: CGCBV) is a Finnish company that specializes in cargo handling solutions and services. They are headquartered in Helsinki, Finland, and operates globally in over 100 countries. With its leading cargo handling solutions and services, they are pioneers in their field. Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value.

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. For this, Cargotec built an Amazon Simple Storage Service (Amazon S3) data lake and cataloged the data assets in AWS Glue Data Catalog. They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed.

In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing. By sharing their story, we hope to inspire readers facing similar challenges and provide insights into how our services can be customized to meet your specific needs.

Challenges

Like many customers, Cargotec’s data lake is distributed across multiple AWS accounts that are owned by different teams. Cargotec wanted to find a solution to share datasets across accounts and use Amazon Athena to query them. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views. Cargotec’s use cases also required them to create views that span tables and views across catalogs. Cargotec’s implementation covers three discrete AWS accounts, 25 databases, 150 tables, and 10 views.

Solution overview

Cargotec required a single catalog per account that contained metadata from their other AWS accounts. The solution that best fit their needs was to replicate metadata using an in-house version of a publicly available utility called Metastore Migration utility. Cargotec extended the utility by changing the overall orchestration layer by adding an Amazon SQS notification and an AWS Lambda. The approach was to programmatically copy and make available each catalog entity (databases, tables, and views) to all consumer accounts. This makes the tables or views local to the account where the query is being run, while the data still remains in its source S3 bucket.

Cargotec’s solution architecture

The following diagram summarizes the architecture and overall flow of events in Cargotec’s design.

Solution Architecture

Catalog entries from a source account are programmatically replicated to multiple target accounts using the following series of steps.

An AWS Glue job (metadata exporter) runs daily on the source account. It reads the table and partition information from the source AWS Glue Data Catalog. Since the target account is used for analytical purposes and does not require real-time schema changes, the metadata exporter runs only once a day. Cargotec uses partition projection, which ensures that the new partitions are available in real-time.
The job then writes the metadata to an S3 bucket in the same account. Please note that the solution doesn’t involve movement of the data across accounts. The target accounts read data from the source account S3 buckets. For guidance on setting up the right permissions, please see the Amazon Athena User Guide.
After the metadata export has been completed, the AWS Glue job pushes a notification to an Amazon Simple Notification Service (Amazon SNS) topic. This message contains the S3 path to the latest metadata export. The SNS notification is Cargotec’s customization to the existing open-source utility.
Every target account runs an AWS Lambda function that is notified when the source account SNS topic receives a push. In short, there are multiple subscriber Lambda functions (one per target account) for the source account SNS topics that get triggered when an export job is completed.
Once triggered, the Lambda function then initiates an AWS Glue job (metadata importer) on the respective target account. The job receives as input the source account’s S3 path to the metadata that has been recently exported.
Based on the path provided, the metadata importer reads the exported metadata from the source S3 bucket.
The metadata importer now uses this information to create or update the corresponding catalog information in the target account.

All along the way, any errors are published to a separate SNS topic for logging and monitoring purposes. With this approach, Cargotec was able to create and consume views that span tables and views from multiple catalogs spread across different AWS accounts.

Implementation

The core of the catalog replication utility is two AWS Glue scripts:

Metadata exporter – An AWS Glue job that reads the source data catalog and creates an export of the databases, tables, and partitions in an S3 bucket in the source account.
Metadata importer – An AWS Glue job that reads the export that was created by the metadata exporter and applies the metadata to target databases. This code is triggered by a Lambda function once files are written to S3. The job runs in the target account.

Metadata exporter

This section provides details on the AWS Glue job that exports the AWS Glue Data Catalog into an S3 location. The source code for the application is hosted the AWS Glue GitHub. Though this may need to be customized to suit your needs, we will go over the core components of the code in this blog.

Metadata exporter inputs

The application takes a few job input parameters as described below:

--mode key accepts either to-s3 or to-jdbc. The latter is used when the code is moving the metadata directly into a JDBC Hive Metastore. In the case of Cargotec, since we are moving the metadata to files on S3, the value for --mode will remain to-s3.
--output-path accepts an S3 location to which the exported metadata should be written. The code creates subdirectories corresponding to databases, tables, and partitions.
--database-names accepts a semicolon-separated list of databases on the source catalog that need to be replicated to the target

Reading the catalog

The metadata about the database, tables, and partitions are read from the AWS Glue catalog.

dyf = glue_context.create_dynamic_frame.from_options(
 connection_type=’com.amazonaws.services.glue.connections.DataCatalogConnection‘,
            connection_options = {
                            'catalog.name': ‘datacatalog’,
                            'catalog.database': database,
                            'catalog.region': region
                                 })

The above code snippet reads the metadata into an AWS Glue DynamicFrame. The frame is then converted to a Spark DataFrame. It is filtered into individual DataFrames based on it being either part of a database, table, or partition. A schema is attached to the data frame using one of the below:

DATACATALOG_DATABASE_SCHEMA = 
    StructType([
        StructField('items', ArrayType(
            DATACATALOG_DATABASE_ITEM_SCHEMA, False),
                    True),
        StructField('type', StringType(), False)
    ])

DATACATALOG_TABLE_SCHEMA = 
    StructType([
        StructField('database', StringType(), False),
        StructField('type', StringType(), False),
        StructField('items', ArrayType(DATACATALOG_TABLE_ITEM_SCHEMA, False), True)
    ])

DATACATALOG_PARTITION_SCHEMA = 
    StructType([
        StructField('database', StringType(), False),
        StructField('table', StringType(), False),
        StructField('items', ArrayType(DATACATALOG_PARTITION_ITEM_SCHEMA, False), True),
        StructField('type', StringType(), False)
    ])

For details on the individual item schema, refer to the schema definition on GitHub.

Persisting the metadata

After converting to a DataFrame with schema, it is persisted to the S3 location marked by the output-path parameter

databases.write.format('json').mode('overwrite').save(output_path + 'databases')
tables.write.format('json').mode('overwrite').save(output_path + 'tables')
partitions.write.format('json').mode('overwrite').save(output_path + 'partitions')

Exploring the output

Navigate to the S3 bucket that contains the output location, and you should be able to see the output metadata in format. An example export for a table would look like the following code snippet.

{
    "database": "default",
    "type": "table",
    "item": {
        "createTime": "1651241372000",
        "lastAccessTime": "0",
        "owner": "spark",
        "retention": 0,
        "name": "an_example_table",
        "tableType": "EXTERNAL_TABLE",
        "parameters": {
            "totalSize": "2734148",
            "EXTERNAL": "TRUE",
            "last_commit_time_sync": "20220429140907",
            "spark.sql.sources.schema.part.0": "{redacted_schema}",
            "numFiles": "1",
            "transient_lastDdlTime": "1651241371",
            "spark.sql.sources.schema.numParts": "1",
            "spark.sql.sources.provider": "hudi"
        },
        "partitionKeys": [],
        "storageDescriptor": {
            "inputFormat": "org.apache.hudi.hadoop.HoodieParquetInputFormat",
            "compressed": false,
            "storedAsSubDirectories": false,
            "location": "s3://redacted_bucket_name/table/an_example_table",
            "numberOfBuckets": -1,
            "outputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "bucketColumns": [],
            "columns": [{
                    "name": "_hoodie_commit_time",
                    "type": "string"
                },
                {
                    "name": "_hoodie_commit_seqno",
                    "type": "string"
                }
            ],
            "parameters": {},
            "serdeInfo": {
                "serializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "parameters": {
                    "hoodie.query.as.ro.table": "false",
                    "path": "s3://redacted_bucket_name/table/an_example_table",
                    "serialization.format": "1"
                }
            },
            "skewedInfo": {
                "skewedColumnNames": [],
                "skewedColumnValueLocationMaps": {},
                "skewedColumnValues": []
            },
            "sortColumns": []
        }
    }
}

Once the export job is complete, the output S3 path will be pushed to an SNS topic. A Lambda function at the target account processes this message and invokes the import AWS Glue job by passing the S3 import location.

Metadata importer

The import job runs on the target account. The code for the job is available on GitHub. As with the exporter, you may need to customize it to suit your specific requirements, but the code as-is should work for most scenarios.

Metadata importer inputs

The inputs to the application are provided as job parameters. Below is a list of parameters that are used for the import process:

--mode key accepts either from-s3 or from-jdbc. The latter is used when migration is from a JDBC source to the AWS Glue Data Catalog. At Cargotec, the metadata is already written to Amazon S3, and hence the value for this key is always set to from-s3.
--region key accepts a valid AWS Region for the AWS Glue Catalog. The target Region is specified using this key.
--database-input-path key accepts the path to the file containing the database metadata. This is the output of the previous import job.
--table-input-path key accepts the path to the file containing the table metadata. This is the output of the previous import job.
--partition-input-path key accepts the path to the file containing the partition metadata. This is the output of the previous import job.

Reading the metadata

The metadata, as previously discussed, are files on Amazon S3. They are read into individual spark data frames with their respective schema information

databases = sql_context.read.json(path=db_input_dir, schema=METASTORE_DATABASE_SCHEMA)
tables = sql_context.read.json(path=tbl_input_dir, schema=METASTORE_TABLE_SCHEMA)
partitions = sql_context.read.json(path=parts_input_dir, schema=METASTORE_PARTITION_SCHEMA)

Loading the catalog

Once the spark data frames are read, they are converted to AWS Glue DynamicFrame and then loaded to the catalog, as shown in the following snippet.

glue_context.write_dynamic_frame.from_options(
        frame=dyf_databases, 
        connection_type='catalog',
        connection_options={
               'catalog.name': datacatalog_name, 
               'catalog.region': region
         }
)
glue_context.write_dynamic_frame.from_options(
        frame=dyf_tables, 
        connection_type='catalog',
        connection_options={
                'catalog.name': datacatalog_name, 
                'catalog.region': region
        }
)
glue_context.write_dynamic_frame.from_options(
        frame=dyf_partitions, 
        connection_type='catalog',
        connection_options={
                 'catalog.name': datacatalog_name, 
                 'catalog.region': region
         }
)

Once the job concludes, you can query the target AWS Glue catalog to ensure the tables from the source have been synced with the destination. To keep things simple and easy to manage, instead of implementing a mechanism to identify tables that change over time, Cargotec updates the catalog information of all databases or tables that are configured in the export job.

Considerations

Though the setup works effectively for Cargotec’s current business requirements, there are a few drawbacks to this approach, which are highlighted below:

The solution involves code. Customizations were made to the existing open-source utility to be able to publish an SNS notification once an export is complete and a Lambda function to trigger the import process.
The export process on the source account is a scheduled job. Hence there is no real-time sync between the source and target accounts. This was not a requirement for Cargotec’s business process.
For tables that don’t use Athena partition projection, query results may be outdated until the new partitions are added to the metastore through MSCK REPAIR TABLE, ALTER TABLE ADD PARTITION, AWS Glue crawler, and so on.
The current approach requires syncing all the tables across the source and target. If the requirement is to capture only the ones that changed instead of a scheduled daily export, the design needs to change and could benefit from the Amazon EventBridge integration with AWS Glue. An example implementation of using AWS Glue APIs to identify changes is shown in Identify source schema changes using AWS Glue.

Conclusion

In this blog post, we have explored a solution for cross-account sharing of data and tables that makes it possible for Cargotec to create views that combine data from multiple AWS accounts. We’re excited to share Cargotec’s success and believe the post has provided you with valuable insights and inspiration for your own projects.

We encourage you to explore our range of services and see how they can help you achieve your goals. Lastly, for more data and analytics blogs, feel free to bookmark the AWS Blogs.

About the Authors

Sumesh M R is a Full Stack Machine Learning Architect at Cargotec. He has several years of software engineering and ML background. Sumesh is an expert in Sagemaker and other AWS ML/Analytics services. He is passionate about data science and loves to explore the latest ML libraries and techniques. Before joining Cargotec, he worked as a Solution Architect at TCS. In his spare time, he loves to play cricket and badminton.

Tero Karttunen is a Senior Cloud Architect at Knowit Finland. He advises clients on architecting and adopting Data Architectures that best serve their Data Analytics and Machine Learning needs. He has helped Cargotec in their data journey for more than two years. Outside of work, he enjoys running, winter sports, and role-playing games.

Arun A K is a Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on AWS Glue, AWS Lake Formation, Amazon Athena, and Amazon EMR. In his free time, he likes to spend time with his friends and family.

How SumUp made digital analytics more accessible using AWS Glue

2023-06-06 Mira Daniels

Post Syndicated from Mira Daniels original https://aws.amazon.com/blogs/big-data/how-sumup-made-digital-analytics-more-accessible-using-aws-glue/

This is a guest blog post by Mira Daniels and Sean Whitfield from SumUp.

SumUp is a leading global financial technology company driven by the purpose of leveling the playing field for small businesses. Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Through its Super App, SumUp provides merchants with a free business account and card, an online store, and an invoicing solution – as well as in-person and remote payments seamlessly integrated with SumUp’s card terminals and point-of-sale registers. For more information, please visit sumup.co.uk.

As most organizations, that have turned to Google Analytics (GA) as a digital analytics solution, mature they discover a more pressing need to integrate this data silo with the rest of their organization’s data to enable better analytics and resulting product development and fraud detection. Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake. AWS Glue gave us a cost-efficient option to migrate the data and we further optimized storage cost by pruning cold data. What was essential to the solution development was a good understanding of the nature of the data, the source of it (export from GA) as well as the form and scope of the data useful to the data consumers.

Business context

At SumUp we use GA and Firebase as our digital analytics solutions and AWS as our main Cloud Provider. In order to mature our data marts, it became clear that we needed to provide Analysts and other data consumers with all tracked digital analytics data in our DWH as they depend on it for analyses, reporting, campaign evaluation, product development and A/B testing. We further use the Digital Analytics data for our reverse ETL pipelines that ingest merchant behavior data back into the Ad tools. As the SumUp merchants user journey only starts onsite (with a Sign up or product purchase), but extends to offline card reader transactions or usage of our products from within our app and web dashboard, it is important to combine Digital Analytics data with other (backend) data sources in our Data Lake or DWH. The Data Science teams also use this data for churn prediction and CLTV modeling.

Given that the only source to access all raw data is by exporting it to BigQuery (first), data accessibility becomes challenging if BigQuery isn’t your DWH solution. What we needed was a data pipeline from BigQuery to our Data Lake and the DWH. The pipeline further needed to run based on the trigger of new data arriving in BigQuery and could not run on a simple schedule as data would not arrive at the same time consistently.

We experimented with some no-code tools that allowed for the direct export of Google Analytics data (not available for Firebase) from BigQuery directly to Snowflake, but due to our growing data volume, this wasn’t financially scalable. Other no-code tools, even if they can move data to S3, did not meet our technical requirements. The solutions we experimented with did not give us the flexibility to monitor and scale resources per pipeline run and optimize the pipeline ourselves.

We had a solid business case to build our own internal digital analytics pipeline to not only reduce spending on our existing outsourced Google Analytics pipeline (that moved from BigQuery to Snowflake directly), but also to make GA and Firebase data available in both the Data Lake and DWH.

Technical challenges

Data source specifics: The data in BigQuery is the export of GA 360 data and Firebase Analytics data. It consists of full-day and intraday tables. BigQuery uses a columnar storage format that can efficiently query semi-structured data, in the case of GA and Firebase data as arrays of structs.

Full-day: Daily tables that do not change retroactively for GA data, but for Firebase data

Intraday: Until the daily tables (full-day are created the intraday tables are populated with new and corrected records.

Update interval: intraday is refreshed and overwritten at least 3x a day. When the full-day table is created, the intraday table from that day is deleted.

Since we started exporting GA tracking data to BigQuery in 2015 the amount of data tracked and stored has grown 70x (logical bytes) and is >3TB in total. Our solution needs not only be able to ingest new data but also backfill historical data from the last 7 years. Firebase data has been exported to BigQuery since 2019 and grew 10x in size (logical bytes) over time.

Our major challenge with ingesting Firebase data to the DWH was the combination of its size (historically >35TB) with the arrival of late data upstream. Since BigQuery processes hits with timestamps up to 72 hours earlier for Firebase data, historical data must be updated for each daily run of the pipeline for the last 3 days. Consequently, this greatly increases compute and storage resources required for the pipeline.

The intraday source data usually arrives every 2-4 hours so real-time downstream pipelines are currently not needed with our GA and Firebase data export (into BigQuery) setup.

Querying the dataset (containing nested and repeated fields) in Snowflake presented an accessibility problem for our users, as it required unfriendly verbose query syntax and greatly strained our compute resources. We used AWS Glue to run a UDF to transform this nested data into a Snowflake object (key-value) data structure that is both more user friendly and requires less compute resources to access.

Regarding version control of the Glue script, our team wanted to contain all of the pipeline logic in the same repository for simplicity and ease of local development. Since we used MWAA for orchestrating our platform pipelines, we wanted our DAG code to be close to the Glue Script. This required us to add an initial first step in our pipeline to push the script to the respective bucket that AWS Glue is synced with.

Our ELT design pattern required that we overwrite/update data stored in S3 before loading it into snowflake which required us to use a Spark DF.

In order to save on Snowflake storage costs, we decided to only keep hot data in a materialised table and have access to colder historical data through external tables.

Solution overview

We chose AWS Glue for the first step in the pipeline (moving data to S3) as it nicely integrates into our serverless AWS infrastructure and Pyspark makes it very flexible to script transformation steps and add partitions to the data storage in S3. It already provided a connector to BigQuery that was easy to configure and the complete Glue job was nicely abstracting the underlying system infrastructure.

We used the existing Google BigQuery Connector for Glue following parts of this blog post.

Ingestion

The code for the ingestion pipeline was set up in a way that it is easily extendable and reusable by splitting jobs/tasks in different methods and classes building a framework for all (future) ingestions from BigQuery to Snowflake.

Initialise Glue Job: Firstly, we push the glue script to S3 as the boto3 client expects it to be there for job execution. We then check if the glue job already exists. If it doesn’t exist, it’s created, otherwise it is updated with the most recent job parameters. We made sure to add appropriate tags for cost monitoring.

Pull From Google Cloud Platform: Based on the data source, GA or Firebase, we pick dynamic (or hard coded for backfilling) days to be pulled from the source. We then check if those selected dates exist as sharded tables in BigQuery, in case the data is late we cannot pull it and will wait to try again a few minutes later (using a generous retry schedule). If the dates we chose to be ingested can be found in BigQuery we run the Glue job.

Dump to S3: Some transformation steps are carried out within the Glue job before the data is moved into S3. It is saved in day-partitioned folders and repartitioned into 300-500MB files for better table query performance in Snowflake.

Data Catalog: We also wanted to automate a Glue Crawler to have metadata in a Data Catalog and be able to explore our files in S3 with Athena. With the help of the boto3 library, our pipeline contains a step which runs the crawler at the end of the pipeline. In order to simplify schema evolution, we moved the Data Catalog creation out of the Glue script.

DWH: The solution loading the data to Snowflake consists of tasks that are scheduled to access storage integrations with S3 and materialise the data in Snowflake tables. The jobs are batch jobs, we haven’t tested any streaming solutions as the data arrives batched in BigQuery. After loading new data into Snowflake we use a table pruner function that deletes the respective days that fall outside of the number of days of retention we defined for the table. This way we make sure to provide the data timeframe used by our stakeholders and avoid unnecessary storage costs.

In addition to the materialised tables in Snowflake we use External tables to make more historical data available (at slower query speed), these External tables are refreshed at the end of the data pipeline in Airflow using the Snowflake operator.

Development and Testing: Before the Glue job script was integrated into the task orchestration process and code, we tested the functionality of the job within Glue Studio using the Spark script Editor and Jupyter Notebook. It was great for quick iteration over a new feature. We used boto3 to access AWS infrastructure services and create, update and run the Glue job

Data Quality is ensured on the one hand by deduplicating Firebase events and adding a new hash key within the Glue script and on the other hand by comparing total row counts per day between Snowflake and BigQuery (the comparison is currently done in tableau). We also get alerted if our Airflow pipeline or the Snowflake tasks fail.

In the future we will add Data Quality checks with DBT.

Conclusion

As SumUp expands its product suite through innovation and acquisition, so does its infrastructure. AWS Glue has proven to be a scalable solution to ingest and transform siloed data in different clouds using marketplace connectors and the flexibility provided by pyspark.

What surprised us was how easily Glue can be customized (from no code to highly customised script and execution setup) and fit our volatile source data (size and shape). In the future we can think of further customization of the Glue script in terms of transformation of the data (more useful unnesting) as well as Glue job resource optimization.

The consumers of Digital Analytics data appreciate being able to make use of the full source data schema in both Data Lake and DWH. The transparency around the process being managed by the central Data Platform team facilitates trust and reliance on the data. Through pipeline abstraction, we’re now able to provide data products in high demand to all our data consumers.

The pipeline framework we built can easily be extended if needed as its components were built separately and are only forming the final pipeline during the last step, the orchestration.

What we plan to test and optimize in the future is loading historical data in automated batch jobs which is dependent on both API limits on the source side and compute resource orchestration on Glue’s side (we did not test automated batching in the past and were manually chunking data into jobs for backfilling). Additionally, we will incorporate these pipeline features into our main ingestion framework, which would allow our stakeholders to define digital analytics pipelines themselves in a self-service manner.

If your organization faces similar challenges with digital analytics data silos, we recommend developing a proof of concept to migrate your data to S3 using the Google BigQuery Connector for Glue and t his blog post. Experiment with the Glue job settings and PySpark script options to find the best match for the size and latency requirements of your data. For migrating to a DWH like Snowflake, leverage COPY INTO statements as they are a cheaper alternative to Snowpipes for this volume of data. Once you have prototyped, develop a proper well tested solution with Amazon Managed Workflows for Apache Airflow MWAA that includes version control for the infrastructure and the pipeline logic.

About the Authors

Mira Daniels (Data Engineer in Data Platform team), recently moved from Data Analytics to Data Engineering to make quality data more easily accessible for data consumers. She has been focusing on Digital Analytics and marketing data in the past.

Sean Whitfield (Senior Data Engineer in Data Platform team), a data enthusiast with a life science background who pursued his passion for data analysis into the realm of IT. His expertise lies in building robust data engineering and self-service tools. He also has a fervor for sharing his knowledge with others and mentoring aspiring data professionals.

BWH Hotels scales enterprise business intelligence adoption while reducing costs with Amazon QuickSight

2023-06-02 Joseph W. Landucci

Post Syndicated from Joseph W. Landucci original https://aws.amazon.com/blogs/big-data/bwh-hotels-scales-enterprise-business-intelligence-adoption-while-reducing-costs-with-amazon-quicksight/

This is a guest post by Joseph Landucci from BWH Hotels. In their own words, “BWH Hotels is a leading, global hospitality enterprise comprised of three hotel companies, including WorldHotels, Best WesternHotels & Resorts and SureStay Hotels. Our mission is to deliver trusted guest experiences, drive hotel success and foster a caring, inclusive culture that respects the environment. Technology plays a key role in meeting these standards.

My team supports analytics throughout BWH Hotels, including some affiliate and international organizations. Four years ago, we switched to Amazon QuickSight, a fully managed, cloud-native business intelligence (BI) service, and have not looked back. In March 2021, I wrote about how QuickSight had helped us cut costs by over 70%, while also improving our operations. Since then, we have moved fully onto Amazon Web Services (AWS) from IBM Cognos and Oracle, and improved our BI adoption while navigating the effects of the COVID-19 pandemic.

In this post, I discuss how we are using QuickSight today, how it helped us through the pandemic and how we are using it to build the future of our business.

Migrating to AWS and QuickSight

After a six-month proof of concept, we decided to move fully to AWS for our data and analytics needs. Our previous BI solution, IBM Cognos, was prohibitively expensive because we had to pay for a license for each user. After we made the decision to make QuickSight our BI solution, we were able to move our hotels from Cognos to QuickSight within 10 months. Our affiliate users made the migration four months later, followed by our corporate users.

Having used Cognos since 2008, this was not just a lift-and-shift migration. We determined what dashboards and data we needed to migrate from our previous solution to QuickSight. During the migration, we discovered that a significant portion of what was in Cognos wasn’t being used. With AWS Database Migration Service (AWS DMS), we were able to put our data into Amazon Redshift and then build dashboards with QuickSight.

Navigating the pandemic with consumption-based pricing

Shortly after we switched to AWS and QuickSight, the COVID-19 pandemic hit, which impacted the hospitality industry considerably. If we hadn’t made this migration, we would have been stuck with fixed licensing costs. Thanks to the AWS pay-as-you-go pricing model, we were able to only pay for what we needed when demand was lower in 2020. Then when travel increased again, we were able to seamlessly scale back up with demand.

This consumption-based pricing model was one of the reasons we chose QuickSight, which turned out to be a game changer during an unanticipated event. We didn’t have to do any of the scaling down and then jump back up ourselves.

Increased usage and functionality

Even as we have reduced costs, our enterprise BI adoption and usage has gone up almost tenfold. Our corporate users and hoteliers find QuickSight much more intuitive than Cognos. We have 24,000 registered users and 8,000 monthly active users within BWH Hotels.

With greater usage comes greater responsibility for governance and training. When we began the move to AWS, we had a weekly governance meeting, which we have now moved to biweekly. In this governance meeting, we determined how we were going to standardize data handling and set rules. These meetings also gave us the opportunity to provide training on the new tools we were using, like QuickSight. Finally, we also had learning sessions which team members were able to access virtually. Now, whenever a new team begins to use QuickSight, we have 3–4 years’ worth of training material to help them get up to speed. We have also found the training resources provided by QuickSight to be valuable.

Improving our business with QuickSight

The BWH Hotels Revenue Management team was an early adopter of QuickSight. We have around 20–30 specific revenue management dashboards used both by corporate and hoteliers worldwide. Our sales team has implemented QuickSight to perform recency, frequency and monetary (RFM) analysis to determine which customers it should prioritize. With these dashboards, our sales teams at both the corporate and property levels are able to score customer and corporate accounts so that they can optimize how they use their time. We have also begun to experiment with sentiment analysis based on call center data that is fed into Amazon Transcribe. Gathering a multitude of data and then visualizing it in QuickSight has made us a much nimbler organization.

We are also excited about the value we are able to provide to our hoteliers with QuickSight Paginated Reports for creating billing invoices. With this new feature, our hotel operators no longer need to go onto the credit card processing portal to access transactions. They can access the information they need right within QuickSight.

Building for the future

In the near future, we plan to embed QuickSight into our property management system (PMS), AutoClerk. This will provide BWH Hotels and our hoteliers with analytics within their PMS, saving them time and giving them easy access to the information they need to improve their businesses. By including analytics, AutoClerk will provide incredible value to customers.

By switching to QuickSight, BWH Hotels has provided more value to our hoteliers – increasing adoption, saving money and building for the future.

To learn more about how QuickSight can help your business with dashboards, reports and more, visit Amazon QuickSight.

About the Author
Joseph W. Landucci serves as Director of Technology Management at BWH Hotels where he has spent over eight years building and evolving their data management strategy. In this role he is responsible for the management and strategic direction of six product teams: Database, Revenue Management Systems, Data Services, Loyalty Applications, Loyalty Automation and Enterprise Analytics.

Trakstar unlocks new analytical opportunities for its HR customers with Amazon QuickSight

2023-05-31 Brian Kasen

Post Syndicated from Brian Kasen original https://aws.amazon.com/blogs/big-data/trakstar-unlocks-new-analytical-opportunities-for-its-hr-customers-with-amazon-quicksight/

This is a guest post by Brian Kasen and Rebecca McAlpine from Trakstar, now a part of Mitratech.

Trakstar, now a part of Mitratech, is a human resources (HR) software company that serves customers from small businesses and educational institutions to large enterprises, globally. Trakstar supercharges employee performance around pivotal moments in talent development. Our team focuses on helping HR leaders make smarter decisions to attract, retain, and engage their workforce. Trakstar has been used by HR leaders for over 20 years, specifically in the areas of applicant tracking, performance management, and learning management.

In 2023, Trakstar joined Mitratech’s world-class portfolio of Human Resources and Compliance products. This includes complementary solutions for OFCCP compliance management; diversity, equity, and inclusion (DEI) strategy and recruiting (now with Circa); advanced background screening; I-9 compliance; workflow automation; policy management; and more.

In 2022, Trakstar launched what is now called Trakstar Insights, which unlocks new analytical insights for HR across the employee life cycle. It’s powered by Amazon QuickSight, a cloud-native business intelligence (BI) tool that enables embedded customized, interactive visuals and dashboards within the product experience.

In this post, we discuss how we use QuickSight to deliver powerful HR analytics to over 3,000 customers and 43,000 users while forecasting savings of 84% year-over-year as compared to our legacy reporting solutions.

The evolving HR landscape

Over the past few years, new realities have emerged, creating unfamiliar challenges for businesses and new opportunities for HR leaders. In 2020, new working arrangements spawned by immediate necessity, with many companies shifting to fully remote or hybrid setups for the first time. As we still adapt to this new environment, organizations have trouble finding talent, with record-level resignation rates and a tight labor market.

As companies look to combat these new challenges, we’ve seen the rise of the Chief People Officer because organizations now recognize people as their greatest asset. With our three products, Trakstar Hire, Trakstar Perform, and Trakstar Learn, HR leaders can use data to take an integrated approach and foster a better employee experience.

Choosing QuickSight to bring solutions to the new reality of work

To help HR leaders navigate the new challenges of our time and answer new questions, we decided to embed interactive dashboards directly into each Trakstar product focused on the growing area of people analytics. QuickSight allowed us to meet our customers’ needs quickly and played a key role in our overall analytics strategy. Because QuickSight is fully managed and serverless, we were able to focus on building value for customers and develop an embedded dashboard delivery solution to support all three products, rather than focusing on managing and optimizing infrastructure. QuickSight allowed us to focus on building dashboards that address key pain points for customers and rapidly innovate.

In a 12-month timespan, we designed and launched five workforce dashboards to over 20,000 users, spanning hiring, performance, learning, and peer group benchmarking. During the same 12-month time period, in addition to our Trakstar Insights dashboard releases, we also migrated Trakstar Learn’s legacy reporting to QuickSight, which supports an additional 20,000 users.

Delighting our customers by embedding QuickSight

Our goal was to build something that would delight our customers by making a material difference in their daily lives. We set out to create something that went beyond a typical Minimum Viable Product, but rather create a Minimum Lovable Product. By this, we mean delivering something that would make the most significant difference for customers in the shortest time possible.

We used QuickSight to build a curated dashboard that went beyond traditional dashboards of bar charts and tables. Our dashboards present retention trends, hiring trends, and learning outcomes supplemented with data narratives that empower our customers to easily interpret trends and make data-driven decisions.

In January 2022, we launched the Perform Insights dashboards. This enabled our customers to see their data in a way they had never seen before. With this dashboard, HR leaders can compare organizational strengths and weaknesses over time. The power of QuickSight lets our customers slice and dice the data in different ways. As shown in the following screenshot, customers can filter by Review Process Type or Group Type and then take actionable next steps based on data. They can see where top and bottom performers reside within teams and take steps to retain top performers and address lower-performing employees. These were net new analytics for our HR customers.

Our investment in building with QuickSight was quickly validated just days after launch. One of our sales reps was able to engage a lost opportunity and land a multi-year contract for double our typical average contract value. We followed up our first dashboard launch by expanding Trakstar Insights into our other products with the Learning Insights dashboard for Trakstar Learn and Hiring Insights for Trakstar Hire (see the following screenshots). These dashboards provided new lenses into how customers can look at their recruitment and training data.

Through our benchmarking dashboards, we empowered our customers so they can now compare their trends against other Trakstar customers in the same industry or of similar size, as shown in the following screenshots. These benchmarking dashboards can help our customers answer the “Am I normal?” question when it comes to talent acquisition and other areas.

Telling a story with people data for reporting

With the custom narrative visual type in QuickSight, our benchmarking dashboards offer dynamic, customer-specific interpretations of their trends and do the heavy lifting interpretation for them while providing action-oriented recommendations. The burden of manual spreadsheet creation, manual export, data manipulation, and analysis has been eliminated for our customers. They can now simply screenshot sections from the dashboards, drop them into a slide deck, and then speak confidently with their executive teams on what the trends mean for their organization, thereby saving tremendous time and effort and opening the door for new opportunities.

A Trakstar Hire customer shared with us, “You literally just changed my life. I typically spend hours creating slides, and this is the content—right here, ready to screenshot for my presentations!”

Building on our success with QuickSight

With the success of launching Trakstar Insights with QuickSight, we knew we could modernize the reporting functionality in Trakstar Learn by migrating to QuickSight from our legacy embedded BI vendor. Our legacy solution was antiquated and expensive. QuickSight brings a more cohesive and modern look to reporting at a significantly lower overall cost. With the session-based pricing model in QuickSight, we are projecting to save roughly 84% this year while offering customers a more powerful analytics experience.

Summary

Building with QuickSight has helped us thrive by delivering valuable HR solutions to our customers. We are excited to continue innovating with QuickSight to deliver even more value to our customers.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded.

About the authors:

Brian Kasen is the Director, Business Intelligence at Mitratech. He is passionate about helping HR leaders be more data-driven in their efforts to hire, retain, and engage their employees. Prior to Mitratech, Brian spent much of his career building analytic solutions across a range of industries, including higher education, restaurant, and software.

Rebecca McAlpine is the Senior Product Manager for Trakstar Insights at Mitratech. Her experience in HR tech experience has allowed her to work in various areas, including data analytics, business systems optimization, candidate experience, job application management, talent engagement strategy, training, and performance management.

Large-scale digital biomarker computation with AWS serverless services

2023-05-31 Avni Patel

Post Syndicated from Avni Patel original https://aws.amazon.com/blogs/architecture/large-scale-digital-biomarker-computation-with-aws-serverless-services/

Digital biomarkers are quantitative, objective measures of physiological and behavioral data. They are collected and measured using digital devices that better represent free-living activity in contrast to a highly structured in-clinic setting. This approach generates large amounts of data that requires processing.

Digital biomarker data is typically saved in different formats, and various sources require different (and often multiple) processing steps. In 2020, the Digital Sciences and Translational Imaging group at Pfizer developed a specialized pipeline using AWS services for analyzing incoming sensor data from wrist devices for sleep and scratch activity.

In time, the original pipeline needed updating to compute digital biomarkers at scale while maintaining reproducibility and data provenance. The key component in re-designing the pipeline is flexibility so the platform can:

Handle various incoming data sources and different algorithms
Handle distinct sets of algorithm parameters

These goals were accomplished with a framework for handling file analysis that uses a two-part approach:

A Python package using a custom common architecture
An AWS-based pipeline to handle processing mapping and computing distribution

This blog post introduces this custom Python package data processing pipeline using AWS services. The AWS architecture maintains data provenance while enabling fast, efficient, and scalable data processing. These large data sets would otherwise consume significant local resources.

Let’s explore each part of this framework in detail.

Custom Python package for data processing

Computation of digital biomarkers requires specific algorithms. Typically, these algorithms—for example, gait, activity, and more—had separate code repositories. Little thought was given to interaction and reproducibility.

Pfizer addressed this issue by creating the SciKit-Digital-Health Python package (SKDH) to implement these algorithms and data ingestion methods. As the SKDH framework is designed with reproducibility and integration in mind, modules can be chained together in a pipeline structure and run sequentially. Pipelines can be saved and loaded later; a key feature in enabling dynamic pipeline selection and reproducibility.

AWS framework for file analysis

While SKDH is the core that computes the digital biomarkers, it requires an operational framework for handling all the data files. To that end, Pfizer and AWS designed the following architecture to handle file analysis on a study-by-study basis, as shown in Figure 1:

Figure 1. Digital biomarker catalog and file processing workflow

The overall platform can be broken into two components: the Catalog and File Processing AWS Step Functions workflows. Let’s explore each of these components, along with the necessary configuration file to process a study.

Catalog Step Functions workflow

The Catalog workflow searches study Amazon Simple Storage Service (Amazon S3) buckets for files that need processing. These files are defined in a study configuration document. The workflow consists of the following steps:

The Catalog Step Function is manually triggered.
If the study configuration file is found and the S3 Study buckets specified in the configuration file exist, search the S3 Study buckets for any files matching those enumerated in the study configuration file.
If the configuration file or S3 Study bucket are not found, send a message to an Amazon Simple Queue Service (Amazon SQS) queue.
For any cataloged files, start a new Processing workflow for each of these files.

File processing Step Functions workflow

The Processing workflow verifies processing requirements and processes both study data files and metadata files through the following steps:

From the study configuration file, the workflow gets relevant details for the file. This includes metadata processing requirements as needed, and the specified set of SKDH algorithms to run.
If the file contains metadata, a special AWS Lambda function cleans the data and saves it to the metadata Amazon DynamoDB table. Otherwise, check whether all requirements are met, including processing requirements (for example, height for gait algorithms) and logistics requirements (for example, file name information extraction functions).
For missing requirements, the workflow will send a message to an Amazon SQS queue and try waiting for them to be met.
If all requirements are met, it will start a batch job for the file.
The batch job runs on AWS Fargate with a custom Docker image with a specified version of SKDH. This allows for easy versioning and reproducibility.
The Docker image loads the specified SKDH pipeline and computes the digital biomarkers. Any results generated are then uploaded back to the S3 Study bucket at the end of processing.

Study configuration file

The configuration file defines everything necessary for processing a study. This includes which S3 bucket(s) the study data is located in, under which prefixes, and the list of file patterns of files to be processed. Figure 2 offers a sample configuration file example:

Figure 2. Sample study configuration file

Building a stack with AWS CloudFormation

AWS CloudFormation provisions and configures all of the resources for the pipeline, allowing AWS services to be easily added and removed. CloudFormation integration with Github makes it easy to keep track of service changes. AWS Step Functions can also be integrated into CloudFormation template files, which gives an AWS specification in essentially one master file.

To iterate and update the stack, a combination of Python, Amazon SDK for Python (Boto3), and AWS Command Line Interface (AWS CLI) tools are used. Python allows for a more intricate command line script with positional and keyword arguments depending on the process. For example, updating the pipeline is a simple command, where <env> is the dev/stage/prod environment that needs to be updated:

Figure 3. Updating the pipeline with a simple command

Meanwhile, the Python function would be something like the following:

Figure 4. Sample Python function

While it might seem unconventional to use both SDK for Python and a call to the AWS CLI from Python in the same script, the AWS CLI has some useful features that make the CloudFormation packaging and stack creation much easier. The end of the example file in Figure 3 pertains to the main utility in using the Python argparse package, which provides an easy to write user-friendly CLI.

This could also be achieved using AWS Cloud Development Kit (AWS CDK), an open source software development framework that allows you to define infrastucture in familiar programming languages. AWS CDK uses constructs that can be customized and reused.

Conclusion

This blog post shared a custom data processing pipeline using AWS services that, through intentional integration with a Python package, can run arbitrary algorithms to process data. By using AWS as the framework for this pipeline, data provenance is maintained while enabling fast, efficient, and scalable processing of large amounts of data that would typically consume significant company resources.

Automate alerting and reporting for AWS Glue job resource usage

2023-05-25 Michael Hamilton

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/automate-alerting-and-reporting-for-aws-glue-job-resource-usage/

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset. Many organizations today are using AWS Glue to build ETL pipelines that bring data from disparate sources and store the data in repositories like a data lake, database, or data warehouse for further consumption. These organizations are looking for ways they can reduce cost across their IT environments and still be operationally performant and efficient.

Picture a scenario where you, the VP of Data and Analytics, are in charge of your data and analytics environments and workloads running on AWS where you manage a team of data engineers and analysts. This team is allowed to create AWS Glue for Spark jobs in development, test, and production environments. During testing, one of the jobs wasn’t configured to automatically scale its compute resources, resulting in jobs timing out, costing the organization more than anticipated. The next steps usually include completing an analysis of the jobs, looking at cost reports to see which account generated the spike in usage, going through logs to see when what happened with the job, and so on. After the ETL job has been corrected, you may want to implement monitoring and set standard alert thresholds for your AWS Glue environment.

This post will help organizations proactively monitor and cost optimize their AWS Glue environments by providing an easier path for teams to measure efficiency of their ETL jobs and align configuration details according to organizational requirements. Included is a solution you will be able to deploy that will notify your team via email about any Glue job that has been configured incorrectly. Additionally, a weekly report is generated and sent via email that aggregates resource usage and provides cost estimates per job.

AWS Glue cost considerations

AWS Glue for Apache Spark jobs are provisioned with a number of workers and a worker type. These jobs can be either G.1X, G.2X, G.4X, G.8X or Z.2X (Ray) worker types that map to data processing units (DPUs). DPUs include a certain amount of CPU, memory, and disk space. The following table contains more details.

Worker Type	DPUs	vCPUs	Memory (GB)	Disk (GB)
G.1X	1	4	16	64
G.2X	2	8	32	128
G.4X	4	16	64	256
G.8X	8	32	128	512
Z.2X	2	8	32	128

For example, if a job is provisioned with 10 workers as G.1X worker type, the job will have access to 40 vCPU and 160 GB of RAM to process data and double using G.2X. Over-provisioning workers can lead to increased cost, due to not all workers being utilized efficiently.

In April 2022, Auto Scaling for AWS Glue was released for AWS Glue version 3.0 and later, which includes AWS Glue for Apache Spark and streaming jobs. Enabling auto scaling on your Glue for Apache Spark jobs will allow you to only allocate workers as needed, up to the worker maximum you specify. We recommend enabling auto scaling for your AWS Glue 3.0 & 4.0 jobs because this feature will help reduce cost and optimize your ETL jobs.

Amazon CloudWatch metrics are also a great way to monitor your AWS Glue environment by creating alarms for certain metrics like average CPU or memory usage. To learn more about how to use CloudWatch metrics with AWS Glue, refer to Monitoring AWS Glue using Amazon CloudWatch metrics.

The following solution provides a simple way to set AWS Glue worker and job duration thresholds, configure monitoring, and receive emails for notifications on how your AWS Glue environment is performing. If a Glue job finishes and detects worker or job duration thresholds were exceeded, it will notify you after the job run has completed, failed, or timed out.

Solution overview

The following diagram illustrates the solution architecture.

When you deploy this application via AWS Serverless Application Model (AWS SAM), it will ask what AWS Glue worker and job duration thresholds you would like to set to monitor the AWS Glue for Apache Spark and AWS Glue for Ray jobs running in that account. The solution will use these values as the decision criteria when invoked. The following is a breakdown of each step in the architecture:

Any AWS Glue for Apache Spark job that succeeds, fails, stops, or times out is sent to Amazon EventBridge.
EventBridge picks up the event from AWS Glue and triggers an AWS Lambda function.
The Lambda function processes the event and determines if the data and analytics team should be notified about the particular job run. The function performs the following tasks:
1. The function sends an email using Amazon Simple Notification Service (Amazon SNS) if needed.
  - If the AWS Glue job succeeded or was stopped without going over the worker or job duration thresholds, or is tagged to not be monitored, no alerts or notifications are sent.
  - If the job succeeded but ran with a worker or job duration thresholds higher than allowed, or the job either failed or timed out, Amazon SNS sends a notification to the designated email with information about the AWS Glue job, run ID, and reason for alerting, along with a link to the specific run ID on the AWS Glue console.
2. The function logs the job run information to Amazon DynamoDB for a weekly aggregated report delivered to email. The Dynamo table has Time to Live enabled for 7 days, which keeps the storage to minimum.
Once a week, the data within DynamoDB is aggregated by a separate Lambda function with meaningful information like longest-running jobs, number of retries, failures, timeouts, cost analysis, and more.
Amazon Simple Email Service (Amazon SES) is used to deliver the report because it can be better formatted than using Amazon SNS. The email is formatted via HTML output that provides tables for the aggregated job run data.
The data and analytics team is notified about the ongoing job runs through Amazon SNS, and they receive the weekly aggregation report through Amazon SES.

Note that AWS Glue Python shell and streaming ETL jobs are not supported because they’re not in scope of this solution.

Prerequisites

You must have the following prerequisites:

An AWS account to deploy the solution to
Proper AWS Identity and Access Management (IAM) privileges to create the resources
The AWS SAM CLI to build and deploy the solution button below, to run template on your AWS environment

Deploy the solution

This AWS SAM application provisions the following resources:

Two EventBridge rules
Two Lambda functions
An SNS topic and subscription
A DynamoDB table
An SES subscription
The required IAM roles and policies

To deploy the AWS SAM application, complete the following steps:

Clone the aws-samples GitHub repository:

git clone https://github.com/aws-samples/aws-glue-job-tracker.git

Deploy the AWS SAM application:

cd aws-glue-job-tracker
sam deploy --guided

sam deploy configuration

Provide the following parameters:

GlueJobWorkerThreshold – Enter the maximum number of workers you want an AWS Glue job to be able to run with before sending threshold alert. The default is 10. An alert will be sent if a Glue job runs with higher workers than specified.

GlueJobDurationThreshold – Enter the maximum duration in minutes you want an AWS Glue job to run before sending threshold alert. The default is 480 minutes (8 hours). An alert will be sent if a Glue job runs with higher job duration than specified.

GlueJobNotifications – Enter an email or distribution list of those who need to be notified through Amazon SNS and Amazon SES. You can go to the SNS topic after the deployment is complete and add emails as needed.

To receive emails from Amazon SNS and Amazon SES, you must confirm your subscriptions. After the stack is deployed, check your email that was specified in the template and confirm by choosing the link in each message. When the application is successfully provisioned, it will begin monitoring your AWS Glue for Apache Spark job environment. The next time a job fails, times out, or exceeds a specified threshold, you will receive an email via Amazon SNS. For example, the following screenshot shows an SNS message about a job that succeeded but had a job duration threshold violation.

You might have jobs that need to run at a higher worker or job duration threshold, and you don’t want the solution to evaluate them. You can simply tag that job with the key/value of remediate and false. The step function will still be invoked, but will use the PASS state when it recognizes the tag. For more information on job tagging, refer to AWS tags in AWS Glue.

Adding tags to glue job configuration

Configure weekly reporting

As mentioned previously, when an AWS Glue for Apache Spark job succeeds, fails, times out, or is stopped, EventBridge forwards this event to Lambda, where it logs specific information about each job run. Once a week, a separate Lambda function queries DynamoDB and aggregates your job runs to provide meaningful insights and recommendations about your AWS Glue for Apache Spark environment. This report is sent via email with a tabular structure as shown in the following screenshot. It’s meant for top-level visibility so you’re able to see your longest job runs over time, jobs that have had many retries, failures, and more. It also provides an overall cost calculation as an estimate of what each AWS Glue job will cost for that week. It should not be used as a guaranteed cost. If you would like to see exact cost per job, the AWS Cost and Usage Report is the best resource to use. The following screenshot shows one table (of five total) from the AWS Glue report function.

weekly report

Clean up

If you don’t want to run the solution anymore, delete the AWS SAM application for each account that it was provisioned in. To delete your AWS SAM stack, run the following command from your project directory:

sam delete

Conclusion

In this post, we discussed how you can monitor and cost-optimize your AWS Glue job configurations to comply with organizational standards and policy. This method can provide cost controls over AWS Glue jobs across your organization. Some other ways to help control the costs of your AWS Glue for Apache Spark jobs include the newly released AWS Glue Flex jobs and Auto Scaling. We also provided an AWS SAM application as a solution to deploy into your accounts. We encourage you to review the resources provided in this post to continue learning about AWS Glue. To learn more about monitoring and optimizing for cost using AWS Glue, please visit this recent blog. It goes in depth on all of the cost optimization options and includes a template that builds a CloudWatch dashboard for you with metrics about all of your Glue job runs.

About the authors

Michael Hamilton is a Sr Analytics Solutions Architect focusing on helping enterprise customers in the south east modernize and simplify their analytics workloads on AWS. He enjoys mountain biking and spending time with his wife and three children when not working.

Angus Ferguson is a Solutions Architect at AWS who is passionate about meeting customers across the world, helping them solve their technical challenges. Angus specializes in Data & Analytics with a focus on customers in the financial services industry.

Defontana provides business administration solutions to Latin American customers using Amazon QuickSight

2023-05-25 Cynthia Valeriano

Post Syndicated from Cynthia Valeriano original https://aws.amazon.com/blogs/big-data/defontana-provides-business-administration-solutions-to-latin-american-customers-using-amazon-quicksight/

This is a guest post by Cynthia Valeriano, Jaime Olivares, and Guillermo Puelles from DeFontana.

Defontana develops fully cloud-based business applications for the administration and management of companies. Based in Santiago, Chile, with operations in Peru, Mexico, and most recently Colombia, our main product is a 100% cloud-based enterprise resource planning (ERP) system that has been providing value to our business customers for about 20 years. In addition to our core ERP product, we have developed integration modules for ecommerce, banks, financial and regulatory institutions, digital signatures, business reports, and many other solutions. Our goal is to continue building solutions for customers and make Defontana the best business administration tool in Chile and Latin America.

Most of our customers are small and medium businesses (SMBs) who need to optimize their resources. Our ERP system helps customers manage cash flow, time, human resources (HR), and other resources. As we were exploring how to continue innovating for our customers and looking for an embedded analytics solution, we chose Amazon QuickSight, which allows us to seamlessly integrate data-driven experiences into our web application.

In this post, we discuss how QuickSight has sped up our development time, enabled us to provide more value to our customers, and even improved our own internal operations.

Saving development time by embedding QuickSight

We built our ERP service as a web application from the beginning. 10 years ago, this was a big differentiator for us, but to continue to serve the SMBs that trust us for information on a daily basis, we wanted to offer even more advanced analytics. By embedding QuickSight into our web application, we have been able to provide business intelligence (BI) functionalities to customers two or three times faster than we would have if we had opted for libraries for generating HTML reports. Thanks to our embedded QuickSight solution, we are able to focus more of our energy on analyzing the requirements and functionalities that we want to offer our customers in each BI report.

The following screenshots show our ERP service, accessed on a web browser, with insights and rich visualizations by QuickSight.

We enjoy using QuickSight because of how well it integrates with other AWS services. Our data is stored in a legacy relational database management system (RDMS), Amazon Aurora, and Amazon DynamoDB. We are in the process of moving away from that legacy RDMS to PostgreSQL through a Babelfish for Aurora PostgreSQL project. This will allow us to reduce costs while also being able to use a multi-Region database with disaster recovery in the future. This would have been too expensive with the legacy RDMS. To seamlessly transfer data from these databases to Amazon Simple Storage Service (Amazon S3), we use AWS Database Migration Service (AWS DMS). Then, AWS Glue allows us to generate several extract, transform, and load (ETL) processes to prepare the data in Amazon S3 to be used in QuickSight. Finally, we use Amazon Athena to generate views to be used as base information in QuickSight.

Providing essential insights for SMB customers

QuickSight simplifies the generation of dashboards. We have made several out-of-the-box dashboards in QuickSight for our customers that they can use directly on our web app right after they sign up. These dashboards provide insights on sales, accounting, cash flow, financial information, and customer clusters based on their data in our ERP service. These free-to-use reports can be used by all customers in the system. We also have dashboards that can be activated by any user of the system for a trial period. Since we launched add-on dashboards, more than 300 companies have activated it, with over 100 of them choosing to continue using it after the free trial.

Besides generic reports, we have created several tailor-made dashboards according to the specific requirements of each customer. These are managed through a customer-focused development process by our engineering team according to the specifications of each customer. With this option, our customers can get reports on accounts payable, accounts receivable, supply information (purchase order flow, receipts and invoices sent by suppliers), inventory details, and more. We have more than 50 customers who have worked with us on tailor-made dashboards. With the broad range of functionalities within QuickSight, we can offer many data visualization options to our customers.

Empowering our own operations

Beyond using QuickSight to serve our customers, we also use QuickSight for our own BI reporting. So far, we have generated more than 80 dashboards to analyze different business flows. For example, we monitor daily sales in specific services, accounting, software as a service (SaaS) metrics, and the operation of our customers. We do all of this from within our own web application, with the power of QuickSight, giving us the opportunity to experience the interface just like our customers do. In 2023, one of our top goals is to provide a 360-degree view of Defontana using QuickSight.

Conclusion

QuickSight has enabled us to seamlessly embed analytics into our ERP service, providing valuable insights to our SMB customers. We have been able to cut costs and continue to grow throughout Latin America. We plan to use QuickSight even more within our own organization, making us more data-driven. QuickSight will empower us to democratize the information that our own employees receive, establish better processes, and create more tools to analyze customer information for behavioral patterns, which we can use to better meet our customers’ needs.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded.

About the authors

Cynthia Valeriano is a Business Intelligence Developer of at Defontana, with skills focused on data analysis and visualization. With 3 years of experience in administrative areas and 2 years of experience in business intelligence projects, she has been in charge of implementing data migration and transformation tasks with various AWS tools, such as AWS DMS and AWS Glue, in addition to generating multiple dashboards in Amazon QuickSight.

Jaime Olivares is a Senior software developer at Defontana, with 6 years of experience in the development of various technologies focused on the analysis of solutions and customer requirements. Experience with AWS in various services, including product development through QuickSight for the analysis of business and accounting data.

Guillermo Puelles is a Technical Manager of the “Appia” Integrations team at Defontana, with 9 years of experience in software development and 5 years working with AWS tools. Responsible for planning and managing various projects for the implementation of BI solutions through QuickSight and other AWS services.

Reserving EC2 Capacity across Availability Zones by utilizing On Demand Capacity Reservations (ODCRs)

2023-05-23 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/reserving-ec2-capacity-across-availability-zones-by-utilizing-on-demand-capacity-reservations-odcrs/

This post is written by Johan Hedlund, Senior Solutions Architect, Enterprise PUMA.

Many customers have successfully migrated business critical legacy workloads to AWS, utilizing services such as Amazon Elastic Compute Cloud (Amazon EC2), Auto Scaling Groups (ASGs), as well as the use of Multiple Availability Zones (AZs), Regions for Business Continuity, and High Availability.

These critical applications require increased levels of availability to meet strict business Service Level Agreements (SLAs), even in extreme scenarios such as when EC2 functionality is impaired (see Advanced Multi-AZ Resilience Patterns for examples). Following AWS best practices such as architecting for flexibility will help here, but for some more rigid designs there can still be challenges around EC2 instance availability.

In this post, I detail an approach for Reserving Capacity for this type of scenario to mitigate the risk of the instance type(s) that your application needs being unavailable, including code for building it and ways of testing it.

Baseline: Multi-AZ application with restrictive instance needs

To focus on the problem of Capacity Reservation, our reference architecture is a simple horizontally scalable monolith. This consists of a single executable running across multiple instances as a cluster in an Auto Scaling group across three AZs for High Availability.

The application in this example is both business critical and memory intensive. It needs six r6i.4xlarge instances to meet the required specifications. R6i has been chosen to meet the required memory to vCPU requirements.

The third-party application we need to run, has a significant license cost, so we want to optimize our workload to make sure we run only the minimally required number of instances for the shortest amount of time.

The application should be resilient to issues in a single AZ. In the case of multi-AZ impact, it should failover to Disaster Recovery (DR) in an alternate Region, where service level objectives are instituted to return operations to defined parameters. But this is outside the scope for this post.

The problem: capacity during AZ failover

In this solution, the Auto Scaling Group automatically balances its instances across the selected AZs, providing a layer of resilience in the event of a disruption in a single AZ. However, this hinges on those instances being available for use in the Amazon EC2 capacity pools. The criticality of our application comes with SLAs which dictate that even the very low likelihood of instance types being unavailable in AWS must be mitigated.

The solution: Reserving Capacity

There are 2 main ways of Reserving Capacity for this scenario: (a) Running extra capacity 24/7, (b) On Demand Capacity Reservations (ODCRs).

In the past, another recommendation would have been to utilize Zonal Reserved Instances (Non Zonal will not Reserve Capacity). But although Zonal Reserved Instances do provide similar functionality as On Demand Capacity Reservations combined with Savings Plans, they do so in a less flexible way. Therefore, the recommendation from AWS is now to instead use On Demand Capacity Reservations in combination with Savings Plans for scenarios where Capacity Reservation is required.

The TCO impact of the licensing situation rules out the first of the two valid options. Merely keeping the spare capacity up and running all the time also doesn’t cover the scenario in which an instance needs to be stopped and started, for example for maintenance or patching. Without Capacity Reservation, there is a theoretical possibility that that instance type would not be available to start up again.

This leads us to the second option: On Demand Capacity Reservations.

How much capacity to reserve?

Our failure scenario is when functionality in one AZ is impaired and the Auto Scaling Group must shift its instances to the remaining AZs while maintaining the total number of instances. With a minimum requirement of six instances, this means that we need 6/2 = 3 instances worth of Reserved Capacity in each AZ (as we can’t know in advance which one will be affected).

Spinning up the solution

If you want to get hands-on experience with On Demand Capacity Reservations, refer to this CloudFormation template and its accompanying README file for details on how to spin up the solution that we’re using. The README also contains more information about the Stack architecture. Upon successful creation, you have the following architecture running in your account.

Note that the default instance type for the AWS CloudFormation stack has been downgraded to t2.micro to keep our experiment within the AWS Free Tier.

Testing the solution

Now we have a fully functioning solution with Reserved Capacity dedicated to this specific Auto Scaling Group. However, we haven’t tested it yet.

The tests utilize the AWS Command Line Interface (AWS CLI), which we execute using AWS CloudShell.

To interact with the resources created by CloudFormation, we need some names and IDs that have been collected in the “Outputs” section of the stack. These can be accessed from the console in a tab under the Stack that you have created.

We set these as variables for easy access later (replace the values with the values from your stack):

export AUTOSCALING_GROUP_NAME=ASGWithODCRs-CapacityBackedASG-13IZJWXF9QV8E
export SUBNET_FOR_MANUALLY_ADDED_INSTANCE=subnet-03045a72a6328ef72
export SUBNETS_TO_KEEP=subnet-03045a72a6328ef72,subnet-0fd00353b8a42f251

How does the solution react to scaling out the Auto Scaling Group beyond the Capacity Reservation?

First, let’s look at what happens if the Auto Scaling Group wants to Scale Out. Our requirements state that we should have a minimum of six instances running at any one time. But the solution should still adapt to increased load. Before knowing anything about how this works in AWS, imagine two scenarios:

The Auto Scaling Group can scale out to a total of nine instances, as that’s how many On Demand Capacity Reservations we have. But it can’t go beyond that even if there is On Demand capacity available.
The Auto Scaling Group can scale just as much as it could when On Demand Capacity Reservations weren’t used, and it continues to launch unreserved instances when the On Demand Capacity Reservations run out (assuming that capacity is in fact available, which is why we have the On Demand Capacity Reservations in the first place).

The instances section of the Amazon EC2 Management Console can be used to show our existing Capacity Reservations, as created by the CloudFormation stack.

As expected, this shows that we are currently using six out of our nine On Demand Capacity Reservations, with two in each AZ.

Now let’s scale out our Auto Scaling Group to 12, thus using up all On Demand Capacity Reservations in each AZ, as well as requesting one extra Instance per AZ.

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 12

The Auto Scaling Group now has the desired Capacity of 12:

And in the Capacity Reservation screen we can see that all our On Demand Capacity Reservations have been used up:

In the Auto Scaling Group we see that – as expected – we weren’t restricted to nine instances. Instead, the Auto Scaling Group fell back on launching unreserved instances when our On Demand Capacity Reservations ran out:

How does the solution react to adding a matching instance outside the Auto Scaling Group?

But what if someone else/another process in the account starts an EC2 instance of the same type for which we have the On Demand Capacity Reservations? Won’t they get that Reservation, and our Auto Scaling Group will be left short of its three instances per AZ, which would mean that we won’t have enough reservations for our minimum of six instances in case there are issues with an AZ?

This all comes down to the type of On Demand Capacity Reservation that we have created, or the “Eligibility”. Looking at our Capacity Reservations, we can see that they are all of the “targeted” type. This means that they are only used if explicitly referenced, like we’re doing in our Target Group for the Auto Scaling Group.

It’s time to prove that. First, we scale in our Auto Scaling Group so that only six instances are used, resulting in there being one unused capacity reservation in each AZ. Then, we try to add an EC2 instance manually, outside the target group.

First, scale in the Auto Scaling Group:

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 6

Then, spin up the new instance, and save its ID for later when we clean up:

export MANUALLY_CREATED_INSTANCE_ID=$(aws ec2 run-instances \
--image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 \
--instance-type t2.micro \
--subnet-id $SUBNET_FOR_MANUALLY_ADDED_INSTANCE \
--query 'Instances[0].InstanceId' --output text)

We still have the three unutilized On Demand Capacity Reservations, as expected, proving that the On Demand Capacity Reservations with the “targeted” eligibility only get used when explicitly referenced:

How does the solution react to an AZ being removed?

Now we’re comfortable that the Auto Scaling Group can grow beyond the On Demand Capacity Reservations if needed, as long as there is capacity, and that other EC2 instances in our account won’t use the On Demand Capacity Reservations specifically purchased for the Auto Scaling Group. It’s time for the big test. How does it all behave when an AZ becomes unavailable?

For our purposes, we can simulate this scenario by changing the Auto Scaling Group to be across two AZs instead of the original three.

First, we scale out to seven instances so that we can see the impact of overflow outside the On Demand Capacity Reservations when we subsequently remove one AZ:

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 7

Then, we change the Auto Scaling Group to only cover two AZs:

aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--vpc-zone-identifier $SUBNETS_TO_KEEP

Give it some time, and we see that the Auto Scaling Group is now spread across two AZs, On Demand Capacity Reservations cover the minimum six instances as per our requirements, and the rest is handled by instances without Capacity Reservation:

Cleanup

It’s time to clean up, as those Instances and On Demand Capacity Reservations come at a cost!

First, remove the EC2 instance that we made:

aws ec2 terminate-instances --instance-ids $MANUALLY_CREATED_INSTANCE_ID

Then, delete the CloudFormation stack.

Conclusion

Using a combination of Auto Scaling Groups, Resource Groups, and On Demand Capacity Reservations (ODCRs), we have built a solution that provides High Availability backed by reserved capacity, for those types of workloads where the requirements for availability in the case of an AZ becoming temporarily unavailable outweigh the increased cost of reserving capacity, and where the best practices for architecting for flexibility cannot be followed due to limitations on applicable architectures.

We have tested the solution and confirmed that the Auto Scaling Group falls back on using unreserved capacity when the On Demand Capacity Reservations are exhausted. Moreover, we confirmed that targeted On Demand Capacity Reservations won’t risk getting accidentally used by other solutions in our account.

Now it’s time for you to try it yourself! Download the IaC template and give it a try! And if you are planning on using On Demand Capacity Reservations, then don’t forget to look into Savings Plans, as they significantly reduce the cost of that Reserved Capacity..

Softbrain provides advanced analytics to sales customers with Amazon QuickSight

2023-05-17 Kenta Oda

Post Syndicated from Kenta Oda original https://aws.amazon.com/blogs/big-data/softbrain-provides-advanced-analytics-to-sales-customers-with-amazon-quicksight/

This is a guest post by Kenta Oda from SOFTBRAIN Co., Ltd.

Softbrain is a leading Japanese producer of software for sales force automation (SFA) and customer relationship management (CRM). Our main product, e-Sales Manager (eSM), is an SFA/CRM tool that provides sales support to over 5,500 companies in Japan. We provide our sales customers with a one-stop source for information and visualization of sales activity, improving their efficiency and agility, which leads to greater business opportunity.

With increasing demand from our customers for analyzing data from different angles throughout the sales process, we needed an embedded analytics tool. We chose Amazon QuickSight, a cloud-native business intelligence (BI) tool that allows you to embed insightful analytics into any application with customized, interactive visuals and dashboards. It integrates seamlessly with eSM and is easy to use at a low cost.

In this post, we discuss how QuickSight is helping us provide our sales customers with the insights they need, and why we consider this business decision a win for Softbrain.

There were four things we were looking for in an embedded analytics solution:

Rich visualization – With our previous solution, which was built in-house, there were only four types of visuals, so it was difficult to combine multiple graphs for an in-depth analysis.
Development speed – We needed to be able to quickly implement BI functionalities. QuickSight requires minimal development due to its serverless architecture, embedding, and API.
Cost – We moved from Tableau to QuickSight because QuickSight allowed us to provide data analysis and visualizations to our customers at a competitive price—ensuring that more of our customers can afford it.
Ease of use – QuickSight is cloud-based and has an intuitive UX for our sales customers to work with.

Innovating with QuickSight

Individual productivity must be greatly improved to keep up with the shifting labor market in Japan. At Softbrain, we aim to innovate using the latest technology to provide science-based insights into customer and sales interactions, enabling those who use eSM to be much more productive. Sales reps and managers are able to make informed decisions.

By using QuickSight as our embedded analytics solution, we can offer data visualizations at a much lower price point, making it much more accessible for our customers than we could with other BI solutions. When we combine the process management system offered by eSM with the intuitive user experience and rich visualization capability of QuickSight, we empower customers to understand their sales data, which sits in Amazon Simple Storage Service (Amazon S3) and Amazon Aurora, and act on it.

Seamless console embedding

What sets QuickSight apart from other BI tools is console embedding, which means our customers have the ability to build their own dashboards within eSM. They can choose which visualizations they want and take an in-depth look at their data. Sales strategy requires agility, and our customers need more than a fixed dashboard. QuickSight offers freedom and flexibility with console embedding.

Console embedding allows eSM to be a one-stop source for all the information sales reps and managers need. They can access all the analyses they need to make decisions right from their web browser because QuickSight is fully managed and serverless. With other BI solutions, the user would need to have the client application installed on their computer to create their own dashboards.

Empowering our sales customers

With insights from QuickSight embedded into eSM, sales reps can analyze the gap between their budget and actual revenue to build an action plan to fill the gap. They can use their dashboards to analyze data on a weekly and monthly basis. They can share this information at meetings and explore the data to figure out why there might be low attainment for certain customers. Our customers can use eSM and QuickSight to understand why win or loss opportunities are increasing. Managers can analyze and compare the performance of their sales reps to learn what high-performing reps are doing and help low performers. Sales reps can also evaluate their own performance.

Driving 95% customer retention rate

All of these insights come from putting sales data into eSM and QuickSight. It’s no secret that our customers love QuickSight. We can boast a 95% customer retention rate and offer QuickSight as an embedded BI solution at largest scale in Japan.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded.

About the author

Kenta Oda is the Chief Technology Officer at SOFTBRAIN Co., Ltd. He is in responsible of new product development with keen insight on better customer experience and go-to-market strategy.

Peloton embraces Amazon Redshift to unlock the power of data during changing times

2023-05-17 Phil Goldstein

Post Syndicated from Phil Goldstein original https://aws.amazon.com/blogs/big-data/peloton-embraces-amazon-redshift-to-unlock-the-power-of-data-during-changing-times/

Jerry Wang, Peloton’s Director of Data Engineering (left), and Evy Kho, Peloton's Manager of Subscription Analytics, discuss how the company has benefited from using Amazon Redshift.

Credit: Phil Goldstein
Jerry Wang, Peloton’s Director of Data Engineering (left), and Evy Kho, Peloton’s Manager of Subscription Analytics, discuss how the company has benefited from using Amazon Redshift.

New York-based Peloton, which aims to help people around the world reach their fitness goals through its connected fitness equipment and subscription-based classes, saw booming growth in the early stage of the COVID-19 pandemic. In 2020, as gyms shuttered and people looked for ways to stay active from the safety of their homes, the company’s annual revenue soared from $915 million in 2019 to $4 billion in 2021. Meanwhile, the company’s subscribers jumped from around 360,000 in 2019 to 2.76 million at the end of 2022.

As Peloton’s business continued to evolve amid a changing macroeconomic environment, it was essential that it could make smart business decisions quickly, and one of the best ways to do that was to harness insights from the huge amount of data that it had been gathering over recent years.

During that same time, AWS has been focused on helping customers manage their ever-growing volumes of data with tools like Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse. The service has grown into a multifaceted service used by tens of thousands of customers to process exabytes of data on a daily basis (1 exabyte is equivalent to 119 billion song downloads). With Amazon Redshift, you get access to a modern data architecture that helps you break down internal data silos, share data securely and seamlessly, and support multiple users who don’t have specialized data and analytics skills.

When Jerry Wang, Peloton’s director of data engineering, joined the company in 2019, he needed to make sure the service would handle the company’s massive and growing amounts of data. He also needed to ensure Amazon Redshift could help the company efficiently manage the wide variety of data and the users who would need to access it, and deliver insights on that data at high velocity—all while being cost-effective and secure.

Wang was delighted to see that as Peloton experienced its massive growth and change, AWS continued to release new Amazon Redshift features and associated capabilities that would perfectly suit his company’s needs at just the right time.

“Over the years, I’ve always been in the stage where I hope Redshift can have a new, specific feature,” Wang says, “and then, in a very limited amount of time, AWS releases that kind of feature.”

Peloton’s data volumes soar as the business grows

Man working out with a weight while viewing a Peloton class on his TV in a living room.

Credit: Peloton

As Peloton’s business has evolved, the amount of data it is generating and analyzing has grown exponentially. From 2019 to now, Wang reports the amount of data the company holds has grown by a factor of 20. In fact, a full 95% of the total historical data the company has generated has come in the last 4 years. This growth has been driven both by surges in the number of users on Peloton’s platform and the variety of data the company is collecting.

Peloton collects reams of data on its sales of internet-connected exercise equipment like stationary bikes and treadmills. The company also collects data on customers’ workouts, which it then provides back to them in various reports such as a monthly summary, giving them insights into how often they worked out, their best output, trends in their workouts, the instructor they used the most, how many calories they burned, and more. All of this data helps Peloton make strategic business decisions, refine its operations to become more efficient, adjust its programming, and drive subscriber engagement and growth.

In 2019 and into 2020, as Peloton’s business boomed, the company needed an analytics system that could help it manage an explosion of data, both from users and related to its business. The company embraced Amazon Redshift because of the service’s versatility, ease of use, price-performance at scale, continuous pace of innovation, and ability to handle concurrent queries from dozens of internal data teams.

Wang said that when he joined the company, there were two kinds of users who were performing daily data operations in Peloton’s Amazon Redshift data warehouse. One group performed extract, transform, and load (ETL) operations to take raw data and make it available for analysis. The other was a group of business users who, each morning, would perform queries to generate local data visualizations, creating a surge of capacity on the Amazon Redshift data warehouse. “So, when these two loads ran together, the performance suffered directly,” Wang says.

One of the features Peloton adopted was Amazon Redshift Concurrency Scaling, which provides consistent and fast query performance even across thousands of concurrent users and concurrent queries. This helped solve the problem by automatically adding query processing power in seconds and processing queries without delays. When the workload demand subsided, the extra processing power was automatically removed, so Peloton only had to pay for the time when Concurrency Scaling data warehouses were in use. Wang says Peloton was running about 10 hours of Concurrency Scaling on a consistent daily basis to deal with the congestion, which, he says, “solved my problem at that moment.”

In 2020, as the pandemic inspired hoards to hop on bikes in their living rooms, Wang also upgraded Amazon Redshift with the newly introduced Amazon Redshift RA3 instances with managed storage (RMS). These represented a new generation of compute instances with managed, analytics-optimized storage designed for high-transaction, fast query performance and lower costs.

“The new instance … was a great feature for us,” Wang says. “It solved our concern about moving from terabyte scale to petabyte scale.”

Peloton’s business is driven by a variety of data for a wide range of users

Man watching a female Peloton biking instructor through a touch screen display on his Peloton bike.

Credit: Peloton

Peloton’s business model is driven by a wide variety of large volumes of data. In addition to selling bikes, treadmills, and indoor rowing machines, and expanding its subscription platform to include non-equipment-based workouts, the company has dozens of instructors in five countries, and it licenses music from three major music licensors. In 2022, it began renting bikes as well as selling them. Internally, Peloton employees working in finance, accounting, marketing, supply chain operations, music and content, and more are using data to track subscriber growth, content engagement, and which sales channels are leading to the most net new subscriptions.

“There was a time when we were just a bike company, and now we’re so much more than that,” says Evy Kho, manager of subscription analytics at Peloton.

There is also a much wider range of sales channels for Peloton equipment than just a few years ago. In the past, Peloton customers could only purchase bikes through the Peloton website or secondhand. Now, customers can purchase hardware from third-party sites like Amazon. That introduced “a really interesting data problem” for Peloton, says Kho, as it strives to determine how to link subscription signups back to exercise equipment sales.

In the face of this variability, complexity, and need for instant access to data to inform business decision-makers, Peloton embraced Amazon Redshift Serverless as an early adopter after AWS introduced the feature in late 2021. Redshift Serverless allows companies to quickly run and scale analytics capacity without database managers and data engineers needing to manage data warehouse infrastructure.

Redshift Serverless also has the ability to quickly spin up analytics capacity for different users, or personas, within an organization. This allows different teams across Peloton to perform analytics on the same datasets at the same time to generate insights on their individual parts of the business. It’s “incredibly important in terms of assessing what’s been good for our business,” Kho says.

Wang also says Peloton is considering supporting specific personas for those who need analytics around financial information governed by securities regulations, and another for users who need to perform analytics on data governed by regulations around personally identifiable information (PII).

Wang points out that Redshift Serverless also allows him to spin up Amazon Redshift data warehouses to handle special usage patterns. For example, ETL loads are often high I/O but require low CPU resources, and are very predictable because Peloton controls the process. However, when internal users want to perform data analytics or machine learning, the company doesn’t have control over the demand for those queries, and the load on Amazon Redshift data warehouses can be variable, with some queries more CPU-intensive than others. Previously, any provisioned data warehouse would have a fixed cost, and it would have to be provisioned to cope with the highest possible workloads even if the utilization rates turned out to be low. Now, for these different scenarios, Wang creates different Amazon Redshift instances to handle that variability without those heavy, fixed costs.

As Peloton’s use of Amazon Redshift has evolved and matured, its costs have gone down, according to Wang. “If you look at Serverless, the amount … that we spend on the Serverless is actually much smaller than we did previously, compared to the Concurrency Scaling cost.”

In a serverless environment, there is no upfront cost to Peloton. “I can set it up as quickly as I can and we pay as we need it,” Wang says. “It scales up when the load goes up. So, it’s a perfect fit.”

Peloton uses Amazon Redshift to get to insights faster

Women running on a Peloton treadmill with a touch screen display

Credit: Peloton

Peloton’s focus on efficiency and sustainable growth has meant that it needs to act more quickly than ever to make sound, data-informed business decisions. Peloton, Wang notes, is long past the stage where all it cared about was growth. “We are a mature company now, so operational efficiency is very important; it’s key to the business,” he says.

When Peloton launches new products, for example, two things typically happen, Wang says. One is that there is a spike in data volumes, both in traffic to its website and the number of sales transactions it’s processing. The second is that the company’s management team will want real-time updates and analysis of how sales are performing.

Redshift Serverless and data sharing lets users quickly start performing real-time analytics and build reporting and dashboard applications without any additional engineering required. Wang confirms this benefit, especially in the example of a new product launch, saying it “will scale up by itself without me having to intervene. I don’t need to allocate a budget. I don’t need to change any configurations.”

In the past, when Peloton only offered its fitness equipment through its own website, it was easy to associate fulfillment data on orders with subscriptions. However, as those channels grew and became more complex, Peloton turned to the data sharing capabilities of Amazon Redshift to share data quickly and easily across teams. Peloton’s teams for subscriber analytics, supply chain, accounting, and more need fast access to fulfillment data to ensure they can track it accurately, respond if changes are needed, and determine how fulfillment data aligns with subscriptions and revenue.

“Getting them those results even faster has been incredibly helpful, and is only becoming more important as we have become far more data-driven than I think you could argue we were before,” Kho says.

Amazon Redshift marries data security, governance, and compliance with innovation

Like all customers, Peloton is concerned about data security, governance, and compliance. With security features like dynamic data masking, role-based access control, and row-level security, Amazon Redshift protects customers’ data with granular authorization features and comprehensive identity management.

Customers also are able to easily provide authorizations for the right users or groups. These features are available out of the box, within the standard pricing model.

Wang notes that Amazon Redshift’s security model is based on a traditional database model, which is a well-understood and robust model. “So for us, to provision access on that model is quite straightforward,” Wang says.

At every stage of Peloton’s evolution over the last 4 years, the company has been able to turn to AWS and Amazon Redshift to help it effectively manage that growth and complexity.

“When I started,” Wang says, “I said, OK, I need a temporary boost in capacity. Then came Concurrency Scaling. And then I said, I need cheaper storage, and [RA3] comes along. And then the ultimate challenge [was], I’m no longer satisfied with a monolithic Redshift instance. Serverless solved that issue.”

Join AWS Data Insights Day 2023

If you want to learn how your company can use Amazon Redshift to analyze large volumes of data in an easy-to-use, scalable, cost-effective, and secure way, don’t miss AWS Data Insights Day on May 24, 2023. During the day-long virtual event, learn from AWS leaders, experts, partners, and customers—including Peloton, Gilead, McDonald’s, Global Foundries, Schneider Electric, and Flutter Entertainment—how Amazon Redshift and features like Amazon Redshift ML are helping drive business innovation, optimization, and cost savings, especially in today’s uncertain economic times.

To learn more about Amazon Redshift, see Amazon Redshift and Amazon Redshift: Ten years of continuous reinvention.

About the author

Phil Goldstein is a copywriter and editor with AWS product marketing. He has 15 years of technology writing experience, and prior to joining AWS was a senior editor at a content marketing agency and a business journalist covering the wireless industry.

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

2023-05-16 Sekar Srinivasan

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/how-zoom-implemented-streaming-log-ingestion-and-efficient-gdpr-deletes-using-apache-hudi-on-amazon-emr/

In today’s digital age, logging is a critical aspect of application development and management, but efficiently managing logs while complying with data protection regulations can be a significant challenge. Zoom, in collaboration with the AWS Data Lab team, developed an innovative architecture to overcome these challenges and streamline their logging and record deletion processes. In this post, we explore the architecture and the benefits it provides for Zoom and its users.

Application log challenges: Data management and compliance

Application logs are an essential component of any application; they provide valuable information about the usage and performance of the system. These logs are used for a variety of purposes, such as debugging, auditing, performance monitoring, business intelligence, system maintenance, and security. However, although these application logs are necessary for maintaining and improving the application, they also pose an interesting challenge. These application logs may contain personally identifiable data, such as user names, email addresses, IP addresses, and browsing history, which creates a data privacy concern.

Laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require organizations to retain application logs for a specific period of time. The exact length of time required for data storage varies depending on the specific regulation and the type of data being stored. The reason for these data retention periods is to ensure that companies aren’t keeping personal data longer than necessary, which could increase the risk of data breaches and other security incidents. This also helps ensure that companies aren’t using personal data for purposes other than those for which it was collected, which could be a violation of privacy laws. These laws also give individuals the right to request the deletion of their personal data, also known as the “right to be forgotten.” Individuals have the right to have their personal data erased, without undue delay.

So, on one hand, organizations need to collect application log data to ensure the proper functioning of their services, and keep the data for a specific period of time. But on the other hand, they may receive requests from individuals to delete their personal data from the logs. This creates a balancing act for organizations because they must comply with both data retention and data deletion requirements.

This issue becomes increasingly challenging for larger organizations that operate in multiple countries and states, because each country and state may have their own rules and regulations regarding data retention and deletion. For example, the Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada and the Australian Privacy Act in Australia are similar laws to GDPR, but they may have different retention periods or different exceptions. Therefore, organizations big or small must navigate this complex landscape of data retention and deletion requirements, while also ensuring that they are in compliance with all applicable laws and regulations.

Zoom’s initial architecture

During the COVID-19 pandemic, the use of Zoom skyrocketed as more and more people were asked to work and attend classes from home. The company had to rapidly scale its services to accommodate the surge and worked with AWS to deploy capacity across most Regions globally. With a sudden increase in the large number of application endpoints, they had to rapidly evolve their log analytics architecture and worked with the AWS Data Lab team to quickly prototype and deploy an architecture for their compliance use case.

At Zoom, the data ingestion throughput and performance needs are very stringent. Data had to be ingested from several thousand application endpoints that produced over 30 million messages every minute, resulting in over 100 TB of log data per day. The existing ingestion pipeline consisted of writing the data to Apache Hadoop HDFS storage through Apache Kafka first and then running daily jobs to move the data to persistent storage. This took several hours while also slowing the ingestion and creating the potential for data loss. Scaling the architecture was also an issue because HDFS data would have to be moved around whenever nodes were added or removed. Furthermore, transactional semantics on billions of records were necessary to help meet compliance-related data delete requests, and the existing architecture of daily batch jobs was operationally inefficient.

It was at this time, through conversations with the AWS account team, that the AWS Data Lab team got involved to assist in building a solution for Zoom’s hyper-scale.

Solution overview

The AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data, analytics, artificial intelligence (AI), machine learning (ML), serverless, and container modernization initiatives. The Data Lab has three offerings: the Build Lab, the Design Lab, and Resident Architect. During the Build and Design Labs, AWS Data Lab Solutions Architects and AWS experts supported Zoom specifically by providing prescriptive architectural guidance, sharing best practices, building a working prototype, and removing technical roadblocks to help meet their production needs.

Zoom and the AWS team (collectively referred to as “the team” going forward) identified two major workflows for data ingestion and deletion.

Data ingestion workflow

The following diagram illustrates the data ingestion workflow.

Data Ingestion Workflow

The team needed to quickly populate millions of Kafka messages in the dev/test environment to achieve this. To expedite the process, we (the team) opted to use Amazon Managed Streaming for Apache Kafka (Amazon MSK), which makes it simple to ingest and process streaming data in real time, and we were up and running in under a day.

To generate test data that resembled production data, the AWS Data Lab team created a custom Python script that evenly populated over 1.2 billion messages across several Kafka partitions. To match the production setup in the development account, we had to increase the cloud quota limit via a support ticket.

We used Amazon MSK and the Spark Structured Streaming capability in Amazon EMR to ingest and process the incoming Kafka messages with high throughput and low latency. Specifically, we inserted the data from the source into EMR clusters at a maximum incoming rate of 150 million Kafka messages every 5 minutes, with each Kafka message holding 7–25 log data records.

To store the data, we chose to use Apache Hudi as the table format. We opted for Hudi because it’s an open-source data management framework that provides record-level insert, update, and delete capabilities on top of an immutable storage layer like Amazon Simple Storage Service (Amazon S3). Additionally, Hudi is optimized for handling large datasets and works well with Spark Structured Streaming, which was already being used at Zoom.

After 150 million messages were buffered, we processed the messages using Spark Structured Streaming on Amazon EMR and wrote the data into Amazon S3 in Apache Hudi-compatible format every 5 minutes. We first flattened the message array, creating a single record from the nested array of messages. Then we added a unique key, known as the Hudi record key, to each message. This key allows Hudi to perform record-level insert, update, and delete operations on the data. We also extracted the field values, including the Hudi partition keys, from incoming messages.

This architecture allowed end-users to query the data stored in Amazon S3 using Amazon Athena with the AWS Glue Data Catalog or using Apache Hive and Presto.

Data deletion workflow

The following diagram illustrates the data deletion workflow.

Data Deletion Workflow

Our architecture allowed for efficient data deletions. To help comply with the customer-initiated data retention policy for GDPR deletes, scheduled jobs ran daily to identify the data to be deleted in batch mode.

We then spun up a transient EMR cluster to run the GDPR upsert job to delete the records. The data was stored in Amazon S3 in Hudi format, and Hudi’s built-in index allowed us to efficiently delete records using bloom filters and file ranges. Because only those files that contained the record keys needed to be read and rewritten, it only took about 1–2 minutes to delete 1,000 records out of the 1 billion records, which had previously taken hours to complete as entire partitions were read.

Overall, our solution enabled efficient deletion of data, which provided an additional layer of data security that was critical for Zoom, in light of its GDPR requirements.

Architecting to optimize scale, performance, and cost

In this section, we share the following strategies Zoom took to optimize scale, performance, and cost:

Optimizing ingestion
Optimizing throughput and Amazon EMR utilization
Decoupling ingestion and GDPR deletion using EMRFS
Efficient deletes with Apache Hudi
Optimizing for low-latency reads with Apache Hudi
Monitoring

Optimizing ingestion

To keep the storage in Kafka lean and optimal, as well as to get a real-time view of data, we created a Spark job to read incoming Kafka messages in batches of 150 million messages and wrote to Amazon S3 in Hudi-compatible format every 5 minutes. Even during the initial stages of the iteration, when we hadn’t started scaling and tuning yet, we were able to successfully load all Kafka messages consistently under 2.5 minutes using the Amazon EMR runtime for Apache Spark.

Optimizing throughput and Amazon EMR utilization

We launched a cost-optimized EMR cluster and switched from uniform instance groups to using EMR instance fleets. We chose instance fleets because we needed the flexibility to use Spot Instances for task nodes and wanted to diversify the risk of running out of capacity for a specific instance type in our Availability Zone.

We started experimenting with test runs by first changing the number of Kafka partitions from 400 to 1,000, and then changing the number of task nodes and instance types. Based on the results of the run, the AWS team came up with the recommendation to use Amazon EMR with three core nodes (r5.16xlarge (64 vCPUs each)) and 18 task nodes using Spot fleet instances (a combination of r5.16xlarge (64 vCPUs), r5.12xlarge (48 vCPUs), r5.8xlarge (32 vCPUs)). These recommendations helped Zoom to reduce their Amazon EMR costs by more than 80% while meeting their desired performance goals of ingesting 150 million Kafka messages under 5 minutes.

Decoupling ingestion and GDPR deletion using EMRFS

A well-known benefit of separation of storage and compute is that you can scale the two independently. But a not-so-obvious advantage is that you can decouple continuous workloads from sporadic workloads. Previously data was stored in HDFS. Resource-intensive GDPR delete jobs and data movement jobs would compete for resources with the stream ingestion, causing a backlog of more than 5 hours in upstream Kafka clusters, which was close to filling up the Kafka storage (which only had 6 hours of data retention) and potentially causing data loss. Offloading data from HDFS to Amazon S3 allowed us the freedom to launch independent transient EMR clusters on demand to perform data deletion, helping to ensure that the ongoing data ingestion from Kafka into Amazon EMR is not starved for resources. This enabled the system to ingest data every 5 minutes and complete each Spark Streaming read in 2–3 minutes. Another side effect of using EMRFS is a cost-optimized cluster, because we removed reliance on Amazon Elastic Block Store (Amazon EBS) volumes for over 300 TB storage that was used for three copies (including two replicas) of HDFS data. We now pay for only one copy of the data in Amazon S3, which provides 11 9s of durability and is relatively inexpensive storage.

Efficient deletes with Apache Hudi

What about the conflict between ingest writes and GDPR deletes when running concurrently? This is where the power of Apache Hudi stands out.

Apache Hudi provides a table format for data lakes with transactional semantics that enables the separation of ingestion workloads and updates when run concurrently. The system was able to consistently delete 1,000 records in less than a minute. There were some limitations in concurrent writes in Apache Hudi 0.7.0, but the Amazon EMR team quickly addressed this by back-porting Apache Hudi 0.8.0, which supports optimistic concurrency control, to the current (at the time of the AWS Data Lab collaboration) Amazon EMR 6.4 release. This saved time in testing and allowed for a quick transition to the new version with minimal testing. This enabled us to query the data directly using Athena quickly without having to spin up a cluster to run ad hoc queries, as well as to query the data using Presto, Trino, and Hive. The decoupling of the storage and compute layers provided the flexibility to not only query data across different EMR clusters, but also delete data using a completely independent transient cluster.

Optimizing for low-latency reads with Apache Hudi

To optimize for low-latency reads with Apache Hudi, we needed to address the issue of too many small files being created within Amazon S3 due to the continuous streaming of data into the data lake.

We utilized Apache Hudi’s features to tune file sizes for optimal querying. Specifically, we reduced the degree of parallelism in Hudi from the default value of 1,500 to a lower number. Parallelism refers to the number of threads used to write data to Hudi; by reducing it, we were able to create larger files that were more optimal for querying.

Because we needed to optimize for high-volume streaming ingestion, we chose to implement the merge on read table type (instead of copy on write) for our workload. This table type allowed us to quickly ingest the incoming data into delta files in row format (Avro) and asynchronously compact the delta files into columnar Parquet files for fast reads. To do this, we ran the Hudi compaction job in the background. Compaction is the process of merging row-based delta files to produce new versions of columnar files. Because the compaction job would use additional compute resources, we adjusted the degree of parallelism for insertion to a lower value of 1,000 to account for the additional resource usage. This adjustment allowed us to create larger files without sacrificing performance throughput.

Overall, our approach to optimizing for low-latency reads with Apache Hudi allowed us to better manage file sizes and improve the overall performance of our data lake.

Monitoring

The team monitored MSK clusters with Prometheus (an open-source monitoring tool). Additionally, we showcased how to monitor Spark streaming jobs using Amazon CloudWatch metrics. For more information, refer to Monitor Spark streaming applications on Amazon EMR.

Outcomes

The collaboration between Zoom and the AWS Data Lab demonstrated significant improvements in data ingestion, processing, storage, and deletion using an architecture with Amazon EMR and Apache Hudi. One key benefit of the architecture was a reduction in infrastructure costs, which was achieved through the use of cloud-native technologies and the efficient management of data storage. Another benefit was an improvement in data management capabilities.

We showed that the costs of EMR clusters can be reduced by about 82% while bringing the storage costs down by about 90% compared to the prior HDFS-based architecture. All of this while making the data available in the data lake within 5 minutes of ingestion from the source. We also demonstrated that data deletions from a data lake containing multiple petabytes of data can be performed much more efficiently. With our optimized approach, we were able to delete approximately 1,000 records in just 1–2 minutes, as compared to the previously required 3 hours or more.

Conclusion

In conclusion, the log analytics process, which involves collecting, processing, storing, analyzing, and deleting log data from various sources such as servers, applications, and devices, is critical to aid organizations in working to meet their service resiliency, security, performance monitoring, troubleshooting, and compliance needs, such as GDPR.

This post shared what Zoom and the AWS Data Lab team have accomplished together to solve critical data pipeline challenges, and Zoom has extended the solution further to optimize extract, transform, and load (ETL) jobs and resource efficiency. However, you can also use the architecture patterns presented here to quickly build cost-effective and scalable solutions for other use cases. Please reach out to your AWS team for more information or contact Sales.

About the Authors

Sekar Srinivasan is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects focused on underprivileged Children’s education.

Chandra Dhandapani is a Senior Solutions Architect at AWS, where he specializes in creating solutions for customers in Analytics, AI/ML, and Databases. He has a lot of experience in building and scaling applications across different industries including Healthcare and Fintech. Outside of work, he is an avid traveler and enjoys sports, reading, and entertainment.

Amit Kumar Agrawal is a Senior Solutions Architect at AWS, based out of San Francisco Bay Area. He works with large strategic ISV customers to architect cloud solutions that address their business challenges. During his free time he enjoys exploring the outdoors with his family.

Viral Shah is a Analytics Sales Specialist working with AWS for 5 years helping customers to be successful in their data journey. He has over 20+ years of experience working with enterprise customers and startups, primarily in the data and database space. He loves to travel and spend quality time with his family.

Best practices to optimize your Amazon EC2 Spot Instances usage

2023-05-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/best-practices-to-optimize-your-amazon-ec2-spot-instances-usage/

This blog post is written by Pranaya Anshu, EC2 PMM, and Sid Ambatipudi, EC2 Compute GTM Specialist.

Amazon EC2 Spot Instances are a powerful tool that thousands of customers use to optimize their compute costs. The National Football League (NFL) is an example of customer using Spot Instances, leveraging 4000 EC2 Spot Instances across more than 20 instance types to build its season schedule. By using Spot Instances, it saves 2 million dollars every season! Virtually any organization – small or big – can benefit from using Spot Instances by following best practices.

Overview of Spot Instances

Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Through Spot Instances, you can take advantage of the massive operating scale of AWS and run hyperscale workloads at a significant cost saving. In exchange for these discounts, AWS has the option to reclaim Spot Instances when EC2 requires the capacity. AWS provides a two-minute notification before reclaiming Spot Instances, allowing workloads running on those instances to be gracefully shut down.

In this blog post, we explore four best practices that can help you optimize your Spot Instances usage and minimize the impact of Spot Instances interruptions: diversifying your instances, considering attribute-based instance type selection, leveraging Spot placement scores, and using the price-capacity-optimized allocation strategy. By applying these best practices, you’ll be able to leverage Spot Instances for appropriate workloads and ultimately reduce your compute costs. Note for the purposes of this blog, we will focus on the integration of Spot Instances with Amazon EC2 Auto Scaling groups.

Pre-requisites

Spot Instances can be used for various stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and AI/ML workloads. However, as previously mentioned, AWS can interrupt Spot Instances with a two-minute notification, so it is best not to use Spot Instances for workloads that cannot handle individual instance interruption — that is, workloads that are inflexible, stateful, fault-intolerant, or tightly coupled.

Best practices

Diversify your instances

The fundamental best practice when using Spot Instances is to be flexible. A Spot capacity pool is a set of unused EC2 instances of the same instance type (for example, m6i.large) within the same AWS Region and Availability Zone (for example, us-east-1a). When you request Spot Instances, you are requesting instances from a specific Spot capacity pool. Since Spot Instances are spare EC2 capacity, you want to base your selection (request) on as many spare pools of capacity as possible in order to increase your likelihood of getting Spot Instances. You should diversify across instance sizes, generations, instance types, and Availability Zones to maximize your savings with Spot Instances. For example, if you are currently using c5a.large in us-east-1a, consider including c6a instances (newer generation of instances), c5a.xl (larger size), or us-east-1b (different Availability Zone) to increase your overall flexibility. Instance diversification is beneficial not only for selecting Spot Instances, but also for scaling, resilience, and cost optimization.

To get hands-on experience with Spot Instances and to practice instance diversification, check out Amazon EC2 Spot Instances workshops. And once you’ve diversified your instances, you can leverage AWS Fault Injection Simulator (AWS FIS) to test your applications’ resilience to Spot Instance interruptions to ensure that they can maintain target capacity while still benefiting from the cost savings offered by Spot Instances. To learn more about stress testing your applications, check out the Back to Basics: Chaos Engineering with AWS Fault Injection Simulator video and AWS FIS documentation.

Consider attribute-based instance type selection

We have established that flexibility is key when it comes to getting the most out of Spot Instances. Similarly, we have said that in order to access your desired Spot Instances capacity, you should select multiple instance types. While building and maintaining instance type configurations in a flexible way may seem daunting or time-consuming, it doesn’t have to be if you use attribute-based instance type selection. With attribute-based instance type selection, you can specify instance attributes — for example, CPU, memory, and storage — and EC2 Auto Scaling will automatically identify and launch instances that meet your defined attributes. This removes the manual-lift of configuring and updating instance types. Moreover, this selection method enables you to automatically use newly released instance types as they become available so that you can continuously have access to an increasingly broad range of Spot Instance capacity. Attribute-based instance type selection is ideal for workloads and frameworks that are instance agnostic, such as HPC and big data workloads, and can help to reduce the work involved with selecting specific instance types to meet specific requirements.

For more information on how to configure attribute-based instance selection for your EC2 Auto Scaling group, refer to Create an Auto Scaling Group Using Attribute-Based Instance Type Selection documentation. To learn more about attribute-based instance type selection, read the Attribute-Based Instance Type Selection for EC2 Auto Scaling and EC2 Fleet news blog or check out the Using Attribute-Based Instance Type Selection and Mixed Instance Groups section of the Launching Spot Instances workshop.

Leverage Spot placement scores

Now that we’ve stressed the importance of flexibility when it comes to Spot Instances and covered the best way to select instances, let’s dive into how to find preferred times and locations to launch Spot Instances. Because Spot Instances are unused EC2 capacity, Spot Instances capacity fluctuates. Correspondingly, it is possible that you won’t always get the exact capacity at a specific time that you need through Spot Instances. Spot placement scores are a feature of Spot Instances that indicates how likely it is that you will be able to get the Spot capacity that you require in a specific Region or Availability Zone. Your Spot placement score can help you reduce Spot Instance interruptions, acquire greater capacity, and identify optimal configurations to run workloads on Spot Instances. However, it is important to note that Spot placement scores serve only as point-in-time recommendations (scores can vary depending on current capacity) and do not provide any guarantees in terms of available capacity or risk of interruption. To learn more about how Spot placement scores work and to get started with them, see the Identifying Optimal Locations for Flexible Workloads With Spot Placement Score blog and Spot placement scores documentation.

As a near real-time tool, Spot placement scores are often integrated into deployment automation. However, because of its logging and graphic capabilities, you may find it to be a valuable resource even before you launch a workload in the cloud. If you are looking to understand historical Spot placement scores for your workload, you should check out the Spot placement score tracker, a tool that automates the capture of Spot placement scores and stores Spot placement score metrics in Amazon CloudWatch. The tracker is available through AWS Labs, a GitHub repository hosting tools. Learn more about the tracker through the Optimizing Amazon EC2 Spot Instances with Spot Placement Scores blog.

When considering ideal times to launch Spot Instances and exploring different options via Spot placement scores, be sure to consider running Spot Instances at off-peak hours – or hours when there is less demand for EC2 Instances. As you may assume, there is less unused capacity – Spot Instances – available during typical business hours than after business hours. So, in order to leverage as much Spot capacity as you can, explore the possibility of running your workload at hours when there is reduced demand for EC2 instances and thus greater availability of Spot Instances. Similarly, consider running your Spot Instances in “off-peak Regions” – or Regions that are not experiencing business hours at that certain time.

On a related note, to maximize your usage of Spot Instances, you should consider using previous generation of instances if they meet your workload needs. This is because, as with off-peak vs peak hours, there is typically greater capacity available for previous generation instances than current generation instances, as most people tend to use current generation instances for their compute needs.

Use the price-capacity-optimized allocation strategy

Once you’ve selected a diversified and flexible set of instances, you should select your allocation strategy. When launching instances, your Auto Scaling group uses the allocation strategy that you specify to pick the specific Spot pools from all your possible pools. Spot offers four allocation strategies: price-capacity-optimized, capacity-optimized, capacity-optimized-prioritized, and lowest-price. Each of these allocation strategies select Spot Instances in pools based on price, capacity, a prioritized list of instances, or a combination of these factors.

The price-capacity-optimized strategy launched in November 2022. This strategy makes Spot Instance allocation decisions based on the most capacity at the lowest price. It essentially enables Auto Scaling groups to identify the Spot pools with the highest capacity availability for the number of instances that are launching. In other words, if you select this allocation strategy, we will find the Spot capacity pools that we believe have the lowest chance of interruption in the near term. Your Auto Scaling groups then request Spot Instances from the lowest priced of these pools.

We recommend you leverage the price-capacity-optimized allocation strategy for the majority of your workloads that run on Spot Instances. To see how the price-capacity-optimized allocation strategy selects Spot Instances in comparison with lowest-price and capacity-optimized allocation strategies, read the Introducing the Price-Capacity-Optimized Allocation Strategy for EC2 Spot Instances blog post.

Clean-up

If you’ve explored the different Spot Instances workshops we recommended throughout this blog post and spun up resources, please remember to delete resources that you are no longer using to avoid incurring future costs.

Conclusion

Spot Instances can be leveraged to reduce costs across a wide-variety of use cases, including containers, big data, machine learning, HPC, and CI/CD workloads. In this blog, we discussed four Spot Instances best practices that can help you optimize your Spot Instance usage to maximize savings: diversifying your instances, considering attribute-based instance type selection, leveraging Spot placement scores, and using the price-capacity-optimized allocation strategy.

To learn more about Spot Instances, check out Spot Instances getting started resources. Or to learn of other ways of reducing costs and improving performance, including leveraging other flexible purchase models such as AWS Savings Plans, read the Increase Your Application Performance at Lower Costs eBook or watch the Seven Steps to Lower Costs While Improving Application Performance webinar.

AWS Nitro System gets independent affirmation of its confidential compute capabilities

2023-05-09 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/aws-nitro-system-gets-independent-affirmation-of-its-confidential-compute-capabilities/

This blog post was written By Anthony Liguori, VP/Distinguished Engineer, EC2 AWS.

Customers around the world trust AWS to keep their data safe, and keeping their workloads secure and confidential is foundational to how we operate. Since the inception of AWS, we have relentlessly innovated on security, privacy tools, and practices to meet, and even exceed, our customers’ expectations.

The AWS Nitro System is the underlying platform for all modern AWS compute instances which has allowed us to deliver the data isolation, performance, cost, and pace of innovation that our customers require. It’s a pioneering design of specialized hardware and software that protects customer code and data from unauthorized access during processing.

When we launched the Nitro System in 2017, we delivered a unique architecture that restricts any operator access to customer data. This means no person or even service from AWS, can access data when it is being used in an Amazon EC2 instance. We knew that designing the system this way would present several architectural and operational challenges for us. However, we also knew that protecting customers’ data in this way was the best way to support our customer’s needs.

When AWS made its Digital Sovereignty Pledge last year, we committed to providing greater transparency and assurances to customers about how AWS services are designed and operated, especially when it comes to handling customer data. As part of that increased transparency, we engaged NCC Group, a leading cybersecurity consulting firm based in the United Kingdom, to conduct an independent architecture review of the Nitro System and the security assurances we make to our customers. NCC has now issued its rand affirmed our claims.

The report states, “As a matter of design, NCC Group found no gaps in the Nitro System that would compromise [AWS] security claims.” Specifically, the report validates the following statements about our Nitro System production hosts:

There is no mechanism for a cloud service provider employee to log in to the underlying host.
No administrative API can access customer content on the underlying host.
There is no mechanism for a cloud service provider employee to access customer content stored on instance storage and encrypted EBS volumes.
There is no mechanism for a cloud service provider employee to access encrypted data transmitted over the network.
Access to administrative APIs always requires authentication and authorization.
Access to administrative APIs is always logged.
Hosts can only run tested and signed software that is deployed by an authenticated and authorized deployment service. No cloud service provider employee can deploy code directly onto hosts.

The report details NCC’s analysis for each of these claims. You can also find additional details about the scope, methodology, and steps that NCC used to evaluate the claims.

How Nitro System protects customer data

At AWS, we know that our customers, especially those who have sensitive or confidential data, may have worries about putting that data in the cloud. That’s why we’ve architected the Nitro System to ensure that your confidential information is as secure as possible. We do this in several ways:

There is no mechanism for any system or person to log in to Amazon EC2 servers, read the memory of EC2 instances, or access any data on encrypted Amazon Elastic Block Store (EBS) volumes.

If any AWS operator, including those with the highest privileges, needs to perform maintenance work on the EC2 server, they can do so only by using a strictly limited set of authenticated, authorized, and audited administrative APIs. Critically, none of these APIs have the ability to access customer data on the EC2 server. These restrictions are built into the Nitro System itself, and no AWS operator can circumvent these controls and protections.

The Nitro System also protects customers from AWS system software through the innovative design of our lightweight Nitro Hypervisor, which manages memory and CPU allocation. Typical commercial hypervisors provide administrators with full access to the system, but with the Nitro System, the only interface operators can use is a restricted API. This means that customers and operators cannot interact with the system in unapproved ways and there is no equivalent of a “root” user. This approach enhances security and allows AWS to update systems in the background, fix system bugs, monitor performance, and even perform upgrades without impacting customer operations or customer data. Customers are unaffected during system upgrades, and their data remains protected.

Finally, the Nitro System can also provide customers an extra layer of data isolation from their own operators and software. AWS created , which allow for isolated compute environments, which is ideal for organizations that need to process personally identifiable information, as well as healthcare, financial, and intellectual property data within their compute instances. These enclaves do not share memory or CPU cores with the customer instance. Further, Nitro Enclaves have cryptographic attestation capabilities that let customers verify that all of the software deployed has been validated and not compromised.

All of these prongs of the Nitro System’s security and confidential compute capabilities required AWS to invest time and resources into building the system’s architecture. We did so because we wanted to ensure that our customers felt confident entrusting us with their most sensitive and confidential data, and we have worked to continue earning that trust. We are not done and this is just one step AWS is taking to increase the transparency about how our services are designed and operated. We will continue to innovate on and deliver unique features that further enhance our customers’ security without compromising on performance.

Learn more:

Watch Anthony speak about AWS Nitro System Security here.

Read the NCC report.
Read this whitepaper on the security design of the AWS Nitro System.
Please visit this webpage to learn more about our ongoing commitment to Digital Sovereignty Digital Sovereignty.
Read Matt Garman’s blog about the NCC Validation Report.

How Encored Technologies built serverless event-driven data pipelines with AWS

2023-05-04 Younggu Yun

Post Syndicated from Younggu Yun original https://aws.amazon.com/blogs/big-data/how-encored-technologies-built-serverless-event-driven-data-pipelines-with-aws/

This post is a guest post co-written with SeonJeong Lee, JaeRyun Yim, and HyeonSeok Yang from Encored Technologies.

Encored Technologies (Encored) is an energy IT company in Korea that helps their customers generate higher revenue and reduce operational costs in renewable energy industries by providing various AI-based solutions. Encored develops machine learning (ML) applications predicting and optimizing various energy-related processes, and their key initiative is to predict the amount of power generated at renewable energy power plants.

In this post, we share how Encored runs data engineering pipelines for containerized ML applications on AWS and how they use AWS Lambda to achieve performance improvement, cost reduction, and operational efficiency. We also demonstrate how to use AWS services to ingest and process GRIB (GRIdded Binary) format data, which is a file format commonly used in meteorology to store and exchange weather and climate data in a compressed binary form. It allows for efficient data storage and transmission, as well as easy manipulation of the data using specialized software.

Business and technical challenge

Encored is expanding their business into multiple countries to provide power trading services for end customers. The amount of data and the number of power plants they need to collect data are rapidly increasing over time. For example, the volume of data required for training one of the ML models is more than 200 TB. To meet the growing requirements of the business, the data science and platform team needed to speed up the process of delivering model outputs. As a solution, Encored aimed to migrate existing data and run ML applications in the AWS Cloud environment to efficiently process a scalable and robust end-to-end data and ML pipeline.

Solution overview

The primary objective of the solution is to develop an optimized data ingestion pipeline that addresses the scaling challenges related to data ingestion. During its previous deployment in an on-premises environment, the time taken to process data from ingestion to preparing the training dataset exceeded the required service level agreement (SLA). One of the input datasets required for ML models is weather data supplied by the Korea Meteorological Administration (KMA). In order to use the GRIB datasets for the ML models, Encored needed to prepare the raw data to make it suitable for building and training ML models. The first step was to convert GRIB to the Parquet file format.

Encored used Lambda to run an existing data ingestion pipeline built in a Linux-based container image. Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, and logging. AWS Lambda is triggered to ingest and process GRIB data files when they are uploaded to Amazon Simple Storage Service (Amazon S3). Once the files are processed, they are stored in Parquet format in the other S3 bucket. Encored receives GRIB files throughout the day, and whenever new files arrive, an AWS Lambda function runs a container image registered in Amazon Elastic Container Registry (ECR). This event-based pipeline triggers a customized data pipeline that is packaged in a container-based solution. Leveraging Amazon AWS Lambda, this solution is cost-effective, scalable, and high-performing.Encored uses Python as their preferred language.

The following diagram illustrates the solution architecture.

For data-intensive tasks such as extract, transform, and load (ETL) jobs and ML inference, Lambda is an ideal solution because it offers several key benefits, including rapid scaling to meet demand, automatic scaling to zero when not in use, and S3 event triggers that can initiate actions in response to object-created events. All this contributes to building a scalable and cost-effective data event-driven pipeline. In addition to these benefits, Lambda allows you to configure ephemeral storage (/tmp) between 512–10,240 MB. Encored used this storage for their data application when reading or writing data, enabling them to optimize performance and cost-effectiveness. Furthermore, Lambda’s pay-per-use pricing model means that users only pay for the compute time in use, making it a cost-effective solution for a wide range of use cases.

Prerequisites

For this walkthrough, you should have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The Docker CLI
Your function codes

Build your application required for your Docker image

The first step is to develop an application that can ingest and process files. This application reads the bucket name and object key passed from a trigger added to Lambda function. The processing logic involves three parts: downloading the file from Amazon S3 into ephemeral storage (/tmp), parsing the GRIB formatted data, and saving the parsed data to Parquet format.

The customer has a Python script (for example, app.py) that performs these tasks as follows:

import os
import tempfile
import boto3
import numpy as np
import pandas as pd
import pygrib

s3_client = boto3.client('s3')
def handler(event, context):
    try:
        # Get trigger file name
        bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
        s3_file_name = event["Records"][0]["s3"]["object"]["key"]

        # Handle temp files: all temp objects are deleted when the with-clause is closed
        with tempfile.NamedTemporaryFile(delete=True) as tmp_file:
            # Step1> Download file from s3 into temp area
            s3_file_basename = os.path.basename(s3_file_name)
            s3_file_dirname = os.path.dirname(s3_file_name)
            local_filename = tmp_file.name
            s3_client.download_file(
                Bucket=bucket_name,
                Key=f"{s3_file_dirname}/{s3_file_basename}",
                Filename=local_filename
            )

            # Step2> Parse – GRIB2 
            grbs = pygrib.open(local_filename)
            list_of_name = []
            list_of_values = []
            for grb in grbs:
                list_of_name.append(grb.name)
                list_of_values.append(grb.values)
            _, lat, lon = grb.data()
            list_of_name += ["lat", "lon"]
            list_of_values += [lat, lon]
            grbs.close()

            dat = pd.DataFrame(
                np.transpose(np.stack(list_of_values).reshape(len(list_of_values), -1)),
                columns=list_of_name,
            )

        # Step3> To Parquet
        s3_dest_uri = S3path
        dat.to_parquet(s3_dest_uri, compression="snappy")

    except Exception as err:
        print(err)

Prepare a Docker file

The second step is to create a Docker image using an AWS base image. To achieve this, you can create a new Dockerfile using a text editor on your local machine. This Dockerfile should contain two environment variables:

LAMBDA_TASK_ROOT=/var/task
LAMBDA_RUNTIME_DIR=/var/runtime

It’s important to install any dependencies under the ${LAMBDA_TASK_ROOT} directory alongside the function handler to ensure that the Lambda runtime can locate them when the function is invoked. Refer to the available Lambda base images for custom runtime for more information.

FROM public.ecr.aws/lambda/python:3.8

# Install the function's dependencies using file requirements.txt
# from your project folder.

COPY requirements.txt  .
RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.handler" ]

Build a Docker image

The third step is to build your Docker image using the docker build command. When running this command, make sure to enter a name for the image. For example:

docker build -t process-grib .

In this example, the name of the image is process-grib. You can choose any name you like for your Docker image.

Upload the image to the Amazon ECR repository

Your container image needs to reside in an Amazon Elastic Container Registry (Amazon ECR) repository. Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. For instructions on creating an ECR repository, refer to Creating a private repository.

The first step is to authenticate the Docker CLI to your ECR registry as follows:

aws ecr get-login-password --region ap-northeast-2 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com

The second step is to tag your image to match your repository name, and deploy the image to Amazon ECR using the docker push command:

docker tag  hello-world:latest 123456789012.dkr.ecr. ap-northeast-2.amazonaws.com/hello-world:latest
docker push 123456789012.dkr.ecr. ap-northeast-2.amazonaws.com/hello-world:latest

Deploy Lambda functions as container images

To create your Lambda function, complete the following steps:

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Choose the Container image option.
For Function name, enter a name.
For Container image URI, provide a container image. You can enter the ECR image URI or browse for the ECR image.
Under Container image overrides, you can override configuration settings such as the entry point or working directory that are included in the Dockerfile.
Under Permissions, expand Change default execution role.
Choose to create a new role or use an existing role.
Choose Create function.

Key considerations

To handle a large amount of data concurrently and quickly, Encored needed to store GRIB formatted files in the ephemeral storage (/tmp) that comes with Lambda. To achieve this requirement, Encored used tempfile.NamedTemporaryFile, which allows users to create temporary files easily that are deleted when no longer needed. With Lambda, you can configure ephemeral storage between 512 MB–10,240 MB for reading or writing data, allowing you to run ETL jobs, ML inference, or other data-intensive workloads.

Business outcome

Hyoseop Lee (CTO at Encored Technologies) said, “Encored has experienced positive outcomes since migrating to AWS Cloud. Initially, there was a perception that running workloads on AWS would be more expensive than using an on-premises environment. However, we discovered that this was not the case once we started running our applications on AWS. One of the most fascinating aspects of AWS services is the flexible architecture options it provides for processing, storing, and accessing large volumes of data that are only required infrequently.”

Conclusion

In this post, we covered how Encored built serverless data pipelines with Lambda and Amazon ECR to achieve performance improvement, cost reduction, and operational efficiency.

Encored successfully built an architecture that will support their global expansion and enhance technical capabilities through AWS services and the AWS Data Lab program. Based on the architecture and various internal datasets Encored has consolidated and curated, Encored plans to provide renewable energy forecasting and energy trading services.

Thanks for reading this post and hopefully you found it useful. To accelerate your digital transformation with ML, AWS is available to support you by providing prescriptive architectural guidance on a particular use case, sharing best practices, and removing technical roadblocks. You’ll leave the engagement with an architecture or working prototype that is custom fit to your needs, a path to production, and deeper knowledge of AWS services. Please contact your AWS Account Manager or Solutions Architect to get started. If you don’t have an AWS Account Manager, please contact Sales.

To learn more about ML inference use cases with Lambda, check out the following blog posts:

These resources will provide you with valuable insights and practical examples of how to use Lambda for ML inference.

About the Authors

SeonJeong Lee is the Head of Algorithms at Encored. She is a data practitioner who finds peace of mind from beautiful codes and formulas.

JaeRyun Yim is a Senior Data Scientist at Encored. He is striving to improve both work and life by focusing on simplicity and essence in my work.

HyeonSeok Yang is the platform team lead at Encored. He always strives to work with passion and spirit to keep challenging like a junior developer, and become a role model for others.

Younggu Yun works at AWS Data Lab in Korea. His role involves helping customers across the APAC region meet their business objectives and overcome technical challenges by providing prescriptive architectural guidance, sharing best practices, and building innovative solutions together.

The history and future roadmap of the AWS CloudFormation Registry

2023-05-04 Eric Z. Beard

Post Syndicated from Eric Z. Beard original https://aws.amazon.com/blogs/devops/cloudformation-coverage/

AWS CloudFormation is an Infrastructure as Code (IaC) service that allows you to model your cloud resources in template files that can be authored or generated in a variety of languages. You can manage stacks that deploy those resources via the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the API. CloudFormation helps customers to quickly and consistently deploy and manage cloud resources, but like all IaC tools, it faced challenges keeping up with the rapid pace of innovation of AWS services. In this post, we will review the history of the CloudFormation registry, which is the result of a strategy we developed to address scaling and standardization, as well as integration with other leading IaC tools and partner products. We will also give an update on the current state of CloudFormation resource coverage and review the future state, which has a goal of keeping CloudFormation and other IaC tools up to date with the latest AWS services and features.

History

The CloudFormation service was first announced in February of 2011, with sample templates that showed how to deploy common applications like blogs and wikis. At launch, CloudFormation supported 13 out of 15 available AWS services with 48 total resource types. At first, resource coverage was tightly coupled to the core CloudFormation engine, and all development on those resources was done by the CloudFormation team itself. Over the past decade, AWS has grown at a rapid pace, and there are currently 200+ services in total. A challenge over the years has been the coverage gap between what was possible for a customer to achieve using AWS services, and what was possible to define in a CloudFormation template.

It became obvious that we needed a change in strategy to scale resource development in a way that could keep up with the rapid pace of innovation set by hundreds of service teams delivering new features on a daily basis. Over the last decade, our pace of innovation has increased nearly 40-fold, with 80 significant new features launched in 2011 versus more than 3,000 in 2021. Since CloudFormation was a key adoption driver (or blocker) for new AWS services, those teams needed a way to create and manage their own resources. The goal was to enable day one support of new services at the time of launch with complete CloudFormation resource coverage.

In 2016, we launched an internal self-service platform that allowed service teams to control their own resources. This began to solve the scaling problems inherent in the prior model where the core CloudFormation team had to do all the work themselves. The benefits went beyond simply distributing developer effort, as the service teams have deep domain knowledge on their products, which allowed them to create more effective IaC components. However, as we developed resources on this model, we realized that additional design features were needed, such as standardization that could enable automatic support for features like drift detection and resource imports.

We embarked on a new project to address these concerns, with the goal of improving the internal developer experience as well as providing a public registry where customers could use the same programming model to define their own resource types. We realized that it wasn’t enough to simply make the new model available—we had to evangelize it with a training campaign, conduct engineering boot-camps, build better tooling like dashboards and deployment pipeline templates, and produce comprehensive on-boarding documentation. Most importantly, we made CloudFormation support a required item on the feature launch checklist for new services, a requirement that goes beyond documentation and is built into internal release tooling (exceptions to this requirement are rare as training and awareness around the registry have improved over time). This was a prime example of one of the maxims we repeat often at Amazon: good mechanisms are better than good intentions.

In 2019, we made this new functionality available to customers when we announced the CloudFormation registry, a capability that allowed developers to create and manage private resource types. We followed up in 2021 with the public registry where third parties, such as partners in the AWS Partner Network (APN), can publish extensions. The open source resource model that customers and partners use to publish third-party registry extensions is the same model used by AWS service teams to provide CloudFormation support for their features.

Once a service team on-boards their resources to the new resource model and builds the expected Create, Read, Update, Delete, and List (CRUDL) handlers, managed experiences like drift detection and resource import are all supported with no additional development effort. One recent example of day-1 CloudFormation support for a popular new feature was Lambda Function URLs, which offered a built-in HTTPS endpoint for single-function micro-services. We also migrated the Amazon Relational Database Service (Amazon RDS) Database Instance resource (AWS::RDS::DBInstance) to the new resource model in September 2022, and within a month, Amazon RDS delivered support for Amazon Aurora Serverless v2 in CloudFormation. This accelerated delivery is possible because teams can now publish independently by taking advantage of the de-centralized Registry ownership model.

Current State

We are building out future innovations for the CloudFormation service on top of this new standardized resource model so that customers can benefit from a consistent implementation of event handlers. We built AWS Cloud Control API on top of this new resource model. Cloud Control API takes the Create-Read-Update-Delete-List (CRUDL) handlers written for the new resource model and makes them available as a consistent API for provisioning resources. APN partner products such as HashiCorp Terraform, Pulumi, and Red Hat Ansible use Cloud Control API to stay in sync with AWS service launches without recurring development effort.

Figure 1. Cloud Control API Resource Handler Diagram

Besides 3rd party application support, the public registry can also be used by the developer community to create useful extensions on top of AWS services. A common solution to extending the capabilities of CloudFormation resources is to write a custom resource, which generally involves inline AWS Lambda function code that runs in response to CREATE, UPDATE, and DELETE signals during stack operations. Some of those use cases can now be solved by writing a registry extension resource type instead. For more information on custom resources and resource types, and the differences between the two, see Managing resources using AWS CloudFormation Resource Types.

CloudFormation Registry modules, which are building blocks authored in JSON or YAML, give customers a way to replace fragile copy-paste template reuse with template snippets that are published in the registry and consumed as if they were resource types. Best practices can be encapsulated and shared across an organization, which allows infrastructure developers to easily adhere to those best practices using modular components that abstract away the intricate details of resource configuration.

CloudFormation Registry hooks give security and compliance teams a vital tool to validate stack deployments before any resources are created, modified, or deleted. An infrastructure team can activate hooks in an account to ensure that stack deployments cannot avoid or suppress preventative controls implemented in hook handlers. Provisioning tools that are strictly client-side do not have this level of enforcement.

A useful by-product of publishing a resource type to the public registry is that you get automatic support for the AWS Cloud Development Kit (CDK) via an experimental open source repository on GitHub called cdk-cloudformation. In large organizations it is typical to see a mix of CloudFormation deployments using declarative templates and deployments that make use of the CDK in languages like TypeScript and Python. By publishing re-usable resource types to the registry, all of your developers can benefit from higher level abstractions, regardless of the tool they choose to create and deploy their applications. (Note that this project is still considered a developer preview and is subject to change)

If you want to see if a given CloudFormation resource is on the new registry model or not, check if the provisioning type is either Fully Mutable or Immutable by invoking the DescribeType API and inspecting the ProvisioningType response element.

Here is a sample CLI command that gets a description for the AWS::Lambda::Function resource, which is on the new registry model.

$ aws cloudformation describe-type --type RESOURCE \
    --type-name AWS::Lambda::Function | grep ProvisioningType

   "ProvisioningType": "FULLY_MUTABLE",

The difference between FULLY_MUTABLE and IMMUTABLE is the presence of the Update handler. FULLY_MUTABLE types includes an update handler to process updates to the type during stack update operations. Whereas, IMMUTABLE types do not include an update handler, so the type can’t be updated and must instead be replaced during stack update operations. Legacy resource types will be NON_PROVISIONABLE.

Opportunities for improvement

As we continue to strive towards our ultimate goal of achieving full feature coverage and a complete migration away from the legacy resource model, we are constantly identifying opportunities for improvement. We are currently addressing feature gaps in supported resources, such as tagging support for EC2 VPC Endpoints and boosting coverage for resource types to support drift detection, resource import, and Cloud Control API. We have fully migrated more than 130 resources, and acknowledge that there are many left to go, and the migration has taken longer than we initially anticipated. Our top priority is to maintain the stability of existing stacks—we simply cannot break backwards compatibility in the interest of meeting a deadline, so we are being careful and deliberate. One of the big benefits of a server-side provisioning engine like CloudFormation is operational stability—no matter how long ago you deployed a stack, any future modifications to it will work without needing to worry about upgrading client libraries. We remain committed to streamlining the migration process for service teams and making it as easy and efficient as possible.

The developer experience for creating registry extensions has some rough edges, particularly for languages other than Java, which is the language of choice on AWS service teams for their resource types. It needs to be easier to author schemas, write handler functions, and test the code to make sure it performs as expected. We are devoting more resources to the maintenance of the CLI and plugins for Python, Typescript, and Go. Our response times to issues and pull requests in these and other repositories in the aws-cloudformation GitHub organization have not been as fast as they should be, and we are making improvements. One example is the cloudformation-cli repository, where we have merged more than 30 pull requests since October of 2022.

To keep up with progress on resource coverage, check out the CloudFormation Coverage Roadmap, a GitHub project where we catalog all of the open issues to be resolved. You can submit bug reports and feature requests related to resource coverage in this repository and keep tabs on the status of open requests. One of the steps we took recently to improve responses to feature requests and bugs reported on GitHub is to create a system that converts GitHub issues into tickets in our internal issue tracker. These tickets go directly to the responsible service teams—an example is the Amazon RDS resource provider, which has hundreds of merged pull requests.

We have recently announced a new GitHub repository called community-registry-extensions where we are managing a namespace for public registry extensions. You can submit and discuss new ideas for extensions and contribute to any of the related projects. We handle the testing, validation, and deployment of all resources under the AwsCommunity:: namespace, which can be activated in any AWS account for use in your own templates.

To get started with the CloudFormation registry, visit the user guide, and then dive in to the detailed developer guide for information on how to use the CloudFormation Command Line Interface (CFN-CLI) to write your own resource types, modules, and hooks.

We recently created a new Discord server dedicated to CloudFormation. Please join us to ask questions, discuss best practices, provide feedback, or just hang out! We look forward to seeing you there.

Conclusion

In this post, we hope you gained some insights into the history of the CloudFormation registry, and the design decisions that were made during our evolution towards a standardized, scalable model for resource development that can be shared by AWS service teams, customers, and APN partners. Some of the lessons that we learned along the way might be applicable to complex design initiatives at your own company. We hope to see you on Discord and GitHub as we build out a rich set of registry resources together!

About the authors:

Integrating DevOps Guru Insights with CloudWatch Dashboard

2023-05-04 Suresh Babu

Post Syndicated from Suresh Babu original https://aws.amazon.com/blogs/devops/integrating-devops-guru-insights-with-cloudwatch-dashboard/

Many customers use Amazon CloudWatch dashboards to monitor applications and often ask how they can integrate Amazon DevOps Guru Insights in order to have a unified dashboard for monitoring. This blog post showcases integrating DevOps Guru proactive and reactive insights to a CloudWatch dashboard by using Custom Widgets. It can help you to correlate trends over time and spot issues more efficiently by displaying related data from different sources side by side and to have a single pane of glass visualization in the CloudWatch dashboard.

Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru’s anomaly detectors can proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; detailed insights provide recommendations to mitigate that behavior.

Amazon CloudWatch dashboard is a customizable home page in the CloudWatch console that monitors multiple resources in a single view. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.

Solution overview

This post will help you to create a Custom Widget for Amazon CloudWatch dashboard that displays DevOps Guru Insights. A custom widget is part of your CloudWatch dashboard that calls an AWS Lambda function containing your custom code. The Lambda function accepts custom parameters, generates your dataset or visualization, and then returns HTML to the CloudWatch dashboard. The CloudWatch dashboard will display this HTML as a widget. In this post, we are providing sample code for the Lambda function that will call DevOps Guru APIs to retrieve the insights information and displays as a widget in the CloudWatch dashboard. The architecture diagram of the solution is below.

Solution Architecture

Figure 1: Reference architecture diagram

Prerequisites and Assumptions

An AWS account. To sign up:
- Create an AWS account. For instructions, see Sign Up For AWS.
DevOps Guru should be enabled in the account. For enabling DevOps guru, see DevOps Guru Setup
Follow this Workshop to deploy a sample application in your AWS Account which can help generate some DevOps Guru insights.

Solution Deployment

We are providing two options to deploy the solution – using the AWS console and AWS CloudFormation. The first section has instructions to deploy using the AWS console followed by instructions for using CloudFormation. The key difference is that we will create one Widget while using the Console, but three Widgets are created when we use AWS CloudFormation.

Using the AWS Console:

We will first create a Lambda function that will retrieve the DevOps Guru insights. We will then modify the default IAM role associated with the Lambda function to add DevOps Guru permissions. Finally we will create a CloudWatch dashboard and add a custom widget to display the DevOps Guru insights.

Navigate to the Lambda Console after logging to your AWS Account and click on Create function.

Figure 2a: Create Lambda Function
Choose Author from Scratch and use the runtime Node.js 16.x. Leave the rest of the settings at default and create the function.

Figure 2b: Create Lambda Function

After a few seconds, the Lambda function will be created and you will see a code source box. Copy the code from the text box below and replace the code present in code source as shown in screen print below.

// SPDX-License-Identifier: MIT-0
// CloudWatch Custom Widget sample: displays count of Amazon DevOps Guru Insights
const aws = require('aws-sdk');

const DOCS = `## DevOps Guru Insights Count
Displays the total counts of Proactive and Reactive Insights in DevOps Guru.
`;

async function getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let proactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'PROACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        console.log(result)
        NextToken = result.NextToken;
        result.ProactiveInsights.forEach(res =&gt; {
        console.log(result.ProactiveInsights[0].Status)
        proactivecount++;
        });
        } while (NextToken);
    return proactivecount;
}

async function getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let reactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'REACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        NextToken = result.NextToken;
        result.ReactiveInsights.forEach(res =&gt; {
        reactivecount++;
        });
        } while (NextToken);
    return reactivecount;
}

function getHtmlOutput(proactivecount, reactivecount, region, event, context) {

    return `DevOps Guru Proactive Insights&lt;br&gt;&lt;font size="+10" color="#FF9900"&gt;${proactivecount}&lt;/font&gt;
    &lt;p&gt;DevOps Guru Reactive Insights&lt;/p&gt;&lt;font size="+10" color="#FF9900"&gt;${reactivecount}`;
}

exports.handler = async (event, context) =&gt; {
    if (event.describe) {
        return DOCS;
    }
    const widgetContext = event.widgetContext;
    const timeRange = widgetContext.timeRange.zoom || widgetContext.timeRange;
    const StartTime = new Date(timeRange.start);
    const EndTime = new Date(timeRange.end);
    const region = event.region || process.env.AWS_REGION;
    const DevOpsGuru = new aws.DevOpsGuru({ region });

    const proactivecount = await getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
    const reactivecount = await getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime);

    return getHtmlOutput(proactivecount, reactivecount, region, event, context);
    
};

Figure 3: Lambda Function Source Code

Click on Deploy to save the function code
Since we used the default settings while creating the function, a default Execution role is created and associated with the function. We will need to modify the IAM role to grant DevOps Guru permissions to retrieve Proactive and Reactive insights.
Click on the Configuration tab and select Permissions from the left side option list. You can see the IAM execution role associated with the function as shown in figure 4.

Figure 4: Lambda function execution role
Click on the IAM role name to open the role in the IAM console. Click on Add Permissions and select Attach policies.

Figure 5: IAM Role Update
Search for DevOps and select the AmazonDevOpsGuruReadOnlyAccess. Click on Add permissions to update the IAM role.

Figure 6: IAM Role Policy Update
Now that we have created the Lambda function for our custom widget and assigned appropriate permissions, we can navigate to CloudWatch to create a Dashboard.
Navigate to CloudWatch and click on dashboards from the left side list. You can choose to create a new dashboard or add the widget in an existing dashboard.
We will choose to create a new dashboard

Figure 7: Create New CloudWatch dashboard
Choose Custom Widget in the Add widget page

Figure 8: Add widget
Click Next in the custom widge page without choosing a sample

Figure 9: Custom Widget Selection
Choose the region where devops guru is enabled. Select the Lambda function that we created earlier. In the preview pane, click on preview to view DevOps Guru metrics. Once the preview is successful, create the Widget.

Figure 10: Create Custom Widget
Congratulations, you have now successfully created a CloudWatch dashboard with a custom widget to get insights from DevOps Guru. The sample code that we provided can be customized to suit your needs.

Using AWS CloudFormation

You may skip this step and move to future scope section if you have already created the resources using AWS Console.

In this step we will show you how to deploy the solution using AWS CloudFormation. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Customers define an initial template and then revise it as their requirements change. For more information on CloudFormation stack creation refer to this blog post.

The following resources are created.

Three Lambda functions that will support CloudWatch Dashboard custom widgets
An AWS Identity and Access Management (IAM) role to that allows the Lambda function to access DevOps Guru Insights and to publish logs to CloudWatch
Three Log Groups under CloudWatch
A CloudWatch dashboard with widgets to pull data from the Lambda Functions

To deploy the solution by using the CloudFormation template

You can use this downloadable template to set up the resources. To launch directly through the console, choose Launch Stack button, which creates the stack in the us-east-1 AWS Region.
Choose Next to go to the Specify stack details page.
(Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

It takes approximately 2-3 minutes for the provisioning to complete. After the status is “Complete”, proceed to validate the resources as listed below.

Validate the resources

Now that the stack creation has completed successfully, you should validate the resources that were created.

On AWS Console, head to CloudWatch, under Dashboards – there will be a dashboard created with name <StackName-Region>.
On AWS Console, head to CloudWatch, under LogGroups there will be 3 new log-groups created with the name as:
- lambdaProactiveLogGroup
- lambdaReactiveLogGroup
- lambdaSummaryLogGroup
On AWS Console, head to Lambda, there will be lambda function(s) under the name:
- lambdaFunctionDGProactive
- lambdaFunctionDGReactive
- lambdaFunctionDGSummary
On AWS Console, head to IAM, under Roles there will be a new role created with name “lambdaIAMRole”

To View Results/Outcome

With the appropriate time-range setup on CloudWatch Dashboard, you will be able to navigate through the insights that have been generated from DevOps Guru on the CloudWatch Dashboard.

Figure 11: DevOpsGuru Insights in Cloudwatch Dashboard

Cleanup

For cost optimization, after you complete and test this solution, clean up the resources. You can delete them manually if you used the AWS Console or by deleting the AWS CloudFormation stack called devopsguru-cloudwatch-dashboard if you used AWS CloudFormation.

For more information on deleting the stacks, see Deleting a stack on the AWS CloudFormation console.

Conclusion

This blog post outlined how you can integrate DevOps Guru insights into a CloudWatch Dashboard. As a customer, you can start leveraging CloudWatch Custom Widgets to include DevOps Guru Insights in an existing Operational dashboard.

AWS Customers are now using Amazon DevOps Guru to monitor and improve application performance. You can start monitoring your applications by following the instructions in the product documentation. Head over to the Amazon DevOps Guru console to get started today.

To learn more about AIOps for Serverless using Amazon DevOps Guru check out this video.

How SOCAR handles large IoT data with Amazon MSK and Amazon ElastiCache for Redis

2023-05-03 Younggu Yun

Post Syndicated from Younggu Yun original https://aws.amazon.com/blogs/big-data/how-socar-handles-large-iot-data-with-amazon-msk-and-amazon-elasticache-for-redis/

This is a guest blog post co-written with SangSu Park and JaeHong Ahn from SOCAR.

As companies continue to expand their digital footprint, the importance of real-time data processing and analysis cannot be overstated. The ability to quickly measure and draw insights from data is critical in today’s business landscape, where rapid decision-making is key. With this capability, businesses can stay ahead of the curve and develop new initiatives that drive success.

This post is a continuation of How SOCAR built a streaming data pipeline to process IoT data for real-time analytics and control. In this post, we provide a detailed overview of streaming messages with Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon ElastiCache for Redis, covering technical aspects and design considerations that are essential for achieving optimal results.

SOCAR is the leading Korean mobility company with strong competitiveness in car-sharing. SOCAR wanted to design and build a solution for a new Fleet Management System (FMS). This system involves the collection, processing, storage, and analysis of Internet of Things (IoT) streaming data from various vehicle devices, as well as historical operational data such as location, speed, fuel level, and component status.

This post demonstrates a solution for SOCAR’s production application that allows them to load streaming data from Amazon MSK into ElastiCache for Redis, optimizing the speed and efficiency of their data processing pipeline. We also discuss the key features, considerations, and design of the solution.

Background

SOCAR operates about 20,000 cars and is planning to include other large vehicle types such as commercial vehicles and courier trucks. SOCAR has deployed in-car devices that capture data using AWS IoT Core. This data was then stored in Amazon Relational Database Service (Amazon RDS). The challenge with this approach included inefficient performance and high resource usage. Therefore, SOCAR looked for purpose-built databases tailored to the needs of their application and usage patterns while meeting the future requirements of SOCAR’s business and technical requirements. The key requirements for SOCAR included achieving maximum performance for real-time data analytics, which required storing data in an in-memory data store.

After careful consideration, ElastiCache for Redis was selected as the optimal solution due to its ability to handle complex data aggregation rules with ease. One of the challenges faced was loading data from Amazon MSK into the database, because there was no built-in Kafka connector and consumer available for this task. This post focuses on the development of a Kafka consumer application that was designed to tackle this challenge by enabling performant data loading from Amazon MSK to Redis.

Solution overview

Extracting valuable insights from streaming data can be a challenge for businesses with diverse use cases and workloads. That’s why SOCAR built a solution to seamlessly bring data from Amazon MSK into multiple purpose-built databases, while also empowering users to transform data as needed. With fully managed Apache Kafka, Amazon MSK provides a reliable and efficient platform for ingesting and processing real-time data.

The following figure shows an example of the data flow at SOCAR.

This architecture consists of three components:

Streaming data – Amazon MSK serves as a scalable and reliable platform for streaming data, capable of receiving and storing messages from a variety of sources, including AWS IoT Core, with messages organized into multiple topics and partitions
Consumer application – With a consumer application, users can seamlessly bring data from Amazon MSK into a target database or data storage while also defining data transformation rules as needed
Target databases – With the consumer application, the SOCAR team was able to load data from Amazon MSK into two separate databases, each serving a specific workload

Although this post focuses on a specific use case with ElastiCache for Redis as the target database and a single topic called gps, the consumer application we describe can handle additional topics and messages, as well as different streaming sources and target databases such as Amazon DynamoDB. Our post covers the most important aspects of the consumer application, including its features and components, design considerations, and a detailed guide to the code implementation.

Components of the consumer application

The consumer application comprises three main parts that work together to consume, transform, and load messages from Amazon MSK into a target database. The following diagram shows an example of data transformations in the handler component.

The details of each component are as follows:

Consumer – This consumes messages from Amazon MSK and then forwards the messages to a downstream handler.
Loader – This is where users specify a target database. For example, SOCAR’s target databases include ElastiCache for Redis and DynamoDB.
Handler – This is where users can apply data transformation rules to the incoming messages before loading them into a target database.

Features of the consumer application

This connection has three features:

Scalability – This solution is designed to be scalable, ensuring that the consumer application can handle an increasing volume of data and accommodate additional applications in the future. For instance, SOCAR sought to develop a solution capable of handling not only the current data from approximately 20,000 vehicles but also a larger volume of messages as the business and data continue to grow rapidly.
Performance – With this consumer application, users can achieve consistent performance, even as the volume of source messages and target databases increases. The application supports multithreading, allowing for concurrent data processing, and can handle unexpected spikes in data volume by easily increasing compute resources.
Flexibility – This consumer application can be reused for any new topics without having to build the entire consumer application again. The consumer application can be used to ingest new messages with different configuration values in the handler. SOCAR deployed multiple handlers to ingest many different messages. Also, this consumer application allows users to add additional target locations. For example, SOCAR initially developed a solution for ElastiCache for Redis and then replicated the consumer application for DynamoDB.

Design considerations of the consumer application

Note the following design considerations for the consumer application:

Scale out – A key design principle of this solution is scalability. To achieve this, the consumer application runs with Amazon Elastic Kubernetes Service (Amazon EKS) because it can allow users to increase and replicate consumer applications easily.
Consumption patterns – To receive, store, and consume data efficiently, it’s important to design Kafka topics depending on messages and consumption patterns. Depending on messages consumed at the end, messages can be received into multiple topics of different schemas. For example, SOCAR has many different topics that are consumed by different workloads.
Purpose-built database – The consumer application supports loading data into multiple target options based on the specific use case. For example, SOCAR stored real-time IoT data in ElastiCache for Redis to power real-time dashboard and web applications, while storing recent trip information in DynamoDB that didn’t require real-time processing.

Walkthrough overview

The producer of this solution is AWS IoT Core, which sends out messages into a topic called gps. The target database of this solution is ElastiCache for Redis. ElastiCache for Redis a fast in-memory data store that provides sub-millisecond latency to power internet-scale, real-time applications. Built on open-source Redis and compatible with the Redis APIs, ElastiCache for Redis combines the speed, simplicity, and versatility of open-source Redis with the manageability, security, and scalability from Amazon to power the most demanding real-time applications.

The target location can be either another database or storage depending on the use case and workload. SOCAR uses Amazon EKS to operate the containerized solution to achieve scalability, performance, and flexibility. Amazon EKS is a managed Kubernetes service to run Kubernetes in the AWS Cloud. Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.

For the programming language, the SOCAR team decided to use the Go Programming language, utilizing both the AWS SDK for Go and a Goroutine, a lightweight logical or virtual thread managed by the Go runtime, which makes it easy to manage multiple threads. The AWS SDK for Go simpliﬁes the use of AWS services by providing a set of libraries that are consistent and familiar for Go developers.

In the following sections, we walk through the steps to implement the solution:

Create a consumer.
Create a loader.
Create a handler.
Build a consumer application with the consumer, loader, and handler.
Deploy the consumer application.

Prerequisites

For this walkthrough, you should have the following:

An AWS account
An MSK cluster
An EKS cluster or Amazon Elastic Compute Cloud (Amazon EC2) instance
ElastiCache for Redis
An AWS Identity and Access Management (IAM) user for Amazon MSK (for more information, refer to How Amazon MSK works with IAM)
Go (programming language)

Create a consumer

In this example, we use a topic called gps, and the consumer includes a Kafka client that receives messages from the topic. SOCAR created a struct and built a consumer (called NewConsumer in the code) to make it extendable. With this approach, any additional parameters and rules can be added easily.

To authenticate with Amazon MSK, SOCAR uses IAM. Because SOCAR already uses IAM to authenticate other resources, such as Amazon EKS, it uses the same IAM role (aws_msk_iam_v2) to authenticate clients for both Amazon MSK and Apache Kafka actions.

The following code creates the consumer:

type Consumer struct {
	logger      *zerolog.Logger
	kafkaReader *kafka.Reader
}

func NewConsumer(logger *zerolog.Logger, awsCfg aws.Config, brokers []string, consumerGroupID, topic string) *Consumer {
	return &Consumer{
		logger: logger,
		kafkaReader: kafka.NewReader(kafka.ReaderConfig{
			Dialer: &kafka.Dialer{
				TLS:           &tls.Config{MinVersion: tls.VersionTLS12},
				Timeout:       10 * time.Second,
				DualStack:     true,
				SASLMechanism: aws_msk_iam_v2.NewMechanism(awsCfg),
			},
			Brokers:     brokers, //
			GroupID:     consumerGroupID, //
			Topic:       topic, //
			StartOffset: kafka.LastOffset, //
		}),
	}
}

func (consumer *Consumer) Close() error {
	var err error = nil
	if consumer.kafkaReader != nil {
		err = consumer.kafkaReader.Close()
		consumer.logger.Info().Msg("closed kafka reader")
	}
	return err
}

func (consumer *Consumer) Consume(ctx context.Context) (kafka.message, error) {
	return consumer.kafkaReader.Readmessage(ctx)
}

Create a loader

The loader function, represented by the Loader struct, is responsible for loading messages to the target location, which in this case is ElastiCache for Redis. The NewLoader function initializes a new instance of the Loader struct with a logger and a Redis cluster client, which is used to communicate with the ElastiCache cluster. The redis.NewClusterClient object is initialized using the NewRedisClient function, which uses IAM to authenticate the client for Redis actions. This ensures secure and authorized access to the ElastiCache cluster. The Loader struct also contains the Close method to close the Kafka reader and free up resources.

The following code creates a loader:

type Loader struct {
	logger      *zerolog.Logger
	redisClient *redis.ClusterClient
}

func NewLoader(logger *zerolog.Logger, redisClient *redis.ClusterClient) *Loader {
	return &Loader{
		logger:      logger,
		redisClient: redisClient,
	}
}

func (consumer *Consumer) Close() error {
	var err error = nil
	if consumer.kafkaReader != nil {
		err = consumer.kafkaReader.Close()
		consumer.logger.Info().Msg("closed kafka reader")
	}
	return err
}

func (consumer *Consumer) Consume(ctx context.Context) (kafka.Message, error) {
	return consumer.kafkaReader.ReadMessage(ctx)
}

func NewRedisClient(ctx context.Context, awsCfg aws.Config, addrs []string, replicationGroupID, username string) (*redis.ClusterClient, error) {
	redisClient := redis.NewClusterClient(&redis.ClusterOptions{
		NewClient: func(opt *redis.Options) *redis.Client {
			return redis.NewClient(&redis.Options{
				Addr: opt.Addr,
				CredentialsProvider: func() (username string, password string) {
					token, err := BuildRedisIAMAuthToken(ctx, awsCfg, replicationGroupID, opt.Username)
					if err != nil {
						panic(err)
					}
					return opt.Username, token
				},
				PoolSize:    opt.PoolSize,
				PoolTimeout: opt.PoolTimeout,
				TLSConfig:   &tls.Config{InsecureSkipVerify: true},
			})
		},
		Addrs:       addrs,
		Username:    username,
		PoolSize:    100,
		PoolTimeout: 1 * time.Minute,
	})
	pong, err := redisClient.Ping(ctx).Result()
	if err != nil {
		return nil, err
	}
	if pong != "PONG" {
		return nil, fmt.Errorf("failed to verify connection to redis server")
	}
	return redisClient, nil
}

Create a handler

A handler is used to include business rules and data transformation logic that prepares data before loading it into the target location. It acts as a bridge between a consumer and a loader. In this example, the topic name is cars.gps.json, and the message includes two keys, lng and lat, with data type Float64. The business logic can be defined in a function like handlerFuncGpsToRedis and then applied as follows:

type (
	handlerFunc    func(ctx context.Context, loader *Loader, key, value []byte) error
	handlerFuncMap map[string]handlerFunc
)

var HandlerRedis = handlerFuncMap{
	"cars.gps.json":   handlerFuncGpsToRedis
}

func GetHandlerFunc(funcMap handlerFuncMap, topic string) (handlerFunc, error) {
	handlerFunc, exist := funcMap[topic]
	if !exist {
		return nil, fmt.Errorf("failed to find handler func for '%s'", topic)
	}
	return handlerFunc, nil
}

func handlerFuncGpsToRedis(ctx context.Context, loader *Loader, key, value []byte) error {
	// unmarshal raw data to map
	data := map[string]interface{}{}
	err := json.Unmarshal(value, &data)
	if err != nil {
		return err
	}

	// prepare things to load on redis as geolocation
	name := string(key)
	lng, err := getFloat64ValueFromMap(data, "lng")
	if err != nil {
		return err
	}
	lat, err := getFloat64ValueFromMap(data, "lat")
	if err != nil {
		return err
	}

	// add geolocation to redis
	return loader.RedisGeoAdd(ctx, "cars#gps", name, lng, lat)
}

Build a consumer application with the consumer, loader, and handler

Now you have created the consumer, loader, and handler. The next step is to build a consumer application using them. In a consumer application, you read messages from your stream with a consumer, transform them using a handler, and then load transformed messages into a target location with a loader. These three components are parameterized in a consumer application function such as the one shown in the following code:

type Connector struct {
	ctx    context.Context
	logger *zerolog.Logger

	consumer *Consumer
	handler  handlerFuncMap
	loader   *Loader
}

func NewConnector(ctx context.Context, logger *zerolog.Logger, consumer *Consumer, handler handlerFuncMap, loader *Loader) *Connector {
	return &Connector{
		ctx:    ctx,
		logger: logger,

		consumer: consumer,
		handler:  handler,
		loader:   loader,
	}
}

func (connector *Connector) Close() error {
	var err error = nil
	if connector.consumer != nil {
		err = connector.consumer.Close()
	}
	if connector.loader != nil {
		err = connector.loader.Close()
	}
	return err
}

func (connector *Connector) Run() error {
	wg := sync.WaitGroup{}
	defer wg.Wait()
	handlerFunc, err := GetHandlerFunc(connector.handler, connector.consumer.kafkaReader.Config().Topic)
	if err != nil {
		return err
	}
	for {
		msg, err := connector.consumer.Consume(connector.ctx)
		if err != nil {
			if errors.Is(context.Canceled, err) {
				break
			}
		}

		wg.Add(1)
		go func(key, value []byte) {
			defer wg.Done()
			err = handlerFunc(connector.ctx, connector.loader, key, value)
			if err != nil {
				connector.logger.Err(err).Msg("")
			}
		}(msg.Key, msg.Value)
	}
	return nil
}

Deploy the consumer application

To achieve maximum parallelism, SOCAR containerizes the consumer application and deploys it into multiple pods on Amazon EKS. Each consumer application contains a unique consumer, loader, and handler. For example, if you need to receive messages from a single topic with five partitions, you can deploy five identical consumer applications, each running in its own pod. Similarly, if you have two topics with three partitions each, you should deploy two consumer applications, resulting in a total of six pods. It’s a best practice to run one consumer application per topic, and the number of pods should match the number of partitions to enable concurrent message processing. The pod number can be specified in the Kubernetes deployment configuration

There are two stages in the Dockerfile. The first stage is the builder, which installs build tools and dependencies, and builds the application. The second stage is the runner, which uses a smaller base image (Alpine) and copies only the necessary files from the builder stage. It also sets the appropriate user permissions and runs the application. It’s also worth noting that the builder stage uses a specific version of the Golang image, while the runner stage uses a specific version of the Alpine image, both of which are considered to be lightweight and secure images.

The following code is an example of the Dockerfile:

# builder
FROM golang:1.18.2-alpine3.16 AS builder
RUN apk add build-base
WORKDIR /usr/src/app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o connector .

# runner
FROM alpine:3.16.0 AS runner
WORKDIR /usr/bin/app
RUN apk add --no-cache tzdata
RUN addgroup --system app && adduser --system --shell /bin/false --ingroup app app
COPY --from=builder /usr/src/app/connector .
RUN chown -R app:app /usr/bin/app
USER app
ENTRYPOINT ["/usr/bin/app/connector"]

Conclusion

In this post, we discussed SOCAR’s approach to building a consumer application that enables IoT real-time streaming from Amazon MSK to target locations such as ElastiCache for Redis. We hope you found this post informative and useful. Thank you for reading!

About the Authors

SangSu Park is the Head of Operation Group at SOCAR. His passion is to keep learning, embrace challenges, and strive for mutual growth through communication. He loves to travel in search of new cities and places.

JaeHong Ahn is a DevOps Engineer in SOCAR’s cloud infrastructure team. He is dedicated to promoting collaboration between developers and operators. He enjoys creating DevOps tools and is committed to using his coding abilities to help build a better world. He loves to cook delicious meals as a private chef for his wife.