Tag Archives: Customer Solutions

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Post Syndicated from Rajiv Arora original https://aws.amazon.com/blogs/big-data/how-gilead-used-amazon-redshift-to-quickly-and-cost-effectively-load-third-party-medical-claims-data/

This post was co-written with Rajiv Arora, Director of Data Science Platform at Gilead Life Sciences.

Gilead Sciences, Inc. is a biopharmaceutical company committed to advancing innovative medicines to prevent and treat life-threatening diseases, including HIV, viral hepatitis, inflammation, and cancer. A leader in virology, Gilead historically relied on these drugs for growth but now through strategic investments, Gilead is expanding and increasing their focus in oncology, having acquired Kite and Immunomedics to boost their exposure to cell therapy and non-cell therapy, making it the primary growth engine. Because Gilead is expanding into biologics and large molecule therapies, and has an ambitious goal of launching 10 innovative therapies by 2030, there is heavy emphasis on using data with AI and machine learning (ML) to accelerate the drug discovery pipeline.

Amazon Redshift Serverless is a fully managed cloud data warehouse that allows you to seamlessly create your data warehouse with no infrastructure management required. You pay only for the compute resources and storage that you use. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), which are part of the compute resources. All of the data stored in your warehouse, such as tables, views, and users, make up a namespace in Redshift Serverless.

One of the benefits of Redshift Serverless is that you don’t need to size your data warehouse for your peak workload. The peak workload includes loading periodic large datasets in multi-terabyte range. You can set a base RPU from 8 up to 512 and Redshift Serverless will automatically scale the RPUs to meet your workload demands. This makes it straightforward to manage your data warehouse in a cost-effective manner.

In this post, we share how Gilead collaborated with AWS to redesign their data ingestion process. They used Redshift Serverless as their data producer to load third-party medical claims data in a fast and cost-effective way, reducing load times from days to hours.

Gilead use case

Gilead loads a variety of data from hundreds of sources to their R&D data environment. They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month. The 3-node RA3 16XL provisioned cluster that had previously been hosting their warehouse was taking around 12 hours to ingest this data to Amazon Redshift, and Gilead was looking to optimize the data ingestion process in a more dynamic manner. Working with Amazon Redshift specialists from AWS, Gilead chose Redshift Serverless as a way to cost-effectively load this data and then use Redshift data sharing to share the final dataset to two additional Redshift data warehouses for end-user queries.

Loading data is a key process for any analytical system, including Amazon Redshift. When loading very large datasets, it’s important to not only load the data as quickly as possible but also in a way that optimizes the consumption queries.

Gilead’s healthcare claims data took 40 hours to load, which meant delays in using the data for downstream processes. The teams sought improvements, targeting a maximum 24-hour SLA for the load. They achieved the load in 8 hours, an 80% reduction in time to make data available.

Solution overview

After collaborating, the Gilead and AWS teams decided on a two-step process to load the data to Amazon Redshift. First, the data was loaded without a distkey and sortkey, which let the load process use the full parallel resources of the cluster. Then we used a deep copy to redistribute this data and add the desired distribution and sort characteristics.

The solution uses Redshift Serverless. The team wanted to ingest data to meet the required SLA, and the following approaches were benchmarked:

  • COPY command – The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon Simple Storage Service (Amazon S3)
  • Data lake analytics Amazon Redshift Spectrum is used to query data directly from files on Amazon S3 by selecting a subset of columns and avoiding the intermediate step of copying data to staging table

Initial Solution approach: Single COPY command

The team determined it would be more effective to apply the distribution and sort keys in a post-copy step. The data was loaded first using automatic distribution of data. This took roughly 12 hours to complete. The team created open and closed claims tables with defined dist keys and with 20% of the columns to alleviate the need to query the larger table. With this success, we learned that we can still improve the big copy, as detailed in the following sections.

Proposed Solution approach 1: Parallel COPY command

Based on the initial solution approach above, the team tested yearly parallel copy commands as illustrated in the following diagram.

Yearly Parallel Copy Commands

Below are the findings and learnings from this approach:

  • Ingesting data for 4 years using parallel copy showed a 25% performance improvement over the single copy command.
  • Compared to Initial solution approach, where we were taking 12 hours to ingest the data, we further optimized this runtime by 67% by segregating the data ingestion into separate yearly staging tables and running parallel copy commands.
  • After the data was loaded into staging yearly tables, we created the open and closed claim tables with an auto distkey with the subset of columns required for larger reporting groups. It took an additional 1 hour to create.

The team used a manifest file to make sure that the COPY command loads all of the required files for the respective year for ingesting.

Proposed Solution approach 2: Data Lake analytics

The team used this approach with Redshift Spectrum to load only the required columns to Redshift Serverless, which avoided loading data into multiple yearly tables and directly to a single table. The following diagram illustrates this approach.

Using Spectrum Approach

The workflow consists of the following steps:

  1. Crawl the files using AWS Glue.
  2. Create a data lake external schema and table in Redshift Serverless.
  3. Create two separate claims table for open and closed claims because open claims are most frequently consumed and are 20% of the columns and 100% of the data.
  4. Create open and closed tables with selective columns needed for optimal performance optimization during consumption instead of all columns in the original third-party dataset. The data volume distribution is as follows:
    • Total number of open claims records = 50 billion
    • Total number of closed claims records = 200 billion
    • Overall, total number of records = 250 billion
  5. Distribute open and closed tables with a customer-identified distkey.
  6. Configure data ingestion into open and closed claims tables combined using Redshift Serverless with 512 RPUs. This took 1.5 hours, which is further improved by 70% compared to scenario 1. We chose 512 RPUs in order to load data in the fastest way possible.

In this method, data ingestion was streamlined by only loading essential fields from the medical claims dataset and by splitting the table into open and closed claims. Open claims data is most frequently accessed and constitutes only 20% of columns so by splitting the tables. The team not only improved the ingestion performance but also consumption.

Amazon Redshift recently launched automatic mounting of AWS Glue Data Catalog, making it easier to run data lake analytics without manually creating external schemas. You can query data lake tables directly from Amazon Redshift Query Editor v2 or your favorite SQL editors.

Recommendations and best practices

Consider the following recommendations when loading large-scale data in Amazon Redshift.

  • Use Redshift Serverless with maximum 512 RPUs to efficiently and quickly load data
  • Depending on consumption use case and query pattern, adopt either of the following approaches:
    • When consumption queries require only selected fields from the dataset and most frequently access a subset of data, use data lake queries to load only the relevant columns from Amazon S3 into Amazon Redshift
    • When consumption queries require all fields, use COPY commands with a manifest file to ingest data in parallel into multiple logically separated tables and create a database view with UNION ALL of all tables
  • Avoid using varchar(max) while creating tables and create VARCHAR columns with the right size

Final Architecture

The following diagram shows the high-level final architecture that was implemented.

Final Architecture

Conclusion

With the scalability of Redshift Serverless, data sharing to decouple ingestion from consumption workloads, and data lake analytics to ingest data, Gilead made their 140 TB dataset available to their analysts within hours of it being delivered. The innovative architecture of using a serverless ingestion data warehouse, a serverless consumption data warehouse for power users, and their original 3-node provisioned cluster for standard queries gives Gilead isolation to ensure data loads don’t affect their users. The architecture provides scalability to serve infrequent large queries with their serverless consumer along with the benefit of a fixed-cost and fixed-performance option of their provisioned cluster for their standard user queries. Due to the monthly schedule of the data load and the variable need for large queries by consumers, Redshift Serverless proved to be a cost-effective option compared to simply increasing the provisioned cluster to serve each of these use cases.

This split producer/consumer model of using Redshift serverless can bring benefits to many workloads that have similar performance characteristics to Gilead’s warehouse. Customers regularly run large data loads infrequently, and those processes compete with user queries. With this pattern, you can rely on your queries to perform consistently regardless of whether new data is being loaded to the system. This strikes a balance between minimizing cost while maintaining performance and frees the system administrators to load data without affecting users.


About the Authors

Rajiv Arora is a Director of Clinical Data Science at Gilead Sciences with over 20 years of experience in the industry. He is responsible for the multi-modal data platform for the development organization and supports all statistical and predictive analytical infrastructure for RWE and Advanced Analytical functions.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Brent Strong is a Senior Solutions Architect in the Healthcare and Life Sciences team at AWS. He has more than 15 years of experience in the industry, focusing on data and analytics and DevOps. At AWS, he works closely with large Life Sciences customers to help them deliver new and innovative treatments.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on Amazon EMR Serverless

Post Syndicated from Mukul Sharma original https://aws.amazon.com/blogs/big-data/godaddy-benchmarking-results-in-up-to-24-better-price-performance-for-their-spark-workloads-with-aws-graviton2-on-amazon-emr-serverless/

This is a guest post co-written with Mukul Sharma, Software Development Engineer, and Ozcan IIikhan, Director of Engineering from GoDaddy.

GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 22 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work.

GoDaddy is a data-driven company, and getting meaningful insights from data helps us drive business decisions to delight our customers. At GoDaddy, we embarked on a journey to uncover the efficiency promises of AWS Graviton2 on Amazon EMR Serverless as part of our long-term vision for cost-effective intelligent computing.

In this post, we share the methodology and results of our benchmarking exercise comparing the cost-effectiveness of EMR Serverless on the arm64 (Graviton2) architecture against the traditional x86_64 architecture. EMR Serverless on Graviton2 demonstrated an advantage in cost-effectiveness, resulting in significant savings in total run costs. We achieved 23.85% improvement in price-performance for sample production Spark workloads—an outcome that holds tremendous potential for businesses striving to maximize their computing efficiency.

Solution overview

GoDaddy’s intelligent compute platform envisions simplification of compute operations for all personas, without limiting power users, to ensure out-of-box cost and performance optimization for data and ML workloads. As a part of this vision, GoDaddy’s Data & ML Platform team plans to use EMR Serverless as one of the compute solutions under the hood.

The following diagram shows a high-level illustration of the intelligent compute platform vision.

Benchmarking EMR Serverless for GoDaddy

EMR Serverless is a serverless option in Amazon EMR that eliminates the complexities of configuring, managing, and scaling clusters when running big data frameworks like Apache Spark and Apache Hive. With EMR Serverless, businesses can enjoy numerous benefits, including cost-effectiveness, faster provisioning, simplified developer experience, and improved resilience to Availability Zone failures.

At GoDaddy, we embarked on a comprehensive study to benchmark EMR Serverless using real production workflows at GoDaddy. The purpose of the study was to evaluate the performance and efficiency of EMR Serverless and develop a well-informed adoption plan. The results of the study have been extremely promising, showcasing the potential of EMR Serverless for our workloads.

Having achieved compelling results in favor of EMR Serverless for our workloads, our attention turned to evaluating the utilization of the Graviton2 (arm64) architecture on EMR Serverless. In this post, we focus on comparing the performance of Graviton2 (arm64) with the x86_64 architecture on EMR Serverless. By conducting this apples-to-apples comparative analysis, we aim to gain valuable insights into the benefits and considerations of using Graviton2 for our big data workloads.

By using EMR Serverless and exploring the performance of Graviton2, GoDaddy aims to optimize their big data workflows and make informed decisions regarding the most suitable architecture for their specific needs. The combination of EMR Serverless and Graviton2 presents an exciting opportunity to enhance the data processing capabilities and drive efficiency in our operations.

AWS Graviton2

The Graviton2 processors are specifically designed by AWS, utilizing powerful 64-bit Arm Neoverse cores. This custom-built architecture provides a remarkable boost in price-performance for various cloud workloads.

In terms of cost, Graviton2 offers an appealing advantage. As indicated in the following table, the pricing for Graviton2 is 20% lower compared to the x86 architecture option.

   x86_64  arm64 (Graviton2) 
per vCPU per hour $0.052624 $0.042094
per GB per hour $0.0057785 $0.004628
per storage GB per hour* $0.000111

*Ephemeral storage: 20 GB of ephemeral storage is available for all workers by default—you pay only for any additional storage that you configure per worker.

For specific pricing details and current information, refer to Amazon EMR pricing.

AWS benchmark

The AWS team performed benchmark tests on Spark workloads with Graviton2 on EMR Serverless using the TPC-DS 3 TB scale performance benchmarks. The summary of their analysis are as follows:

  • Graviton2 on EMR Serverless demonstrated an average improvement of 10% for Spark workloads in terms of runtime. This indicates that the runtime for Spark-based tasks was reduced by approximately 10% when utilizing Graviton2.
  • Although the majority of queries showcased improved performance, a small subset of queries experienced a regression of up to 7% on Graviton2. These specific queries showed a slight decrease in performance compared to the x86 architecture option.
  • In addition to the performance analysis, the AWS team considered the cost factor. Graviton2 is offered at a 20% lower cost than the x86 architecture option. Taking this cost advantage into account, the AWS benchmark set yielded an overall 27% better price-performance for workloads. This means that by using Graviton2, users can achieve a 27% improvement in performance per unit of cost compared to the x86 architecture option.

These findings highlight the significant benefits of using Graviton2 on EMR Serverless for Spark workloads, with improved performance and cost-efficiency. It showcases the potential of Graviton2 in delivering enhanced price-performance ratios, making it an attractive choice for organizations seeking to optimize their big data workloads.

GoDaddy benchmark

During our initial experimentation, we observed that arm64 on EMR Serverless consistently outperformed or performed on par with x86_64. One of the jobs showed a 7.51% increase in resource usage on arm64 compared to x86_64, but due to the lower price of arm64, it still resulted in a 13.48% cost reduction. In another instance, we achieved an impressive 43.7% reduction in run cost, attributed to both the lower price and reduced resource utilization. Overall, our initial tests indicated that arm64 on EMR Serverless delivered superior price-performance compared to x86_64. These promising findings motivated us to conduct a more comprehensive and rigorous study.

Benchmark results

To gain a deeper understanding of the value of Graviton2 on EMR Serverless, we conducted our study using real-life production workloads from GoDaddy, which are scheduled to run at a daily cadence. Without any exceptions, EMR Serverless on arm64 (Graviton2) is significantly more cost-effective compared to the same jobs run on EMR Serverless on the x86_64 architecture. In fact, we recorded an impressive 23.85% improvement in price-performance across the sample GoDaddy jobs using Graviton2.

Like the AWS benchmarks, we observed slight regressions of less than 5% in the total runtime of some jobs. However, given that these jobs will be migrated from Amazon EMR on EC2 to EMR Serverless, the overall total runtime will still be shorter due to the minimal provisioning time in EMR Serverless. Additionally, across all jobs, we observed an average speed up of 2.1% in addition to the cost savings achieved.

These benchmarking results provide compelling evidence of the value and effectiveness of Graviton2 on EMR Serverless. The combination of improved price-performance, shorter runtimes, and overall cost savings makes Graviton2 a highly attractive option for optimizing big data workloads.

Benchmarking methodology

As an extension of a larger benchmarking EMR Serverless for GoDaddy study, where we divided Spark jobs into brackets based on total runtime (quick-run, medium-run, long-run), we measured effect of architecture (arm64 vs. x86_64) on total cost and total runtime. All other parameters were kept the same to achieve an apples-to-apples comparison.

The team followed these steps:

  1. Prepare the data and environment.
  2. Choose two random production jobs from each job bracket.
  3. Make necessary changes to avoid inference with actual production outputs.
  4. Run tests to execute scripts over multiple iterations to collect accurate and consistent data points.
  5. Validate input and output datasets, partitions, and row counts to ensure identical data processing.
  6. Gather relevant metrics from the tests.
  7. Analyze results to draw insights and conclusions.

The following table shows the summary of an example Spark job.

Metric  EMR Serverless (Average) – X86_64  EMR Serverless (Average) – Graviton  X86_64 vs Graviton (% Difference) 
Total Run Cost $2.76 $1.85 32.97%

Total Runtime

(hh:mm:ss)

00:41:31 00:34:32 16.82%
EMR Release Label emr-6.9.0
Job Type Spark
Spark Version Spark 3.3.0
Hadoop Distribution Amazon 3.3.3
Hive/HCatalog Version Hive 3.1.3, HCatalog 3.1.3

Summary of results

The following table presents a comparison of job performance between EMR Serverless on arm64 (Graviton2) and EMR Serverless on x86_64. For each architecture, every job was run at least three times to obtain the accurate average cost and runtime.

 Job  Average x86_64 Cost Average arm64 Cost Average x86_64 Runtime (hh:mm:ss) Average arm64 Runtime (hh:mm:ss)  Average Cost Savings %  Average Performance Gain % 
1 $1.64 $1.25 00:08:43 00:09:01 23.89% -3.24%
2 $10.00 $8.69 00:27:55 00:28:25 13.07% -1.79%
3 $29.66 $24.15 00:50:49 00:53:17 18.56% -4.85%
4 $34.42 $25.80 01:20:02 01:24:54 25.04% -6.08%
5 $2.76 $1.85 00:41:31 00:34:32 32.97% 16.82%
6 $34.07 $24.00 00:57:58 00:51:09 29.57% 11.76%
Average  23.85% 2.10%

Note that the improvement calculations are based on higher-precision results for more accuracy.

Conclusion

Based on this study, GoDaddy observed a significant 23.85% improvement in price-performance for sample production Spark jobs utilizing the arm64 architecture compared to the x86_64 architecture. These compelling results have led us to strongly recommend internal teams to use arm64 (Graviton2) on EMR Serverless, except in cases where there are compatibility issues with third-party packages and libraries. By adopting an arm64 architecture, organizations can achieve enhanced cost-effectiveness and performance for their workloads, contributing to more efficient data processing and analytics.


About the Authors

Mukul Sharma is a Software Development Engineer on Data & Analytics (DnA) organization at GoDaddy. He is a polyglot programmer with experience in a wide array of technologies to rapidly deliver scalable solutions. He enjoys singing karaoke, playing various board games, and working on personal programming projects in his spare time.

Ozcan Ilikhan is a Director of Engineering on Data & Analytics (DnA) organization at GoDaddy. He is passionate about solving customer problems and increasing efficiency using data and ML/AI. In his spare time, he loves reading, hiking, gardening, and working on DIY projects.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in analytics. He has over 6 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Ramesh Kumar Venkatraman is a Senior Solutions Architect at AWS who is passionate about containers and databases. He works with AWS customers to design, deploy, and manage their AWS workloads and architectures. In his spare time, he loves to play with his two kids and follows cricket.

SmugMug’s durable search pipelines for Amazon OpenSearch Service

Post Syndicated from Lee Shepherd original https://aws.amazon.com/blogs/big-data/smugmugs-durable-search-pipelines-for-amazon-opensearch-service/

SmugMug operates two very large online photo platforms, SmugMug and Flickr, enabling more than 100 million customers to safely store, search, share, and sell tens of billions of photos. Customers uploading and searching through decades of photos helped turn search into critical infrastructure, growing steadily since SmugMug first used Amazon CloudSearch in 2012, followed by Amazon OpenSearch Service since 2018, after reaching billions of documents and terabytes of search storage.

Here, Lee Shepherd, SmugMug Staff Engineer, shares SmugMug’s search architecture used to publish, backfill, and mirror live traffic to multiple clusters. SmugMug uses these pipelines to benchmark, validate, and migrate to new configurations, including Graviton based r6gd.2xlarge instances from i3.2xlarge, along with testing Amazon OpenSearch Serverless. We cover three pipelines used for publishing, backfilling, and querying without introducing spiky unrealistic traffic patterns, and without any impact on production services.

There are two main architectural pieces critical to the process:

  • A durable source of truth for index data. It’s best practice and part of our backup strategy to have a durable store beyond the OpenSearch index, and Amazon DynamoDB provides scalability and integration with AWS Lambda that simplifies a lot of the process. We use DynamoDB for other non-search services, so this was a natural fit.
  • A Lambda function for publishing data from the source of truth into OpenSearch. Using function aliases helps run multiple configurations of the same Lambda function at the same time and is key to keeping data in sync.

Publishing

The publishing pipeline is driven from events like a user entering keywords or captions, new uploads, or label detection through Amazon Rekognition. These events are processed, combining data from a few other asset stores like Amazon Aurora MySQL Compatible Edition and Amazon Simple Storage Service (Amazon S3), before writing a single item into DynamoDB.

Writing to DynamoDB invokes a Lambda publishing function, through the DynamoDB Streams Kinesis Adapter, that takes a batch of updated items from DynamoDB and indexes them into OpenSearch. There are other benefits to using the DynamoDB Streams Kinesis Adapter such as reducing the number of concurrent Lambdas required.

The publishing Lambda function uses environment variables to determine what OpenSearch domain and index to publish to. A production alias is configured to write to the production OpenSearch domain, off of the DynamoDB table or Kinesis Stream

When testing new configurations or migrating, a migration alias is configured to write to the new OpenSearch domain but use the same trigger as the production alias. This enables dual indexing of data to both OpenSearch Service domains simultaneously.

Here’s an example of the DynamoDB table schema:

 "Id": 123456,  // partition key
 "Fields": {
  "format": "JPG",
  "height": 1024,
  "width": 1536,
  ...
 },
 "LastUpdated": 1600107934,

The ‘LastUpdated’ value is used as the document version when indexing, allowing OpenSearch to reject any out-of-order updates.

Backfilling

Now that changes are being published to both domains, the new domain (index) needs to be backfilled with historical data. To backfill a newly created index, a combination of Amazon Simple Queue Service (Amazon SQS) and DynamoDB is used. A script populates an SQS queue with messages that contain instructions for parallel scanning a segment of the DynamoDB table.

The SQS queue launches a Lambda function that reads the message instructions, fetches a batch of items from the corresponding segment of the DynamoDB table, and writes them into an OpenSearch index. New messages are written to the SQS queue to keep track of progress through the segment. After the segment completes, no more messages are written to the SQS queue and the process stops itself.

Concurrency is determined by the number of segments, with additional controls provided by Lambda concurrency scaling. SmugMug is able to index more than 1 billion documents per hour on their OpenSearch configuration while incurring zero impact to the production domain.

A NodeJS AWS-SDK based script is used to seed the SQS queue. Here’s a snippet of the SQS configuration script’s options:

Usage: queue_segments [options]

Options:
--search-endpoint <url>  OpenSearch endpoint url
--sqs-url <url>          SQS queue url
--index <string>         OpenSearch index name
--table <string>         DynamoDB table name
--key-name <string>      DynamoDB table partition key name
--segments <int>         Number of parallel segments

Along with the format of the resulting SQS message:

{
  searchEndpoint: opts.searchEndpoint,
  sqsUrl: opts.sqsUrl,
  table: opts.table,
  keyName: opts.keyName,
  index: opts.index,
  segment: i,
  totalSegments: opts.segments,
  exclusiveStartKey: <lastEvaluatedKey from previous iteration>
}

As each segment is processed, the ‘lastEvaluatedKey’ from the previous iteration is added to the message as the ‘exclusiveStartKey’ for the next iteration.

Mirroring

Last, our mirrored search query results run by sending an OpenSearch query to an SQS queue, in addition to our production domain. The SQS queue launches a Lambda function that replays the query to the replica domain. The search results from these requests are not sent to any user but allow replicating production load on the OpenSearch service under test without impact to production systems or customers.

Conclusion

When evaluating a new OpenSearch domain or configuration, the main metrics we are interested in are query latency performance, namely the took latencies (latencies per time), and most importantly latencies for searching. In our move to Graviton R6gd, we saw about 40 percent lower P50-P99 latencies, along with similar gains in CPU usage compared to i3’s (ignoring Graviton’s lower costs). Another welcome benefit was the more predictable and monitorable JVM memory pressure with the garbage collection changes from the addition of G1GC on R6gd and other new instances.

Using this pipeline, we’re also testing OpenSearch Serverless and finding its best use-cases. We’re excited about that service and fully intend to have an entirely serverless architecture in time. Stay tuned for results.


About the Authors

Lee Shepherd is a SmugMug Staff Software Engineer

Aydn Bekirov is an Amazon Web Services Principal Technical Account Manager

Enabling highly available connectivity from on premises to AWS Local Zones

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/enabling-highly-available-connectivity-from-on-premises-to-aws-local-zones/

This post is written by Leonardo Solano, Senior Hybrid Cloud SA and Robert Belson SA Developer Advocate.

Planning your network topology is a foundational requirement of the reliability pillar of the AWS Well-Architected Framework. REL02-BP02 defines how to provide redundant connectivity between private networks in the cloud and on-premises environments using AWS Direct Connect for resilient, redundant connections using AWS Site-to-Site VPN, or AWS Direct Connect failing over to AWS Site-to-Site VPN. As more customers use a combination of on-premises environments, Local Zones, and AWS Regions, they have asked for guidance on how to extend this pillar of the AWS Well-Architected Framework to include Local Zones. As an example, if you are on an application modernization journey, you may have existing Amazon EKS clusters that have dependencies on persistent on-premises data.

AWS Local Zones enables single-digit millisecond latency to power applications such as real-time gaming, live streaming, augmented and virtual reality (AR/VR), virtual workstations, and more. Local Zones can also help you meet data sovereignty requirements in regulated industries  such as healthcare, financial services, and the public sector. Additionally, enterprises can leverage a hybrid architecture and seamlessly extend their on-premises environment to the cloud using Local Zones. In the example above, you could extend Amazon EKS clusters to include node groups in a Local Zone (or multiple Local Zones) or on premises using AWS Outpost rack.

To provide connectivity between private networks in Local Zones and on-premises environments, customers typically consider Direct Connect or software VPNs available in the AWS Marketplace. This post provides a reference implementation to eliminate single points of failure in connectivity while offering automatic network impairment detection and intelligent failover using both Direct Connect and software VPNs in AWS Market place. Moreover, this solution minimizes latency by ensuring traffic does not hairpin through the parent AWS Region to the Local Zone.

Solution overview

In Local Zones, all architectural patterns based on AWS Direct Connect follow the same architecture as in AWS Regions and can be deployed using the AWS Direct Connect Resiliency Toolkit. As of the date of publication, Local Zones do not support AWS managed Site-to-Site VPN (view latest Local Zones features). Thus, for customers that have access to only a single Direct Connect location or require resiliency beyond a single connection, this post will demonstrate a solution using an AWS Direct Connect failover strategy with a software VPN appliance. You can find a range of third-party software VPN appliances as well as the throughput per VPN tunnel that each offering provides in the AWS Marketplace.

Prerequisites:

To get started, make sure that your account is opt-in for Local Zones and configure the following:

  1. Extend a Virtual Private Cloud (VPC) from the Region to the Local Zone, with at least 3 subnets. Use Getting Started with AWS Local Zones as a reference.
    1. Public subnet in Local Zone (public-subnet-1)
    2. Private subnets in Local Zone (private-subnet-1 and private-subnet-2)
    3. Private subnet in the Region (private-subnet-3)
    4. Modify DNS attributes in your VPC, including both “enableDnsSupport” and “enableDnsHostnames”;
  2. Attach an Internet Gateway (IGW) to the VPC;
  3. Attach a Virtual Private Gateway (VGW) to the VPC;
  4. Create an ec2 vpc-endpoint attached to the private-subnet-3;
  5. Define the following routing tables (RTB):
    1. Private-subnet-1 RTB: enabling propagation for VGW;
    2. Private-subnet-2 RTB: enabling propagation for VGW;
    3. Public-subnet-1 RTB: with a default route with IGW-ID as the next hop;
  6. Configure a Direct Connect Private Virtual Interface (VIF) from your on-premises environment to Local Zones Virtual Gateway’s VPC. For more details see this post: AWS Direct Connect and AWS Local Zones interoperability patterns;
  7. Launch any software VPN appliance from AWS Marketplace on Public-subnet-1. In this blog post on simulating Site-to-Site VPN customer gateways using strongSwan, you can find an example that provides the steps to deploy a third-party software VPN in AWS Region;
  8. Capture the following parameters from your environment:
    1. Software VPN Elastic Network Interface (ENI) ID
    2. Private-subnet-1 RTB ID
    3. Probe IP, which must be an on-premises resource that can respond to Internet Control Message Protocol (ICMP) requests.

High level architecture

This architecture requires a utility Amazon Elastic Compute Cloud (Amazon EC2) instance in a private subnet (private-subnet-2), sending ICMP probes over the Direct Connect connection. Once the utility instance detects lost packets to on-premises network from the Local Zone it initiates a failover by adding a static route with the on-premises CIDR range as the destination and the VPN Appliance ENI-ID as the next hop in the production private subnet (private-subnet-1), taking priority over the Direct Connect propagated route. Once healthy, this utility will revert back to the default route to the original Direct Connect connection.

On-premises considerations

To add redundancy in the on-premises environment, you can use two routers using any First Hop Redundancy Protocol (FHRP) as Hot Standby Router Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP). The router connected to the Direct Connect link has the highest priority, taking the Primary role in the FHRP process while the VPN router remain the Secondary router. The failover mechanism in the FHRP relies on interface or protocol state as BGP, which triggers the failover mechanism.

High level HA architecture for Software VPN

Figure 1. High level HA architecture for Software VPN

Failover by modifying the production subnet RTB

Figure 2. Failover by modifying the production subnet RTB

Step-by-step deployment

Create IAM role with permissions to create and delete routes in your private-subnet-1 route table:

  1. Create ec2-role-trust-policy.json file on your local machine:
cat > ec2-role-trust-policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
  1. Create your EC2 IAM role, such as my_ec2_role:
aws iam create-role --role-name my_ec2_role --assume-role-policy-document file://ec2-role-trust-policy.json
  1. Create a file with the necessary permissions to attach to the EC2 IAM role. Name it ec2-role-iam-policy.json.
aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json
  1. Create the IAM policy and attach the policy to the IAM role my_ec2_role that you previously created:
aws iam create-policy --policy-name my-ec2-policy --policy-document file://ec2-role-iam-policy.json

aws iam attach-role-policy --policy-arn arn:aws:iam::<account_id>:policy/my-ec2-policy --role-name my_ec2_role
  1. Create an instance profile and attach the IAM role to it:
aws iam create-instance-profile –instance-profile-name my_ec2_instance_profile
aws iam add-role-to-instance-profile –instance-profile-name my_ec2_instance_profile –role-name my_ec2_role   

Launch and configure your utility instance

  1. Capture the Amazon Linux 2 AMI ID through CLI:
aws ec2 describe-images --filters "Name=name,Values=amzn2-ami-kernel-5.10-hvm-2.0.20230404.1-x86_64-gp2" | grep ImageId 

Sample output:

            "ImageId": "ami-069aabeee6f53e7bf",

  1. Create an EC2 key for the utility instance:
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
  1. Launch the utility instance in the Local Zone (replace the variables with your account and environment parameters):
aws ec2 run-instances --image-id ami-069aabeee6f53e7bf --key-name MyKeyPair --count 1 --instance-type t3.medium  --subnet-id <private-subnet-2-id> --iam-instance-profile Name=my_ec2_instance_profile_linux

Deploy failover automation shell script on the utility instance

  1. Create the following shell script in your utility instance (replace the health check variables with your environment values):
cat > vpn_monitoring.sh <<EOF
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT-0
# Health Check variables
Wait_Between_Pings=2
RTB_ID=<private-subnet-1-rtb-id>
PROBE_IP=<probe-ip>
remote_cidr=<remote-cidr>
GW_ENI_ID=<software-vpn-eni_id>
Active_path=DX

echo `date` "-- Starting VPN monitor"

while [ . ]; do
  # Check health of main VPN Appliance path to remote probe ip
  pingresult=`ping -c 3 -W 1 $PROBE_IP | grep time= | wc -l`
  # Check to see if any of the health checks succeeded
  if ["$pingresult" == "0"]; then
    if ["$Active_path" == "DX"]; then
      echo `date` "-- Direct Connect failed. Failing over vpn"
      aws ec2 create-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --network-interface-id $GW_ENI_ID --region us-east-1
      Active_path=VPN
      DX_tries=10
      echo "probe_ip: unreachable – active_path: vpn"
    else
      echo "probe_ip: unreachable – active_path: vpn"
    fi
  else     
    if ["$Active_path" == "VPN"]; then
      let DX_tries=DX_tries-1
      if ["$DX_tries" == "0"]; then
        echo `date` "-- failing back to Direct Connect"
        aws ec2 delete-route --route-table-id $RTB_ID --destination-cidr-block $remote_cidr --region us-east-1
        Active_path=DX
        echo "probe_ip: reachable – active_path: Direct Connect"
      else
        echo "probe_ip: reachable – active_path: vpn"
      fi
    else
      echo "probe:ip: reachable – active_path: Direct Connect"	    
    fi
  fi    
done EOF
  1. Modify permissions to your shell script file:
chmod +x vpn_monitoring.sh
  1. Start the shell script:
./vpn_monitoring.sh

Test the environment

Failover process between Direct Connect and software VPN

Figure 3. Failover process between Direct Connect and software VPN

Simulate failure of the Direct Connect link, breaking the available path from the Local Zone to the on-premises environment. You can simulate the failure using the failure test feature in Direct Connect console.

Bringing BGP session down

Figure 4. Bringing BGP session down

Setting the failure time

Figure 5. Setting the failure time

In the utility instance you will see the following logs:

Thu Sep 21 14:39:34 UTC 2023 -- Direct Connect failed. Failing over vpn

The shell script in action will detect packet loss by ICMP probes against a probe IP destination on premises, triggering the failover process. As a result, it will make an API call (aws ec2 create-route) to AWS using the EC2 interface endpoint.

The script will create a static route in the private-subnet-1-RTB toward on-premises CIDR with the VPN Elastic-Network ID as the next hop.

private-subnet-1-RTB during the test

Figure 6. private-subnet-1-RTB during the test

The FHRP mechanisms detect the failure in the Direct Connect Link and then reduce the FHRP priority on this path, which triggers the failover to the secondary link through the VPN path.

Once you cancel the test or the test finishes, the failback procedure will revert the private-subnet-1 route table to its initial state, resulting in the following logs to be emitted by the utility instance:

Thu Sep 21 14:42:34 UTC 2023 -- failing back to Direct Connect

private-subnet-1 route table initial state

Figure 7. private-subnet-1 route table initial state

Cleaning up

To clean up your AWS based resources, run following AWS CLI commands:

aws ec2 terminate-instances --instance-ids <your-utility-instance-id>
aws iam delete-instance-profile --instance-profile-name my_ec2_instance_profile
aws iam delete-role my_ec2_role

Conclusion

This post demonstrates how to create a failover strategy for Local Zones using the same resilience mechanisms already established in the AWS Regions. By leveraging Direct Connect and software VPNs, you can achieve high availability in scenarios where you are constrained to a single Direct Connect location due to geographical limitations. In the architectural pattern illustrated in this post, the failover strategy relies on a utility instance with least-privileged permissions. The utility instance identifies network impairment and dynamically modify your production route tables to keep the connectivity established from a Local Zone to your on-premises location. This same mechanism provides capabilities to automatically failback from the software VPN to Direct Connect once the utility instance validates that the Direct Connect Path is sufficiently reliable to avoid network flapping. To learn more about Local Zones, you can visit the AWS Local Zones user guide.

Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/new-reset-amazon-ec2-mac-instances-to-a-known-state-using-replace-root-volume-capability/

This post is written by Sebastien Stormacq, Principal Developer Advocate.

Amazon Elastic Compute Cloud (Amazon EC2) now supports replacing the root volume on a running EC2 Mac instance, enabling you to restore the root volume of an EC2 Mac instance to its initial launch state, to a specific snapshot, or to a new Amazon Machine Image (AMI).

Since 2021, we have offered on-demand and pay-as-you-go access to Amazon EC2 Mac instances, in the same manner as our Intel, AMD and Graviton-based instances. Amazon EC2 Mac instances integrate all the capabilities you know and love from macOS with dozens of AWS services such as Amazon Virtual Private Cloud (VPC) for network security, Amazon Elastic Block Store (EBS) for expandable storage, Elastic Load Balancing (ELB) for distributing build queues, Amazon FSx for scalable file storage, and AWS Systems Manager Agent (SSM Agent) for configuring, managing, and patching macOS environments.

Just like for every EC2 instance type, AWS is responsible for protecting the infrastructure that runs all of the services offered in the AWS cloud. To ensure that EC2 Mac instances provide the same security and data privacy as other Nitro-based EC2 instances, Amazon EC2 performs a scrubbing workflow on the underlying Dedicated Host as soon as you stop or terminate an instance. This scrubbing process erases the internal SSD, clears the persistent NVRAM variables, and updates the device firmware to the latest version enabling you to run the latest macOS AMIs. The documentation has more details about this process.

The scrubbing process ensures a sanitized dedicated host for each EC2 Mac instance launch and takes some time to complete. Our customers have shared two use cases where they may need to set back their instance to a previous state in a shorter time period or without the need to initiate the scrubbing workflow. The first use case is when patching an existing disk image to bring OS-level or applications-level updates to your fleet, without manually patching individual instances in-place. The second use case is during continuous integration and continuous deployment (CI/CD) when you need to restore an Amazon EC2 Mac instance to a defined well-known state at the end of a build.

To restart your EC2 Mac instance in its initial state without stopping or terminating them, we created the ability to replace the root volume of an Amazon EC2 Mac instance with another EBS volume. This new EBS volume is created either from a new AMI, an Amazon EBS Snapshot, or from the initial volume state during boot.

You just swap the root volume with a new one and initiate a reboot at OS-level. Local data, additional attached EBS volumes, networking configurations, and IAM profiles are all preserved. Additional EBS volumes attached to the instance are also preserved, as well as the instance IP addresses, IAM policies, and security groups.

Let’s see how Replace Root Volume works

To prepare and initiate an Amazon EBS root volume replacement, you can use the AWS Management Console, the AWS Command Line Interface (AWS CLI), or one of our AWS SDKs. For this demo, I used the AWS CLI to show how you can automate the entire process.

To start the demo, I first allocate a Dedicated Host and then start an EC2 Mac instance, SSH-connect to it, and install the latest version of Xcode. I use the open-source xcodeinstall CLI tool to download and install Xcode. Typically, you also download, install, and configure a build agent and additional build tools or libraries as required by your build pipelines.

Once the instance is ready, I create an Amazon Machine Image (AMI). AMIs are disk images you can reuse to launch additional and identical EC2 Mac instances. This can be done from any machine that has the credentials to make API calls on your AWS account. In the following, you can see the commands I issued from my laptop’s Terminal application.

#
# Find the instance’s ID based on the instance name tag
#
~ aws ec2 describe-instances \
--filters "Name=tag:Name,Values=RRV-Demo" \
--query "Reservations[].Instances[].InstanceId" \
--output text 

i-0fb8ffd5dbfdd5384

#
# Create an AMI based on this instance
#
~ aws ec2 create-image \
--instance-id i-0fb8ffd5dbfdd5384 \
--name "macOS_13.3_Gold_AMI"	\
--description "macOS 13.2 with Xcode 13.4.1"

{
 
"ImageId": "ami-0012e59ed047168e4"
}

It takes a few minutes to complete the AMI creation process.

After I created this AMI, I can use my instance as usual. I can use it to build, test, and distribute my application, or make any other changes on the root volume.

When I want to reset the instance to the state of my AMI, I initiate the replace root volume operation:

~ aws ec2 create-replace-root-volume-task	\
--instance-id i-0fb8ffd5dbfdd5384 \
--image-id ami-0012e59ed047168e4
{
"ReplaceRootVolumeTask": {
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "pending", "StartTime": "2023-05-26T12:44:35Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "SnapshotId": "snap-02be6b9c02d654c83", "DeleteReplacedRootVolume": false
}
}

The root Amazon EBS volume is replaced with a fresh one created from the AMI, and the system triggers an OS-level reboot.

I can observe the progress with the DescribeReplaceRootVolumeTasks API

~ aws ec2 describe-replace-root-volume-tasks \
--replace-root-volume-task-ids replacevol-07634c2a6cf2a1c61

{
"ReplaceRootVolumeTasks": [
{
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "succeeded", "StartTime": "2023-05-26T12:44:35Z",
"CompleteTime": "2023-05-26T12:44:43Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "DeleteReplacedRootVolume": false
}
]
}

After a short time, the instance becomes available again, and I can connect over ssh.

~ ssh [email protected]
Warning: Permanently added '3.0.0.86' (ED25519) to the list of known hosts.
Last login: Wed May 24 18:13:42 2023 from 81.0.0.0

┌───┬──┐	 |  |_ )
│ ╷╭╯╷ │	_| (	/
│ └╮	│   |\  |  |
│ ╰─┼╯ │ Amazon EC2
└───┴──┘ macOS Ventura 13.2.1
 
ec2-user@ip-172-31-58-100 ~ %

Additional thoughts

There are a couple of additional points to know before using this new capability:

  • By default, the old root volume is preserved. You can pass the –-delete-replaced-root-volume option to delete it automatically. Do not forget to delete old volumes and their corresponding Amazon EBS Snapshots when you don’t need them anymore to avoid being charged for them.
  • During the replacement, the instance will be unable to respond to health checks and hence might be marked as unhealthy if placed inside an Auto Scaled Group. You can write a custom health check to change that behavior.
  • When replacing the root volume with an AMI, the AMI must have the same product code, billing information, architecture type, and virtualization type as that of the instance.
  • When replacing the root volume with a snapshot, you must use snapshots from the same lineage as the instance’s current root volume.
  • The size of the new volume is the largest of the AMI’s block device mapping and the size of the old Amazon EBS root volume.
  • Any non-root Amazon EBS volume stays attached to the instance.
  • Finally, the content of the instance store (the internal SSD drive) is untouched, and all other meta-data of the instance are unmodified (the IP addresses, ENI, IAM policies etc.).

Pricing and availability

Replace Root Volume for EC2 Mac is available in all AWS Regions where Amazon EC2 Mac instances are available. There is no additional cost to use this capability. You are charged for the storage consumed by the Amazon EBS Snapshots and AMIs.

Check other options available on the API or AWS CLI and go configure your first root volume replacement task today!

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Post Syndicated from Bhupesh Sharma original https://aws.amazon.com/blogs/big-data/modernize-a-legacy-real-time-analytics-application-with-amazon-managed-service-for-apache-flink/

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. To reap the benefits of cloud computing, like increased agility and just-in-time provisioning of resources, organizations are migrating their legacy analytics applications to AWS. The lift and shift migration approach is limited in its ability to transform businesses because it relies on outdated, legacy technologies and architectures that limit flexibility and slow down productivity. In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink.

Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities. In this post, we discuss challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions. We introduce you to Amazon Managed Service for Apache Flink Studio and get started querying streaming data interactively using Amazon Kinesis Data Streams.

Solution overview

In this post, we walk through a call center analytics solution that provides insights into the call center’s performance in near-real time through metrics that determine agent efficiency in handling calls in the queue. Key performance indicators (KPIs) of interest for a call center from a near-real-time platform could be calls waiting in the queue, highlighted in a performance dashboard within a few seconds of data ingestion from call center streams. These metrics help agents improve their call handle time and also reallocate agents across organizations to handle pending calls in the queue.

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases. However, as data loses its relevance with time, transformations in a near-real-time analytics platform need only the latest data from the streams to generate insights. This may require frequent truncation in certain tables to retain only the latest stream of events. Also, the need to derive near-real-time insights within seconds requires frequent materialized view refreshes in this traditional relational database approach. Frequent materialized view refreshes on top of constantly changing base tables due to streamed data can lead to snapshot isolation errors. Also, a data model that allows table truncations at a regular frequency (for example, every 15 seconds) to store only relevant data in tables can cause locking and performance issues.

The following diagram provides the high-level architecture of a legacy call center analytics platform. In this traditional architecture, a relational database is used to store data from streaming data sources. Datasets used for generating insights are curated using materialized views inside the database and published for business intelligence (BI) reporting.

Modernizing this traditional database-driven architecture in the AWS Cloud allows you to use sophisticated streaming technologies like Amazon Managed Service for Apache Flink, which are built to transform and analyze streaming data in real time. With Amazon Managed Service for Apache Flink, you can gain actionable insights from streaming data with serverless, fully managed Apache Flink. You can use Amazon Managed Service for Apache Flink to quickly build end-to-end stream processing applications and process data continuously, getting insights in seconds or minutes. With Amazon Managed Service for Apache Flink, you can use Apache Flink code or Flink SQL to continuously generate time-series analytics over time windows and perform sophisticated joins across streams.

The following architecture diagram illustrates how a legacy call center analytics platform running on databases can be modernized to run on the AWS Cloud using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. You can perform sophisticated joins over these streaming datasets and create views on top of it using Amazon Managed Service for Apache Flink to generate KPIs required for the business using Amazon OpenSearch Service. You can analyze streaming data interactively using managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio in near-real time. The near-real-time insights can then be visualized as a performance dashboard using OpenSearch Dashboards.

In this post, you perform the following high-level implementation steps:

  1. Ingest data from streaming data sources to Kinesis Data Streams.
  2. Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the stream data within seconds of data ingestion.
  3. Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards.

Prerequisites

This post requires you to set up the Amazon Kinesis Data Generator (KDG) to send data to a Kinesis data stream using an AWS CloudFormation template. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

We use two datasets in this post. The first dataset is fact data, which contains call center organization data. The KDG generates a fact data feed in every 15 seconds that contains the following information:

  • AgentId – Agents work in a call center setting surrounded by other call center employees answering customers’ questions and referring them to the necessary resources to solve their problems.
  • OrgId – A call center contains different organizations and departments, such as Care Hub, IT Hub, Allied Hub, Prem Hub, Help Hub, and more.
  • QueueId – Call queues provide an effective way to route calls using simple or sophisticated patterns to ensure that all calls are getting into the correct hands quickly.
  • WorkMode – An agent work mode determines your current state and availability to receive incoming calls from the Automatic Call Distribution (ACD) and Direct Agent Call (DAC) queues. Call Center Elite does not route ACD and DAC calls to your phone when you are in an Aux mode or ACW mode.
  • WorkSkill – Working as a call center agent requires several soft skills to see the best results, like problem-solving, bilingualism, channel experience, aptitude with data, and more.
  • HandleTime – This customer service metric measures the length of a customer’s call.
  • ServiceLevel – The call center service level is defined as the percentage of calls answered within a predefined amount of time—the target time threshold. It can be measured over any period of time (such as 30 minutes, 1 hour, 1 day, or 1 week) and for each agent, team, department, or the company as a whole.
  • WorkStates – This specifies what state an agent is in. For example, an agent in an available state is available to handle calls from an ACD queue. An agent can have several states with respect to different ACD devices, or they can use a single state to describe their relationship to all ACD devices. Agent states are reported in agent-state events.
  • WaitingTime – This is the average time an inbound call spends waiting in the queue or waiting for a callback if that feature is active in your IVR system.
  • EventTime – This is the time when the call center stream is sent (via the KDG in this post).

The following fact payload is used in the KDG to generate sample fact data:

{
"AgentId" : {{random.number(
{
"min":2001,
"max":2005
}
)}},
"OrgID" : {{random.number(
{
"min":101,
"max":105
}
)}},
"QueueId" : {{random.number(
{
"min":1,
"max":5
}
)}},
"WorkMode" : "{{random.arrayElement(
["TACW","ACW","AUX"]
)}}",
"WorkSkill": "{{random.arrayElement(
["Problem Solving","Bilingualism","Channel experience","Aptitude with data"]
)}}",
"HandleTime": {{random.number(
{
"min":1,
"max":150
}
)}},
"ServiceLevel":"{{random.arrayElement(
["Sev1","Sev2","Sev3"]
)}}",
"WorkSates": "{{random.arrayElement(
["Unavailable","Available","On a call"]
)}}",
"WaitingTime": {{random.number(
{
"min":10,
"max":150
}
)}},
"EventTime":"{{date.utc("YYYY-MM-DDTHH:mm:ss")}}"
}

The following screenshot shows the output of the sample fact data in an Amazon Managed Service for Apache Flink notebook.

The second dataset is dimension data. This data contains metadata information like organization names for their respective organization IDs, agent names, and more. The frequency of the dimension dataset is twice a day, whereas the fact dataset gets loaded in every 15 seconds. In this post, we use Amazon Simple Storage Service (Amazon S3) as a data storage layer to store metadata information (Amazon DynamoDB can be used to store metadata information as well). We use AWS Lambda to load metadata from Amazon S3 to another Kinesis data stream that stores metadata information. The following JSON file stored in Amazon S3 has metadata mappings to be loaded into the Kinesis data stream:

[{"OrgID": 101,"OrgName" : "Care Hub","Region" : "APAC"},
{"OrgID" : 102,"OrgName" : "Prem Hub","Region" : "AMER"},
{"OrgID" : 103,"OrgName" : "IT Hub","Region" : "EMEA"},
{"OrgID" : 104,"OrgName" : "Help Hub","Region" : "EMEA"},
{"OrgID" : 105,"OrgName" : "Allied Hub","Region" : "LATAM"}]

Ingest data from streaming data sources to Kinesis Data Streams

To start ingesting your data, complete the following steps:

  1. Create two Kinesis data streams for the fact and dimension datasets, as shown in the following screenshot. For instructions, refer to Creating a Stream via the AWS Management Console.

  1. Create a Lambda function on the Lambda console to load metadata files from Amazon S3 to Kinesis Data Streams. Use the following code:
    import boto3
    import json
    
    # Create S3 object
    s3_client = boto3.client("s3")
    S3_BUCKET = '<S3 Bucket Name>'
    kinesis_client = boto3.client("kinesis")
    stream_name = '<Kinesis Stream Name>'
        
    def lambda_handler(event, context):
      
      # Read Metadata file on Amazon S3
      object_key = "<S3 File Name.json>"  
      file_content = s3_client.get_object(
          Bucket=S3_BUCKET, Key=object_key)["Body"].read()
      
      #Decode the S3 object to json
      decoded_data = file_content.decode("utf-8").replace("'", '"')
      json_data = json.dumps(decoded_data)
      
      #Upload json data to Kinesis data stream
      partition_key = 'OrgID'
      for record in json_data:
        response = kinesis_client.put_record(
          StreamName=stream_name,
          Data=json.dumps(record),
          PartitionKey=partition_key)

Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the streaming data

The next step is to create tables in Amazon Managed Service for Apache Flink Studio for further transformations (joins, aggregations, and so on). To set up and query Kinesis Data Streams using Amazon Managed Service for Apache Flink Studio, refer to Query your data streams interactively using Amazon Managed Service for Apache Flink Studio and Python and create an Amazon Managed Service for Apache Flink notebook. Then complete the following steps:

  1. In the Amazon Managed Service for Apache Flink Studio notebook, create a fact table from the facts data stream you created earlier, using the following query.

The event time attribute is defined using a WATERMARK statement in the CREATE table DDL. A WATERMARK statement defines a watermark generation expression on an existing event time field, which marks the event time field as the event time attribute.

The event time refers to the processing of streaming data based on timestamps that are attached to each row. The timestamps can encode when an event happened. Processing time (PROCTime) refers to the machine’s system time that is running the respective operation.

%flink.ssql
CREATE TABLE <Fact Table Name> (
AgentId INT,
OrgID INT,
QueueId BIGINT,
WorkMode VARCHAR,
WorkSkill VARCHAR,
HandleTime INT,
ServiceLevel VARCHAR,
WorkSates VARCHAR,
WaitingTime INT,
EventTime TIMESTAMP(3),
WATERMARK FOR EventTime AS EventTime - INTERVAL '4' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '<fact stream name>',
'aws.region' = '<AWS region ex. us-east-1>',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

  1. Create a dimension table in the Amazon Managed Service for Apache Flink Studio notebook that uses the metadata Kinesis data stream:
    %flink.ssql
    CREATE TABLE <Metadata Table Name> (
    AgentId INT,
    AgentName VARCHAR,
    OrgID INT,
    OrgName VARCHAR,
    update_time as CURRENT_TIMESTAMP,
    WATERMARK FOR update_time AS update_time
    )
    WITH (
    'connector' = 'kinesis',
    'stream' = '<Metadata Stream Name>',
    'aws.region' = '<AWS region ex. us-east-1> ',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
    );

  1. Create a versioned view to extract the latest version of metadata table values to be joined with the facts table:
    %flink.ssql(type=update)
    CREATE VIEW versioned_metadata AS 
    SELECT OrgID,OrgName
      FROM (
          SELECT *,
          ROW_NUMBER() OVER (PARTITION BY OrgID
             ORDER BY update_time DESC) AS rownum 
          FROM <Metadata Table>)
    WHERE rownum = 1;

  1. Join the facts and versioned metadata table on orgID and create a view that provides the total calls in the queue in each organization in a span of every 5 seconds, for further reporting. Additionally, create a tumble window of 5 seconds to receive the final output in every 5 seconds. See the following code:
    %flink.ssql(type=update)
    
    CREATE VIEW joined_view AS
    SELECT streamtable.window_start, streamtable.window_end,metadata.OrgName,streamtable.CallsInQueue
    FROM
    (SELECT window_start, window_end, OrgID, count(QueueId) as CallsInQueue
      FROM TABLE(
        TUMBLE( TABLE <Fact Table Name>, DESCRIPTOR(EventTime), INTERVAL '5' SECOND))
      GROUP BY window_start, window_end, OrgID) as streamtable
    JOIN
        <Metadata table name> metadata
        ON metadata.OrgID = streamtable.OrgID

  1. Now you can run the following query from the view you created and see the results in the notebook:
%flink.ssql(type=update)

SELECT jv.window_end, jv.CallsInQueue, jv.window_start, jv.OrgName
FROM joined_view jv where jv.OrgName = ‘Prem Hub’ ;

Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards

You can publish the metrics generated within Amazon Managed Service for Apache Flink Studio to OpenSearch Service and visualize metrics in near-real time by creating a call center performance dashboard, as shown in the following example. Refer to Stream the data and validate output to configure OpenSearch Dashboards with Amazon Managed Service for Apache Flink. After you configure the connector, you can run the following command from the notebook to create an index in an OpenSearch Service cluster.

%flink.ssql(type=update)

drop table if exists active_call_queue;
CREATE TABLE active_call_queue (
window_start TIMESTAMP,
window_end TIMESTAMP,
OrgID int,
OrgName varchar,
CallsInQueue BIGINT,
Agent_cnt bigint,
max_handle_time bigint,
min_handle_time bigint,
max_wait_time bigint,
min_wait_time bigint
) WITH (
‘connector’ = ‘elasticsearch-7’,
‘hosts’ = ‘<Amazon OpenSearch host name>’,
‘index’ = ‘active_call_queue’,
‘username’ = ‘<username>’,
‘password’ = ‘<password>’
);

The active_call_queue index is created in OpenSearch Service. The following screenshot shows an index pattern from OpenSearch Dashboards.

Now you can create visualizations in OpenSearch Dashboards. The following screenshot shows an example.

Conclusion

In this post, we discussed ways to modernize a legacy, on-premises, real-time analytics architecture and build a serverless data analytics solution on AWS using Amazon Managed Service for Apache Flink. We also discussed challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions.

If you have any questions or suggestions, please leave us a comment.


About the Authors

Bhupesh Sharma is a Senior Data Engineer with AWS. His role is helping customers architect highly-available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys playing musical instruments, road biking, and swimming.

Devika Singh is a Senior Data Engineer at Amazon, with deep understanding of AWS services, architecture, and cloud-based best practices. With her expertise in data analytics and AWS cloud migrations, she provides technical guidance to the community, on architecting and building a cost-effective, secure and scalable solution, that is driven by business needs. Beyond her professional pursuits, she is passionate about classical music and travel.

Blue/Green deployments using AWS CDK Pipelines and AWS CodeDeploy

Post Syndicated from Luiz Decaro original https://aws.amazon.com/blogs/devops/blue-green-deployments-using-aws-cdk-pipelines-and-aws-codedeploy/

Customers often ask for help with implementing Blue/Green deployments to Amazon Elastic Container Service (Amazon ECS) using AWS CodeDeploy. Their use cases usually involve cross-Region and cross-account deployment scenarios. These requirements are challenging enough on their own, but in addition to those, there are specific design decisions that need to be considered when using CodeDeploy. These include how to configure CodeDeploy, when and how to create CodeDeploy resources (such as Application and Deployment Group), and how to write code that can be used to deploy to any combination of account and Region.

Today, I will discuss those design decisions in detail and how to use CDK Pipelines to implement a self-mutating pipeline that deploys services to Amazon ECS in cross-account and cross-Region scenarios. At the end of this blog post, I also introduce a demo application, available in Java, that follows best practices for developing and deploying cloud infrastructure using AWS Cloud Development Kit (AWS CDK).

The Pipeline

CDK Pipelines is an opinionated construct library used for building pipelines with different deployment engines. It abstracts implementation details that developers or infrastructure engineers need to solve when implementing a cross-Region or cross-account pipeline. For example, in cross-Region scenarios, AWS CloudFormation needs artifacts to be replicated to the target Region. For that reason, AWS Key Management Service (AWS KMS) keys, an Amazon Simple Storage Service (Amazon S3) bucket, and policies need to be created for the secondary Region. This enables artifacts to be moved from one Region to another. In cross-account scenarios, CodeDeploy requires a cross-account role with access to the KMS key used to encrypt configuration files. This is the sort of detail that our customers want to avoid dealing with manually.

AWS CodeDeploy is a deployment service that automates application deployment across different scenarios. It deploys to Amazon EC2 instances, On-Premises instances, serverless Lambda functions, or Amazon ECS services. It integrates with AWS Identity and Access Management (AWS IAM), to implement access control to deploy or re-deploy old versions of an application. In the Blue/Green deployment type, it is possible to automate the rollback of a deployment using Amazon CloudWatch Alarms.

CDK Pipelines was designed to automate AWS CloudFormation deployments. Using AWS CDK, these CloudFormation deployments may include deploying application software to instances or containers. However, some customers prefer using CodeDeploy to deploy application software. In this blog post, CDK Pipelines will deploy using CodeDeploy instead of CloudFormation.

A pipeline build with CDK Pipelines that deploys to Amazon ECS using AWS CodeDeploy. It contains at least 5 stages: Source, Build, UpdatePipeline, Assets and at least one Deployment stage.

Design Considerations

In this post, I’m considering the use of CDK Pipelines to implement different use cases for deploying a service to any combination of accounts (single-account & cross-account) and regions (single-Region & cross-Region) using CodeDeploy. More specifically, there are four problems that need to be solved:

CodeDeploy Configuration

The most popular options for implementing a Blue/Green deployment type using CodeDeploy are using CloudFormation Hooks or using a CodeDeploy construct. I decided to operate CodeDeploy using its configuration files. This is a flexible design that doesn’t rely on using custom resources, which is another technique customers have used to solve this problem. On each run, a pipeline pushes a container to a repository on Amazon Elastic Container Registry (ECR) and creates a tag. CodeDeploy needs that information to deploy the container.

I recommend creating a pipeline action to scan the AWS CDK cloud assembly and retrieve the repository and tag information. The same action can create the CodeDeploy configuration files. Three configuration files are required to configure CodeDeploy: appspec.yaml, taskdef.json and imageDetail.json. This pipeline action should be executed before the CodeDeploy deployment action. I recommend creating template files for appspec.yaml and taskdef.json. The following script can be used to implement the pipeline action:

##
#!/bin/sh
#
# Action Configure AWS CodeDeploy
# It customizes the files template-appspec.yaml and template-taskdef.json to the environment
#
# Account = The target Account Id
# AppName = Name of the application
# StageName = Name of the stage
# Region = Name of the region (us-east-1, us-east-2)
# PipelineId = Id of the pipeline
# ServiceName = Name of the service. It will be used to define the role and the task definition name
#
# Primary output directory is codedeploy/. All the 3 files created (appspec.json, imageDetail.json and 
# taskDef.json) will be located inside the codedeploy/ directory
#
##
Account=$1
Region=$2
AppName=$3
StageName=$4
PipelineId=$5
ServiceName=$6
repo_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages[] | .destinations[] | .repositoryName' | head -1) 
tag_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages | to_entries[0].key')  
echo ${repo_name} 
echo ${tag_name} 
printf '{"ImageURI":"%s"}' "$Account.dkr.ecr.$Region.amazonaws.com/${repo_name}:${tag_name}" > codedeploy/imageDetail.json                     
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-appspec.yaml > codedeploy/appspec.yaml 
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-taskdef.json | sed 's#TASK_EXEC_ROLE#arn:aws:iam::'$Account':role/'$ServiceName'#g' | sed 's#fargate-task-definition#'$ServiceName'#g' > codedeploy/taskdef.json 
cat codedeploy/appspec.yaml
cat codedeploy/taskdef.json
cat codedeploy/imageDetail.json

Using a Toolchain

A good strategy is to encapsulate the pipeline inside a Toolchain to abstract how to deploy to different accounts and regions. This helps decoupling clients from the details such as how the pipeline is created, how CodeDeploy is configured, and how cross-account and cross-Region deployments are implemented. To create the pipeline, deploy a Toolchain stack. Out-of-the-box, it allows different environments to be added as needed. Depending on the requirements, the pipeline may be customized to reflect the different stages or waves that different components might require. For more information, please refer to our best practices on how to automate safe, hands-off deployments and its reference implementation.

In detail, the Toolchain stack follows the builder pattern used throughout the CDK for Java. This is a convenience that allows complex objects to be created using a single statement:

 Toolchain.Builder.create(app, Constants.APP_NAME+"Toolchain")
        .stackProperties(StackProps.builder()
                .env(Environment.builder()
                        .account(Demo.TOOLCHAIN_ACCOUNT)
                        .region(Demo.TOOLCHAIN_REGION)
                        .build())
                .build())
        .setGitRepo(Demo.CODECOMMIT_REPO)
        .setGitBranch(Demo.CODECOMMIT_BRANCH)
        .addStage(
                "UAT",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                             
        .build();

In the statement above, the continuous deployment pipeline is created in the TOOLCHAIN_ACCOUNT and TOOLCHAIN_REGION. It implements a stage that builds the source code and creates the Java archive (JAR) using Apache Maven.  The pipeline then creates a Docker image containing the JAR file.

The UAT stage will deploy the service to the SERVICE_ACCOUNT and SERVICE_REGION using the deployment configuration CANARY_10_PERCENT_5_MINUTES. This means 10 percent of the traffic is shifted in the first increment and the remaining 90 percent is deployed 5 minutes later.

To create additional deployment stages, you need a stage name, a CodeDeploy deployment configuration and an environment where it should deploy the service. As mentioned, the pipeline is, by default, a self-mutating pipeline. For example, to add a Prod stage, update the code that creates the Toolchain object and submit this change to the code repository. The pipeline will run and update itself adding a Prod stage after the UAT stage. Next, I show in detail the statement used to add a new Prod stage. The new stage deploys to the same account and Region as in the UAT environment:

... 
        .addStage(
                "Prod",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                                                      
        .build();

In the statement above, the Prod stage will deploy new versions of the service using a CodeDeploy deployment configuration CANARY_10_PERCENT_5_MINUTES. It means that 10 percent of traffic is shifted in the first increment of 5 minutes. Then, it shifts the rest of the traffic to the new version of the application. Please refer to Organizing Your AWS Environment Using Multiple Accounts whitepaper for best-practices on how to isolate and manage your business applications.

Some customers might find this approach interesting and decide to provide this as an abstraction to their application development teams. In this case, I advise creating a construct that builds such a pipeline. Using a construct would allow for further customization. Examples are stages that promote quality assurance or deploy the service in a disaster recovery scenario.

The implementation creates a stack for the toolchain and another stack for each deployment stage. As an example, consider a toolchain created with a single deployment stage named UAT. After running successfully, the DemoToolchain and DemoService-UAT stacks should be created as in the next image:

Two stacks are needed to create a Pipeline that deploys to a single environment. One stack deploys the Toolchain with the Pipeline and another stack deploys the Service compute infrastructure and CodeDeploy Application and DeploymentGroup. In this example, for an application named Demo that deploys to an environment named UAT, the stacks deployed are: DemoToolchain and DemoService-UAT

CodeDeploy Application and Deployment Group

CodeDeploy configuration requires an application and a deployment group. Depending on the use case, you need to create these in the same or in a different account from the toolchain (pipeline). The pipeline includes the CodeDeploy deployment action that performs the blue/green deployment. My recommendation is to create the CodeDeploy application and deployment group as part of the Service stack. This approach allows to align the lifecycle of CodeDeploy application and deployment group with the related Service stack instance.

CodePipeline allows to create a CodeDeploy deployment action that references a non-existing CodeDeploy application and deployment group. This allows us to implement the following approach:

  • Toolchain stack deploys the pipeline with CodeDeploy deployment action referencing a non-existing CodeDeploy application and deployment group
  • When the pipeline executes, it first deploys the Service stack that creates the related CodeDeploy application and deployment group
  • The next pipeline action executes the CodeDeploy deployment action. When the pipeline executes the CodeDeploy deployment action, the related CodeDeploy application and deployment will already exist.

Below is the pipeline code that references the (initially non-existing) CodeDeploy application and deployment group.

private IEcsDeploymentGroup referenceCodeDeployDeploymentGroup(
        final Environment env, 
        final String serviceName, 
        final IEcsDeploymentConfig ecsDeploymentConfig, 
        final String stageName) {

    IEcsApplication codeDeployApp = EcsApplication.fromEcsApplicationArn(
            this,
            Constants.APP_NAME + "EcsCodeDeployApp-"+stageName,
            Arn.format(ArnComponents.builder()
                    .arnFormat(ArnFormat.COLON_RESOURCE_NAME)
                    .partition("aws")
                    .region(env.getRegion())
                    .service("codedeploy")
                    .account(env.getAccount())
                    .resource("application")
                    .resourceName(serviceName)
                    .build()));

    IEcsDeploymentGroup deploymentGroup = EcsDeploymentGroup.fromEcsDeploymentGroupAttributes(
            this,
            Constants.APP_NAME + "-EcsCodeDeployDG-"+stageName,
            EcsDeploymentGroupAttributes.builder()
                    .deploymentGroupName(serviceName)
                    .application(codeDeployApp)
                    .deploymentConfig(ecsDeploymentConfig)
                    .build());

    return deploymentGroup;
}

To make this work, you should use the same application name and deployment group name values when creating the CodeDeploy deployment action in the pipeline and when creating the CodeDeploy application and deployment group in the Service stack (where the Amazon ECS infrastructure is deployed). This approach is necessary to avoid a circular dependency error when trying to create the CodeDeploy application and deployment group inside the Service stack and reference these objects to configure the CodeDeploy deployment action inside the pipeline. Below is the code that uses Service stack construct ID to name the CodeDeploy application and deployment group. I set the Service stack construct ID to the same name I used when creating the CodeDeploy deployment action in the pipeline.

   // configure AWS CodeDeploy Application and DeploymentGroup
   EcsApplication app = EcsApplication.Builder.create(this, "BlueGreenApplication")
           .applicationName(id)
           .build();

   EcsDeploymentGroup.Builder.create(this, "BlueGreenDeploymentGroup")
           .deploymentGroupName(id)
           .application(app)
           .service(albService.getService())
           .role(createCodeDeployExecutionRole(id))
           .blueGreenDeploymentConfig(EcsBlueGreenDeploymentConfig.builder()
                   .blueTargetGroup(albService.getTargetGroup())
                   .greenTargetGroup(tgGreen)
                   .listener(albService.getListener())
                   .testListener(listenerGreen)
                   .terminationWaitTime(Duration.minutes(15))
                   .build())
           .deploymentConfig(deploymentConfig)
           .build();

CDK Pipelines roles and permissions

CDK Pipelines creates roles and permissions the pipeline uses to execute deployments in different scenarios of regions and accounts. When using CodeDeploy in cross-account scenarios, CDK Pipelines deploys a cross-account support stack that creates a pipeline action role for the CodeDeploy action. This cross-account support stack is defined in a JSON file that needs to be published to the AWS CDK assets bucket in the target account. If the pipeline has the self-mutation feature on (default), the UpdatePipeline stage will do a cdk deploy to deploy changes to the pipeline. In cross-account scenarios, this deployment also involves deploying/updating the cross-account support stack. For this, the SelfMutate action in UpdatePipeline stage needs to assume CDK file-publishing and a deploy roles in the remote account.

The IAM role associated with the AWS CodeBuild project that runs the UpdatePipeline stage does not have these permissions by default. CDK Pipelines cannot grant these permissions automatically, because the information about the permissions that the cross-account stack needs is only available after the AWS CDK app finishes synthesizing. At that point, the permissions that the pipeline has are already locked-in­­. Hence, for cross-account scenarios, the toolchain should extend the permissions of the pipeline’s UpdatePipeline stage to include the file-publishing and deploy roles.

In cross-account environments it is possible to manually add these permissions to the UpdatePipeline stage. To accomplish that, the Toolchain stack may be used to hide this sort of implementation detail. In the end, a method like the one below can be used to add these missing permissions. For each different mapping of stage and environment in the pipeline it validates if the target account is different than the account where the pipeline is deployed. When the criteria is met, it should grant permission to the UpdatePipeline stage to assume CDK bootstrap roles (tagged using key aws-cdk:bootstrap-role) in the target account (with the tag value as file-publishing or deploy). The example below shows how to add permissions to the UpdatePipeline stage:

private void grantUpdatePipelineCrossAccoutPermissions(Map<String, Environment> stageNameEnvironment) {

    if (!stageNameEnvironment.isEmpty()) {

        this.pipeline.buildPipeline();
        for (String stage : stageNameEnvironment.keySet()) {

            HashMap<String, String[]> condition = new HashMap<>();
            condition.put(
                    "iam:ResourceTag/aws-cdk:bootstrap-role",
                    new String[] {"file-publishing", "deploy"});
            pipeline.getSelfMutationProject()
                    .getRole()
                    .addToPrincipalPolicy(PolicyStatement.Builder.create()
                            .actions(Arrays.asList("sts:AssumeRole"))
                            .effect(Effect.ALLOW)
                            .resources(Arrays.asList("arn:*:iam::"
                                    + stageNameEnvironment.get(stage).getAccount() + ":role/*"))
                            .conditions(new HashMap<String, Object>() {{
                                    put("ForAnyValue:StringEquals", condition);
                            }})
                            .build());
        }
    }
}

The Deployment Stage

Let’s consider a pipeline that has a single deployment stage, UAT. The UAT stage deploys a DemoService. For that, it requires four actions: DemoService-UAT (Prepare and Deploy), ConfigureBlueGreenDeploy and Deploy.

When using CodeDeploy the deployment stage is expected to have four actions: two actions to create CloudFormation change set and deploy the ECS or compute infrastructure, an action to configure CodeDeploy and the last action that deploys the application using CodeDeploy. In the diagram, these are (in the diagram in the respective order): DemoService-UAT.Prepare and DemoService-UAT.Deploy, ConfigureBlueGreenDeploy and Deploy.

The
DemoService-UAT.Deploy action will create the ECS resources and the CodeDeploy application and deployment group. The
ConfigureBlueGreenDeploy action will read the AWS CDK
cloud assembly. It uses the configuration files to identify the Amazon Elastic Container Registry (Amazon ECR) repository and the container image tag pushed. The pipeline will send this information to the
Deploy action.  The
Deploy action starts the deployment using CodeDeploy.

Solution Overview

As a convenience, I created an application, written in Java, that solves all these challenges and can be used as an example. The application deployment follows the same 5 steps for all deployment scenarios of account and Region, and this includes the scenarios represented in the following design:

A pipeline created by a Toolchain should be able to deploy to any combination of accounts and regions. This includes four scenarios: single-account and single-Region, single-account and cross-Region, cross-account and single-Region and cross-account and cross-Region

Conclusion

In this post, I identified, explained and solved challenges associated with the creation of a pipeline that deploys a service to Amazon ECS using CodeDeploy in different combinations of accounts and regions. I also introduced a demo application that implements these recommendations. The sample code can be extended to implement more elaborate scenarios. These scenarios might include automated testing, automated deployment rollbacks, or disaster recovery. I wish you success in your transformative journey.

Luiz Decaro

Luiz is a Principal Solutions architect at Amazon Web Services (AWS). He focuses on helping customers from the Financial Services Industry succeed in the cloud. Luiz holds a master’s in software engineering and he triggered his first continuous deployment pipeline in 2005.

Automate legacy ETL conversion to AWS Glue using Cognizant Data and Intelligence Toolkit (CDIT) – ETL Conversion Tool

Post Syndicated from Deepak Singh original https://aws.amazon.com/blogs/big-data/automate-legacy-etl-conversion-to-aws-glue-using-cognizant-data-and-intelligence-toolkit-cdit-etl-conversion-tool/

This blog post is co-written with Govind Mohan and Kausik Dhar from Cognizant. 

Migrating on-premises data warehouses to the cloud is no longer viewed as an option but a necessity for companies to save cost and take advantage of what the latest technology has to offer. Although we have seen a lot of focus toward migrating data from legacy data warehouses to the cloud and multiple tools to support this initiative, data is only part of the journey. Successful migration of legacy extract, transform, and load (ETL) processes that acquire, enrich, and transform the data plays a key role in the success of any end-to-end data warehouse migration to the cloud.

The traditional approach of manually rewriting a large number of ETL processes to cloud-native technologies like AWS Glue is time consuming and can be prone to human error. Cognizant Data & Intelligence Toolkit (CDIT) – ETL Conversion Tool automates this process, bringing in more predictability and accuracy, eliminating the risk associated with manual conversion, and providing faster time to market for customers.

Cognizant is an AWS Premier Tier Services Partner with several AWS Competencies. With its industry-based, consultative approach, Cognizant helps clients envision, build, and run more innovative and efficient businesses.

In this post, we describe how Cognizant’s Data & Intelligence Toolkit (CDIT)- ETL Conversion Tool can help you automatically convert legacy ETL code to AWS Glue quickly and effectively. We also describe the main steps involved, the supported features, and their benefits.

Solution overview

Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool automates conversion of ETL pipelines and orchestration code from legacy tools to AWS Glue and AWS Step Functions and eliminates the manual processes involved in a customer’s ETL cloud migration journey.

It comes with an intuitive user interface (UI). You can use these accelerators by selecting the source and target ETL tool for conversion and then uploading an XML file of the ETL mapping to be converted as input.

The tool also supports continuous monitoring of the overall progress, and alerting mechanisms are in place in the event of any failures, errors, or operational issues.

Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool internally uses many native AWS services, such as Amazon Simple Storage Service (Amazon S3) and Amazon Relational Database Service (Amazon RDS) for storage and metadata management; Amazon Elastic Compute Cloud (Amazon EC2) and AWS Lambda for processing; Amazon CloudWatch, AWS Key Management Service (AWS KMS), and AWS IAM Identity Center (successor to AWS Single Sign-On) for monitoring and security; and AWS CloudFormation for infrastructure management. The following diagram illustrates this architecture.

How to use CDIT: ETL Conversion Tool for ETL migration.

Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool supports the following legacy ETL tools as source and supports generating corresponding AWS Glue ETL scripts in both Python and Scala:

  • Informatica
  • DataStage
  • SSIS
  • Talend

Let’s look at the migration steps in more detail.

Assess the legacy ETL process

Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool enables you to assess in bulk the potential automation percentage and complexity of a set of ETL jobs and workflows that are in scope for migration to AWS Glue. The assessment option helps you understand what kind of saving can be achieved using Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool, the complexity of the ETL mappings, and the extent of manual conversion needed, if any. You can upload a single ETL mapping or a folder containing multiple ETL mappings as input for assessment and generate an assessment report, as shown in the following figure.

Convert the ETL code to AWS Glue

To convert legacy ETL code, you upload the XML file of the ETL mapping as input to the tool. User inputs are stored in the internal metadata repository of the tool and Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool parses these XML input files and breaks them down to a patented canonical model, which is then forward engineered into the target AWS Glue scripts in Python or Scala. The following screenshot shows an example of the Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool GUI and Output Console pane.

If any part of the input ETL job couldn’t be converted completely to the equivalent AWS Glue script, it’s tagged between comment lines in the output so that it can be manually fixed.

Convert the workflow to Step Functions

The next logical step after converting the legacy ETL jobs is to orchestrate the run of these jobs in the logical order. Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool lets you automate the conversion of on-premises ETL workflows by converting them to corresponding Step Functions workflows. The following figure illustrates a sample input Informatica workflow.

Workflow conversion follows the similar pattern as that of the ETL mapping. XML files for ETL workflows are uploaded as input and Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool it generates the equivalent Step Functions JSON file based on the input XML file data.

Benefits of using Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool

The following are the key benefits of using Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool to automate legacy ETL conversion:

  • Cost reduction – You can reduce the overall migration effort by as much as 80% by automating the conversion of ETL and workflows to AWS Glue and Step Functions
  • Better planning and implementation – You can assess the ETL scope and determine automation percentage, complexity, and unsupported patterns before the start of the project, resulting in accurate estimation and timelines
  • Completeness – Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool offers one solution with support for multiple legacy ETL tools like Informatica, DataStage, Talend, and more.
  • Improved customer experience – You can achieve migration goals seamlessly without errors caused by manual conversion and with high automation percentage

Case study: Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool proposed implementation

A large US-based insurance and annuities company wanted to migrate their legacy ETL process in Informatica to AWS Glue as part of their cloud migration strategy.

As part of this engagement, Cognizant helped the customer successfully migrate their Informatica based data acquisition and integration ETL jobs and workflows to AWS. A proof of concept (PoC) using Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool was completed first to showcase and validate automation capabilities.

Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool was used to automate the conversion of over 300 Informatica mappings and workflows to equivalent AWS Glue jobs and Step Functions workflows, respectively. As a result, the customer was able to migrate all legacy ETL code to AWS as planned and retire the legacy application.

The following are key highlights from this engagement:

  • Migration of over 300 legacy Informatica ETL jobs to AWS Glue
  • Automated conversion of over 6,000 transformations from legacy ETL to AWS Glue
  • 85% automation achieved using CDIT: ETL Conversion Tool
  • The customer saved licensing fees and retired their legacy application as planned

Conclusion

In this post, we discussed how migrating legacy ETL processes to the cloud is critical to the success of a cloud migration journey. Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool enables you to perform an assessment of the existing ETL process to derive complexity and automation percentage for better estimation and planning. We also discussed the ETL technologies supported by Cognizant Data & Intelligence Toolkit (CDIT): ETL Conversion Tool and how ETL jobs can be converted to corresponding AWS Glue scripts. Lastly, we demonstrated how to use existing ETL workflows to automatically generate corresponding Step Functions orchestration jobs.

To learn more, please reach out to Cognizant.


About the Authors

Deepak Singh is a Senior Solutions Architect at Amazon Web Services with 20+ years of experience in Data & AIA. He enjoys working with AWS partners and customers on building scalable analytical solutions for their business outcomes. When not at work, he loves spending time with family or exploring new technologies in analytics and AI space.

Piyush Patra is a Partner Solutions Architect at Amazon Web Services where he supports partners with their Analytics journeys and is the global lead for strategic Data Estate Modernization and Migration partner programs.

Govind Mohan is an Associate Director with Cognizant with over 18 year of experience in data and analytics space, he has helped design and implement multiple large-scale data migration, application lift & shift and legacy modernization projects and works closely with customers in accelerating the cloud modernization journey leveraging Cognizant Data and Intelligence Toolkit (CDIT) platform.

Kausik Dhar is a technology leader having more than 23 years of IT experience – primarily focused on Data & Analytics, Data Modernization, Application Development, Delivery Management, and Solution Architecture. He has played a pivotal role in guiding clients through the designing and executing large-scale data and process migrations, in addition to spearheading successful cloud implementations. Kausik possesses expertise in formulating migration strategies for complex programs and adeptly constructing data lake/Lakehouse architecture employing a wide array of tools and technologies.

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Post Syndicated from Karthik Kondamudi original https://aws.amazon.com/blogs/big-data/stitch-fix-seamless-migration-transitioning-from-self-managed-kafka-to-amazon-msk/

This post is co-written with Karthik Kondamudi and Jenny Thompson from Stitch Fix.

Stitch Fix is a personalized clothing styling service for men, women, and kids. At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing. At Stitch Fix, we have used Kafka extensively as part of our data infrastructure to support various needs across the business for over six years. Kafka plays a central role in the Stitch Fix efforts to overhaul its event delivery infrastructure and build a self-service data integration platform.

If you’d like to know more background about how we use Kafka at Stitch Fix, please refer to our previously published blog post, Putting the Power of Kafka into the Hands of Data Scientists. This post includes much more information on business use cases, architecture diagrams, and technical infrastructure.

In this post, we will describe how and why we decided to migrate from self-managed Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK). We’ll start with an overview of our self-managed Kafka, why we chose to migrate to Amazon MSK, and ultimately how we did it.

  1. Kafka clusters overview
  2. Why migrate to Amazon MSK
  3. How we migrated to Amazon MSK
  4. Navigating challenges and lessons learned
  5. Conclusion

Kafka Clusters Overview

At Stitch Fix, we rely on several different Kafka clusters dedicated to specific purposes. This allows us to scale these clusters independently and apply more stringent SLAs and message delivery guarantees per cluster. This also reduces overall risk by minimizing the impact of changes and upgrades and allows us to isolate and fix any issues that occur within a single cluster.

Our main Kafka cluster serves as the backbone of our data infrastructure. It handles a multitude of critical functions, including managing business events, facilitating microservice communication, supporting feature generation for machine learning workflows, and much more. The stability, reliability, and performance of this cluster are of utmost importance to our operations.

Our logging cluster plays a vital role in our data infrastructure. It serves as a centralized repository for various application logs, including web server logs and Nginx server logs. These logs provide valuable insights for monitoring and troubleshooting purposes. The logging cluster ensures smooth operations and efficient analysis of log data.

Why migrate to Amazon MSK

In the past six years, our data infrastructure team has diligently managed our Kafka clusters. While our team has acquired extensive knowledge in maintaining Kafka, we have also faced challenges such as rolling deployments for version upgrades, applying OS patches, and the overall operational overhead.

At Stitch Fix, our engineers thrive on creating new features and expanding our service offerings to delight our customers. However, we recognized that allocating significant resources to Kafka maintenance was taking away precious time from innovation. To overcome this challenge, we set out to find a managed service provider that could handle maintenance tasks like upgrades and patching while granting us complete control over cluster operations, including partition management and rebalancing. We also sought an effortless scaling solution for storage volumes, keeping our costs in check while being ready to accommodate future growth.

After thorough evaluation of multiple options, we found the perfect match in Amazon MSK because it allows us to offload cluster maintenance to the highly skilled Amazon engineers. With Amazon MSK in place, our teams can now focus their energy on developing innovative applications unique and valuable to Stitch Fix, instead of getting caught up in Kafka administration tasks.

Amazon MSK streamlines the process, eliminating the need for manual configurations, additional software installations, and worries about scaling. It simply works, enabling us to concentrate on delivering exceptional value to our cherished customers.

How we migrated to Amazon MSK

While planning our migration, we desired to switch specific services to Amazon MSK individually with no downtime, ensuring that only a specific subset of services would be migrated at a time. The overall infrastructure would run in a hybrid environment where some services connect to Amazon MSK and others to the existing Kafka infrastructure.

We decided to start the migration with our less critical logging cluster first and then proceed to migrating the main cluster. Although the logs are essential for monitoring and troubleshooting purposes, they hold relatively less significance to the core business operations. Additionally, the number and types of consumers and producers for the logging cluster is smaller, making it an easier choice to start with. Then, we were able to apply our learnings from the logging cluster migration to the main cluster. This deliberate choice enabled us to execute the migration process in a controlled manner, minimizing any potential disruptions to our critical systems.

Over the years, our experienced data infrastructure team has employed Apache Kafka MirrorMaker 2 (MM2) to replicate data between different Kafka clusters. Currently, we rely on MM2 to replicate data from two different production Kafka clusters. Given its proven track record within our organization, we decided to use MM2 as the primary tool for our data migration process.

The general guidance for MM2 is as follows:

  1. Begin with less critical applications.
  2. Perform active migrations.
  3. Familiarize yourself with key best practices for MM2.
  4. Implement monitoring to validate the migration.
  5. Accumulate essential insights for migrating other applications.

MM2 offers flexible deployment options, allowing it to function as a standalone cluster or be embedded within an existing Kafka Connect cluster. For our migration project, we deployed a dedicated Kafka Connect cluster operating in distributed mode.

This setup provided the scalability we needed, allowing us to easily expand the standalone cluster if necessary. Depending on specific use cases such as geoproximity, high availability (HA), or migrations, MM2 can be configured for active-active replication, active-passive replication, or both. In our case, as we migrated from self-managed Kafka to Amazon MSK, we opted for an active-passive configuration, where MirrorMaker was used for migration purposes and subsequently taken offline upon completion.

MirrorMaker configuration and replication policy

By default, MirrorMaker renames replication topics by prefixing the name of the source Kafka cluster to the destination cluster. For instance, if we replicate topic A from the source cluster “existing” to the new cluster “newkafka,” the replicated topic would be named “existing.A” in “newkafka.” However, this default behavior can be modified to maintain consistent topic names within the newly created MSK cluster.

To maintain consistent topic names in the newly created MSK cluster and avoid downstream issues, we utilized the CustomReplicationPolicy jar provided by AWS. This jar, included in our MirrorMaker setup, allowed us to replicate topics with identical names in the MSK cluster. Additionally, we utilized MirrorCheckpointConnector to synchronize consumer offsets from the source cluster to the target cluster and MirrorHeartbeatConnector to ensure connectivity between the clusters.

Monitoring and metrics

MirrorMaker comes equipped with built-in metrics to monitor replication lag and other essential parameters. We integrated these metrics into our MirrorMaker setup, exporting them to Grafana for visualization. Since we have been using Grafana to monitor other systems, we decided to use it during migration as well. This enabled us to closely monitor the replication status during the migration process. The specific metrics we monitored will be described in more detail below.

Additionally, we monitored the MirrorCheckpointConnector included with MirrorMaker, as it periodically emits checkpoints in the destination cluster. These checkpoints contained offsets for each consumer group in the source cluster, ensuring seamless synchronization between the clusters.

Network layout

At Stitch Fix, we use several virtual private clouds (VPCs) through Amazon Virtual Private Cloud (Amazon VPC) for environment isolation in each of our AWS accounts. We have been using separate production and staging VPCs since we initially started using AWS. When necessary, peering of VPCs across accounts is handled through AWS Transit Gateway. To maintain the strong isolation between environments we have been using all along, we created separate MSK clusters in their respective VPCs for production and staging environments.

Side note: It will be easier now to quickly connect Kafka clients hosted in different virtual private clouds with recently announced Amazon MSK multi-VPC private connectivity, which was not available at the time of our migration.

Migration steps: High-level overview

In this section, we outline the high-level sequence of events for the migration process.

Kafka Connect setup and MM2 deploy

First, we deployed a new Kafka Connect cluster on an Amazon Elastic Compute Cloud (Amazon EC2) cluster as an intermediary between the existing Kafka cluster and the new MSK cluster. Next, we deployed the 3 MirrorMaker connectors to this Kafka Connect cluster. Initially, this cluster was configured to mirror all the existing topics and their configurations into the destination MSK cluster. (We eventually changed this configuration to be more granular, as described in the “Navigating challenges and lessons learned” section below.)

Monitor replication progress with MM metrics

Take advantage of the JMX metrics offered by MirrorMaker to monitor the progress of data replication. In addition to comprehensive metrics, we primarily focused on key metrics, namely replication-latency-ms and checkpoint-latency-ms. These metrics provide invaluable insights into the replication status, including crucial aspects such as replication lag and checkpoint latency. By seamlessly exporting these metrics to Grafana, you gain the ability to visualize and closely track the progress of replication, ensuring the successful reproduction of both historical and new data by MirrorMaker.

Evaluate usage metrics and provisioning

Analyze the usage metrics of the new MSK cluster to ensure proper provisioning. Consider factors such as storage, throughput, and performance. If required, resize the cluster to meet the observed usage patterns. While resizing may introduce additional time to the migration process, it is a cost-effective measure in the long run.

Sync consumer offsets between source and target clusters

Ensure that consumer offsets are synchronized between the source in-house clusters and the target MSK clusters. Once the consumer offsets are in sync, redirect the consumers of the existing in-house clusters to consume data from the new MSK cluster. This step ensures a seamless transition for consumers and allows uninterrupted data flow during the migration.

Update producer applications

After confirming that all consumers are successfully consuming data from the new MSK cluster, update the producer applications to write data directly to the new cluster. This final step completes the migration process, ensuring that all data is now being written to the new MSK cluster and taking full advantage of its capabilities.

Navigating challenges and lessons learned

During our migration, we encountered three challenges that required careful attention: scalable storage, more granular configuration of replication configuration, and memory allocation.

Initially, we faced issues with auto scaling Amazon MSK storage. We learned storage auto scaling requires a 24-hour cool-off period before another scaling event can occur. We observed this when migrating the logging cluster, and we applied our learnings from this and factored in the cool-off period during production cluster migration.

Additionally, to optimize MirrorMaker replication speed, we updated the original configuration to divide the replication jobs into batches based on volume and allocated more tasks to high-volume topics.

During the initial phase, we initiated replication using a single connector to transfer all topics from the source to target clusters, encompassing a significant number of tasks. However, we encountered challenges such as increasing replication lag for high-volume topics and slower replication for specific topics. Upon careful examination of the metrics, we adopted an alternative approach by segregating high-volume topics into multiple connectors. In essence, we divided the topics into categories of high, medium, and low volumes, assigning them to respective connectors and adjusting the number of tasks based on replication latency. This strategic adjustment yielded positive outcomes, allowing us to achieve faster and more efficient data replication across the board.

Lastly, we encountered Java virtual machine heap memory exhaustion, resulting in missing metrics while running MirrorMaker replication. To address this, we increased memory allocation and restarted the MirrorMaker process.

Conclusion

Stitch Fix’s migration from self-managed Kafka to Amazon MSK has allowed us to shift our focus from maintenance tasks to delivering value for our customers. It has reduced our infrastructure costs by 40 percent and given us the confidence that we can easily scale the clusters in the future if needed. By strategically planning the migration and using Apache Kafka MirrorMaker, we achieved a seamless transition while ensuring high availability. The integration of monitoring and metrics provided valuable insights during the migration process, and Stitch Fix successfully navigated challenges along the way. The migration to Amazon MSK has empowered Stitch Fix to maximize the capabilities of Kafka while benefiting from the expertise of Amazon engineers, setting the stage for continued growth and innovation.

Further reading


About the Authors

Karthik Kondamudi is an Engineering Manager in the Data and ML Platform Group at StitchFix. His interests lie in Distributed Systems and large-scale data processing. Beyond work, he enjoys spending time with family and hiking. A dog lover, he’s also passionate about sports, particularly cricket, tennis, and football.

Jenny Thompson is a Data Platform Engineer at Stitch Fix. She works on a variety of systems for Data Scientists, and enjoys making things clean, simple, and easy to use. She also likes making pancakes and Pavlova, browsing for furniture on Craigslist, and getting rained on during picnics.

Rahul Nammireddy is a Senior Solutions Architect at AWS, focusses on guiding digital native customers through their cloud native transformation. With a passion for AI/ML technologies, he works with customers in industries such as retail and telecom, helping them innovate at a rapid pace. Throughout his 23+ years career, Rahul has held key technical leadership roles in a diverse range of companies, from startups to publicly listed organizations, showcasing his expertise as a builder and driving innovation. In his spare time, he enjoys watching football and playing cricket.

Todd McGrath is a data streaming specialist at Amazon Web Services where he advises customers on their streaming strategies, integration, architecture, and solutions. On the personal side, he enjoys watching and supporting his 3 teenagers in their preferred activities as well as following his own pursuits such as fishing, pickleball, ice hockey, and happy hour with friends and family on pontoon boats. Connect with him on LinkedIn.

Accelerate Amazon Redshift secure data use with Satori – Part 1

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-secure-data-use-with-satori-part-1/

This post is co-written by Lisa Levy, Content Specialist at Satori.

Data democratization enables users to discover and gain access to data faster, improving informed data-driven decisions and using data to generate business impact. It also increases collaboration across teams and organizations, breaking down data silos and enabling cross-functional teams to work together more effectively.

A significant barrier to data democratization is ensuring that data remains secure and compliant. The ability to search, locate, and mask sensitive data is critical for the data democratization process. Amazon Redshift provides numerous features such as role-based access control (RBAC), row-level security (RLS), column-level security (CLS), and dynamic data masking to facilitate the secure use of data.

In this two-part series, we explore how Satori, an Amazon Redshift Ready partner, can help Amazon Redshift users automate secure access to data and provide their data users with self-service data access. Satori integrates natively with both Amazon Redshift provisioned clusters and Amazon Redshift Serverless for easy setup of your Amazon Redshift data warehouse in the secure Satori portal.

In part 1, we provide detailed steps on how to integrate Satori with your Amazon Redshift data warehouse and control how data is accessed with security policies.

In part 2, we will explore how to set up self-service data access with Satori to data stored in Amazon Redshift.

Satori’s data security platform

Satori is a data security platform that enables frictionless self-service access for users with built-in security. Satori accelerates implementing data security controls on datawarehouses like Amazon Redshift, is straightforward to integrate, and doesn’t require any changes to your Amazon Redshift data, schema, or how your users interact with data.

Integrating Satori with Amazon Redshift accelerates organizations’ ability to make use of their data to generate business value. This faster time-to-value is achieved by enabling companies to manage data access more efficiently and effectively.

By using Satori with the Modern Data Architecture on AWS, you can find and get access to data using a personalized data portal, and companies can set policies such as just-in-time access to data and fine-grained access control. Additionally, all data access is audited. Satori seamlessly works with native Redshift objects, external tables that can be queried through Amazon Redshift Spectrum, as well shared database objects through Redshift data sharing.

Satori anonymizes data on the fly, based on your requirements, according to users, roles, and datasets. The masking is applied regardless of the underlying database and doesn’t require writing code or making changes to your databases, data warehouses, and data lakes. Satori continuously monitors data access, identifies the location of each dataset, and classifies the data in each column. The result is a self-populating data inventory, which also classifies the data for you and allows you to add your own customized classifications.

Satori integrates with identity providers to enrich its identity context and deliver better analytics and more accurate access control policies. Satori interacts with identity providers either via API or by using the SAML protocol. Satori also integrates with business intelligence (BI) tools like Amazon QuickSight, Tableau, Power BI etc. to monitor and enforce security and privacy policies for data consumers who use BI tools to access data.

In this post, we explore how organizations can accelerate secure data use in Amazon Redshift with Satori, including the benefits of integration and the necessary steps to start. We’ll go through an example of integrating Satori with a Redshift cluster and view how security policies are applied dynamically when queried through DBeaver.

Prerequisites

You should have the following prerequisites:

  • An AWS account.
  • A Redshift cluster and Redshift Severless endpoint to store and manage data. You can create and manage your cluster through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Redshift API.
  • A Satori account and the Satori connector for Amazon Redshift.
  • A Redshift security group. You’ll need to configure your Redshift security group to allow inbound traffic from the Satori connector for Amazon Redshift. Note that Satori can be deployed as a software as a service (SaaS) data access controller or within your VPC.

Prepare the data

To set up our example, complete the following steps:

  1. On the Amazon Redshift console, navigate to Query Editor v2.

If you’re familiar with SQL Notebooks, you can download this SQL notebook for the demonstration and import it to quickly get started.

  1. In the Amazon Redshift provisioned Cluster, Use the following code to create a table, populate it, and create roles and users:
-- 1- Create Schema
create schema if not exists customer_schema;

-- 2- Create customer and credit_cards table
CREATE TABLE customer_schema.credit_cards (
customer_id INT,
name TEXT,
is_fraud BOOLEAN,
credit_card TEXT
);


create table customer_schema.customer (
id INT,
first_name TEXT,
last_name TEXT,
email TEXT,
gender TEXT,
ssn TEXT
);

-- 3- Populate the tables with sample data
INSERT INTO customer_schema.credit_cards
VALUES
(100,'John Smith','n', '4532109867542837'),
(101,'Jane Doe','y', '4716065243786267'),
(102,'Mahendra Singh','n', '5243111024532276'),
(103,'Adaku Zerhouni','n', '6011011238764578'),
(104,'Miguel Salazar','n', '6011290347689234'),
(105,'Jack Docket','n', '3736165700234635');

INSERT INTO customer_schema.customer VALUES
(1,'Yorke','Khomishin','[email protected]','Male','866-95-2246'),
(2,'Tedd','Donwell','[email protected]','Male','726-62-3033'),
(3,'Lucien','Keppe','[email protected]','Male','865-28-6322'),
(4,'Hester','Arnefield','[email protected]','Female','133-72-9078'),
(5,'Abigale','Bertouloume','[email protected]','Female','780-69-6814'),
(6,'Larissa','Bremen','[email protected]','Female','121-78-7749');

-- 4-  GRANT  SELECT permissions on the table
GRANT SELECT ON customer_schema.credit_cards TO PUBLIC;
GRANT SELECT ON customer_schema.customer TO PUBLIC;

-- 5- create roles
CREATE ROLE customer_service_role;
CREATE ROLE auditor_role;
CREATE ROLE developer_role;
CREATE ROLE datasteward_role;


-- 6- create four users
CREATE USER Jack WITH PASSWORD '1234Test!';
CREATE USER Kim WITH PASSWORD '1234Test!';
CREATE USER Mike WITH PASSWORD '1234Test!';
CREATE USER Sarah WITH PASSWORD '1234Test!';


-- 7- Grant roles to above users
GRANT ROLE customer_service_role TO Jack;
GRANT ROLE auditor_role TO Kim;
GRANT ROLE developer_role TO Mike;
GRANT ROLE datasteward_role TO Sarah;
  1. Get namespaces for the Redshift provisioned cluster and Redshift Serverless endpoint

Connect to provisioned cluster through Query Editor V2 and run the following SQL:

select current_namespace; -- (Save as <producer_namespace>)

Repeat the above step for Redshift Serverless endpoint and get the namespace:

select current_namespace; -- (Save as <consumer_namespace>
  1. Connect to Redshift provisioned cluster and create an outbound data share (producer) with the following SQL
-- Creating a datashare

CREATE DATASHARE cust_share SET PUBLICACCESSIBLE TRUE;

-- Adding schema to datashare

ALTER DATASHARE cust_share ADD SCHEMA customer_schema;

-- Adding customer table to datshares. We can add all the tables also if required

ALTER DATASHARE cust_share ADD TABLE customer_schema.credit_cards;

GRANT USAGE ON DATASHARE cust_share TO NAMESPACE '<consumer_namespace>'; -- (replace with consumer namespace created in prerequisites 4)
  1. Connect to Redshift Serverless endpoint and execute the below statements to setup the inbound datashare.
CREATE DATABASE cust_db FROM DATASHARE cust_share OF NAMESPACE '< producer_namespace >'; -- (replace with producer namespace created in prerequisites 4)
  1. Optionally, create the credit_cards table as an external table by using this sample file in Amazon S3 and adding the table to AWS Glue Data Catalog through Glue Crawler. Once the table is available in Glue Data Catalog, you can create the external schema in your Amazon Redshift Serverless endpoint using the below SQL
CREATE external SCHEMA satori_external

FROM data catalog DATABASE 'satoriblog'

IAM_ROLE default

CREATE external DATABASE if not exists;

Verify that the external table credit_cards is available from your Redshift Serverless endpoint

select * from satori_external.credit_cards ;

Connect to Amazon Redshift

If you don’t have a Satori account, you can either create a test drive account or get Satori from the AWS Marketplace. Then complete the following steps to connect to Amazon Redshift:

  1. Log in to Satori.
  2. Choose Data Stores in the navigation pane, choose Add Datastore, and choose Amazon Redshift.

DatastoreSetup001

  1. Add your cluster identifier from the Amazon Redshift console. Satori will automatically detect the Region where your cluster resides within your AWS account.
  2. Satori will generate a Satori hostname for your cluster, which you will use to connect to your Redshift cluster
  3. In this demonstration, we will add a Redshift provisioned cluster and a Redshift Serverless endpoint to create two datastores in Satori

DatastoreProvisioned003

Datastore Serverless002

  1. Allow inbound access for the Satori IP addresses listed in your Redshift cluster security group.

For more details on connecting Satori to your Redshift cluster, refer to Adding an AWS Redshift Data Store to Satori.

  1. Under Authentication Settings, enter your root or superuser credentials for each datastore.

AuthenticationSettings004

  1. Leave the rest of the tabs with their default settings and choose Save.

Now your data stores are ready to be accessed through Satori.

Create a dataset

Complete the following steps to create a dataset:

  1. Choose Datasets in the navigation pane and choose Add New Dataset.
  2. Select your datastore and enter the details of your dataset.

CustomerDataset005

A dataset can be a collection of database objects that you categorize as a dataset. For Redshift provisioned cluster, we created a customer dataset with details on the database and schema. You can also optionally choose to focus on a specific table within the schema or even exclude certain schemas or tables from the dataset.

For Redshift Serverless, we created a dataset that with all datastore locations, to include the shared table and External table

ServerlessDataset006

  1. Choose Save.
  1. For each dataset, navigate to User Access Rules and create dataset user access policies for the roles we created.

UserAccessRoles007

  1. Enable Give Satori Control Over Access to the Dataset.
  2. Optionally, you can add expiration and revoke time configurations to the access policies to limit how long access is granted to the Redshift cluster.

Create a security policy for the dataset

Satori provides multiple masking profile templates that you can use as a baseline and customize before adding them to your security policies. Complete the following steps to create your security policy:

  1. Choose Masking Profiles in the navigation pane and use the Restrictive Policy template to create a masking policy.

MaskingProfiles008

  1. Provide a descriptive name for the policy.
  2. You can customize the policy further to add custom fields and their respective masking policies. The following example shows the additional field Credit Card Number that was added with the action to mask everything but the last four characters.

  1. Choose Security Policies in the navigation pane and create a security policy called Customer Data Security Policy.

  1. Associate the policy with the masking profile created in the previous step.

  1. Associate the created security policy with the datasets by editing the dataset and navigating to the Security Policies tab.

Now that the integration, policy, and access controls are set, let’s query the data through DBeaver.

Query secure data

To query your data, connect to the Redshift cluster and Redshift Serverless endpoint using their respective Satori hostname that was obtained earlier.

When you query the data in Redshift provisioned cluster, you will see the security policies applied to the result set at runtime.

When you query the data in Redshift Serverless endpoint, you will see the security policies applied to credit_cards table shared from the Redshift provisioned cluster.

You will get similar results with policies applied if you query the external table in Amazon S3 from Redshift Serverless endpoint

Summary

In this post, we described how Satori can help you with secure data access from your Redshift cluster without requiring any changes to your Redshift data, schema, or how your users interact with data. In part 2, we will explore how to set up self-service data access to data stored in Amazon Redshift with the different roles we created as part of the initial setup.

Satori is available on the AWS Marketplace. To learn more, start a free trial or request a demo meeting.


About the authors

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

LisaLevyLisa Levy is a Content Specialist at Satori. She publishes informative content to effectively describe how Satori’s data security platform enhances organizational productivity.

How Chime Financial uses AWS to build a serverless stream analytics platform and defeat fraudsters

Post Syndicated from Khandu Shinde original https://aws.amazon.com/blogs/big-data/how-chime-financial-uses-aws-to-build-a-serverless-stream-analytics-platform-and-defeat-fraudsters/

This is a guest post by Khandu Shinde, Staff Software Engineer and Edward Paget, Senior Software Engineering at Chime Financial.

Chime is a financial technology company founded on the premise that basic banking services should be helpful, easy, and free. Chime partners with national banks to design member first financial products. This creates a more competitive market with better, lower-cost options for everyday Americans who aren’t being served well by traditional banks. We help drive innovation, inclusion, and access across the industry.

Chime has a responsibility to protect our members against unauthorized transactions on their accounts. Chime’s Risk Analysis team constantly monitors trends in our data to find patterns that indicate fraudulent transactions.

This post discusses how Chime utilizes AWS Glue, Amazon Kinesis, Amazon DynamoDB, and Amazon SageMaker to build an online, serverless fraud detection solution — the Chime Streaming 2.0 system.

Problem statement

In order to keep up with the rapid movement of fraudsters, our decision platform must continuously monitor user events and respond in real-time. However, our legacy data warehouse-based solution was not equipped for this challenge. It was designed to manage complex queries and business intelligence (BI) use cases on a large scale. However, with a minimum data freshness of 10 minutes, this architecture inherently didn’t align with the near real-time fraud detection use case.

To make high-quality decisions, we need to collect user event data from various sources and update risk profiles in real time. We also need to be able to add new fields and metrics to the risk profiles as our team identifies new attacks, without needing engineering intervention or complex deployments.

We decided to explore streaming analytics solutions where we can capture, transform, and store event streams at scale, and serve rule-based fraud detection models and machine learning (ML) models with milliseconds latency.

Solution overview

The following diagram illustrates the design of the Chime Streaming 2.0 system.

The design included the following key components:

  1. We have Amazon Kinesis Data Streams as our streaming data service to capture and store event streams at scale. Our stream pipelines capture various event types, including user enrollment events, user login events, card swipe events, peer-to-peer payments, and application screen actions.
  2. Amazon DynamoDB is another data source for our Streaming 2.0 system. It acts as the application backend and stores data such as blocked devices list and device-user mapping. We mainly use it as lookup tables in our pipeline.
  3. AWS Glue jobs form the backbone of our Streaming 2.0 system. The simple AWS Glue icon in the diagram represents thousands of AWS Glue jobs performing different transformations. To achieve the 5-15 seconds end-to-end data freshness service level agreement (SLA) for the Steaming 2.0 pipeline, we use streaming ETL jobs in AWS Glue to consume data from Kinesis Data Streams and apply near-real-time transformation. We choose AWS Glue mainly due to its serverless nature, which simplifies infrastructure management with automatic provisioning and worker management, and the ability to perform complex data transformations at scale.
  4. The AWS Glue streaming jobs generate derived fields and risk profiles that get stored in Amazon DynamoDB. We use Amazon DynamoDB as our online feature store due to its millisecond performance and scalability.
  5. Our applications call Amazon SageMaker Inference endpoints for fraud detections. The Amazon DynamoDB online feature store supports real-time inference with single digit millisecond query latency.
  6. We use Amazon Simple Storage Service (Amazon S3) as our offline feature store. It contains historical user activities and other derived ML features.
  7. Our data scientist team can access the dataset and perform ML model training and batch inferencing using Amazon SageMaker.

AWS Glue pipeline implementation deep dive

There are several key design principles for our AWS Glue Pipeline and the Streaming 2.0 project.

  • We want to democratize our data platform and make the data pipeline accessible to all Chime developers.
  • We want to implement cloud financial backend services and achieve cost efficiency.

To achieve data democratization, we needed to enable different personas in the organization to use the platform and define transformation jobs quickly, without worrying about the actual implementation details of the pipelines. The data infrastructure team built an abstraction layer on top of Spark and integrated services. This layer contained API wrappers over integrated services, job tags, scheduling configurations and debug tooling, hiding Spark and other lower-level complexities from end users. As a result, end users were able to define jobs with declarative YAML configurations and define transformation logic with SQL. This simplified the onboarding process and accelerated the implementation phase.

To achieve cost efficiency, our team built a cost attribution dashboard based on AWS cost allocation tags. We enforced tagging with the above abstraction layer and had clear cost attribution for all AWS Glue jobs down to the team level. This enabled us to track down less optimized jobs and work with job owners to implement best practices with impact-based priority. One common misconfiguration we found was sizing of AWS Glue jobs. With data democratization, many users lacked the knowledge to right-size their AWS Glue jobs. The AWS team introduced AWS Glue auto scaling to us as a solution. With AWS Glue Auto Scaling, we no longer needed to plan AWS Glue Spark cluster capacity in advance. We could just set the maximum number of workers and run the jobs. AWS Glue monitors the Spark application execution, and allocates more worker nodes to the cluster in near-real time after Spark requests more executors based on our workload requirements. We noticed a 30–45% cost saving across our AWS Glue Jobs once we turned on Auto Scaling.

Conclusion

In this post, we showed you how Chime’s Streaming 2.0 system allows us to ingest events and make them available to the decision platform just seconds after they are emitted from other services. This enables us to write better risk policies, provide fresher data for our machine learning models, and protect our members from unauthorized transactions on their accounts.

Over 500 developers in Chime are using this streaming pipeline and we ingest more than 1 million events per second. We follow the sizing and scaling process from the AWS Glue streaming ETL jobs best practices blog and land on a 1:1 mapping between Kinesis Shard and vCPU core. The end-to-end latency is less than 15 seconds, and it improves the model score calculation speed by 1200% compared to legacy implementation. This system has proven to be reliable, performant, and cost-effective at scale.

We hope this post will inspire your organization to build a real-time analytics platform using serverless technologies to accelerate your business goals.


About the Authors

Khandu Shinde Khandu Shinde is a Staff Engineer focused on Big Data Platforms and Solutions for Chime. He helps to make the platform scalable for Chime’s business needs with architectural direction and vision. He’s based in San Francisco where he plays cricket and watches movies.

Edward Paget Edward Paget is a Software Engineer working on building Chime’s capabilities to mitigate risk to ensure our members’ financial peace of mind. He enjoys being at the intersection of big data and programming language theory. He’s based in Chicago where he spends his time running along the lake shore.

Dylan Qu is a Specialist Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

How Vercel Shipped Cron Jobs in 2 Months Using Amazon EventBridge Scheduler

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/how-vercel-shipped-cron-jobs-in-2-months-using-amazon-eventbridge-scheduler/

Vercel implemented Cron Jobs using Amazon EventBridge Scheduler, enabling their customers to create, manage, and run scheduled tasks at scale. The adoption of this feature was rapid, reaching over 7 million weekly cron invocations within a few months of release. This article shows how they did it and how they handle the massive scale they’re experiencing.

Vercel builds a front-end cloud that makes it easier for engineers to deploy and run their front-end applications. With more than 100 million deployments in Vercel in the last two years, Vercel helps users take advantage of best-in-class AWS infrastructure with zero configuration by relying heavily on serverless technology. Vercel provides a lot of features that help developers host their front-end applications. However, until the beginning of this year, they hadn’t built Cron Jobs yet.

A cron job is a scheduled task that automates running specific commands or scripts at predetermined intervals or fixed times. It enables users to set up regular, repetitive actions, such as backups, sending notification emails to customers, or processing payments when a subscription needs to be renewed. Cron jobs are widely used in computing environments to improve efficiency and automate routine operations, and they were a commonly requested feature from Vercel’s customers.

In December 2022, Vercel hosted an internal hackathon to foster innovation. That’s where Vincent Voyer and Andreas Schneider joined forces to build a prototype cron job feature for the Vercel platform. They formed a team of five people and worked on the feature for a week. The team worked on different tasks, from building a user interface to display the cron jobs to creating the backend implementation of the feature.

Amazon EventBridge Scheduler
When the hackathon team started thinking about solving the cron job problem, their first idea was to use Amazon EventBridge rules that run on a schedule. However, they realized quickly that this feature has a limit of 300 rules per account per AWS Region, which wasn’t enough for their intended use. Luckily, one of the team members had read the announcement of Amazon EventBridge Scheduler in the AWS Compute blog and they thought this would be a perfect tool for their problem.

By using EventBridge Scheduler, they could schedule one-time or recurrently millions of tasks across over 270 AWS services without provisioning or managing the underlying infrastructure.

How cron jobs work

For creating a new cron job in Vercel, a customer needs to define the frequency in which this task will run and the API they want to invoke. Vercel, in the backend, uses EventBridge Scheduler and creates a new schedule when a new cron job is created.

To call the endpoint, the team used an AWS Lambda function that receives the path that needs to be invoked as input parameters.

How cron jobs works

When the time comes for the cron job to run, EventBridge Scheduler invokes the function, which then calls the customer website endpoint that was configured.

By the end of the week, Vincent and his team had a working prototype version of the cron jobs feature, and they won a prize at the hackathon.

Building Vercel Cron Jobs
After working for one week on this prototype in December, the hackathon ended, and Vincent and his team returned to their regular jobs. In early January 2023, Vicent and the Vercel team decided to take the project and turn it into a real product.

During the hackathon, the team built the fundamental parts of the feature, but there were some details that they needed to polish to make it production ready. Vincent and Andreas worked on the feature, and in less than two months, on February 22, 2023, they announced Vercel Cron Jobs to the public. The announcement tweet got over 400 thousand views, and the community loved the launch.

Tweet from Vercel announcing cron jobs

The adoption of this feature was very rapid. Within a few months of launching Cron Jobs, Vercel reached over 7 million cron invocations per week, and they expect the adoption to continue growing.

Cron jobs adoption

How Vercel Cron Jobs Handles Scale
With this pace of adoption, scaling this feature is crucial for Vercel. In order to scale the amount of cron invocations at this pace, they had to make some business and architectural decisions.

From the business perspective, they defined limits for their free-tier customers. Free-tier customers can create a maximum of two cron jobs in their account, and they can only have hourly schedules. This means that free customers cannot run a cron job every 30 minutes; instead, they can do it at most every hour. Only customers on Vercel paid tiers can take advantage of EventBridge Scheduler minute granularity for scheduling tasks.

Also, for free customers, minute precision isn’t guaranteed. To achieve this, Vincent took advantage of the time window configuration from EventBridge Scheduler. The flexible time window configuration allows you to start a schedule within a window of time. This means that the scheduled tasks are dispersed across the time window to reduce the impact of multiple requests on downstream services. This is very useful if, for example, many customers want to run their jobs at midnight. By using the flexible time window, the load can spread across a set window of time.

From the architectural perspective, Vercel took advantage of hosting the APIs and owning the functions that the cron jobs invoke.

Validating the calls

This means that when the Lambda function is started by EventBridge Scheduler, the function ends its run without waiting for a response from the API. Then Vercel validates if the cron job ran by checking if the API and Vercel function ran correctly from its observability mechanisms. In this way, the function duration is very short, less than 400 milliseconds. This allows Vercel to run a lot of functions per second without affecting their concurrency limits.

Lambda invocations and duration dashboard

What Was The Impact?
Vercel’s implementation of Cron Jobs is an excellent example of what serverless technologies enable. In two months, with two people working full time, they were able to launch a feature that their community needed and enthusiastically adopted. This feature shows the completeness of Vercel’s platform and is an important feature to convince their customers to move to a paid account.

If you want to get started with EventBridge Scheduler, see Serverless Land patterns for EventBridge Scheduler, where you’ll find a broad range of examples to help you.

Marcia

Using and Managing Security Groups on AWS Snowball Edge devices

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/using-and-managing-security-groups-on-aws-snowball-edge-devices/

This blog post is written by Jared Novotny & Tareq Rajabi, Specialist Hybrid Edge Solution Architects. 

The AWS Snow family of products are purpose-built devices that allow petabyte-scale movement of data from on-premises locations to AWS Regions. Snow devices also enable customers to run Amazon Elastic Compute Cloud (Amazon EC2) instances with Amazon Elastic Block Storage (Amazon EBS), and Amazon Simple Storage Service (Amazon S3) in edge locations.

Security groups are used to protect EC2 instances by controlling ingress and egress traffic. Once a security group is created and associated with an instance, customers can add ingress and egress rules to control data flow. Just like the default VPC in a region, there is a default security group on Snow devices. A default security group is applied when an instance is launched and no other security group is specified.  This default security group in a region allows all inbound traffic from network interfaces and instances that are assigned to the same security group, and allows and all outbound traffic. On Snowball Edge, the default security group allows all inbound and outbound traffic.

In this post, we will review the tools and commands required to create, manage and use security groups on the Snowball Edge device.

Some things to keep in mind:

  1. AWS Snowball Edge is limited to 50 security groups.
  2. An instance will only have one security group, but each group can have a total of 120 rules. This is comprised of 60 inbound and 60 outbound rules.
  3. Security groups can only have allow statements to allow network traffic.
  4. Deny statements aren’t allowed.
  5. Some commands in the Snowball Edge client (AWS CLI) don’t provide an output.
  6. AWS CLI commands can use the name or the security group ID.

Prerequisites and tools

Customers must place an order for Snowball Edge from their AWS Console to be able to run the following AWS CLI commands and configure security groups to protect their EC2 instances.

The AWS Snowball Edge client is a standalone terminal application that customers can run on their local servers and workstations to manage and operate their Snowball Edge devices. It supports Windows, Mac, and Linux systems.

AWS OpsHub is a graphical user interface that you can use to manage your AWS Snowball devices. Furthermore, it’s the easiest tool to use to unlock Snowball Edge devices. It can also be used to configure the device, launch instances, manage storage, and provide monitoring.

Customers can download and install the Snowball Edge client and AWS OpsHub from AWS Snowball resources.

Getting Started

To get started, when a Snow device arrives at a customer site, the customer must unlock the device and launch an EC2 instance. This can be done via AWS OpsHub or the AWS Snowball Edge Client. AWS Snow Family of devices support both Virtual Network Interfaces (VNI) and Direct Network interfaces (DNI), customers should review the types of interfaces before deciding which one is best for their use case. Note that security groups are only supported with VNIs, so that is what was used in this post. A post explaining how to use these interfaces should be reviewed before proceeding.

Viewing security group information

Once the AWS Snowball Edge is unlocked, configured, and has an EC2 instance running, we can dig deeper into using security groups to act as a virtual firewall and control incoming and outgoing traffic.

Although the AWS OpsHub tool provides various functionalities for compute and storage operations, it can only be used to view the name of the security group associated to an instance in a Snowball Edge device:

view the name of the security group associated to an instance in a Snowball Edge device

Every other interaction with security groups must be through the AWS CLI.

The following command shows how to easily read the outputs describing the protocols, sources, and destinations. This particular command will show information about the default security group, which allows all inbound and outbound traffic on EC2 instances running on the Snowball Edge.

In the following sections we review the most common commands with examples and outputs.

View (all) existing security groups:

aws ec2 describe-security-groups --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge
{
    "SecurityGroups": [
        {
            "Description": "default security group",
            "GroupName": "default",
            "IpPermissions": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ],
            "GroupId": "s.sg-8ec664a23666db719",
            "IpPermissionsEgress": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ]
        }
    ]
}

Create new security group:

aws ec2 create-security-group --group-name allow-ssh--description "allow only ssh inbound" --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

The output returns a GroupId:

{  "GroupId": "s.sg-8f25ee27cee870b4a" }

Add port 22 ingress to security group:

aws ec2 authorize-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --protocol tcp --port 22 --cidr 10.100.10.0/24 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{    "Return": true }

Note that if you’re using the default security group, then the outbound rule is still to allow all traffic.

Revoke port 22 ingress rule from security group

aws ec2 revoke-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22, IpRanges=[{CidrIp=10.100.10.0/24}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Revoke default egress rule:

aws ec2 revoke-security-group-egress --group-ids.sg-8f25ee27cee870b4a  --ip-permissions IpProtocol="-1",IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Note that this rule will remove all outbound ephemeral ports.

Add default outbound rule (revoked above):

aws ec2 authorize-security-group-egress --group-id s.sg-8f25ee27cee870b4a --ip-permissions IpProtocol="-1", IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{  "Return": true }

Changing an instance’s existing security group:

aws ec2 modify-instance-attribute --instance-id s.i-852971d05144e1d63 --groups s.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. We can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows (command output simplified):

aws ec2 describe-instances --instance-id s.i-852971d05144e1d63 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge


    "Reservations": [{
            "Instances": [{
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                    "SecurityGroups": [
                        {
                            "GroupName": "allow-ssh",
                            "GroupId": "s.sg-8f25ee27cee870b4a"
                        }      ], }  ] }

Changing and instance’s security group back to default:

aws ec2 modify-instance-attribute --instance-ids.i-852971d05144e1d63 --groups s.sg-8ec664a23666db719 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. You can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows:

aws ec2 describe-instances –instance-ids.i-852971d05144e1d63 –endpoint Https://MySnowIPAddress:8008 –profile SnowballEdge

{
    "Reservations": [
        {  "Instances": [ {
                    "AmiLaunchIndex": 0,
                    "ImageId": "s.ami-8b0223704ca8f08b2",
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                             "SecurityGroups": [
                        {
                            "GroupName": "default",
                            "GroupId": "s.sg-8ec664a23666db719" ] }

Delete security group:

aws ec2 delete-security-group --group-ids.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Sample walkthrough to add a SSH Security Group

As an example, assume a single EC2 instance “A” running on a Snowball Edge device. By default, all traffic is allowed to EC2 instance “A”. As per the following diagram, we want to tighten security and allow only the management PC to SSH to the instance.

1. Create an SSH security group:

aws ec2 create-security-group --group-name MySshGroup--description “ssh access” --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

2. This will return a “GroupId” as an output:

{   "GroupId": "s.sg-8a420242d86dbbb89" }

3. After the creation of the security group, we must allow port 22 ingress from the management PC’s IP:

aws ec2 authorize-security-group-ingress --group-name MySshGroup -- protocol tcp --port 22 -- cidr 192.168.26.193/32 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

4. Verify that the security group has been created:

aws ec2 describe-security-groups ––group-name MySshGroup –endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{
	“SecurityGroups”:   [
		{
			“Description”: “SG for web servers”,
			“GroupName”: :MySshGroup”,
			“IpPermissinos”:  [
				{ “FromPort”: 22,
			 “IpProtocol”: “tcp”,
			 “IpRanges”: [
			{
				“CidrIp”: “192.168.26.193.32/32”
						} ],
					“ToPort”:  22 }],}

5. After the security group has been created, we must associate it with the instance:

aws ec2 modify-instance-attribute –-instance-id s.i-8f7ab16867ffe23d4 –-groups s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

6. Optionally, we can delete the Security Group after it is no longer required:

aws ec2 delete-security-group --group-id s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that for the above association, the instance ID is an output of the “aws ec2 describe-instances” command, while the security group ID is an output of the “describe-security-groups” command (or the “GroupId” returned by the console in Step 2 above).

Conclusion

This post addressed the most common commands used to create and manage security groups with the AWS Snowball Edge device. We explored the prerequisites, tools, and commands used to view, create, and modify security groups to ensure the EC2 instances deployed on AWS Snowball Edge are restricted to authorized users. We concluded with a simple walkthrough of how to restrict access to an EC2 instance over SSH from a single IP address. If you would like to learn more about the Snowball Edge product, there are several resources available on the AWS Snow Family site.

How Ontraport reduced data processing cost by 80% with AWS Glue

Post Syndicated from Elijah Ball original https://aws.amazon.com/blogs/big-data/how-ontraport-reduced-data-processing-cost-by-80-with-aws-glue/

This post is written in collaboration with Elijah Ball from Ontraport.

Customers are implementing data and analytics workloads in the AWS Cloud to optimize cost. When implementing data processing workloads in AWS, you have the option to use technologies like Amazon EMR or serverless technologies like AWS Glue. Both options minimize the undifferentiated heavy lifting activities like managing servers, performing upgrades, and deploying security patches and allow you to focus on what is important: meeting core business objectives. The difference between both approaches can play a critical role in enabling your organization to be more productive and innovative, while also saving money and resources.

Services like Amazon EMR focus on offering you flexibility to support data processing workloads at scale using frameworks you’re accustomed to. For example, with Amazon EMR, you can choose from multiple open-source data processing frameworks such as Apache Spark, Apache Hive, and Presto, and fine-tune workloads by customizing things such as cluster instance types on Amazon Elastic Compute Cloud (Amazon EC2) or use containerized environments running on Amazon Elastic Kubernetes Service (Amazon EKS). This option is best suited when migrating workloads from big data environments like Apache Hadoop or Spark, or when used by teams that are familiar with open-source frameworks supported on Amazon EMR.

Serverless services like AWS Glue minimize the need to think about servers and focus on offering additional productivity and DataOps tooling for accelerating data pipeline development. AWS Glue is a serverless data integration service that helps analytics users discover, prepare, move, and integrate data from multiple sources via a low-code or no-code approach. This option is best suited when organizations are resource-constrained and need to build data processing workloads at scale with limited expertise, allowing them to expedite development and reduced Total Cost of Ownership (TCO).

In this post, we show how our AWS customer Ontraport evaluated the use of AWS Glue and Amazon EMR to reduce TCO, and how they reduced their storage cost by 92% and their processing cost by 80% with only one full-time developer.

Ontraport’s workload and solution

Ontraport is a CRM and automation service that powers businesses’ marketing, sales and operations all in one place—empowering businesses to grow faster and deliver more value to their customers.

Log processing and analysis is critical to Ontraport. It allows them to provide better services and insight to customers such as email campaign optimization. For example, email logs alone record 3–4 events for every one of the 15–20 million messages Ontraport sends on behalf of their clients each day. Analysis of email transactions with providers such as Google and Microsoft allow Ontraport’s delivery team to optimize open rates for the campaigns of clients with big contact lists.

Some of the big log contributors are web server and CDN events, email transaction records, and custom event logs within Ontraport’s proprietary applications. The following is a sample breakdown of their daily log contributions:

Cloudflare request logs 75 million records
CloudFront request logs 2 million records
Nginx/Apache logs 20 million records
Email logs 50 million records
General server logs 50 million records
Ontraport app logs 6 million records

Ontraport’s solution uses Amazon Kinesis and Amazon Kinesis Data Firehose to ingest log data and write recent records into an Amazon OpenSearch Service database, from where analysts and administrators can analyze the last 3 months of data. Custom application logs record interactions with the Ontraport CRM so client accounts can be audited or recovered by the customer support team. Originally, all logs were retained back to 2018. Retention is multi-leveled by age:

  • Less than 1 week – OpenSearch hot storage
  • Between 1 week and 3 months – OpenSearch cold storage
  • More than 3 months – Extract, transform, and load (ETL) processed in Amazon Simple Storage Service (Amazon S3), available through Amazon Athena

The following diagram shows the architecture of their log processing and analytics data pipeline.

Evaluating the optimal solution

In order to optimize storage and analysis of their historical records in Amazon S3, Ontraport implemented an ETL process to transform and compress TSV and JSON files into Parquet files with partitioning by the hour. The compression and transformation helped Ontraport reduce their S3 storage costs by 92%.

In phase 1, Ontraport implemented an ETL workload with Amazon EMR. Given the scale of their data (hundreds of billions of rows) and only one developer, Ontraport’s first attempt at the Apache Spark application required a 16-node EMR cluster with r5.12xlarge core and task nodes. The configuration allowed the developer to process 1 year of data and minimize out-of-memory issues with a rough version of the Spark ETL application.

To help optimize the workload, Ontraport reached out to AWS for optimization recommendations. There were a considerable number of options to optimize the workload within Amazon EMR, such as right-sizing Amazon Elastic Compute Cloud (Amazon EC2) instance type based on workload profile, modifying Spark YARN memory configuration, and rewriting portions of the Spark code. Considering the resource constraints (only one full-time developer), the AWS team recommended exploring similar logic with AWS Glue Studio.

Some of the initial benefits with using AWS Glue for this workload include the following:

  • AWS Glue has the concept of crawlers that provides a no-code approach to catalog data sources and identify schema from multiple data sources, in this case, Amazon S3.
  • AWS Glue provides built-in data processing capabilities with abstract methods on top of Spark that reduce the overhead required to develop efficient data processing code. For example, AWS Glue supports a DynamicFrame class corresponding to a Spark DataFrame that provides additional flexibility when working with semi-structured datasets and can be quickly transformed into a Spark DataFrame. DynamicFrames can be generated directly from crawled tables or directly from files in Amazon S3. See the following example code:
    dyf = glueContext.create_dynamic_frame.from_options(
    
    connection_type = 's3',
    connection_options = {'paths': [s3://<bucket/paths>]},
    format = 'json')

  • It minimizes the need for Ontraport to right-size instance types and auto scaling configurations.
  • Using AWS Glue Studio interactive sessions allows Ontraport to quickly iterate when code changes where needed when detecting historical log schema evolution.

Ontraport had to process 100 terabytes of log data. The cost of processing each terabyte with the initial configuration was approximately $500. That cost came down to approximately $100 per terabyte after using AWS Glue. By using AWS Glue and AWS Glue Studio, Ontraport’s cost of processing the jobs was reduced by 80%.

Diving deep into the AWS Glue workload

Ontraport’s first AWS Glue application was a PySpark workload that ingested data from TSV and JSON files in Amazon S3, performed basic transformations on timestamp fields, and converted the data types of a couple fields. Finally, it writes output data into a curated S3 bucket as compressed Parquet files of approximately 1 GB in size and partitioned in 1-hour intervals to optimize for queries with Athena.

With an AWS Glue job configured with 10 workers of the type G.2x configuration, Ontraport was able to process approximately 500 million records in less than 60 minutes. When processing 10 billion records, they were able to increase the job configuration to a maximum of 100 workers with auto scaling enabled to complete the job within 1 hour.

What’s next?

Ontraport has been able to process logs as early as 2018. The team is updating the processing code to allow for scenarios of schema evolution (such as new fields) and parameterized some components to fully automate the batch processing. They are also looking to fine-tune the number of provisioned AWS Glue workers to obtain optimal price-performance.

Conclusion

In this post, we showed you how Ontraport used AWS Glue to help reduce development overhead and simplify development efforts for their ETL workloads with only one full-time developer. Although services like Amazon EMR offer great flexibility and optimization, the ease of use and simplification in AWS Glue often offer a faster path for cost-optimization and innovation for small and medium businesses. For more information about AWS Glue, check out Getting Started with AWS Glue.


About the Authors

Elijah Ball has been a Sys Admin at Ontraport for 12 years. He is currently working to move Ontraport’s production workloads to AWS and develop data analysis strategies for Ontraport.

Pablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 16 years of FinTech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics initiatives.

Vikram Honmurgi is a Customer Solutions Manager at Amazon Web Services. With over 15 years of software delivery experience, Vikram is passionate about assisting customers and accelerating their cloud journey, delivering frictionless migrations, and ensuring our customers capture the full potential and sustainable business advantages of migrating to the AWS Cloud.

Estimating Scope 1 Carbon Footprint with Amazon Athena

Post Syndicated from Thomas Burns original https://aws.amazon.com/blogs/big-data/estimating-scope-1-carbon-footprint-with-amazon-athena/

Today, more than 400 organizations have signed The Climate Pledge, a commitment to reach net-zero carbon by 2040. Some of the drivers that lead to setting explicit climate goals include customer demand, current and anticipated government relations, employee demand, investor demand, and sustainability as a competitive advantage. AWS customers are increasingly interested in ways to drive sustainability actions. In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena, a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

The Greenhouse Gas Protocol

The Greenhouse Gas Protocol (GHGP) provides standards for measuring and managing global warming impacts from an organization’s operations and value chain.

The greenhouse gases covered by the GHGP are the seven gases required by the UNFCCC/Kyoto Protocol (which is often called the “Kyoto Basket”). These gases are carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), the so-called F-gases (hydrofluorocarbons and perfluorocarbons), sulfur hexafluoride (SF6) nitrogen trifluoride (NF3). Each greenhouse gas is characterized by its global warming potential (GWP), which is determined by the gas’s greenhouse effect and its lifetime in the atmosphere. Since carbon dioxide (CO2) accounts for about 76 percent of total man-made greenhouse gas emissions, the global warming potential of greenhouse gases are measured relative to CO2, and are thus expressed as CO2-equivalent (CO2e).

The GHGP divides an organization’s emissions into three primary scopes:

  • Scope 1 – Direct greenhouse gas emissions (for example from burning fossil fuels)
  • Scope 2 – Indirect emissions from purchased energy (typically electricity)
  • Scope 3 – Indirect emissions from the value chain, including suppliers and customers

How do we estimate greenhouse gas emissions?

There are different methods to estimating GHG emissions that includes the Continuous Emissions Monitoring System (CEMS) Method, the Spend-Based Method, and the Consumption-Based Method.

Direct Measurement – CEMS Method

An organization can estimate its carbon footprint from stationary combustion sources by performing a direct measurement of carbon emissions using the CEMS method. This method requires continuously measuring the pollutants emitted in exhaust gases from each emissions source using equipment such as gas analyzers, gas samplers, gas conditioning equipment (to remove particulate matter, water vapor and other contaminants), plumbing, actuated valves, Programmable Logic Controllers (PLCs) and other controlling software and hardware. Although this approach may yield useful results, CEMS requires specific sensing equipment for each greenhouse gas to be measured, requires supporting hardware and software, and is typically more suitable for Environment Health and Safety applications of centralized emission sources. More information on CEMS is available here.

Spend-Based Method

Because the financial accounting function is mature and often already audited, many organizations choose to use financial controls as a foundation for their carbon footprint accounting. The Economic Input-Output Life Cycle Assessment (EIO LCA) method is a spend-based method that combines expenditure data with monetary-based emission factors to estimate the emissions produced. The emission factors are published by the U.S. Environment Protection Agency (EPA) and other peer-reviewed academic and government sources. With this method, you can multiply the amount of money spent on a business activity by the emission factor to produce the estimated carbon footprint of the activity.

For example, you can convert the amount your company spends on truck transport to estimated kilograms (KG) of carbon dioxide equivalent (CO₂e) emitted as shown below.

Estimated Carbon Footprint = Amount of money spent on truck transport * Emission Factor [1]

Although these computations are very easy to make from general ledgers or other financial records, they are most valuable for initial estimates or for reporting minor sources of greenhouse gases. As the only user-provided input is the amount spent on an activity, EIO LCA methods aren’t useful for modeling improved efficiency. This is because the only way to reduce EIO-calculated emissions is to reduce spending. Therefore, as a company continues to improve its carbon footprint efficiency, other methods of estimating carbon footprint are often more desirable.

Consumption-Based Method

From either Enterprise Resource Planning (ERP) systems or electronic copies of fuel bills, it’s straightforward to determine the amount of fuel an organization procures during a reporting period. Fuel-based emission factors are available from a variety of sources such as the US Environmental Protection Agency and commercially-licensed databases. Multiplying the amount of fuel procured by the emission factor yields an estimate of the CO2e emitted through combustion. This method is often used for estimating the carbon footprint of stationary emissions (for instance backup generators for data centers or fossil fuel ovens for industrial processes).

If for a particular month an enterprise consumed a known amount of motor gasoline for stationary combustion, the Scope 1 CO2e footprint of the stationary gasoline combustion can be estimated in the following manner:

Estimated Carbon Footprint = Amount of Fuel Consumed * Stationary Combustion Emission Factor[2]

Organizations may estimate their carbon emissions by using existing data found in fuel and electricity bills, ERP data, and relevant emissions factors, which are then consolidated in to a data lake. Using existing analytics tools such as Amazon Athena and Amazon QuickSight an organization can gain insight into its estimated carbon footprint.

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint.

Analytics Architecture

Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case. For example, in the data ingestion phase, depending on the existing data requirements, there are many options to ingest data into the data lake such as using the AWS Command Line Interface (CLI), AWS DataSync, or AWS Database Migration Service.

Example of calculating a Scope 1 stationary emissions footprint with AWS services

Let’s assume you burned 100 standard cubic feet (scf) of natural gas in an oven. Using the US EPA emission factors for stationary emissions we can estimate the carbon footprint associated with the burning. In this case the emission factor is 0.05449555 Kg CO2e /scf.[3]

Amazon S3 is ideal for building a data lake on AWS to store disparate data sources in a single repository, due to its virtually unlimited scalability and high durability. Athena, a serverless interactive query service, allows the analysis of data directly from Amazon S3 using standard SQL without having to load the data into Athena or run complex extract, transform, and load (ETL) processes. Amazon QuickSight supports creating visualizations of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to extract a subset of the data. QuickSight dashboards can provide you with insights (such as your company’s estimated carbon footprint) quickly, and also provide the ability to generate standardized reports for your business and sustainability users.

In this example, the sample data is stored in a file system and uploaded to Amazon S3 using the AWS Command Line Interface (CLI) as shown in the following architecture diagram. AWS recommends creating AWS resources and managing CLI access in accordance with the Best Practices for Security, Identity, & Compliance guidance.

The AWS CLI command below demonstrates how to upload the sample data folders into the S3 target location.

aws s3 cp /path/to/local/file s3://bucket-name/path/to/destination

The snapshot of the S3 console shows two newly added folders that contains the files.

S3 Bucket Overview of Files

To create new table schemas, we start by running the following script for the gas utilization table in the Athena query editor using Hive DDL. The script defines the data format, column details, table properties, and the location of the data in S3.

CREATE EXTERNAL TABLE `gasutilization`(
`fuel_id` int,
`month` string,
`year` int,
`usage_therms` float,
`usage_scf` float,
`g-nr1_schedule_charge` float,
`accountfee` float,
`gas_ppps` float,
`netcharge` float,
`taxpercentage` float,
`totalcharge` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gasutilization'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

Athena Hive DDLThe script below shows another example of using Hive DDL to generate the table schema for the gas emission factor data.

CREATE EXTERNAL TABLE `gas_emission_factor`(
`fuel_id` int,
`gas_name` string,
`emission_factor` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gas_emission_factor'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

After creating the table schema in Athena, we run the below query against the gas utilization table that includes details of gas bills to show the gas utilization and the associated charges, such as gas public purpose program surcharge (PPPS) and total charges after taxes for the year of 2020:

SELECT * FROM "gasutilization" where year = 2020;

Athena gas utilization overview by month

We are also able to analyze the emission factor data showing the different fuel types and their corresponding CO2e emission as shown in the screenshot.

athena co2e emission factor

With the emission factor and the gas utilization data, we can run the following query below to get an estimated Scope 1 carbon footprint alongside other details. In this query, we joined the gas utilization table and the gas emission factor table on fuel id and multiplied the gas usage in standard cubic foot (scf) by the emission factor to get the estimated CO2e impact. We also selected the month, year, total charge, and gas usage measured in therms and scf, as these are often attributes that are of interest for customers.

SELECT "gasutilization"."usage_scf" * "gas_emission_factor"."emission_factor" 
AS "estimated_CO2e_impact", 
"gasutilization"."month", 
"gasutilization"."year", 
"gasutilization"."totalcharge", 
"gasutilization"."usage_therms", 
"gasutilization"."usage_scf" 
FROM "gasutilization" 
JOIN "gas_emission_factor" 
on "gasutilization"."fuel_id"="gas_emission_factor"."fuel_id";

athena join

Lastly, Amazon QuickSight allows visualization of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to get a subset of the data. The following is an example of a QuickSight dashboard showing the gas utilization, gas charges, and estimated carbon footprint across different years.

QuickSight sample dashboard

We have just estimated the Scope 1 carbon footprint for one source of stationary combustion. If we were to do the same process for all sources of stationary and mobile emissions (with different emissions factors) and add the results together, we could roll up an accurate estimate of our Scope 1 carbon emissions for the entire business by only utilizing native AWS services and our own data. A similar process will yield an estimate of Scope 2 emissions, with grid carbon intensity in the place of Scope 1 emission factors.

Summary

This blog discusses how organizations can use existing data in disparate sources to build a data architecture to gain better visibility into Scope 1 greenhouse gas emissions. With Athena, S3, and QuickSight, organizations can now estimate their stationary emissions carbon footprint in a repeatable way by applying the consumption-based method to convert fuel utilization into an estimated carbon footprint.

Other approaches available on AWS include Carbon Accounting on AWS, Sustainability Insights Framework, Carbon Data Lake on AWS, and general guidance detailed at the AWS Carbon Accounting Page.

If you are interested in information on estimating your organization’s carbon footprint with AWS, please reach out to your AWS account team and check out AWS Sustainability Solutions.

References

  1. An example from page four of Amazon’s Carbon Methodology document illustrates this concept.
    Amount spent on truck transport: $100,000
    EPA Emission Factor: 1.556 KG CO2e /dollar of truck transport
    Estimated CO₂e emission: $100,000 * 1.556 KG CO₂e/dollar of truck transport = 155,600 KG of CO2e
  2. For example,
    Gasoline consumed: 1,000 US Gallons
    EPA Emission Factor: 8.81 Kg of CO2e /gallon of gasoline combusted
    Estimated CO2e emission = 1,000 US Gallons * 8.81 Kg of CO2e per gallon of gasoline consumed= 8,810 Kg of CO2e.
    EPA Emissions Factor for stationary emissions of motor gasoline is 8.78 kg CO2 plus .38 grams of CH4, plus .08 g of N2O.
    Combining these emission factors using 100-year global warming potential for each gas (CH4:25 and N2O:298) gives us Combined Emission Factor = 8.78 kg + 25*.00038 kg + 298 *.00008 kg = 8.81 kg of CO2e per gallon.
  3. The Emission factor per scf is 0.05444 kg of CO2 plus 0.00103 g of CH4 plus 0.0001 g of N2O. To get this in terms of CO2e we need to multiply the emission factor of the other two gases by their global warming potentials (GWP). The 100-year GWP for CH4  and N2O are 25 and 298 respectively. Emission factors and GWPs come from the US EPA website.


About the Authors


Thomas Burns
, SCR, CISSP is a Principal Sustainability Strategist and Principal Solutions Architect at Amazon Web Services. Thomas supports manufacturing and industrial customers world-wide. Thomas’s focus is using the cloud to help companies reduce their environmental impact both inside and outside of IT.

Aileen Zheng is a Solutions Architect supporting US Federal Civilian Sciences customers at Amazon Web Services (AWS). She partners with customers to provide technical guidance on enterprise cloud adoption and strategy and helps with building well-architected solutions. She is also very passionate about data analytics and machine learning. In her free time, you’ll find Aileen doing pilates, taking her dog Mumu out for a hike, or hunting down another good spot for food! You’ll also see her contributing to projects to support diversity and women in technology.

How FIS ingests and searches vector data for quick ticket resolution with Amazon OpenSearch Service

Post Syndicated from Rupesh Tiwari original https://aws.amazon.com/blogs/big-data/how-fis-ingests-and-searches-vector-data-for-quick-ticket-resolution-with-amazon-opensearch-service/

This post was co-written by Sheel Saket, Senior Data Science Manager at FIS, and Rupesh Tiwari, Senior Architect at Amazon Web Services.

Do you ever find yourself grappling with multiple defect logging mechanisms, scattered project management tools, and fragmented software development platforms? Have you experienced the frustration of lacking a unified view, hindering your ability to efficiently manage and identify common trending issues within your enterprise? Are you constantly facing challenges when it comes to addressing defects and their impact, causing disruptions in your production cycles?

If these questions resonate with you, then you’re not alone. FIS, a leading technology and services provider, has encountered these very challenges. In their quest for a solution, they teamed up with AWS to tackle these obstacles head-on. In this post, we take you on a journey through their collaborative project, exploring how they used Amazon OpenSearch Service to transform their operations, enhance efficiency, and gain valuable insights.

This post shares FIS’s journey in overcoming challenges and provides step-by-step instructions for provisioning the solution architecture in your AWS account. You’ll learn how to implement a transformative solution that empowers your organization with near-real-time data indexing and visualization capabilities.

In the following sections, we dive into the details of FIS’s journey and discover how they overcame these challenges, revolutionizing their approach to defect management and software development.

Challenges for near-real-time ticket visualization and search

FIS faced several challenges in achieving near-real-time ticket visualization and search capabilities, including the
following:

  • Integrating ticket data from tens of different third-party systems
  • Overcoming API call thresholds and limitations from various systems
  • Implementing an efficient KNN vector search algorithm for resolving issues and performing trend analysis
  • Establishing a robust data ingestion and indexing process for real-time updates from 15,000 tickets per day
  • Ensuring unified access to ticket information across 20 development teams
  • Providing secure and scalable access to ticket data for up to 250 teams

Despite these challenges, FIS successfully enhanced their operational efficiency, enabled quick ticket resolution, and gained valuable insights through the integration of OpenSearch Service.

Let’s delve into the technical walkthrough of the architecture diagram and mechanisms. The following section provides step-by-step instructions for provisioning and implementing the solution on your AWS Management Console, along with a helpful video tutorial.

Solution overview

The architecture diagram of FIS’s near-real-time data indexing and visualization solution incorporates various AWS services for specific functions. The solution uses GitHub as the data source, employs Amazon Simple Storage Service (Amazon S3) for scalable storage, manages APIs with Amazon API Gateway, performs serverless computing using AWS Lambda, and facilitates data streaming and ETL (extract, transform, and load) processes through Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. OpenSearch Service is employed for analytics and application monitoring. This architecture ensures a robust and scalable solution, enabling FIS to efficiently index and visualize data in near-real time. With these AWS services, FIS effectively manages their data pipeline and gains valuable insights for their business processes.

The following diagram illustrates the solution architecture.

Architecture Diagram

The workflow includes the following steps:

  1. GitHub webhook events stream data to both Amazon S3 and OpenSearch
    Service, facilitating real-time data analysis.
  2. A Lambda function connects to an API Gateway REST API, processing and structuring the received payloads.
  3. The Lambda function adds the structured data to a Kinesis data stream, enabling immediate data streaming and quick ticket insights.
  4. Kinesis Data Firehose streams the records from the Kinesis data stream to an S3 bucket, simultaneously creating an index in OpenSearch Service.
  5. OpenSearch Service uses the indexed data to provide near-real-time visualization and enable efficient ticket analysis through K-Nearest Neighbor (KNN) search, enhancing productivity and optimizing data operations.

The following sections provide step-by-step instructions for setting up the solution. Additionally, we have created a video guide that demonstrates each step in detail. You are welcome to watch the video and follow along with this post if you prefer.

Prerequisites

You should have the following prerequisites:

Implement the solution

Complete the following steps to implement the solution:

  1. Create an OpenSearch Service domain.
  2. Create an S3 bucket named git-data.
  3. Create a Kinesis data stream named git-data-stream.
  4. Create a Firehose delivery stream named git-data-delivery-stream with
    git-data-stream as the source and git-data as the destination, and a buffer interval of 60 seconds.
  5. Create a Lambda function named git-webhook-handler with a timeout of 5 minutes. Add code to add data to the Kinesis data stream.
  6. Grant the Lambda function’s execution role permission to put_record on the Kinesis data stream.
  7. Create a REST API in API Gateway named git-webhook-handler-api. Create a resource named
    git-data with a POST method, integrate it with the Lambda function git-webhook-handler created in the previous step, and deploy the REST API.
  8. Create a delivery stream with the Kinesis data stream as the source and OpenSearch Service as the destination. Provide the AWS Identity and Access Management (IAM) role for Kinesis Data Firehose with the necessary permissions to create an index in OpenSearch Service. Finally, add the IAM role as a backend service in OpenSearch Service.
  9. Navigate to your GitHub repository and create a webhook to enable seamless integration with the solution. Copy the REST API URL and enter this newly created webhook.

Test the solution

To test the solution, complete the following steps:

  1. Go to your GitHub repository and choose the Star button, and verify that you receive a response with a status code of 200.
  2. Also, check for the ShardId and SequenceNumber in the recent deliveries to confirm successful event addition to the Kinesis data stream.

Kinesis data stream

  1. On the Kinesis console, use the Data Viewer to confirm the arrival of data records.

kinesis record data

  1. Navigate to the OpenSearch Dashboard and choose the dev tool.
  2. Search for the records and observe that all the Git events are displayed
    in the result pane.

opensearch devtool

  1. On the Amazon S3 console, open the bucket and view the data records.

s3 bucket records

Security

We adhere to IAM best practices to uphold security:

  1. Craft a Lambda execution role for read/write operations on the Kinesis data stream.
  2. Generate an IAM role for Kinesis Data Firehose to manage Amazon S3 and OpenSearch
    Service access.
  3. Link this IAM role in OpenSearch Service security to confer backend user privileges.

Clean up

To avoid incurring future charges, delete all the resources you created.

Benefits of near-real-time ticket visualization and search

During our demonstration, we showcased the utilization of GitHub as the streaming data source. However, it’s important to note that the solution we presented has the flexibility to scale and incorporate multiple data sources from various services. This allows for the consolidation and visualization of diverse data in near-real time, using the capabilities of OpenSearch Service.

With the implementation of the solution described in this post, FIS effectively overcame all the challenges they faced.

In this section, we delve into the details of the challenges and benefits they achieved:

  • Integrating ticket data from multiple third-party systems – Near-real-time data streaming ensures an up-to-date information flow from third-party providers for timely insights
  • Overcoming API call thresholds and limitations imposed by different systems – Unrestricted data flow with no threshold or rate limiting enables seamless integration and continuous updates
  • Accommodating scalability requirements for up to 250 teams – The asynchronous, serverless architecture effortlessly scales more than 250 times larger without infrastructure modifications
  • Efficiently resolving tickets and performing trend analysis – OpenSearch Service semantic KNN search identifies duplicates and defects, and optimizes operations for improved efficiency
  • Gaining valuable insights for business processes – Artificial intelligence (AI) and machine
    learning (ML) analytics use the data stored in the S3 bucket, empowering deeper insights and informed decision-making
  • Ensuring secure access to ticket data and regulatory compliance – Secure data access and compliance with data protection regulations ensure data privacy and regulatory compliance

Conclusion

FIS, in collaboration with AWS, successfully addressed several challenges to achieve near-real-time ticket visualization and search capabilities. With OpenSearch Service, FIS enhanced operational efficiency by efficiently resolving ticketsand performing trend analysis. With their data ingestion and indexing process, FIS processed 15,000 tickets per day in real time. The solution provided secure and scalable access to ticket data for more than 250 teams, enabling unified collaboration. FIS experienced a remarkable 30% reduction in ticket resolution time, empowering teams to quickly address
issues.

As Sheel Saket, Senior Data Science Manager at FIS, states, “Our near-real-time solution transformed how we identify and resolve tickets, improving our overall productivity.”

Furthermore, organizations can further improve the solution by adopting Amazon OpenSearch Ingestion for data ingestion, which offers cost savings and out-of-the-box data processing capabilities. By embracing this transformative solution, organizations can optimize their ticket management, drive productivity, and deliver exceptional experiences to customers.

Want to know more? You can reach out to FIS from their official FIS contact page, follow FIS Twitter, and visit the FIS LinkedIn page.


About the Author

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Sheel Saket is a Senior Data Science Manager at FIS in Chicago, Illinois. He has over 11 years of IT experience in the finance, insurance, and e-commerce domains, and specializes in architecting large-scale AI solutions and cloud MLOps. In his spare time, Sheel enjoys listening to audiobooks, podcasts, and watching movies with his family.

Alcion supports their multi-tenant platform with Amazon OpenSearch Serverless

Post Syndicated from Zack Rossman original https://aws.amazon.com/blogs/big-data/alcion-supports-their-multi-tenant-platform-with-amazon-opensearch-serverless/

This is a guest blog post co-written with Zack Rossman from Alcion.

Alcion, a security-first, AI-driven backup-as-a-service (BaaS) platform, helps Microsoft 365 administrators quickly and intuitively protect data from cyber threats and accidental data loss. In the event of data loss, Alcion customers need to search metadata for the backed-up items (files, emails, contacts, events, and so on) to select specific item versions for restore. Alcion uses Amazon OpenSearch Service to provide their customers with accurate, efficient, and reliable search capability across this backup catalog. The platform is multi-tenant, which means that Alcion requires data isolation and strong security so as to ensure that tenants can only search their own data.

OpenSearch Service is a fully managed service that makes it easy to deploy, scale, and operate OpenSearch in the AWS Cloud. OpenSearch is an Apache-2.0-licensed, open-source search and analytics suite, comprising OpenSearch (a search, analytics engine, and vector database), OpenSearch Dashboards (a visualization and utility user interface), and plugins that provide advanced capabilities like enterprise-grade security, anomaly detection, observability, alerting, and much more. Amazon OpenSearch Serverless is a serverless deployment option that makes it simple to use OpenSearch without configuring, managing, and scaling OpenSearch Service domains.

In this post, we share how adopting OpenSearch Serverless enabled Alcion to meet their scale requirements, reduce their operational overhead, and secure their tenants’ data by enforcing tenant isolation within their multi-tenant environment.

OpenSearch Service managed domains

For the first iteration of their search architecture, Alcion chose the managed domains deployment option in OpenSearch Service and was able to launch their search functionality in production in less than a month. To meet their security, scale, and tenancy requirements, they stored data for each tenant in a dedicated index and used fine-grained access control in OpenSearch Service to prevent cross-tenant data leaks. As their workload evolved, Alcion engineers tracked OpenSearch domain utilization via the provided Amazon CloudWatch metrics, making changes to increase storage and optimize their compute resources.

The team at Alcion used several features of OpenSearch Service managed domains to improve their operational stance. They introduced index aliases, which provide a single alias name to access (read and write) multiple underlying indexes. They also configured Index State Management (ISM) policies to help them control their data lifecycle by rolling indexes over based on index size. Together, the ISM policies and index aliases were necessary to scale indexes for large tenants. Alcion also used index templates to define the shards per index (partitioning) of their data so as to automate their data lifecycle and improve the performance and stability of their domains.

The following architecture diagram shows how Alcion configured their OpenSearch managed domains.

The following diagram shows how Microsoft 365 data was indexed to and queried from tenant-specific indexes. Alcion implemented request authentication by providing the OpenSearch primary user credentials with each API request.

OpenSearch Serverless overview and tenancy model options

OpenSearch Service managed domains provided a stable foundation for Alcion’s search functionality, but the team needed to manually provision resources to the domains for their peak workload. This left room for cost optimizations because Alcion’s workload is bursty—there are large variations in the number of search and indexing transactions per second, both for a single customer and taken as a whole. To reduce costs and operational burden, the team turned to OpenSearch Serverless, which offers auto-scaling capability.

To use OpenSearch Serverless, the first step is to create a collection. A collection is a group of OpenSearch indexes that work together to support a specific workload or use case. The compute resources for a collection, called OpenSearch Compute Units (OCUs), are shared across all collections in an account that share an encryption key. The pool of OCUs is automatically scaled up and down to meet the demands of indexing and search traffic.

The level of effort required to migrate from an OpenSearch Service managed domain to OpenSearch Serverless was manageable thanks to the fact that OpenSearch Serverless collections support the same OpenSearch APIs and libraries as OpenSearch Service managed domains. This allowed Alcion to focus on optimizing the tenancy model for the new search architecture. Specifically, the team needed to decide how to partition tenant data within collections and indexes while balancing security and total cost of ownership. Alcion engineers, in collaboration with the OpenSearch Serverless team, considered three tenancy models:

  • Silo model: Create a collection for each tenant
  • Pool model: Create a single collection and use a single index for multiple tenants
  • Bridge model: Create a single collection and use a single index per tenant

All three design choices had benefits and trade-offs that had to be considered for designing the final solution.

Silo model: Create a collection for each tenant

In this model, Alcion would create a new collection whenever a new customer onboarded to their platform. Although tenant data would be cleanly separated between collections, this option was disqualified because the collection creation time meant that customers wouldn’t be able to back up and search data immediately after registration.

Pool model: Create a single collection and use a single index for multiple tenants

In this model, Alcion would create a single collection per AWS account and index tenant-specific data in one of many shared indexes belonging to that collection. Initially, pooling tenant data into shared indexes was attractive from a scale perspective because this led to the most efficient use of index resources. But after further analysis, Alcion found that they would be well within the per-collection index quota even if they allocated one index for each tenant. With that scalability concern resolved, Alcion pursued the third option because siloing tenant data into dedicated indexes results in stronger tenant isolation than the shared index model.

Bridge model: Create a single collection and use a single index per tenant

In this model, Alcion would create a single collection per AWS account and create an index for each of the hundreds of tenants managed by that account. By assigning each tenant to a dedicated index and pooling these indexes in a single collection, Alcion reduced onboarding time for new tenants and siloed tenant data into cleanly separated buckets.

Implementing role-based access control for supporting multi-tenancy

OpenSearch Serverless offers a multi-point, inheritable set of security controls, covering data access, network access, and encryption. Alcion took full advantage of OpenSearch Serverless data access policies to implement role-based access control (RBAC) for each tenant-specific index with the following details:

  • Allocate an index with a common prefix and the tenant ID (for example, index-v1-<tenantID>)
  • Create a dedicated AWS Identity and Access Management (IAM) role that is used to sign requests to the OpenSearch Serverless collection
  • Create an OpenSearch Serverless data access policy that grants document read/write permissions within a dedicated tenant index to the IAM role for that tenant
  • OpenSearch API requests to a tenant index are signed with temporary credentials belonging to the tenant-specific IAM role

The following is an example OpenSearch Serverless data access policy for a mock tenant with ID t-eca0acc1-12345678910. This policy grants the IAM role document read/write access to the dedicated tenant access.

[
    {
        "Rules": [
            {
                "Resource": [
                    "index/collection-searchable-entities/index-v1-t-eca0acc1-12345678910"
                ],
                "Permission": [
                    "aoss:ReadDocument",
                    "aoss:WriteDocument",
                ],
                "ResourceType": "index"
            }
        ],
        "Principal": [
            "arn:aws:iam::12345678910:role/OpenSearchAccess-t-eca0acc1-1b9f-4b3f-95d6-12345678910"
        ],
        "Description": "Allow document read/write access to OpenSearch index belonging to tenant t-eca0acc1-1b9f-4b3f-95d6-12345678910"
    }
] 

The following architecture diagram depicts how Alcion implemented indexing and searching for Microsoft 365 resources using the OpenSearch Serverless shared collection approach.

The following is the sample code snippet for sending an API request to an OpenSearch Serverless collection. Notice how the API client is initialized with a signer object that signs requests with the same IAM principal that is linked to the OpenSearch Serverless data access policy from the previous code snippet.

package alcion

import (
	"context"
	"encoding/json"
	"strings"

	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/credentials/stscreds"
	"github.com/aws/aws-sdk-go-v2/service/sts"
	"github.com/opensearch-project/opensearch-go/v2"
	"github.com/opensearch-project/opensearch-go/v2/opensearchapi"
	"github.com/opensearch-project/opensearch-go/v2/signer"
	awssignerv2 "github.com/opensearch-project/opensearch-go/v2/signer/awsv2"
	"github.com/pkg/errors"
)

const (
	// Scope the API request to the AWS OpenSearch Serverless service
	aossService = "aoss"

	// Mock values
	indexPrefix        = "index-v1-"
	collectionEndpoint = "<https://kfbr3928z4y6vot2mbpb.us-east-1.aoss.amazonaws.com>"
	tenantID           = "t-eca0acc1-1b9f-4b3f-95d6-b0b96b8c03d0"
	roleARN            = "arn:aws:iam::1234567890:role/OpenSearchAccess-t-eca0acc1-1b9f-4b3f-95d6-b0b96b8c03d0"
)

func CreateIndex(ctx context.Context, tenantID string) (*opensearchapi.Response, error) {

	sig, err := createRequestSigner(ctx)
	if err != nil {
		return nil, errors.Wrapf(err, "error creating new signer for AWS OSS")
	}

	cfg := opensearch.Config{
		Addresses: []string{collectionEndpoint},
		Signer:    sig,
	}

	aossClient, err := opensearch.NewClient(cfg)
	if err != nil {
		return nil, errors.Wrapf(err, "error creating new OpenSearch API client")
	}

  body, err := getSearchBody()
  if err != nil {
    return nil, errors.Wrapf(err, "error getting search body")
  }

	req := opensearchapi.SearchRequest{
		Index: []string{indexPrefix + tenantID},
		Body:  body,
	}

	return req.Do(ctx, aossClient)
}

func createRequestSigner(ctx context.Context) (signer.Signer, error) {

	awsCfg, err := config.LoadDefaultConfig(ctx)
	if err != nil {
		return nil, errors.Wrapf(err, "error loading default config")
	}

	stsClient := sts.NewFromConfig(awsCfg)
	provider := stscreds.NewAssumeRoleProvider(stsClient, roleARN)

	awsCfg.Credentials = aws.NewCredentialsCache(provider)
	return awssignerv2.NewSignerWithService(awsCfg, aossService)
}

func getSearchBody() (*strings.Reader, error) {
	// Match all documents, page size = 10
	query := map[string]interface{}{
		"size": 10,
	}

	queryJson, err := json.Marshal(query)
  if err != nil {
    return nil, err
  }

	return strings.NewReader(string(queryJson)), nil
} 

Conclusion

In May of 2023, Alcion rolled out its search architecture based on the shared collection and dedicated index-per-tenant model in all production and pre-production environments. The team was able to tear out complex code and operational processes that had been dedicated to scaling OpenSearch Service managed domains. Furthermore, thanks to the auto scaling capabilities of OpenSearch Serverless, Alcion has reduced their OpenSearch costs by 30% and expects the cost profile to scale favorably.

In their journey from managed to serverless OpenSearch Service, Alcion benefited in their initial choice of OpenSearch Service managed domains. In migrating forward, they were able to reuse the same OpenSearch APIs and libraries for their OpenSearch Serverless collections that they used for their OpenSearch Service managed domain. Additionally, they updated their tenancy model to take advantage of OpenSearch Serverless data access policies. With OpenSearch Serverless, they were able to effortlessly adapt to their customers’ scale needs while ensuring tenant isolation.

For more information about Alcion, visit their website.


About the Authors

Zack Rossman is a Member of Technical Staff at Alcion. He is the tech lead for the search and AI platforms. Prior to Alcion, Zack was a Senior Software Engineer at Okta, developing core workforce identity and access management products for the Directories team.

Niraj Jetly is a Software Development Manager for Amazon OpenSearch Serverless. Niraj leads several data plane teams responsible for launching Amazon OpenSearch Serverless. Prior to AWS, Niraj led several product and engineering teams as CTO, VP of Engineering, and Head of Product Management for over 15 years. Niraj is a recipient of over 15 innovation awards, including being named CIO of the year in 2014 and top 100 CIO in 2013 and 2016. A frequent speaker at several conferences, he has been quoted in NPR, WSJ, and The Boston Globe.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Enable data analytics with Talend and Amazon Redshift Serverless

Post Syndicated from Tamara Astakhova original https://aws.amazon.com/blogs/big-data/enable-data-analytics-with-talend-and-amazon-redshift-serverless/

This is a guest post co-written with Cameron Davie from Talend.

Today, in order to accelerate and scale data analytics, companies are looking for an approach to minimize infrastructure management and predict computing needs for different types of workloads, including spikes and ad hoc analytics.

The integration of Talend Cloud and Talend Stitch with Amazon Redshift Serverless can help you achieve successful business outcomes without data warehouse infrastructure management.

In this post, we demonstrate how Talend easily integrates with Redshift Serverless to help you accelerate and scale data analytics with trusted data.

About Redshift Serverless

Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Data scientists, developers, and data analysts can access meaningful insights and build data-driven applications with zero maintenance. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. You can load your data and start querying in your favorite business intelligence (BI) tools, build machine learning (ML) models in SQL, or combine your data with third-party data for new insights because Redshift Serverless seamlessly integrates with your data landscape. Existing Amazon Redshift customers can migrate their Redshift clusters to Redshift Serverless using the Amazon Redshift console or API without making changes to their applications and have the advantage of using this capability.

About Talend

Talend is an AWS ISV Partner with the Amazon Redshift Ready Product designation and AWS Competencies in both Data and Analytics and Migration. Talend Cloud combines data integration, data integrity, and data governance in a single, unified platform that makes it easy to collect, transform, clean, govern, and share your data. Talend Stitch is fully managed, scalable service that helps replicate data into your cloud data warehouse and quickly access analytics to make better, faster decisions.

Solution overview

The integration of Talend with Amazon Redshift adds new features and capabilities. As of this writing, Talend has 14 distinct native connectivity and configuration components for Amazon Redshift, which are fully documented in the Talend Help Center.

From the Talend Studio interface, there are no differences or changes required to support or access a Redshift Serverless instance or provisioned cluster.

In the following sections, we detail the steps to integrate the Talend Studio interface with Redshift Serverless.

Prerequisites

To complete the integration, you need a Redshift Serverless data warehouse. For setup instructions, see the Getting Started Guide. You also need a Talend Cloud account and Talend Studio. For setup instructions, see the Talend Cloud installation guide.

Integrate Talend Studio with Redshift Serverless

In the Talend Studio interface, you first create and establish a connection to Redshift Serverless. Then you add an output component to standard loading from your desired source into your Redshift Serverless data warehouse, using the established connection. The alternative step is to use a bulk loading component to load large amounts of data directly to your Redshift Serverless data warehouse, using the tRedshiftBulkExec component. Complete the following steps:

  1. Configure a tRedshiftConnection component to connect to Redshift Serverless:
    • For Database, choose Amazon Redshift.
    • Leave the values for Property Type and Driver version as default.
    • For Host, enter the Redshift Serverless endpoint’s host URL.
    • For Port, enter 5349.
    • For Database, enter your database name.
    • For Schema, enter your preferred schema.
    • For Username and Password, enter your user name and password, respectively.

Follow security best practices by using a strong password policy and regular password rotation to reduce the risk of password-based attacks or exploits.

For more information on how to connect to a database, refer to tDBConnection.

After you create the connection object, you can add an output component to your Talend Studio job. The output component defines that the data being processed in the job’s workflow will land in Redshift Serverless. The following examples show standard output and bulk loading output.

  1. Add a tRedshiftOutput database component.

tRedshiftOutput database component

  1. Configure the tRedshiftOutput database component to write, update, make changes to the connected Redshift Serverless data warehouse.
  2. When using the tRedshiftOutput component, select Use an existing component and choose the connection you created.

This step makes sure that this component is pre-configured.

tDBOutput component

For more information on how to set up a tDBOutput component, see tDBOutput.

  1. Alternatively, you can configure a tRedshiftBulkExec database component to run the insert operations on the connected Redshift Serverless data warehouse.

Using the tRedshiftBulkExec database component allows you to mass load data files directly from Amazon Simple Storage Service (Amazon S3) into Redshift Serverless as tables. The following screenshot illustrates that Talend is able to use connection information in a job across multiple components, saving time and effort when establishing connections to both Amazon Redshift and Amazon S3.

  1. When using the tRedshiftBulkExec component, select Use an existing component for Database settings and choose the connection you created.

This makes sure that this component is preconfigured.

  1. For S3 Setting, select Use an existing S3 connection and enter your existing connection that you will configure separately.

tDBBulkExec component

For more information on how to set up a tDBBulkExec component, see tDBBulkExec.

As well as Talend Cloud for enterprise-level data transformation needs, you could also use Talend Stitch to handle data ingestion and data replication to Redshift Serverless. All configuration for ingestion or replicating data from your desired sources to Redshift Serverless is done in a single input screen.

  1. Provide the following parameters:
    • For Display Name, enter your preferred display name for this connection.
    • For Description, enter a description of the connection. This is optional.
    • For Host, enter the Redshift Serverless endpoint’s host URL.
    • For Port, enter 5349.
    • For Database, enter your database name.
    • For Username and Password, enter your user name and password, respectively.

All support documents and information (including diagrams, steps, and screenshots) can be found in the Talend Cloud and Talend Stitch documentation.

Summary

In this post, we demonstrated how the integration of Talend with Redshift Serverless helps you quickly integrate multiple data sources into a fully managed, secure platform and immediately enable business-wide analytics.

Check out AWS Marketplace and sign up for a free trial with Talend. For more information about Redshift Serverless, refer to the Getting Started Guide.


About the Authors

Tamara Astakhova is a Sr. Partner Solutions Architect in Data and Analytics at AWS. She has over 18 years of experience in the architecture and development of large-scale data analytics systems. Tamara is working with strategic partners helping them build complex AWS-optimized architectures.

Cameron Davie is a Principal Solutions Engineer for the Tech Alliances team. He oversees the technical responsibilities of Talend’s most strategic ISV partnerships. Cameron has been with Talend for 6 years in this role, working directly as the primary technical resource for partners such as AWS, Snowflake, and more. Cameron’s role at Talend is primarily focused on technical enablement and evangelism. This includes showcasing key capabilities of our partners’ solution internally as well as demonstrating Talend’s core technical capabilities with the technical sellers at Talend’s strategic ISV partners. Cameron is a veteran of ISV partnerships and enterprise software, with over 23 years of experience. Before Talend, he spent 14 years at SAP on their OEM/Embedded Solutions partnership team.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/orca-securitys-journey-to-a-petabyte-scale-data-lake-with-apache-iceberg-and-aws-analytics/

This post is co-written with Eliad Gat and Oded Lifshiz from Orca Security.

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. One key component that plays a central role in modern data architectures is the data lake, which allows organizations to store and analyze large amounts of data in a cost-effective manner and run advanced analytics and machine learning (ML) at scale.

Orca Security is an industry-leading Cloud Security Platform that identifies, prioritizes, and remediates security risks and compliance issues across your AWS Cloud estate. Orca connects to your environment in minutes with patented SideScanning technology to provide complete coverage across vulnerabilities, malware, misconfigurations, lateral movement risk, weak and leaked passwords, overly permissive identities, and more.

The Orca Platform is powered by a state-of-the-art anomaly detection system that uses cutting-edge ML algorithms and big data capabilities to detect potential security threats and alert customers in real time, ensuring maximum security for their cloud environment. At the core of Orca’s anomaly detection system is its transactional data lake, which enables the company’s data scientists, analysts, data engineers, and ML specialists to extract valuable insights from vast amounts of data and deliver innovative cloud security solutions to its customers.

In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics. We explore why Orca chose to build a transactional data lake and examine the key considerations that guided the selection of Apache Iceberg as the preferred table format.

In addition, we describe the Orca Platform architecture and the technologies used. Lastly, we discuss the challenges encountered throughout the project, present the solutions used to address them, and share valuable lessons learned.

Why did Orca build a data lake?

Prior to the creation of the data lake, Orca’s data was distributed among various data silos, each owned by a different team with its own data pipelines and technology stack. This setup led to several issues, including scaling difficulties as the data size grew, maintaining data quality, ensuring consistent and reliable data access, high costs associated with storage and processing, and difficulties supporting streaming use cases. Moreover, running advanced analytics and ML on disparate data sources proved challenging. To overcome these issues, Orca decided to build a data lake.

A data lake is a centralized data repository that enables organizations to store and manage large volumes of structured and unstructured data, eliminating data silos and facilitating advanced analytics and ML on the entire data. By decoupling storage and compute, data lakes promote cost-effective storage and processing of big data.

Why did Orca choose Apache Iceberg?

Orca considered several table formats that have evolved in recent years to support its transactional data lake. Amongst the options, Apache Iceberg stood out as the ideal choice because it met all of Orca’s requirements.

First, Orca sought a transactional table format that ensures data consistency and fault tolerance. Apache Iceberg’s transactional and ACID guarantees, which allow concurrent read and write operations while ensuring data consistency and simplified fault handling, fulfill this requirement. Furthermore, Apache Iceberg’s support for time travel and rollback capabilities makes it highly suitable for addressing data quality issues by reverting to a previous state in a consistent manner.

Second, a key requirement was to adopt an open table format that integrates with various processing engines. This was to avoid vendor lock-in and allow teams to choose the processing engine that best suits their needs. Apache Iceberg’s engine-agnostic and open design meets this requirement by supporting all popular processing engines, including Apache Spark, Amazon Athena, Apache Flink, Trino, Presto, and more.

In addition, given the substantial data volumes handled by the system, an efficient table format was required that can support querying petabytes of data very fast. Apache Iceberg’s architecture addresses this need by efficiently filtering and reducing scanned data, resulting in accelerated query times.

An additional requirement was to allow seamless schema changes without impacting end-users. Apache Iceberg’s range of features, including schema evolution, hidden partitions, and partition evolution, addresses this requirement.

Lastly, it was important for Orca to choose a table format that is widely adopted. Apache Iceberg’s growing and active community aligned with the requirement for a popular and community-backed table format.

Solution overview

Orca’s data lake is based on open-source technologies that seamlessly integrate with Apache Iceberg. The system ingests data from various sources such as cloud resources, cloud activity logs, and API access logs, and processes billions of messages, resulting in terabytes of data daily. This data is sent to Apache Kafka, which is hosted on Amazon Managed Streaming for Apache Kafka (Amazon MSK). It is then processed using Apache Spark Structured Streaming running on Amazon EMR and stored in the data lake. Amazon EMR streamlines the process of loading all required Iceberg packages and dependencies, ensuring that the data is stored in Apache Iceberg format and ready for consumption as quickly as possible.

The data lake is built on top of Amazon S3 using Apache Iceberg table format with Apache Parquet as the underlying file format. In addition, the AWS Glue Data Catalog enables data discovery, and AWS Identity and Access Management (IAM) enforces secure access controls for the lake and its operations.

The data lake serves as the foundation for a variety of capabilities that are supported by different engines.

Data pipelines built on Apache Spark and Athena SQL analyze and process the data stored in the data lake. These data pipelines generate valuable insights and curated data that are stored in Apache Iceberg tables for downstream usage. This data is then used by various applications for streaming analytics, business intelligence, and reporting.

Amazon SageMaker is used to build, train, and deploy a range of ML models. Specifically, the system uses Amazon SageMaker Processing jobs to process the data stored in the data lake, employing the AWS SDK for Pandas (previously known as AWS Wrangler) for various data transformation operations, including cleaning, normalization, and feature engineering. This ensures that the data is suitable for training purposes. Additionally, SageMaker training jobs are employed for training the models. After the models are trained, they are deployed and used to identify anomalies and alert customers in real time to potential security threats. The following diagram illustrates the solution architecture.

Orca security Data Lake Architecture

Challenges and lessons learned

Orca faced several challenges while building its petabyte-scale data lake, including:

  • Determining optimal table partitioning
  • Optimizing EMR streaming ingestion for high throughput
  • Taming the small files problem for fast reads
  • Maximizing performance with Athena version 3
  • Maintaining Apache Iceberg tables
  • Managing data retention
  • Monitoring the data lake infrastructure and operations
  • Mitigating data quality issues

In this section, we describe each of these challenges and the solutions implemented to address them.

Determining optimal table partitioning

Determining optimal partitioning for each table is very important in order to optimize query performance and minimize the impact on teams querying the tables when partitioning changes. Apache Iceberg’s hidden partitions combined with partition transformations proved to be valuable in achieving this goal because it allowed for transparent changes to partitioning without impacting end-users. Additionally, partition evolution enables experimentation with various partitioning strategies to optimize cost and performance without requiring a rewrite of the table’s data every time.

For example, with these features, Orca was able to easily change several of its table partitioning from DAY to HOUR with no impact on user queries. Without this native Iceberg capability, they would have needed to coordinate the new schema with all the teams that query the tables and rewrite the entire data, which would have been a costly, time-consuming, and error-prone process.

Optimizing EMR streaming ingestion for high throughput

As mentioned previously, the system ingests billions of messages daily, resulting in terabytes of data processed and stored each day. Therefore, optimizing the EMR clusters for this type of load while maintaining high throughput and low costs has been an ongoing challenge. Orca addressed this in several ways.

First, Orca chose to use instance fleets with its EMR clusters because they allow optimized resource allocation by combining different instance types and sizes. Instance fleets improve resilience by allowing multiple Availability Zones to be configured. As a result, the cluster will launch in an Availability Zone with all the required instance types, preventing capacity limitations. Additionally, instance fleets can use both Amazon Elastic Compute Cloud (Amazon EC2) On-Demand and Spot instances, resulting in cost savings.

The process of sizing the cluster for high throughput and lower costs involved adjusting the number of core and task nodes, selecting suitable instance types, and fine-tuning CPU and memory configurations. Ultimately, Orca was able to find an optimal configuration consisting of on-demand core nodes and spot task nodes of varying sizes, which provided high throughput but also ensured compliance with SLAs.

Orca also found that using different Kafka Spark Structured Streaming properties, such as minOffsetsPerTrigger, maxOffsetsPerTrigger, and minPartitions, provided higher throughput and better control of the load. Using minPartitions, which enables better parallelism and distribution across a larger number of tasks, was particularly useful for consuming high lags quickly.

Lastly, when dealing with a high data ingestion rate, Amazon S3 may throttle the requests and return 503 errors. To address this scenario, Iceberg offers a table property called write.object-storage.enabled, which incorporates a hash prefix into the stored S3 object path. This approach effectively mitigates throttling problems.

Taming the small files problem for fast reads

A common challenge often encountered when ingesting streaming data into the data lake is the creation of many small files. This can have a negative impact on read performance when querying the data with Athena or Apache Spark. Having a high number of files leads to longer query planning and runtimes due to the need to process and read each file, resulting in overhead for file system operations and network communication. Additionally, this can result in higher costs due to the large number of S3 PUT and GET requests required.

To address this challenge, Apache Spark Structured Streaming provides the trigger mechanism, which can be used to tune the rate at which data is committed to Apache Iceberg tables. The commit rate has a direct impact on the number of files being produced. For instance, a higher commit rate, corresponding to a shorter time interval, results in lots of data files being produced.

In certain cases, launching the Spark cluster on an hourly basis and configuring the trigger to AvailableNow facilitated the processing of larger data batches and reduced the number of small files created. Although this approach led to cost savings, it did involve a trade-off of reduced data freshness. However, this trade-off was deemed acceptable for specific use cases.

In addition, to address preexisting small files within the data lake, Apache Iceberg offers a data files compaction operation that combines these smaller files into larger ones. Running this operation on a schedule is highly recommended to optimize the number and size of the files. Compaction also proves valuable in handling late-arriving data and enables the integration of this data into consolidated files.

Maximizing performance with Athena version 3

Orca was an early adopter of Athena version 3, Amazon’s implementation of the Trino query engine, which provides extensive support for Apache Iceberg. Whenever possible, Orca preferred using Athena over Apache Spark for data processing. This preference was driven by the simplicity and serverless architecture of Athena, which led to reduced costs and easier usage, unlike Spark, which typically required provisioning and managing a dedicated cluster at higher costs.

In addition, Orca used Athena as part of its model training and as the primary engine for ad hoc exploratory queries conducted by data scientists, business analysts, and engineers. However, for maintaining Iceberg tables and updating table properties, Apache Spark remained the more scalable and feature-rich option.

Maintaining Apache Iceberg tables

Ensuring optimal query performance and minimizing storage overhead became a significant challenge as the data lake grew to a petabyte scale. To address this challenge, Apache Iceberg offers several maintenance procedures, such as the following:

  • Data files compaction – This operation, as mentioned earlier, involves combining smaller files into larger ones and reorganizing the data within them. This operation not only reduces the number of files but also enables data sorting based on different columns or clustering similar data using z-ordering. Using Apache Iceberg’s compaction results in significant performance improvements, especially for large tables, making a noticeable difference in query performance between compacted and uncompacted data.
  • Expiring old snapshots – This operation provides a way to remove outdated snapshots and their associated data files, enabling Orca to maintain low storage costs.

Running these maintenance procedures efficiently and cost-effectively using Apache Spark, particularly the compaction operation, which operates on terabytes of data daily, requires careful consideration. This entails appropriately sizing the Spark cluster running on EMR and adjusting various settings such as CPU and memory.

In addition, using Apache Iceberg’s metadata tables proved to be very helpful in identifying issues related to the physical layout of Iceberg’s tables, which can directly impact query performance. Metadata tables offer insights into the physical data storage layout of the tables and offer the convenience of querying them with Athena version 3. By accessing the metadata tables, crucial information about tables’ data files, manifests, history, partitions, snapshots, and more can be obtained, which aids in understanding and optimizing the table’s data layout.

For instance, the following queries can uncover valuable information about the underlying data:

  • The number of files and their average size per partition:
    >SELECT partition, file_count, (total_size / file_count) AS avg_file_size FROM "db"."table$partitions"

  • The number of data files pointed to by each manifest:
    SELECT path, added_data_files_count + existing_data_files_count AS number_of_data_files FROM "db"."table$manifests"

  • Information about the data files:
    SELECT file_path, file_size_in_bytes FROM "db"."table$files"

  • Information related to data completeness:
    SELECT record_count, partition FROM "db"."table$partitions"

Managing data retention

Effective management of data retention in a petabyte-scale data lake is crucial to ensure low storage costs as well as to comply with GDPR. However, implementing such a process can be challenging when dealing with Iceberg data stored in S3 buckets, because deleting files based on simple S3 lifecycle policies could potentially cause table corruption. This is because Iceberg’s data files are referenced in manifest files, so any changes to data files must also be reflected in the manifests.

To address this challenge, certain considerations must be taken into account while handling data retention properly. Apache Iceberg provides two modes for handling deletes, namely copy-on-write (CoW), and merge-on-read (MoR). In CoW mode, Iceberg rewrites data files at the time of deletion and creates new data files, whereas in MoR mode, instead of rewriting the data files, a delete file is written that lists the position of deleted records in files. These files are then reconciled with the remaining data during read time.

In favor of faster read times, CoW mode is preferable and when used in conjunction with the expiring old snapshots operation, it allows for the hard deletion of data files that have exceeded the set retention period.

In addition, by storing the data sorted based on the field that will be utilized for deletion (for example, organizationID), it’s possible to reduce the number of files that require rewriting. This optimization significantly enhances the efficiency of the deletion process, resulting in improved deletion times.

Monitoring the data lake infrastructure and operations

Managing a data lake infrastructure is challenging due to the various components it encompasses, including those responsible for data ingestion, storage, processing, and querying.

Effective monitoring of all these components involves tracking resource utilization, data ingestion rates, query runtimes, and various other performance-related metrics, and is essential for maintaining optimal performance and detecting issues as soon as possible.

Monitoring Amazon EMR was crucial because it played a vital role in the system for data ingestion, processing, and maintenance. Orca monitored the cluster status and resource usage of Amazon EMR by utilizing the available metrics through Amazon CloudWatch. Furthermore, it used JMX Exporter and Prometheus to scrape specific Apache Spark metrics and create custom metrics to further improve the pipelines’ observability.

Another challenge emerged when attempting to further monitor the ingestion progress through Kafka lag. Although Kafka lag tracking is the standard method for monitoring ingestion progress, it posed a challenge because Spark Structured Streaming manages its offsets internally and doesn’t commit them back to Kafka. To overcome this, Orca utilized the progress of the Spark Structured Streaming Query Listener (StreamingQueryListener) to monitor the processed offsets, which were then committed to a dedicated Kafka consumer group for lag monitoring.

In addition, to ensure optimal query performance and identify potential performance issues, it was essential to monitor Athena queries. Orca addressed this by using key metrics from Athena and the AWS SDK for Pandas, specifically TotalExecutionTime and ProcessedBytes. These metrics helped identify any degradation in query performance and keep track of costs, which were based on the size of the data scanned.

Mitigating data quality issues

Apache Iceberg’s capabilities and overall architecture played a key role in mitigating data quality challenges.

One of the ways Apache Iceberg addresses these challenges is through its schema evolution capability, which enables users to modify or add columns to a table’s schema without rewriting the entire data. This feature prevents data quality issues that may arise due to schema changes, because the table’s schema is managed as part of the manifest files, ensuring safe changes.

Furthermore, Apache Iceberg’s time travel feature provides the ability to review a table’s history and roll back to a previous snapshot. This functionality has proven to be extremely useful in identifying potential data quality issues and swiftly resolving them by reverting to a previous state with known data integrity.

These robust capabilities ensure that data within the data lake remains accurate, consistent, and reliable.

Conclusion

Data lakes are an essential part of a modern data architecture, and now it’s easier than ever to create a robust, transactional, cost-effective, and high-performant data lake by using Apache Iceberg, Amazon S3, and AWS Analytics services such as Amazon EMR and Athena.

Since building the data lake, Orca has observed significant improvements. The data lake infrastructure has allowed Orca’s platform to have seamless scalability while reducing the cost of running its data pipelines by over 50% utilizing Amazon EMR. Additionally, query costs were reduced by more than 50% using the efficient querying capabilities of Apache Iceberg and Athena version 3.

Most importantly, the data lake has made a profound impact on Orca’s platform and continues to play a key role in its success, supporting new use cases such as change data capture (CDC) and others, and enabling the development of cutting-edge cloud security solutions.

If Orca’s journey has sparked your interest and you are considering implementing a similar solution in your organization, here are some strategic steps to consider:

  • Start by thoroughly understanding your organization’s data needs and how this solution can address them.
  • Reach out to experts, who can provide you with guidance based on their own experiences. Consider engaging in seminars, workshops, or online forums that discuss these technologies. The following resources are recommended for getting started:
  • An important part of this journey would be to implement a proof of concept. This hands-on experience will provide valuable insights into the complexities of a transactional data lake.

Embarking on a journey to a transactional data lake using Amazon S3, Apache Iceberg, and AWS Analytics can vastly improve your organization’s data infrastructure, enabling advanced analytics and machine learning, and unlocking insights that drive innovation.


About the Authors

Eliad Gat is a Big Data & AI/ML Architect at Orca Security. He has over 15 years of experience designing and building large-scale cloud-native distributed systems, specializing in big data, analytics, AI, and machine learning.

Oded Lifshiz is a Principal Software Engineer at Orca Security. He enjoys combining his passion for delivering innovative, data-driven solutions with his expertise in designing and building large-scale machine learning pipelines.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value. Yonatan also leads the Apache Iceberg Israel community.

Carlos Rodrigues is a Big Data Specialist Solutions Architect at Amazon Web Services. He helps customers worldwide build transactional data lakes on AWS using open table formats like Apache Hudi and Apache Iceberg.

Sofia Zilberman is a Sr. Analytics Specialist Solutions Architect at Amazon Web Services. She has a track record of 15 years of creating large-scale, distributed processing systems. She remains passionate about big data technologies and architecture trends, and is constantly on the lookout for functional and technological innovations.

Reimagine Software Development With CodeWhisperer as Your AI Coding Companion

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/reimagine-software-development-with-codewhisperer-as-your-ai-coding-companion/

In the few months since Amazon CodeWhisperer became generally available, many customers have used it to simplify and streamline the way they develop software. CodeWhisperer uses generative AI powered by a foundational model to understand the semantics and context of your code and provide relevant and useful suggestions. It can help build applications faster and more securely, and it can help at different levels, from small suggestions to writing full functions and unit tests that help decompose a complex problem into simpler tasks.

Imagine you want to improve your code test coverage or implement a fine-grained authorization model for your application. As you begin writing your code, CodeWhisperer is there, working alongside you. It understands your comments and existing code, providing real-time suggestions that can range from snippets to entire functions or classes. This immediate assistance adapts to your flow, reducing the need for context-switching to search for solutions or syntax tips. Using a code companion can enhance focus and productivity during the development process.

When you encounter an unfamiliar API, CodeWhisperer accelerates your work by offering relevant code suggestions. In addition, CodeWhisperer offers a comprehensive code scanning feature that can detect elusive vulnerabilities and provide suggestions to rectify them. This aligns with best practices such as those outlined by the Open Worldwide Application Security Project (OWASP). This makes coding not just more efficient, but also more secure and with an increased assurance in the quality of your work.

CodeWhisperer can also flag code suggestions that resemble open-source training data, and flag and remove problematic code that might be considered biased or unfair. It provides you with the associated open-source project’s repository URL and license, making it easier for you to review them and add attribution where necessary.

Here are a few examples of CodeWhisperer in action that span different areas of software development, from prototyping and onboarding to data analytics and permissions management.

CodeWhisperer Speeds Up Prototyping and Onboarding
One customer using CodeWhisperer in an interesting way is BUILDSTR, a consultancy that provides cloud engineering services focused on platform development and modernization. They use Node.js and Python in the backend and mainly React in the frontend.

I talked with Kyle Hines, co-founder of BUILDSTR, who said, “leveraging CodeWhisperer across different types of development projects for different customers, we’ve seen a huge impact in prototyping. For example, we are impressed by how quickly we are able to create templates for AWS Lambda functions interacting with other AWS services such as Amazon DynamoDB.” Kyle said their prototyping now takes 40% less time, and they noticed a reduction of more than 50% in the number of vulnerabilities present in customer environments.

Screenshot of a code editor using CodeWhisperer to generate the handler of an AWS Lambda function.

Kyle added, “Because hiring and developing new talent is a perpetual process for consultancies, we leveraged CodeWhisperer for onboarding new developers and it helps BUILDSTR Academy reduce the time and complexity for onboarding by more than 20%.”

CodeWhisperer for Exploratory Data Analysis
Wendy Wong is a business performance analyst building data pipelines at Service NSW and agile projects in AI. For her contributions to the community, she’s also an AWS Data Hero. She says Amazon CodeWhisperer has significantly accelerated her exploratory data analysis process, when she is analyzing a dataset to get a summary of its main characteristics using statistics and visualization tools.

She finds CodeWhisperer to be a swift, user-friendly, and dependable coding companion that accurately infers her intent with each line of code she crafts, and ultimately aids in the enhancement of her code quality through its best practice suggestions.

“Using CodeWhisperer, building code feels so much easier when I don’t have to remember every detail as it will accurately autocomplete my code and comments,” she shared. “Earlier, it would take me 15 minutes to set up data preparation pre-processing tasks, but now I’m ready to go in 5 minutes.”

Screenshot of exploratory data analysis using Amazon CodeWhisperer in a Jupyter notebook.

Wendy says she has gained efficiency by delegating these repetitive tasks to CodeWhisperer, and she wrote a series of articles to explain how to use it to simplify exploratory data analysis.

Another tool used to explore data sets is SQL. Wendy is looking into how CodeWhisperer can help data engineers who are not SQL experts. For instance, she noticed they can just ask to “write multiple joins” or “write a subquery” to quickly get the correct syntax to use.

Asking Amazon CodeWhisperer to generate SQL syntax and code.

CodeWhisperer Accelerates Testing and Other Daily Tasks
I had the opportunity to spend some time with software engineers in the AWS Developer Relations Platform team. That’s the team that, among other things, builds and operates the community.aws website.

Screenshot of the community.aws website, built and operated by the AWS Developer Relations Platform team with some help from Amazon CodeWhisperer.

Nikitha Tejpal’s work primarily revolves around TypeScript, and CodeWhisperer aids her coding process by offering effective autocomplete suggestions that come up as she types. She said she specifically likes the way CodeWhisperer helps with unit tests.

“I can now focus on writing the positive tests, and then use a comment to have CodeWhisperer suggest negative tests for the same code,” she says. “In this way, I can write unit tests in 40% less time.”

Her colleague, Carlos Aller Estévez, relies on CodeWhisperer’s autocomplete feature to provide him with suggestions for a line or two to supplement his existing code, which he accepts or ignores based on his own discretion. Other times, he proactively leverages the predictive abilities of CodeWhisperer to write code for him. “If I want explicitly to get CodeWhisperer to code for me, I write a method signature with a comment describing what I want, and I wait for the autocomplete,” he explained.

For instance, when Carlos’s objective was to check if a user had permissions on a given path or any of its parent paths, CodeWhisperer provided a neat solution for part of the problem based on Carlos’s method signature and comment. The generated code checks the parent directories of a given resource, then creates a list of all possible parent paths. Carlos then implemented a simple permission check over each path to complete the implementation.

“CodeWhisperer helps with algorithms and implementation details so that I have more time to think about the big picture, such as business requirements, and create better solutions,” he added.

Code generated by CodeWhisperer based on method signature and comment.

CodeWhisperer is a Multilingual Team Player
CodeWhisperer is polyglot, supporting code generation for 15 programming languages: Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, Kotlin, C, C++, Shell scripting, SQL, and Scala.

CodeWhisperer is also a team player. In addition to Visual Studio (VS) Code and the JetBrains family of IDEs (including IntelliJ, PyCharm, GoLand, CLion, PhpStorm, RubyMine, Rider, WebStorm, and DataGrip), CodeWhisperer is also available for JupyterLab, in AWS Cloud9, in the AWS Lambda console, and in Amazon SageMaker Studio.

At AWS, we are committed to helping our customers transform responsible AI from theory into practice by investing to build new services to meet the needs of our customers and make it easier for them to identify and mitigate bias, improve explainability, and help keep data private and secure.

You can use Amazon CodeWhisperer for free in the Individual Tier. See CodeWhisperer pricing for more information. To get started, follow these steps.

Danilo