Tag Archives: Migration

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Post Syndicated from Sivaprasad Mahamkali original https://aws.amazon.com/blogs/big-data/modernize-your-etl-platform-with-aws-glue-studio-a-case-study-from-bms/

This post is co-written with Ramesh Daddala, Jitendra Kumar Dash and Pavan Kumar Bijja from Bristol Myers Squibb.

Bristol Myers Squibb (BMS) is a global biopharmaceutical company whose mission is to discover, develop, and deliver innovative medicines that help patients prevail over serious diseases. BMS is consistently innovating, achieving significant clinical and regulatory successes. In collaboration with AWS, BMS identified a business need to migrate and modernize their custom extract, transform, and load (ETL) platform to a native AWS solution to reduce complexities, resources, and investment to upgrade when new Spark, Python, or AWS Glue versions are released. In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine. AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor ETL jobs in AWS Glue. Offering this service reduced BMS’s operational maintenance and cost, and offered flexibility to business users to perform ETL jobs with ease.

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. Although this framework met their ETL objectives, it was difficult to maintain and upgrade. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). Each time the newer version of Apache Spark (and corresponding AWS Glue version) was released, it required significant operational support and time-consuming manual changes to upgrade existing ETL jobs. Manually upgrading, testing, and deploying over 5,000 jobs every few quarters was time consuming, error prone, costly, and not sustainable. Because another release for the EDLS framework was pending, BMS decided to assess alternate managed solutions to reduce their operational and upgrade challenges.

In this post, we share how BMS will modernize leveraging the success of the proof of concept targeting BMS’s ETL platform using AWS Glue Studio.

Solution overview

This solution addresses BMS’s EDLS requirements to overcome challenges using a custom-built ETL framework that required frequent maintenance and component upgrades (requiring extensive testing cycles), avoid complexity, and reduce the overall cost of the underlying infrastructure derived from the proof of concept. BMS had the following goals:

  • Develop ETL jobs using visual workflows provided by the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a low-code environment that allows you to compose data transformation workflows, seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine, and inspect the schema and data results in each step of the job.
  • Migrate over 5,000 existing ETL jobs using native AWS Glue Studio in an automated and scalable manner.

EDLS job steps and metadata

Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework. Each job step incorporates the following ETL functions:

  • File ingest – File ingestion enables you to ingest or list files from multiple file sources, like Amazon Simple Storage Service (Amazon S3), SFTP, and more. The metadata holds configurations for the file ingestion step to connect to Amazon S3 or SFTP endpoints and ingest files to target location. It retrieves the specified files and available metadata to show on the UI.
  • Data quality check – The data quality module enables you to perform quality checks on a huge amount of data and generate reports that describe and validate the data quality. The data quality step uses an EDLS ingested source object from Amazon S3 and runs one to many data conformance checks that are configured by the tenant.
  • Data transform join – This is one of the submodules of the data transform module that can perform joins between the datasets using a custom SQL based on the metadata configuration.
  • Database ingest – The database ingestion step is one of the important service components in EDLS, which facilitates you to obtain and import the desired data from the database and export it to a specific file in the location of your choice.
  • Data transform – The data transform module performs various data transformations against the source data using JSON-driven rules. Each data transform capability has its own JSON rule and, based on the specific JSON rule you provide, EDLS performs the data transformation on the files available in the Amazon S3 location.
  • Data persistence – The data persistence module is one of the important service components in EDLS, which enables you to obtain the desired data from the source and persist it to an Amazon Relational Database Service (Amazon RDS) database.

The metadata corresponding to each job step includes ingest sources, transformation rules, data quality checks, and data destinations stored in an RDS instance.

Migration utility

The solution involves building a Python utility that reads EDLS metadata from the RDS database and translating each of the job steps into an equivalent AWS Glue Studio visual editor JSON node representation.

AWS Glue Studio provides two types of transforms:

  • AWS Glue-native transforms – These are available to all users and are managed by AWS Glue.
  • Custom visual transforms – This new functionality allows you to upload custom-built transforms used in AWS Glue Studio. Custom visual transforms expand the managed transforms, enabling you to search and use transforms from the AWS Glue Studio interface.

The following is a high-level diagram depicting the sequence flow of migrating a BMS EDLS job to an AWS Glue Studio visual editor job.

Migrating BMS EDLS jobs to AWS Glue Studio includes the following steps:

  1. The Python utility reads existing metadata from the EDLS metadata database.
  2. For each job step type, based on the job metadata, the Python utility selects either the native AWS Glue transform, if available, or a custom-built visual transform (when the native functionality is missing).
  3. The Python utility parses the dependency information from metadata and builds a JSON object representing a visual workflow represented as a Directed Acyclic Graph (DAG).
  4. The JSON object is sent to the AWS Glue API, creating the AWS Glue ETL job. These jobs are visually represented in the AWS Glue Studio visual editor using a series of sources, transforms (native and custom), and targets.

Sample ETL job generation using AWS Glue Studio

The following flow diagram depicts a sample ETL job that incrementally ingests the source RDBMS data in AWS Glue based on modified timestamps using a custom SQL and merges it into the target data on Amazon S3.

The preceding ETL flow can be represented using the AWS Glue Studio visual editor through a combination of native and custom visual transforms.

Custom visual transform for incremental ingestion

Post POC, BMS and AWS identified there will be a need to leverage custom transforms to execute a subset of jobs leveraging their current EDLS Service where Glue Studio functionality will not be a natural fit. The BMS team’s requirement was to ingest data from various databases without depending on the existence of transaction logs or specific schema, so AWS Database Migration Service (AWS DMS) wasn’t an option for them. AWS Glue Studio provides the native SQL query visual transform, where a custom SQL query can be used to transform the source data. However, in order to query the source database table based on a modified timestamp column to retrieve new and modified records since the last ETL run, the previous timestamp column state needs to be persisted so it can be used in the current ETL run. This needs to be a recurring process and can also be abstracted across various RDBMS sources, including Oracle, MySQL, Microsoft SQL Server, SAP Hana, and more.

AWS Glue provides a job bookmark feature to track the data that has already been processed during a previous ETL run. An AWS Glue job bookmark supports one or more columns as the bookmark keys to determine new and processed data, and it requires that the keys are sequentially increasing or decreasing without gaps. Although this works for many incremental load use cases, the requirement is to ingest data from different sources without depending on any specific schema, so we didn’t use an AWS Glue job bookmark in this use case.

The SQL-based incremental ingestion pull can be developed in a generic way using a custom visual transform using a sample incremental ingestion job from a MySQL database. The incremental data is merged into the target Amazon S3 location in Apache Hudi format using an upsert write operation.

In the following example, we’re using the MySQL data source node to define the connection but the DynamicFrame of the data source itself is not used. The custom transform node (DB incremental ingestion) will act as the source for reading the data incrementally using the custom SQL query and the previously persisted timestamp from the last ingestion.

The transform accepts as input parameters the preconfigured AWS Glue connection name, database type, table name, and custom SQL (parameterized timestamp field).

The following is the sample visual transform Python code:

import boto3
from awsglue import DynamicFrame
from datetime import datetime

region_name = "us-east-1"

dyna_client = boto3.client('dynamodb')
HISTORIC_DATE = datetime(1970,1,1).strftime("%Y-%m-%d %H:%M:%S")
DYNAMODB_TABLE = "edls_run_stats"

def db_incremental(self, transformation_node, con_name, con_type, table_name, sql_query):
    logger = self.glue_ctx.get_logger()

    last_updt_tmst = get_table_last_updt_tmst(logger, DYNAMODB_TABLE, transformation_node)

    logger.info(f"Last updated timestamp from the DynamoDB-> {last_updt_tmst}")

    sql_query = sql_query.format(**{"lastmdfdtmst": last_updt_tmst})

    connection_options_source = {
        "useConnectionProperties": "true",
        "connectionName": con_name,
        "dbtable": table_name,
        "sampleQuery": sql_query
    }

    df = self.glue_ctx.create_dynamic_frame.from_options(connection_type= con_type, connection_options= connection_options_source )
                                         
    return df

DynamicFrame.db_incremental = db_incremental

def get_table_last_updt_tmst(logger, table_name, transformation_node):
    response = dyna_client.get_item(TableName=table_name,
                                    Key={'transformation_node': {'S': transformation_node}}
                                    )
    if 'Item' in response and 'last_updt_tmst' in response['Item']:
        return response['Item']['last_updt_tmst']['S']
    else:
        return HISTORIC_DATE

To merge the source data into the Amazon S3 target, a data lake framework like Apache Hudi or Apache Iceberg can be used, which is natively supported in AWS Glue 3.0 and later.

You can also use Amazon EventBridge to detect the final AWS Glue job state change and update the Amazon DynamoDB table’s last ingested timestamp accordingly.

Build the AWS Glue Studio job using the AWS SDK for Python (Boto3) and AWS Glue API

For the sample ETL flow and the corresponding AWS Glue Studio ETL job we showed earlier, the underlying CodeGenConfigurationNode struct (an AWS Glue job definition pulled using the AWS Command Line Interface (AWS CLI) command aws glue get-job –job-name <jobname>) is represented as a JSON object, shown in the following code:

"CodeGenConfigurationNodes": {<br />"node-1679802581077": {<br />"DynamicTransform": {<br />"Name": "DB Incremental Ingestion",<br />"TransformName": "db_incremental",<br />"Inputs": [<br />"node-1679707801419"<br />],<br />"Parameters": [<br />{<br />"Name": "node_name",<br />"Type": "str",<br />"Value": [<br />"job_123_incr_ingst_table1"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "jdbc_url",<br />"Type": "str",<br />"Value": [<br />"jdbc:mysql://database.xxxx.us-west-2.rds.amazonaws.com:3306/db_schema"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "db_creds",<br />"Type": "str",<br />"Value": [<br />"creds"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "table_name",<br />"Type": "str",<br />"Value": [<br />"tables"<br />],<br />"IsOptional": false<br />}<br />]<br />}<br />}<br />}<br />}

The JSON object (ETL job DAG) represented in the CodeGenConfigurationNode is generated through a series of native and custom transforms with the respective input parameter arrays. This can be accomplished using Python JSON encoders that serialize the class objects to JSON and subsequently create the AWS Glue Studio visual editor job using the Boto3 library and AWS Glue API.

Inputs required to configure the AWS Glue transforms are sourced from the EDLS jobs metadata database. The Python utility reads the metadata information, parses it, and configures the nodes automatically.

The order and sequencing of the nodes is sourced from the EDLS jobs metadata, with one node becoming the input to one or more downstream nodes building the DAG flow.

Benefits of the solution

The migration path will help BMS achieve their core objectives of decomposing their existing custom ETL framework to modular, visually configurable, less complex, and easily manageable pipelines using visual ETL components. The utility aids the migration of the legacy ETL pipelines to native AWS Glue Studio jobs in an automated and scalable manner.

With consistent out-of-the box visual ETL transforms in the AWS Glue Studio interface, BMS will be able to build sophisticated data pipelines without having to write code.

The custom visual transforms will extend AWS Glue Studio capabilities and fulfill some of the BMS ETL requirements where the native transforms are missing that functionality. Custom transforms will help define, reuse, and share business-specific ETL logic among all the teams. The solution increases the consistency between teams and keeps the ETL pipelines up to date by minimizing duplicate effort and code.

With minor modifications, the migration utility can be reused to automate migration of pipelines during future AWS Glue version upgrades.

Conclusion

The successful outcome of this proof of concept has shown that migrating over 5,000 jobs from BMS’s custom application to native AWS services can deliver significant productivity gains and cost savings. By moving to AWS, BMS will be able to reduce the effort required to support AWS Glue, improve DevOps delivery, and save an estimated 58% on AWS Glue spend.

These results are very promising, and BMS is excited to embark on the next phase of the migration. We believe that this project will have a positive impact on BMS’s business and help us achieve our strategic goals.


About the authors

Sivaprasad Mahamkali is a Senior Streaming Data Engineer at AWS Professional Services. Siva leads customer engagements related to real-time streaming solutions, data lakes, analytics using opensource and AWS services. Siva enjoys listening to music and loves to spend time with his family.

Dan Gibbar is a Senior Engagement Manager at AWS Professional Services. Dan leads healthcare and life science engagements collaborating with customers and partners to deliver outcomes. Dan enjoys the outdoors, attempting triathlons, music and spending time with family.

Shrinath Parikh as a Senior Cloud Data Architect with AWS. He works with customers around the globe to assist them with their data analytics, data lake, data lake house, serverless, governance and NoSQL use cases. In Shrinath’s off time, he enjoys traveling, spending time with family and learning/building new tools using cutting edge technologies.

Ramesh Daddala is a Associate Director at BMS. Ramesh leads enterprise data engineering engagements related to enterprise data lake services (EDLs) and collaborating with Data partners to deliver and support enterprise data engineering and ML capabilities. Ramesh enjoys the outdoors, traveling and loves to spend time with family.

Jitendra Kumar Dash is a Senior Cloud Architect at BMS with expertise in hybrid cloud services, Infrastructure Engineering, DevOps, Data Engineering, and Data Analytics solutions. He is passionate about food, sports, and adventure.

Pavan Kumar Bijja is a Senior Data Engineer at BMS. Pavan enables data engineering and analytical services to BMS Commercial domain using enterprise capabilities. Pavan leads enterprise metadata capabilities at BMS. Pavan loves to spend time with his family, playing Badminton and Cricket.

Shovan Kanjilal is a Senior Data Lake Architect working with strategic accounts in AWS Professional Services. Shovan works with customers to design data and machine learning solutions on AWS.

An attendee’s guide to hybrid cloud and edge computing at AWS re:Invent 2023

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/an-attendees-guide-to-hybrid-cloud-and-edge-computing-at-aws-reinvent-2023/

This post is written by Savitha Swaminathan, AWS Sr. Product Marketing Manager

AWS re:Invent 2023 starts on Nov 27th in Las Vegas, Nevada. The event brings technology business leaders, AWS partners, developers, and IT practitioners together to learn about the latest innovations, meet AWS experts, and network among their peer attendees.

This year, AWS re:Invent will once again have a dedicated track for hybrid cloud and edge computing. The sessions in this track will feature the latest innovations from AWS to help you build and run applications securely in the cloud, on premises, and at the edge – wherever you need to. You will hear how AWS customers are using our cloud services to innovate on premises and at the edge. You will also be able to immerse yourself in hands-on experiences with AWS hybrid and edge services through innovative demos and workshops.

At re:Invent there are several session types, each designed to provide you with a way to learn however fits you best:

  • Innovation Talks provide a comprehensive overview of how AWS is working with customers to solve their most important problems.
  • Breakout sessions are lecture style presentations focused on a topic or area of interest and are well liked by business leaders and IT practitioners, alike.
  • Chalk talks deep dive on customer reference architectures and invite audience members to actively participate in the white boarding exercise.
  • Workshops and builder sessions popular with developers and architects, provide the most hands-on experience where attendees can build real-time solutions with AWS experts.

The hybrid edge track will include one leadership overview session and 15 other sessions (4 breakouts, 6 chalk talks, and 5 workshops). The sessions are organized around 4 key themes: Low latency, Data residency, Migration and modernization, and AWS at the far edge.

Hybrid Cloud & Edge Overview

HYB201 | AWS wherever you need it

Join Jan Hofmeyr, Vice President, Amazon EC2, in this leadership session where he presents a comprehensive overview of AWS hybrid cloud and edge computing services, and how we are helping customers innovate on AWS wherever they need it – from Regions, to metro centers, 5G networks, on premises, and at the far edge. Jun Shi, CEO and President of Accton, will also join Jan on stage to discuss how Accton enables smart manufacturing across its global manufacturing sites using AWS hybrid, IoT, and machine learning (ML) services.

Low latency

Many customer workloads require single-digit millisecond latencies for optimal performance. Customers in every industry are looking for ways to run these latency sensitive portions of their applications in the cloud while simplifying operations and optimizing for costs. You will hear about customer use cases and how AWS edge infrastructure is helping companies like Riot Games meet their application performance goals and innovate at the edge.

Breakout session

HYB305 | Delivering low-latency applications at the edge

Chalk talk

HYB308 | Architecting for low latency and performance at the edge with AWS

Workshops

HYB302 | Architecting and deploying applications at the edge

HYB303 | Deploying a low-latency computer vision application at the edge

Data residency

As cloud has become main stream, governments and standards bodies continue to develop security, data protection, and privacy regulations. Having control over digital assets and meeting data residency regulations is becoming increasingly important for public sector customers and organizations operating in regulated industries. The data residency sessions deep dive into the challenges, solutions, and innovations that customers are addressing with AWS to meet their data residency requirements.

Breakout session

HYB309 | Navigating data residency and protecting sensitive data

Chalk talk

HYB307 | Architecting for data residency and data protection at the edge

Workshops

HYB301 | Addressing data residency requirements with AWS edge services

Migration and modernization

Migration and modernization in industries that have traditionally operated with on-premises infrastructure or self-managed data centers is helping customers achieve scale, flexibility, cost savings, and performance. We will dive into customer stories and real-world deployments, and share best practices for hybrid cloud migrations.

Breakout session

HYB203 | A migration strategy for edge and on-premises workloads

Chalk talk

HYB313 | Real-world analysis of successful hybrid cloud migrations

AWS at the far edge

Some customers operate in what we call the far edge: remote oil rigs, military and defense territories, and even space! In these sessions we cover customer use cases and explore how AWS brings cloud services to the far edge and helps customers gain the benefits of the cloud regardless of where they operate.

Breakout session

HYB306 | Bringing AWS to remote edge locations

Chalk talk

HYB312 | Deploying cloud-enabled applications starting at the edge

Workshops

HYB304 | Generative AI for robotics: Race for the best drone control assistant

In addition to the sessions across the 4 themes listed above, the track includes two additional chalk talks covering topics that are applicable more broadly to customers operating hybrid workloads. These chalk talks were chosen based on customer interest and will have repeat sessions, due to high customer demand.

HYB310 | Building highly available and fault-tolerant edge applications

HYB311 | AWS hybrid and edge networking architectures

Learn through interactive demos

In addition to breakout sessions, chalk talks, and workshops, make sure you check out our interactive demos to see the benefits of hybrid cloud and edge in action:

Drone Inspector: Generative AI at the Edge

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | AWS for Every App activation

Embark on a competitive adventure where generative artificial intelligence (AI) intersects with edge computing. Experience how drones can swiftly respond to chat instructions for a time-sensitive object detection mission. Learn how you can deploy foundation models and computer vision (CV) models at the edge using AWS hybrid and edge services for real-time insights and actions.

AWS Hybrid Cloud & Edge kiosk

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | Kiosk #9 & 10

Stop by and chat with our experts about AWS Local Zones, AWS Outposts, AWS Snow Family, AWS Wavelength, AWS Private 5G, AWS Telco Network Builder, and Integrated Private Wireless on AWS. Check out the hardware innovations inside an AWS Outposts rack up close and in person. Learn how you can set up a reliable private 5G network within days and live stream video content with minimal latency.

AWS Next Gen Infrastructure Experience

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852

Check out demos across Global Infrastructure, AWS for Hybrid Cloud & Edge, Compute, Storage, and Networking kiosks, share on social, and win prizes!

The Future of Connected Mobility

Location: Venetian Level 4, EBC Lounge, wall outside of Lando 4201B

Step into the driver’s seat and experience high fidelity 3D terrain driving simulation with AWS Local Zones. Gain real-time insights from vehicle telemetry with AWS IoT Greengrass running on AWS Snowcone and a broader set of AWS IoT services and Amazon Managed Grafana in the Region. Learn how to combine local data processing with cloud analytics for enhanced safety, performance, and operational efficiency. Explore how you can rapidly deliver the same experience to global users in 75+ countries with minimal application changes using AWS Outposts.

Immersive tourism experience powered by 5G and AR/VR

Location: Venetian, Level 2 | Expo Hall | Telco demo area

Explore and travel to Chichen Itza with an augmented reality (AR) application running on a private network fully built on AWS, which includes the Radio Access Network (RAN), the core, security, and applications, combined with services for deployment and operations. This demo features AWS Outposts.

AWS unplugged: A real time remote music collaboration session using 5G and MEC

Location: Venetian, Level 2 | Expo Hall | Telco demo area

We will demonstrate how musicians in Los Angeles and Las Vegas can collaborate in real time with AWS Wavelength. You will witness songwriters and musicians in Los Angeles and Las Vegas in a live jam session.

Disaster relief with AWS Snowball Edge and AWS Wickr

Location: AWS for National Security & Defense | Venetian, Casanova 606

The hurricane has passed leaving you with no cell coverage and you have a slim chance of getting on the internet. You need to set up a situational awareness and communications network for your team, fast. Using Wickr on Snowball Edge Compute, you can rapidly deploy a platform that provides both secure communications with rich collaboration functionality, as well as real time situational awareness with the Wickr ATAK integration. Allowing you to get on with what’s important.


We hope this guide to the Hybrid Cloud and Edge track at AWS re:Invent 2023 helps you plan for the event and we hope to see you there!

Approaches for migrating users to Amazon Cognito user pools

Post Syndicated from Edward Sun original https://aws.amazon.com/blogs/security/approaches-for-migrating-users-to-amazon-cognito-user-pools/

Update: An earlier version of this post was published on September 14, 2017, on the Front-End Web and Mobile Blog.


Amazon Cognito user pools offer a fully managed OpenID Connect (OIDC) identity provider so you can quickly add authentication and control access to your mobile app or web application. User pools scale to millions of users and add layers of additional features for security, identity federation, app integration, and customization of the user experience. Amazon Cognito is available in regions around the globe, processing over 100 billion authentications each month. You can take advantage of security features when using user pools in Cognito, such as email and phone number verification, multi-factor authentication, and advanced security features, such as compromised credentials detection, and adaptive authentications.

Many customers ask about the best way to migrate their existing users to Amazon Cognito user pools. In this blog post, we describe several different recommended approaches and provide step-by-step instructions on how to implement them.

Key considerations

The main consideration when migrating users across identity providers is maintaining a consistent end-user experience. Ideally, users can continue to use their existing passwords so that their experience is seamless. However, security best practices dictate that passwords should never be stored directly as cleartext in a user store. Instead, passwords are used to compute cryptographic hashes and verifiers that can later be used to verify submitted passwords. This means that you cannot securely export passwords in cleartext form from an existing user store and import them into a Cognito user pool. You might ask your users to choose a new password during the migration. Or, if you want to retain the existing passwords, you need to retain access to the existing hashes and verifiers, at least during the migration period.

A secondary consideration is the migration timeline. For example, do you need a faster migration timeline because your current identity store’s license is expiring? Or do you prefer a slow and steady migration because you are modernizing your current application, and it takes time to connect your existing systems to the new identity provider?

The following two methods define our recommended approaches for migrating existing users into a user pool:

  • Bulk user import – Export your existing users into a comma-separated (.csv) file, and then upload this .csv file to import users into a user pool. Your desired user attributes (except passwords) can be included and mapped to attributes in the target user pool. This approach requires users to reset their passwords when they sign in with Cognito. You can choose to migrate your existing user store entirely in a single import job or split users into multiple jobs for parallel or incremental processing.
  • Just-in-time user migration – Migrate users just in time into a Cognito user pool as they sign in to your mobile or web app. This approach allows users to retain their current passwords, because the migration process captures and verifies the password during the sign-in process, seamlessly migrating them to the Cognito user pool.

In the following sections, we describe the bulk user import and just-in-time user migration methods in more detail and then walk through the steps of each approach.

Bulk user import

You perform bulk import of users into an Amazon Cognito user pool by uploading a .csv file that contains user profile data, including usernames, email addresses, phone numbers, and other attributes. You can download a template .csv file for your user pool from Cognito, with a user schema structured in the template header.

Following is an example of performing bulk user import.

To create an import job

  1. Open the Cognito user pool console and select the target user pool for migration.
  2. On the Users tab, navigate to the Import users section, and choose Create import job.
  3. Figure 1: Create import job

    Figure 1: Create import job

  4. In the Create import job dialog box, download the template.csv file for user import.
  5. Export your existing user data from your existing user directory or store your data into the .csv file
  6. Match the user attribute types with column headings in the template. Each user must have an email address or a phone number that is marked as verified in the .csv file, in order to receive the password reset confirmation code.
  7. Figure 2: Configure import job

    Figure 2: Configure import job

  8. Go back to the Create import job dialog box (as shown in Figure 2) and do the following:
    1. Enter a Job name.
    2. Choose to Create a new IAM role or Use an existing IAM role. This role grants Amazon Cognito permission to write to Amazon CloudWatch Logs in your account, so that Cognito can provide logs for successful imports and errors for skipped or failed transactions.
    3. Upload the .csv file that you have prepared, and choose Create and start job.

Depending on the size of the .csv file, the job can run for minutes or hours, and you can follow the status from that same page in the Amazon Cognito console.

Figure 3: Check import job status

Figure 3: Check import job status

Cognito runs through the import job and imports users with a RESET_REQUIRED state. When users attempt to sign in, Cognito will return PasswordResetRequiredException from the sign-in API, and the app should direct the user into the ForgotPassword flow.

Figure 4: View imported user

Figure 4: View imported user

The bulk import approach can also be used continuously to incrementally import users. You can set up an Extract-Transform-Load (ETL) batch job process to extract incremental changes to your existing user directories, such as the new sign-ups on the existing systems before you switch over to a Cognito user pool. Your batch job will transform the changes into a .csv file to map user attribute schemas, and load the .csv file as a Cognito import job through the CreateUserImportJob CLI or SDK operation. Then start the import job through the StartUserImportJob CLI or SDK operation. For more information, see Importing users into user pools in the Amazon Cognito Developer Guide.

Just-in-time user migration

The just-in-time (JIT) user migration method involves first attempting to sign in the user through the Amazon Cognito user pool. Then, if the user doesn’t exist in the Cognito user pool, Cognito calls your Migrate User Lambda trigger and sends the username and password to the Lambda trigger to sign the user in through the existing user store. If successful, the Migrate User Lambda trigger will also fetch user attributes and return them to Cognito. Then Cognito silently creates the user in the user pool with user attributes, as well as salts and password verifiers from the user-provided password. With the Migrate User Lambda trigger, your client app can start to use the Cognito user pool to sign in users who have already been migrated, and continue migrating users who are signing in for the first time towards the user pool. This just-in-time migration approach helps to create a seamless authentication experience for your users.

Cognito, by default, uses the USER_SRP_AUTH authentication flow with the Secure Remote Password (SRP) protocol. This flow doesn’t involve sending the password across the network, but rather allows the client to exchange a cryptographic proof with the Cognito service to prove the client’s knowledge of the password. For JIT user migration, Cognito needs to verify the username and password against the existing user store. Therefore, you need to enable a different Cognito authentication flow. You can choose to use either the USER_PASSWORD_AUTH flow for client-side authentication or the ADMIN_USER_PASSWORD_AUTH flow for server-side authentication. This will allow the password to be sent to Cognito over an encrypted TLS connection, and allow Cognito to pass the information to the Lambda function to perform user authentication against the original user store.

This JIT approach might not be compatible with existing identity providers that have multi-factor authentication (MFA) enabled, because the Lambda function cannot support multiple rounds of challenges. If the existing identity provider requires MFA, you might consider the alternative JIT migration approach discussed later in this blog post.

Figure 5 illustrates the steps for the JIT sign-in flow. The mobile or web app first tries to sign in the user in the user pool. If the user isn’t already in the user pool, Cognito handles user authentication and invokes the Migrate User Lambda trigger to migrate the user. This flow keeps the logic in the app simple and allows the app to use the Amazon Cognito SDK to sign in users in the standard way. The migration logic takes place in the Lambda function in the backend.

Figure 5: JIT migration user authentication flow

Figure 5: JIT migration user authentication flow

The flow in Figure 5 starts in the mobile or web app, which attempts to sign in the user by using the AWS SDK. If the user doesn’t exist in the user pool, the migration attempt starts. Cognito calls the Migrate User Lambda trigger with triggerSource set to UserMigration_Authentication, and passes the user’s username and password in the request in order to attempt to migrate the user.

This approach also works in the forgot password flow shown in Figure 6, where the user has forgotten their password and hasn’t been migrated yet. In this case, once the user makes a “Forgot Password” request, your mobile or web app will send a forgot password request to Cognito. Cognito invokes your Migrate User Lambda trigger with triggerSource set to UserMigration_ForgotPassword, and passes the username in the request in order to attempt user lookup, migrate the user profile, and facilitate the password reset process.

Figure 6: JIT migration forgot password flow

Figure 6: JIT migration forgot password flow

Just-in-time user migration sample code

In this section, we show sample source codes for a Migrate User Lambda trigger overall structure. We will fill in the commented sections with additional code, shown later in the section. When you set up your own Lambda function, configure a Lambda execution role to grant permissions for CloudWatch logs.

const handler = async (event) => {
    if (event.triggerSource == "UserMigration_Authentication") {
        //***********************************************************************
        // Attempt to sign in the user or verify the password with existing identity store
        // (shown in the Section A – Migrate User of this post)
        //***********************************************************************
    }
    else if (event.triggerSource == "UserMigration_ForgotPassword") {
       //***********************************************************************
       // Attempt to look up the user in your existing identity store
       // (shown in the section B – Forget Password of this post)
       //***********************************************************************
    }
    return event;
};
export { handler };

In the migration flow, the Lambda trigger will sign in the user and verify the user’s password in the existing user store. That may involve a sign-in attempt against your existing user store or a check of the password against a stored hash. You need to customize this step based on your existing setup. You can also create a function to fetch user attributes that you want to migrate. If your existing user store conforms to the OIDC specification, you can parse the ID Token claims to retrieve the user’s attributes. The following example shows how to set the username and attributes for the migrated user.

// Section A – Migrate User
if (event.triggerSource == "UserMigration_Authentication") {
// Attempt to sign in the user or verify the password with the existing user store.
// Add an authenticateUser() functionbased on your existing user store setup. 
    const user = await authenticateUser(event.userName, event.request.password);
    if (user) {
        // Migrating user attributes from the source user store. You can migrate additional attributes as needed.
        event.response.userAttributes = {
            // Setting username and email address
            username: event.userName,
            email: user.emailAddress,
            email_verified: "true",
        };
        // Setting user status to CONFIRMED to autoconfirm users so they can sign in to the user pool
        event.response.finalUserStatus = "CONFIRMED";
        // Setting messageAction to SUPPRESS to decline to send the welcome message that Cognito usually sends to new users
        event.response.messageAction = "SUPPRESS";
        }
    }

The user is now migrated from the existing user store to the user pool, as well as the user’s attributes. Users will also be redirected to your application with the authorization code or JSON Web Tokens, depending on the OAuth 2.0 grant types you configured in the user pool.

Let’s look at the forgot password flow. Your Lambda function calls the existing user store and migrates other attributes in the user’s profile first, and then Lambda sets user attributes in the response to the Cognito user pool. Cognito initiates the ForgotPassword flow and sends a confirmation code to the user to confirm the password reset process. The user needs to have a verified email address or phone number migrated from the existing user store to receive the forgot password confirmation code. The following sample code demonstrates how to complete the ForgotPassword flow.

// Section B – Forgot Password
else if (event.triggerSource == "UserMigration_ForgotPassword") {
        // Look up the user in your existing user store service.  
		// Add a lookupUser() function based on your existing user store setup. 
        const lookupResult = await lookupUser(event.userName);
        if (lookupResult) {
            // Setting user attributes from the source user store
            event.response.userAttributes = {
                username: event.userName,
                // Required to set verified communication to receive password recovery code
                email: lookupResult.emailAddress,
                email_verified: "true",
            };
            event.response.finalUserStatus = "RESET_REQUIRED";
            event.response.messageAction = "SUPPRESS";
        }
    }

Just-in-time user migration – alternative approach

Using the Migrate User Lambda trigger, we showed the JIT migration approach where the app switches to use the Cognito user pool at the beginning of the migration period, to interface with the user for signing in and migrating them from the existing user store. An alternative JIT approach is to maintain the existing systems and user store, but to silently create each user in the Cognito user pool in a backend process as users sign in, then switch over to use Cognito after enough users have been migrated.

Figure 7: JIT migration alternative approach with backend process

Figure 7: JIT migration alternative approach with backend process

Figure 7 shows this alternative approach in depth. When an end user signs in successfully in your mobile or web app, the backend migration process is initiated. This backend process first calls the Cognito admin API operation, AdminCreateUser, to create users and map user attributes in the destination user pool. The user will be created with a temporary password and be placed in FORCE_CHANGE_PASSWORD status. If you capture the user password during the sign-in process, you can also migrate the password by setting it permanently for the newly created user in the Cognito user pool using the AdminSetUserPassword API operation. This operation will also set the user status to CONFIRMED to allow the user to sign in to Cognito using the existing password.

Following is a code example for the AdminCreateUser function using the AWS SDK for JavaScript.

var params = {
    MessageAction: "SUPPRESS",
    UserAttributes: [{
        Name: "name",
        Value: "Nikki Wolf"
    },
    {
        Name: "email",
        Value: "[email protected]"
    },
    {
        Name: "email_verified",
        Value: "True"
    }
    ],
    UserPoolId: "us-east-1_EXAMPLE",
    Username: "nikki_wolf"
};
const cognito = new CognitoIdentityProviderClient();
const createUserCommand  = new AdminCreateUserCommand(params);
await cognito.send (createUserCommand);

The following is a code example for the AdminSetUserPassword function.

var params = {
    UserPoolId: 'us-east-1_EXAMPLE' ,
    Username: 'nikki_wolf' ,
    Password: 'ExamplePassword1$' ,
    Permanent: true
};
const cognito = new CognitoIdentityProviderClient();
const setUserPasswordCommand = new AdminSetUserPasswordCommand(params);
await cognito.send(setUserPasswordCommand);

This alternative approach does not require the app to update its authentication codebase until a majority of users are migrated, but you need to propagate user attribute changes and new user signups from the existing systems to Cognito. If you are capturing and migrating passwords, you should also build a similar logic to capture password changes in existing systems and set the new password in the user pool to keep it synchronized until you perform a full switchover from the existing identity store to the Cognito user pool.

Summary and best practices

In this post, we described our two recommended approaches for migrating users into an Amazon Cognito user pool. You can decide which approach is best suited for your use case. The bulk method is simpler to implement, but it doesn’t preserve user passwords like the just-in-time migration does. The just-in-time migration is transparent to users and mitigates the potential attrition of users that can occur when users need to reset their passwords.

You could also consider a hybrid approach, where you first apply JIT migration as users are actively signing in to your app, and perform bulk import for the remaining less-active users. This hybrid approach helps provide a good experience for your active user communities, while being able to decommission existing user stores in a manageable timeline because you don’t need to wait for every user to sign in and be migrated through JIT migration.

We hope you can use these explanations and code samples to set up the most suitable approach for your migration project.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Edward Sun

Edward Sun

Edward is a Security Specialist Solutions Architect focused on identity and access management. He loves helping customers throughout their cloud transformation journey with architecture design, security best practices, migration, and cost optimizations. Outside of work, Edward enjoys hiking, golfing, and cheering for his alma mater, the Georgia Bulldogs.

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/migrate-microsoft-azure-synapse-analytics-to-amazon-redshift-using-aws-sct/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. With Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers, such as the following:

  • Amazon Redshift seamlessly integrates with broader data, analytics, and AI or machine learning (ML) services on AWS, enabling you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning, and more.
  • You don’t need to worry about workloads such as ETL (extract, transform, and load), dashboards, ad-hoc queries, and so on interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get 1 hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is straightforward to use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault tolerance is built in. All data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, serverless web services or event-driven applications. You can also use your favorite business intelligence (BI) and SQL tools to access, analyze, and visualize data in Amazon Redshift.
  • Amazon Redshift ML makes it straightforward for data scientists to create, train, and deploy ML models using familiar SQL. You can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row-level and column-level security controls on sensitive data, and more.

In this post, we show how to migrate a data warehouse from Microsoft Azure Synapse to Redshift Serverless using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database code and storage objects to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your Azure Synapse Analytics. First, we create a Redshift database into which Azure Synapse data will be migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert Azure Synapse schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from Azure Synapse, upload it into an S3 bucket, and copy it to Amazon Redshift.

The following diagram illustrates our solution architecture.

This walkthrough covers the following steps:

  1. Create a Redshift Serverless data warehouse.
  2. Create the S3 bucket and folder.
  3. Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT:
    1. Connect to the Azure Synapse source.
    2. Connect to the Amazon Redshift target.
    3. Convert the Azure Synapse schema to a Redshift database.
    4. Analyze the assessment report and address the action items.
    5. Apply the converted schema to the target Redshift database.
  4. Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents:
    1. Generate trust and key stores (this step is optional).
    2. Install and configure the data extraction agent.
    3. Start the data extraction agent.
    4. Register the data extraction agent.
    5. Add virtual partitions for large tables (this step is optional).
    6. Create a local data migration task.
    7. Start the local data migration task.
  5. View data in Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

Create a Redshift Serverless data warehouse

In this step, we create a Redshift Serverless data warehouse with a workgroup and namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users. To isolate workloads and manage different resources in Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create a Redshift Serverless data warehouse with a workgroup and namespace:

  1. On the Amazon Redshift console, choose the AWS Region that you want to use.
  2. In the navigation pane, choose Redshift Serverless.
  3. Choose Create workgroup.

  1. For Workgroup name, enter a name that describes the compute resources.

  1. Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  2. Choose Next.
  3. For Namespace, enter a name that describes your dataset.
  4. In the Database name and password section, select Customize admin user credentials.
  5. For Admin user name, enter a user name of your choice (for example, awsuser).
  6. For Admin user password, enter a password of your choice (for example, MyRedShiftPW2022).

  1. Choose Next.

Note that data in the Redshift Serverless namespace is encrypted by default.

  1. In the Review and Create section, choose Create.

Now you create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace. Note that there can only be one default IAM role.

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the namespace you just created.
  2. On the Security and encryption tab, in the Permissions section, choose Manage IAM roles.
  3. Choose Manage IAM roles and choose Create IAM role.
  4. In the Specify an Amazon S3 bucket for the IAM role to access section, choose one of the following methods:
    1. Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with names containing the word redshift.
    2. Choose Any Amazon S3 bucket to allow the created IAM role to access all S3 buckets.
    3. Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
  5. Choose Create IAM role as default.
  6. Capture the endpoint for the Redshift Serverless workgroup you just created.
  7. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workgroup you just created.
  8. In the General information section, copy the endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create an S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a unique DNS-compliant name for your bucket (for example, uniquename-as-rs).

For more information about bucket names, refer to Bucket naming rules.

  1. For AWS Region, choose the Region in which you created the Redshift Serverless workgroup.
  2. Choose Create bucket.

  1. Choose Buckets in the navigation pane and navigate to the S3 bucket you just created (uniquename-as-rs).
  2. Choose Create folder.
  3. For Folder name, enter incoming.
  4. Choose Create folder.

Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT

To convert the Azure Synapse schema to Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that you created previously and launch AWS SCT.

Connect to the Azure Synapse source

Complete the following steps to connect to the Azure Synapse source:

  1. On the File menu, choose Create New Project.
  2. Choose a location to store your project files and data.
  3. Provide a meaningful but memorable name for your project (for example, Azure Synapse to Amazon Redshift).
  4. To connect to the Azure Synapse source data warehouse, choose Add source.
  5. Choose Azure Synapse and choose Next.
  6. For Connection name, enter a name (for example, olap-azure-synapse).

AWS SCT displays this name in the object tree in left pane.

  1. For Server name, enter your Azure Synapse server name.
  2. For SQL pool, enter your Azure Synapse pool name.
  3. Enter a user name and password.
  4. Choose Test connection to verify that AWS SCT can connect to your source Azure Synapse project.
  5. When the connection is successfully validated, choose Ok and Connect.

Connect to the Amazon Redshift target

Follow these steps to connect to Amazon Redshift:

  1. In AWS SCT, choose Add target.
  2. Choose Amazon Redshift, then choose Next.
  3. For Connection name, enter a name to describe the Amazon Redshift connection.

AWS SCT displays this name in the object tree in the right pane.

  1. For Server name, enter the Redshift Serverless workgroup endpoint you captured earlier.
  2. For Server port, enter 5439.
  3. For Database, enter dev.
  4. For User name, enter the user name you chose when creating the Redshift Serverless workgroup.
  5. For Password, enter the password you chose when creating the Redshift Serverless workgroup.
  6. Deselect Use AWS Glue.
  7. Choose Test connection to verify that AWS SCT can connect to your target Redshift workgroup.
  8. When the test is successful, choose OK.
  9. Choose Connect to connect to the Amazon Redshift target.

Alternatively, you can use connection values that are stored in AWS Secrets Manager.

Convert the Azure Synapse schema to a Redshift data warehouse

After you create the source and target connections, you will see the source Azure Synapse object tree in the left pane and the target Amazon Redshift object tree in the right pane. We then create mapping rules to describe the source target pair for the Azure Synapse to Amazon Redshift migration.

Follow these steps to convert the Azure Synapse dataset to Amazon Redshift format:

  1. In the left pane, choose (right-click) the schema you want to convert.
  2. Choose Convert schema.
  3. In the dialog box, choose Yes.

When the conversion is complete, you will see a new schema created in the Amazon Redshift pane (right pane) with the same name as your Azure Synapse schema.

The sample schema we used has three tables; you can see these objects in Amazon Redshift format in the right pane. AWS SCT converts all the Azure Synapse code and data objects to Amazon Redshift format. You can also use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects, but some objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from Main view to Assessment Report view as shown in the following screenshot.

The Summary tab shows objects that were converted automatically and objects that were not converted automatically. Green represents automatically converted objects or objects with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action items tab shows the recommended actions for each conversion issue. If you choose an action item from the list, AWS SCT highlights the object that the action item applies to.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database and schema show you the effort required to design and implement the recommendations for converting action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. AWS SCT completes some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it in the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the ID column data type from decimal(3,0) in Azure Synapse to the smallint data type in Amazon Redshift.

Apply the converted schema to the target Redshift data warehouse

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.

Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this section, we configure AWS SCT extraction agents to extract data from Azure Synapse and migrate to Amazon Redshift. For this post, we install the AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For very large datasets, AWS SCT supports the use of multiple data extraction agents running on several instances to maximize throughput and increase the speed of data migration.

Generate trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all data passed between the applications remains private and integral. To use SSL communication, you need to generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, choose Settings, Global settings, and Security.
  2. Choose Generate trust and key store.

  1. Enter a name and password for the trust and key stores.
  2. Enter a location to store them.
  3. Choose Generate, then choose OK.

Install and configure the data extraction agent

In the installation package for AWS SCT, you can find a subfolder called agents (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Service port, enter the port number the agent listens on. It is 8192 by default.
  2. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data.

The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

  1. For Enter Redshift JDBC driver file or files, enter the location where you downloaded the Redshift JDBC drivers.
  2. For Add the Amazon Redshift driver, enter YES.
  3. For Enable SSL communication, enter yes. Enter No here if you don’t want to use SSL.
  4. Choose Next.

  1. For Trust store path, enter the storage location you specified when creating the trust and key store.
  2. For Trust store password, enter the password for the trust store.
  3. For Enable client SSL authentication, enter yes.
  4. For Key store path, enter the storage location you specified when creating the trust and key store.
  5. For Key store password, enter the password for the key store.
  6. Choose Next.

Start the data extraction agent

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, use C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed for your operating system. To stop an agent, run the same command but replace start with stop. To restart an agent, run the same RestartAgent.bat file.

Note that you should have administrator access to run those commands.

Register the data extraction agent

Follow these steps to register the data extraction agent:

  1. In AWS SCT, change the view to Data Migration view choose Register.
  2. Select Redshift data agent, then choose OK.

  1. For Description, enter a name to identify the agent.
  2. For Host name, if you installed the extraction agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT extraction agent is installed. It is recommended to install extraction agents on Linux for better performance.
  3. For Port, enter the number you used for the listening port (default 8192) when installing the AWS SCT extraction agent.
  4. Select Use SSL to encrypt AWS SCT connection to Data Extraction Agent.

  1. If you’re using SSL, navigate to the SSL tab.
  2. For Trust store, choose the trust store you created earlier.
  3. For Key store, choose the key store you created earlier.
  4. Choose Test connection.
  5. After the connection is validated successfully, choose OK and Register.

Create a local data migration task

To migrate data from Azure Synapse Analytics to Amazon Redshift, you create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local data migration task:

  1. In AWS SCT, under the schema name in the left pane, choose (right-click) the table you want to migrate (for this post, we use the table tbl_currency).
  2. Choose Create Local task.

  1. Choose from the following migration modes:
    1. Extract the source data and store it on a local PC or virtual machine where the agent runs.
    2. Extract the data and upload it to an S3 bucket.
    3. Extract the data, upload it to Amazon S3, and copy it into Amazon Redshift. (We choose this option for this post.)

  1. On the Advanced tab, provide the extraction and copy settings.

  1. On the Source server tab, make sure you are using the current connection properties.

  1. On the Amazon S3 settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket you created earlier.

The AWS SCT data extraction agent uploads the data in those S3 buckets and folders before copying it to Amazon Redshift.

  1. Choose Test Task.

  1. When the task is successfully validated, choose OK, then choose Create.

Start the local data migration task

To start the task, choose Start or Restart on the Tasks tab.

First, the data extraction agent extracts data from Azure Synapse. Then the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.

At this point, AWS SCT has successfully migrated data from the source Azure Synapse table to the Redshift table.

View data in Amazon Redshift

After the data migration task is complete, you can connect to Amazon Redshift and validate the data. Complete the following steps:

  1. On the Amazon Redshift console, navigate to the Query Editor v2.
  2. Open the Redshift Serverless workgroup you created.
  3. Choose Query data.

  1. For Database, enter a name for your database.
  2. For Authentication, select Federated user
  3. Choose Create connection.

  1. Open a new editor by choosing the plus sign.
  2. In the editor, write a query to select from the schema name and table or view name you want to verify.

You can explore the data, run ad-hoc queries, and make visualizations, charts, and views.

The following screenshot is the view of the source Azure Synapse dataset we used in this post.

Clean up

Follow the steps in this section to clean up any AWS resources you created as part of this post.

Stop the EC2 instance

Follow these steps to stop the EC2 instance:

  1. On the Amazon EC2 console, in the navigation pane, choose Instances.
  2. Select the instance you created.
  3. Choose Instance state, then choose Terminate instance.
  4. Choose Terminate when prompted for confirmation.

Delete the Redshift Serverless workgroup and namespace

Follow these steps to delete the Redshift Serverless workgroup and namespace:

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workspace you created
  2. On the Actions menu, choose Delete workgroup.
  3. Select Delete the associated namespace.
  4. Deselect Create final snapshot.
  5. Enter delete in the confirmation text box and choose Delete.

Delete the S3 bucket

Follow these steps to delete the S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose the bucket you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. This post discussed how a data migration task extracts, downloads, and migrates data from Azure Synapse to Amazon Redshift. The solution we presented performs a one-time migration of database objects and data. Data changes made in Azure Synapse when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to Azure Synapse on hold or rerun the ETL jobs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

To get started, download and install AWS SCT, sign in to the AWS Management Console, check out Redshift Serverless, and start migrating!


About the Authors

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Migrating your secrets to AWS Secrets Manager, Part 2: Implementation

Post Syndicated from Adesh Gairola original https://aws.amazon.com/blogs/security/migrating-your-secrets-to-aws-secrets-manager-part-2-implementation/

In Part 1 of this series, we provided guidance on how to discover and classify secrets and design a migration solution for customers who plan to migrate secrets to AWS Secrets Manager. We also mentioned steps that you can take to enable preventative and detective controls for Secrets Manager. In this post, we discuss how teams should approach the next phase, which is implementing the migration of secrets to Secrets Manager. We also provide a sample solution to demonstrate migration.

Implement secrets migration

Application teams lead the effort to design the migration strategy for their application secrets. Once you’ve made the decision to migrate your secrets to Secrets Manager, there are two potential options for migration implementation. One option is to move the application to AWS in its current state and then modify the application source code to retrieve secrets from Secrets Manager. Another option is to update the on-premises application to use Secrets Manager for retrieving secrets. You can use features such as AWS Identity and Access Management (IAM) Roles Anywhere to make the application communicate with Secrets Manager even before the migration, which can simplify the migration phase.

If the application code contains hardcoded secrets, the code should be updated so that it references Secrets Manager. A good interim state would be to pass these secrets as environment variables to your application. Using environment variables helps in decoupling the secrets retrieval logic from the application code and allows for a smooth cutover and rollback (if required).

Cutover to Secrets Manager should be done in a maintenance window. This minimizes downtime and impacts to production.

Before you perform the cutover procedure, verify the following:

  • Application components can access Secrets Manager APIs. Based on your environment, this connectivity might be provisioned through interface virtual private cloud (VPC) endpoints or over the internet.
  • Secrets exist in Secrets Manager and have the correct tags. This is important if you are using attribute-based access control (ABAC).
  • Applications that integrate with Secrets Manager have the required IAM permissions.
  • Have a well-documented cutover and rollback plan that contains the changes that will be made to the application during cutover. These would include steps like updating the code to use environment variables and updating the application to use IAM roles or instance profiles (for apps that are being migrated to Amazon Elastic Compute Cloud (Amazon EC2)).

After the cutover, verify that Secrets Manager integration was successful. You can use AWS CloudTrail to confirm that application components are using Secrets Manager.

We recommend that you further optimize your integration by enabling automatic secrets rotation. If your secrets were previously widely accessible (for example, they were stored in your Git repositories), we recommend rotating as soon as possible when migrating .

Sample application to demo integration with Secrets Manager

In the next sections, we present a sample AWS Cloud Development Kit (AWS CDK) solution that demonstrates the implementation of the previously discussed guardrails, design, and migration strategy. You can use the sample solution as a starting point and expand upon it. It includes components that environment teams may deploy to help provide potentially secure access for application teams to migrate their secrets to Secrets Manager. The solution uses ABAC, a tagging scheme, and IAM Roles Anywhere to demonstrate regulated access to secrets for application teams. Additionally, the solution contains client-side utilities to assist application and migration teams in updating secrets. Teams with on-premises applications that are seeking integration with Secrets Manager before migration can use the client-side utility for access through IAM Roles Anywhere.

The sample solution is hosted on the aws-secrets-manager-abac-authorization-samples GitHub repository and is made up of the following components:

  • A common environment infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • A sample VPC created with Amazon Virtual Private Cloud (Amazon VPC), with PUBLIC, PRIVATE_WITH_NAT, and PRIVATE_ISOLATED subnet types.
    • VPC endpoints for the AWS Key Management Service (AWS KMS) and Secrets Manager services to the sample VPC. The use of VPC endpoints means that calls to AWS KMS and Secrets Manager are not made over the internet and remain internal to the AWS backbone network.
    • An empty shell secret, tagged with the supplied attributes and an IAM managed policy that uses attribute-based access control conditions. This means that the secret is managed in code, but the actual secret value is not visible in version control systems like GitHub or in AWS CloudFormation parameter inputs. 
  • An IAM Roles Anywhere infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • An AWS Certificate Manager Private Certificate Authority (AWS Private CA).
    • An IAM Roles Anywhere public key infrastructure (PKI) trust anchor that uses AWS Private CA.
    • An IAM role for the on-premises application that uses the common environment infrastructure stack.
    • An IAM Roles Anywhere profile.

    Note: You can choose to use your existing CAs as trust anchors. If you do not have a CA, the stack described here provisions a PKI for you. IAM Roles Anywhere allows migration teams to use Secrets Manager before the application is moved to the cloud. Post migration, you could consider updating the applications to use native IAM integration (like instance profiles for EC2 instances) and revoking IAM Roles Anywhere credentials.

  • A client-side utility (primarily used by application or migration teams). This is a shell script that does the following:
    • Assists in provisioning a certificate by using OpenSSL.
    • Uses aws_signing_helper (Credential Helper) to set up AWS CLI profiles by using the credential_process for IAM Roles Anywhere.
    • Assists application teams to access and update their application secrets after assuming an IAM role by using IAM Roles Anywhere.
  • A sample application stack (created and owned by the application/migration team). This is a sample serverless application that demonstrates the use of the solution. It deploys the following components, which indicate that your ABAC-based IAM strategy is working as expected and is effectively restricting access to secrets:
    • The sample application stack uses a VPC-deployed common environment infrastructure stack.
    • It deploys an Amazon Aurora MySQL serverless cluster in the PRIVATE_ISOLATED subnet and uses the secret that is created through a common environment infrastructure stack.
    • It deploys a sample Lambda function in the PRIVATE_WITH_NAT subnet.
    • It deploys two IAM roles for testing:
      • allowedRole (default role): When the application uses this role, it is able to use the GET action to get the secret and open a connection to the Aurora MySQL database.
      • Not allowedRole: When the application uses this role, it is unable to use the GET action to get the secret and open a connection to the Aurora MySQL database.

Prerequisites to deploy the sample solution

The following software packages need to be installed in your development environment before you deploy this solution:

Note: In this section, we provide examples of AWS CLI commands and configuration for Linux or macOS operating systems. For instructions on using AWS CLI on Windows, refer to the AWS CLI documentation.

Before deployment, make sure that the correct AWS credentials are configured in your terminal session. The credentials can be either in the environment variables or in ~/.aws. For more details, see Configuring the AWS CLI.

Next, use the following commands to set your AWS credentials to deploy the stack:

export AWS_ACCESS_KEY_ID=<>
export AWS_SECRET_ACCESS_KEY=<>
export AWS_REGION = <>

You can view the IAM credentials that are being used by your session by running the command aws sts get-caller-identity. If you are running the cdk command for the first time in your AWS account, you will need to run the following cdk bootstrap command to provision a CDK Toolkit stack that will manage the resources necessary to enable deployment of cloud applications with the AWS CDK.

cdk bootstrap aws://<AWS account number>/<Region> # Bootstrap CDK in the specified account and AWS Region

Select the applicable archetype and deploy the solution

This section outlines the design and deployment steps for two archetypes:

Archetype 1: Application is currently on premises

Archetype 1 has the following requirements:

  • The application is currently hosted on premises.
  • The application would consume API keys, stored credentials, and other secrets in Secrets Manager.

The application, environment and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack (as described earlier in this post) to bootstrap the AWS account with secrets and IAM policy by using the supplied tagging requirement.
  2. Additionally, the environment engineer deploys the IAM Roles Anywhere infrastructure stack.
  3. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  4. The application developer uses the client-side utility to update the AWS CLI profile to consume the IAM Roles Anywhere role from the on-premises servers.

    Figure 1 shows the workflow for Archetype 1.

    Figure 1: Application on premises connecting to Secrets Manager

    Figure 1: Application on premises connecting to Secrets Manager

To deploy Archetype 1

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Do not modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common environment infrastructure stack.
    ./helper.sh prepare
    Then, run the following command to deploy the IAM Roles Anywhere infrastructure stack../helper.sh on-prem
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, by using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it’s still using the dummy value.

    Then, run the following command to set up the client and server on premises../helper.sh client-profile-setup

    Follow the command prompt. It will help you request a client certificate and update the AWS CLI profile.

    Important: When you request a client certificate, make sure to supply at least one distinguished name, like CommonName.

The sample output should look like the following.


‐‐> This role can be used by the application by using the AWS CLI profile 'developer'.
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'.
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value ‐‐secret-id $SECRET_ARN ‐‐profile developer

At this point, the client-side utility (helper.sh client-profile-setup) should have updated the AWS CLI configuration file with the following profile.

[profile developer]
region = <aws-region>
credential_process = /Users/<local-laptop-user>/.aws/aws_signing_helper credential-process
    ‐‐certificate /Users/<local-laptop-user>/.aws/client_cert.pem
    ‐‐private-key /Users/<local-laptop-user>/.aws/my_private_key.clear.key
    ‐‐trust-anchor-arn arn:aws:rolesanywhere:<aws-region>:444455556666:trust-anchor/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111 
    ‐‐profile-arn arn:aws:rolesanywhere:<aws-region>:444455556666:profile/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222 
    ‐‐role-arn arn:aws:iam::444455556666:role/RolesanywhereabacStack-onPremAppRole-1234567890ABC

To test Archetype 1 deployment

  • The application team can verify that the AWS CLI profile has been properly set up and is capable of retrieving secrets from Secrets Manager by running the following client-side utility command.
    ./helper.sh on-prem-test

This client-side utility (helper.sh) command verifies that the AWS CLI profile (for example, developer) has been set up for IAM Roles Anywhere and can run the GetSecretValue API action to retrieve the value of the secret stored in Secrets Manager.

The sample output should look like the following.

‐‐> Checking credentials ...
{
    "UserId": "AKIAIOSFODNN7EXAMPLE:EXAMPLE11111EXAMPLEEXAMPLE111111",
    "Account": "444455556666",
    "Arn": "arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC"
}
‐‐> Assume role worked for:
arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC
‐‐> This role can be used by the application by using the AWS CLI profile 'developer'. 
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'. 
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value --secret-id $SECRET_ARN ‐‐profile $PROFILE_NAME
-------Output-------
{
  "password": "randomuniquepassword",
  "servertype": "testserver1",
  "username": "testuser1"
}
-------Output-------

Archetype 2: Application has migrated to AWS

Archetype 2 has the following requirement:

  • Deploy a sample application to demonstrate how ABAC authorization works for Secrets Manager APIs.

The application, environment, and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack to bootstrap the AWS account with secrets and an IAM policy by using the supplied tagging requirement.
  2. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  3. The application developer tests the sample application to confirm operability of ABAC.

Figure 2 shows the workflow for Archetype 2.

Figure 2: Sample migrated application connecting to Secrets Manager

Figure 2: Sample migrated application connecting to Secrets Manager

To deploy Archetype 2

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Don’t modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common platform infrastructure stack.
    ./helper.sh prepare
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it is still using the dummy value.

    Then, run the following command to deploy a sample app stack.
    ./helper.sh on-aws

    Note: If your secrets were migrated from a system that did not have the correct access controls, as a best security practice, you should rotate them at least once manually.

At this point, the client-side utility should have deployed a sample application Lambda function. This function connects to a MySQL database by using credentials stored in Secrets Manager. It retrieves the secret values, validates them, and establishes a connection to the database. The function returns a message that indicates whether the connection to the database is working or not.

To test Archetype 2 deployment

  • The application team can use the following client-side utility (helper.sh) to invoke the Lambda function and verify whether the connection is functional or not.
    ./helper.sh on-aws-test

The sample output should look like the following.

‐‐> Check if AWS CLI is installed
‐‐> AWS CLI found 
‐‐> Using tags to create Lambda function name and invoking a test 
‐‐> Checking the Lambda invoke response..... 
‐‐> The status code is 200
‐‐> Reading response from test function: 
"Connection to the DB is working."
‐‐> Response shows database connection is working from Lambda function using secret.

Conclusion

Building an effective secrets management solution requires careful planning and implementation. AWS Secrets Manager can help you effectively manage the lifecycle of your secrets at scale. We encourage you to take an iterative approach to building your secrets management solution, starting by focusing on core functional requirements like managing access, defining audit requirements, and building preventative and detective controls for secrets management. In future iterations, you can improve your solution by implementing more advanced functionalities like automatic rotation or resource policies for secrets.

To read Part 1 of this series, go to Migrating your secrets to AWS, Part I: Discovery and design.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Adesh Gairola

Adesh Gairola

Adesh Gairola is a Senior Security Consultant at Amazon Web Services in Sydney, Australia. Adesh is eager to help customers build robust defenses, and design and implement security solutions that enable business transformations. He is always looking for new ways to help customers improve their security posture.

Eric Swamy

Eric Swamy

Eric is a Senior Security Consultant working in the Professional Services team in Sydney, Australia. He is passionate about helping customers build the confidence and technical capability to move their most sensitive workloads to cloud. When not at work, he loves to spend time with his family and friends outdoors, listen to music, and go on long walks.

Migrating your secrets to AWS Secrets Manager, Part I: Discovery and design

Post Syndicated from Eric Swamy original https://aws.amazon.com/blogs/security/migrating-your-secrets-to-aws-secrets-manager-part-i-discovery-and-design/

“An ounce of prevention is worth a pound of cure.” – Benjamin Franklin

A secret can be defined as sensitive information that is not intended to be known or disclosed to unauthorized individuals, entities, or processes. Secrets like API keys, passwords, and SSH keys provide access to confidential systems and resources, but it can be a challenge for organizations to maintain secure and consistent management of these secrets. Commonly observed anti-patterns in organizational secrets management systems include sharing plaintext secrets in emails or messaging apps, allowing application developers to view secrets in plaintext, hard-coding secrets into applications and storing them in version control systems, failing to rotate secrets regularly, and not logging and monitoring access to secrets.

We have created a two-part Amazon Web Services (AWS) blog post that provides prescriptive guidance on how you can use AWS Secrets Manager to help you achieve a cloud-based and modern secrets management system. In this first blog post, we discuss approaches to discover and classify secrets. In Part 2 of this series, we elaborate on the implementation phase and discuss migration techniques that will help you migrate your secrets to AWS Secrets Manager.

Managing secrets: Best practices and personas

A secret’s lifecycle comprises four phases: create, store, use, and destroy. An effective secrets management solution protects the secret in each of these phases from unauthorized access. Besides being secure, robust, scalable, and highly available, the secrets management system should integrate closely with other tools, solutions, and services that are being used within the organization. Legacy secret stores may lack integration with privileged access management (PAM), logging and monitoring, DevOps, configuration management, and encryption and auditing, which leads to teams not having uniform practices for consuming secrets and creates discrepancies from organizational policies.

Secrets Manager is a secrets management service that helps you protect access to your applications, services, and IT resources. This is a non-exhaustive list of features that AWS Secrets Manager offers:

  • Access control through AWS Identity and Access Management (IAM) — Secrets Manager offers built-in integration with the AWS Identity and Access Management (IAM) service. You can attach access control policies to IAM principals or to secrets themselves (by using resource-based policies).
  • Logging and monitoring — Secrets Manager integrates with AWS logging and monitoring services such as AWS CloudTrail and Amazon CloudWatch. This means that you can use your existing AWS logging and monitoring stack to log access to secrets and audit their usage.
  • Integration with other AWS services — Secrets Manager can store and manage the lifecycle of secrets created by other AWS services like Amazon Relational Database Service (Amazon RDS), Amazon Redshift, and Amazon QuickSight. AWS is constantly working on integrating more services with Secrets Manager.
  • Secrets encryption at rest — Secrets Manager integrates with AWS Key Management Service (AWS KMS). Secrets are encrypted at rest by using an AWS-managed key or customer-managed key.
  • Framework to support the rotation of secrets securely — Rotation helps limit the scope of a compromise and should be an integral part of a modern approach to secrets management. You can use Secrets Manager to schedule automatic database credentials rotation for Amazon RDS, Amazon Redshift, and Amazon DocumentDB. You can use customized AWS Lambda functions to extend the Secrets Manager rotation feature to other secret types, such as API keys and OAuth tokens for on-premises and cloud resources.

Security, cloud, and application teams within an organization need to work together cohesively to build an effective secrets management solution. Each of these teams has unique perspectives and responsibilities when it comes to building an effective secrets management solution, as shown in the following table.

Persona Responsibilities What they want What they don’t want
Security teams/security architect Define control objectives and requirements from the secrets management system Least privileged short-lived access, logging and monitoring, and rotation of secrets Secrets sprawl
Cloud team/environment team Implement controls, create guardrails, detect events of interest Scalable, robust, and highly available secrets management infrastructure Application teams reaching out to them to provision or manage app secrets
Developer/migration engineer Migrate applications and their secrets to the cloud Independent control and management of their app secrets Dependency on external teams

To sum up the requirements from all the personas mentioned here: The approach to provision and consume secrets should be secure, governed, easily scalable, and self-service.

We’ll now discuss how to discover and classify secrets and design the migration in a way that helps you to meet these varied requirements.

Discovery — Assess and categorize existing secrets

The initial discovery phase involves running sessions aimed at discovering, assessing, and categorizing secrets. Migrating applications and associated infrastructure to the cloud requires a strategic and methodical approach to progressively discover and analyze IT assets. This analysis can be used to create high-confidence migration wave plans. You should treat secrets as IT assets and include them in the migration assessment planning.

For application-related secrets, arguably the most appropriate time to migrate a secret is when the application that uses the secret is being migrated itself. This lets you track and report the use of secrets as soon as the application begins to operate in the cloud. If secrets are left on-premises during an application migration, this often creates a risk to the availability of the application. The migrated application ends up having a dependency on the connectivity and availability of the on-premises secrets management system.

The activities performed in this phase are often handled by multiple teams. Depending on the purpose of the secret, this can be a mix of application developers, migration teams, and environment teams.

Following are some common secret types you might come across while migrating applications.

Type Description
Application secrets Secrets specific to an application
Client credentials Cloud to on-premises credentials or OAuth tokens (such as Okta, Google APIs, and so on)
Database credentials Credentials for cloud-hosted databases, for example, Amazon Redshift, Amazon RDS or Amazon Aurora, Amazon DocumentDB
Third-party credentials Vendor application credentials or API keys
Certificate private keys Custom applications or infrastructure that might require programmatic access to the private key
Cryptographic keys Cryptographic keys used for data encryption or digital signatures
SSH keys Centralized management of SSH keys can potentially make it easier to rotate, update, and track keys
AWS access keys On-premises to cloud credentials (IAM)

Creating an inventory for secrets becomes simpler when organizations have an IT asset management (ITAM) or Identity and Access Management (IAM) tool to manage their IT assets (such as secrets) effectively. For organizations that don’t have an on-premises secrets management system, creating an inventory of secrets is a combination of manual and automated efforts. Application subject matter experts (SMEs) should be engaged to find the location of secrets that the application uses. In addition, you can use commercial tools to scan endpoints and source code and detect secrets that might be hardcoded in the application. Amazon CodeGuru is a service that can detect secrets in code. It also provides an option to migrate these secrets to Secrets Manager.

AWS has previously described seven common migration strategies for moving applications to the cloud. These strategies are refactor, replatform, repurchase, rehost, relocate, retain, and retire. For the purposes of migrating secrets, we recommend condensing these seven strategies into three: retire, retain, and relocate. You should evaluate every secret that is being considered for migration against a decision tree to determine which of these three strategies to use. The decision tree evaluates each secret against key business drivers like cost reduction, risk appetite, and the need to innovate. This allows teams to assess if a secret can be replaced by native AWS services, needs to be retained on-premises, migrated to Secrets Manager, or retired. Figure 1 shows this decision process.

Figure 1: Decision tree for assessing a secret for migration

Figure 1: Decision tree for assessing a secret for migration

Capture the associated details for secrets that are marked as RELOCATE. This information is essential and must remain confidential. Some secret metadata is transitive and can be derived from related assets, including details such as itsm-tier, sensitivity-rating, cost-center, deployment pipeline, and repository name. With Secrets Manager, you will use resource tags to bind this metadata with the secret.

You should gather at least the following information for the secrets that you plan to relocate and migrate to AWS Secrets Manager.

Metadata about secrets Rationale for gathering data
Secrets team name or owner Gathering the name or email address of the individual or team responsible for managing secrets can aid in verifying that they are maintained and updated correctly.
Secrets application name or ID To keep track of which applications use which secrets, it is helpful to collect application details that are associated with these secrets.
Secrets environment name or ID Gathering information about the environment to which secrets belong, such as “prod,” “dev,” or “test,” can assist in the efficient management and organization of your secrets.
Secrets data classification Understanding your organization’s data classification policy can help you identify secrets that contain sensitive or confidential information. It is recommended to handle these secrets with extra care. This information, which may be labeled “confidential,” “proprietary,” or “personally identifiable information (PII),” can indicate the level of sensitivity associated with a particular secret according to your organization’s data classification policy or standard.
Secrets function or usage If you want to quickly find the secrets you need for a specific task or project, consider documenting their usage. For example, you can document secrets related to “backup,” “database,” “authentication,” or “third-party integration.” This approach can allow you to identify and retrieve the necessary secrets within your infrastructure without spending a lot of time searching for them.

This is also a good time to decide on the rotation strategy for each secret. When you rotate a secret, you update the credentials in both Secrets Manager and the service to which that secret provides access (in other words, the resource). Secrets Manager supports automatic rotation of secrets based on a schedule.

Design the migration solution

In this phase, security and environment teams work together to onboard the Secrets Manager service to their organization’s cloud environment. This involves defining access controls, guardrails, and logging capabilities so that the service can be consumed in a regulated and governed manner.

As a starting point, use the following design principles mentioned in the Security Pillar of the AWS Well Architected Framework to design a migration solution:

  • Implement a strong identity foundation
  • Enable traceability
  • Apply security at all layers
  • Automate security best practices
  • Protect data at rest and in transit
  • Keep people away from data
  • Prepare for security events

The design considerations covered in the rest of this section will help you prepare your AWS environment to host production-grade secrets. This phase can be run in parallel with the discovery phase.

Design your access control system to establish a strong identity foundation

In this phase, you define and implement the strategy to restrict access to secrets stored in Secrets Manager. You can use the AWS Identity and Access Management (IAM) service to specify that identities (human and non-human IAM principals) are only able to access and manage secrets that they own. Organizations that organize their workloads and environments by using separate AWS accounts should consider using a combination of role-based access control (RBAC) and attribute-based access control (ABAC) to restrict access to secrets depending on the granularity of access that’s required.

You can use a scalable automation to deploy and update key IAM roles and policies, including the following:

  • Pipeline deployment policies and roles — This refers to IAM roles for CICD pipelines. These pipelines should be the primary mechanism for creating, updating, and deleting secrets in the organization.
  • IAM Identity Center permission sets — These allow human identities access to the Secrets Manager API. We recommend that you provision secrets by using infrastructure as code (IaC). However, there are instances where users need to interact directly with the service. This can be for initial testing, troubleshooting purposes, or updating a secret value when automatic rotation fails or is not enabled.
  • IAM permissions boundary — Boundary policies allow application teams to create IAM roles in a self-serviced, governed, and regulated manner.

Most organizations have Infrastructure, DevOps, or Security teams that deploy baseline configurations into AWS accounts. These solutions help these teams govern the AWS account and often have their own secrets. IAM policies should be created such that the IAM principals created by the application teams are unable to access secrets that are owned by the environment team, and vice versa. To enforce this logical boundary, you can use tagging and naming conventions on your secrets by using IAM.

A sample scheme for tagging your secrets can look like the following.

Tag key Tag value Notes Policy elements Secret tags
appname
  • Lowercase
  • Alphanumeric only
  • User friendly
  • Quickly identifiable
A user-friendly name for the application PrincipalTag/ appname =<value> (applies to role)
RequestTag/ appname =<value> (applies to caller)
SecretManager:ResourceTag/ appname=<value> (applies to the secret)
appname:<value>
appid
  • Lowercase
  • Alphanumeric only
  • Unique across the organization
  • Fixed length (5–7 characters)
Uniquely identifies the application among other cloud-hosted apps PrincipalTag/appid=<value>
RequestTag/appid=<value>
SecretManager:ResourceTag/appid=<value>
appid:<value>
appfunc
  • Lowercase
  • Fixed values (for example, web, msg, dba, api, storage, container, middleware, tool, service)
Used to describe the function of a particular target that the secret material is associated with (for example, web server, message broker, database) PrincipalTag/appfunc=<value>
RequestTag/appfunc=<value>
SecretManager:ResourceTag/appfunc=<value>
Appfunc:<value>
appenv
  • Lowercase
  • Fixed values (for example, dev, test, nonp, prod)
An identifier for the secret usage environment PrincipalTag/appenv=<value>
RequestTag/appenv=<value>
SecretManager:ResourceTag/appenv=<value>
appenv:<value>
dataclassification
  • Lowercase
  • Fixed values (for example, protected, confidential)
Use your organization’s data classification standards to classify the secrets PrincipalTag/dataclassification=<value>
RequestTag/dataclassification=<value>
SecretManager:ResourceTag/dataclassification=<value>
Dataclassification:<value>

If you maintain a registry that documents details of your cloud-hosted applications, most of these tags can be derived from the registry.

It’s common to apply different security and operational policies for the non-production and production environments of a given workload. Although production environments are generally deployed in a dedicated account, it’s common to have less critical non-production apps and environments coexisting in the same AWS account. For operation and governance at scale in these multi-tenanted accounts, you can use attribute-based access control (ABAC) to manage secure access to secrets. ABAC enables you to grant permissions based on tags. The main benefits of using tag-based access control are its scalability and operational efficiency.

Figure 2 shows an example of ABAC in action, where an IAM policy allows access to a secret only if the appfunc, appenv, and appid tags on the secret match the tags on the IAM principal that is trying to access the secrets.

Figure 2: ABAC access control

Figure 2: ABAC access control

ABAC works as follows:

  • Tags on a resource define who can access the resource. It is therefore important that resources are tagged upon creation.
  • For a create secret operation, IAM verifies whether the Principal tags on the IAM identity that is making the API call match the request tags in the request.
  • For an update, delete, or read operation, IAM verifies that the Principal tags on the IAM identity that is making the API call match the resource tags on the secret.
  • Regardless of the number of workloads or environments that coexist in the same account, you only need to create one ABAC-based IAM policy. This policy is the same for different kinds of accounts and can be deployed by using a capability like AWS CloudFormation StackSets. This is the reason that ABAC scales well for scenarios where multiple applications and environments are deployed in the same AWS account.
  • IAM roles can use a common IAM policy, such as the one described in the previous bullet point. You need to verify that the roles have the correct tags set on them, according to your tagging convention. This will automatically grant the roles access to the secrets that have the same resource tags.
  • Note that with this approach, tagging secrets and IAM roles becomes the most critical component for controlling access. For this reason, all tags on IAM roles and secrets on Secrets Manager must follow a standard naming convention at all times.

The following is an ABAC-based IAM policy that allows creation, updates, and deletion of secrets based on the tagging scheme described in the preceding table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/appfunc": "${aws:PrincipalTag/appfunc}",
                    "secretsmanager:ResourceTag/appenv": "${aws:PrincipalTag/appenv}",
                    "secretsmanager:ResourceTag/name": "${aws:PrincipalTag/name}",
                    "secretsmanager:ResourceTag/appid": "${aws:PrincipalTag/appid}"
                }
            },
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:PutSecretValue",
                "secretsmanager:UpdateSecret",
                "secretsmanager:DeleteSecret"
            ],
            "Resource": "arn:aws:secretsmanager:ap-southeast-2:*:secret:${aws:PrincipalTag/name}/${aws:PrincipalTag/appid}/${aws:PrincipalTag/appfunc}/${aws:PrincipalTag/appenv}*",
            "Effect": "Allow",
            "Sid": "AccessBasedOnResourceTags"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/appfunc": "${aws:PrincipalTag/appfunc}",
                    "aws:RequestTag/appid": "${aws:PrincipalTag/appid}",
                    "aws:RequestTag/name": "${aws:PrincipalTag/name}",
                    "aws:RequestTag/appenv": "${aws:PrincipalTag/appenv}"
                }
            },
            "Action": [
                "secretsmanager:TagResource",
                "secretsmanager:CreateSecret"
            ],
            "Resource": "arn:aws:secretsmanager:ap-southeast-2:*:secret:${aws:PrincipalTag/name}/${aws:PrincipalTag/appid}/${aws:PrincipalTag/appfunc}/${aws:PrincipalTag/appenv}*",
            "Effect": "Allow",
            "Sid": "AccessBasedOnRequestTags"
        }
    ]
}

In addition to controlling access, this policy also enforces a naming convention. IAM principals will only be able to create a secret that matches the following naming scheme.

Secret name = value of tag-key (appid + appfunc + appenv + name)
For example, /ordersapp/api/prod/logisticsapi

You can choose to implement ABAC so that the resource name matches the principal tags or the resource tags match the principal tags, or both. These are just different types of ABAC. The sample policy provided here implements both types. It’s important to note that because ABAC-based IAM policies are shared across multiple workloads, potential misconfigurations in the policies will have a wider scope of impact.

For more information about building your ABAC strategy, refer to the blog post Working backward: From IAM policies and principal tags to standardized names and tags for your AWS resources.

You can also add checks in your pipeline to provide early feedback for developers. These checks may potentially assist in verifying whether appropriate tags have been set up in IaC resources prior to their creation. Your pipeline-based controls provide an additional layer of defense and complement or extend restrictions enforced by IAM policies.

Resource-based policies

Resource-based policies are a flexible and powerful mechanism to control access to secrets. They are directly associated with a secret and allow specific principals mentioned in the policy to have access to the secret. You can use these policies to grant identities (internal or external to the account) access to a secret.

If your organization uses resource policies, security teams should come up with control objectives for these policies. Controls should be set so that only resource-based policies meeting your organizations requirements are created. Control objectives for resource policies may be set as follows:

  • Allow statements in the policy to have allow access to the secret from the same application.
  • Allow statements in the policy to have allow access from organization-owned cross-account identities only if they belong to the same environment. Controls that meet these objectives can be preventative (checks in pipeline) or responsive (config rules and Amazon EventBridge invoked Lambda functions).

Environment teams can also choose to provision resource-based policies for application teams. The provision process can be manual, but is preferably automated. An example would be that these teams can allow application teams to tag secrets with specific values, like a cross-account IAM role Amazon Resource Number (ARN) that needs access. An automation invoked by EventBridge rules then asserts that the cross-account principal in the tag belongs to the organization and is in the same environment, and then provisions a resource-based policy for the application team. Using such mechanisms creates a self-service way for teams to create safe resource policies that meet common use cases.

Resource-based policies for Secrets Manager can be a helpful tool for controlling access to secrets, but it is important to consider specific situations where alternative access control mechanisms might be more appropriate. For example, if your access control requirements for secrets involve complex conditions or dependencies that cannot be easily expressed using the resource-based policy syntax, it may be challenging to manage and maintain the policies effectively. In such cases, you may want to consider using a different access control mechanism that better aligns with your requirements. For help determining which type of policy to use, see Identity-based policies and resource-based policies.

Design detective controls to achieve traceability, monitoring, and alerting

Prepare your environment to record and flag events of interest when Secrets Manager is used to store and update secrets. We recommend that you start by identifying risks and then formulate objectives and devise control measures for each identified risk, as follows:

  • Control objectives — What does the control evaluate, and how is it configured? Controls can be configured by using CloudTrail events invoked by Lambda functions, AWS config rules, or CloudWatch alarms. Controls can evaluate a misconfigured property in a secrets resource or report on an event of interest.
  • Target audience — Identify teams that should be notified if the event occurs. This can be a combination of the environment, security, and application teams.
  • Notification type — SNS, email, Slack channel notifications, or an ITIL ticket.
  • Criticality — Low, medium, or high, based on the criticality of the event.

The following is a sample matrix that can serve as a starting point for documenting detective controls for Secrets Manager. The column titled AWS services in the table offers some suggestions for implementation to help you meet your control objetves.

Risk Control objective Criticality AWS services
A secret is created without tags that match naming and tagging schemes
  • Enforce least privilege
  • Establish logging and monitoring
  • Manage secrets
HIGH (if using ABAC) CloudTrail invoked Lambda function or custom AWS config rule
IAM related tags on a secret are updated, removed
  • Manage secrets
  • Enforce least privilege
HIGH (if using ABAC) CloudTrail invoked Lambda function or custom config rule
A resource policy is created when resource policies have not been onboarded to the environment
  • Manage secrets
  • Enforce least privilege
HIGH Pipeline or CloudTrail invoked ¬Lambda function or custom config rule
A secret is marked for deletion from an unusual source — root user or admin break glass role
  • Improve availability
  • Protect configurations
  • Prepare for incident response
  • Manage secrets
HIGH CloudTrail invoked Lambda function
A non-compliant resource policy was created — for example, to provide secret access to a foreign account
  • Enforce least privilege
  • Manage secrets
HIGH CloudTrail invoked Lambda function or custom config rule
An AWS KMS key for secrets encryption is marked for deletion
  • Manage secrets
  • Protect configurations
HIGH CloudTrail invoked Lambda function
A secret rotation failed
  • Manage secrets
  • Improve availability
MEDIUM Managed config rule
A secret is inactive and is not being accessed for x number of days
  • Optimize costs
LOW Managed config rule
Secrets are created that do not use KMS key
  • Encrypt data at rest
LOW Managed config rule
Automatic rotation is not enabled
  • Manage secrets
LOW Managed config rule
Successful create, update, and read events for secrets
  • Establish logging and monitoring
LOW CloudTrail logs

We suggest that you deploy these controls in your AWS accounts by using a scalable mechanism, such as CloudFormation StackSets.

For more details, see the following topics:

Design for additional protection at the network layer

You can use the guiding principles for Zero Trust networking to add additional mechanisms to control access to secrets. The best security doesn’t come from making a binary choice between identity-centric and network-centric controls, but by using both effectively in combination with each other.

VPC endpoints allow you to provide a private connection between your VPC and Secrets Manager API endpoints. They also provide the ability to attach a policy that allows you to enforce identity-centric rules at a logical network boundary. You can use global context keys like aws:PrincipalOrgID in VPC endpoint policies to allow requests to Secrets Manager service only from identities that belong to the same AWS organization. You can also use aws:sourceVpce and aws:sourceVpc IAM conditions to allow access to the secret only if the request originates from a specific VPC endpoint or VPC, respectively.

For more details on VPC endpoints, see Using an AWS Secrets Manager VPC endpoint.

Design for least privileged access to encryption keys

To reduce unauthorized access, secrets should be encrypted at rest. Secrets Manager integrates with AWS KMS and uses envelope encryption. Every secret in Secrets Manager is encrypted with a unique data key. Each data key is protected by a KMS key. Whenever the secret value inside a secret changes, Secrets Manager generates a new data key to protect it. The data key is encrypted under a KMS key and stored in the metadata of the secret. To decrypt the secret, Secrets Manager first decrypts the encrypted data key by using the KMS key in AWS KMS.

The following is a sample AWS KMS policy that permits cryptographic operations to a KMS key only from the Secrets Manager service within an AWS account, and allows the AWS KMS decrypt action from a specific IAM principal throughout the organization.

{
    "Version": "2012-10-17",
    "Id": "secrets_manager_encrypt_org",
    "Statement": [
        {
            "Sid": "Root Access",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::444455556666:root"
            },
            "Action": "kms:*",
            "Resource": "*"
        },
        {
            "Sid": "Allow access for Key Administrators",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
             "arn:aws:iam::444455556666:role/platformRoles/KMS-key-admin-role",                    "arn:aws:iam::444455556666:role/platformRoles/KMS-key-automation-role"
                ]
            },
            "Action": [
                "kms:CancelKeyDeletion",
                "kms:Create*",
                "kms:Delete*",
                "kms:Describe*",
                "kms:Disable*",
                "kms:Enable*",
                "kms:Get*",
                "kms:List*",
                "kms:Put*",
                "kms:Revoke*",
                "kms:ScheduleKeyDeletion",
                "kms:TagResource",
                "kms:UntagResource",
                "kms:Update*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow Secrets Manager use of the KMS key for a specific account",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:DescribeKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "kms:CallerAccount": "444455556666",
                    "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
                }
            }
        },
        {
            "Sid": "Allow use of Secrets Manager secrets from a specific IAM role (service account) throughout your org",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "kms:Decrypt",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "o-exampleorgid"
                },
                "StringLike": {
                    "aws:PrincipalArn": "arn:aws:iam::*:role/platformRoles/secretsAccessRole"
                }
            }
        }
    ]
}

Additionally, you can use the secretsmanager:KmsKeyId IAM condition key to allow secrets creation only when AWS KMS encryption is enabled for the secret. You can also add checks in your pipeline that allow the creation of a secret only when a KMS key is associated with the secret.

Design or update applications for efficient retrieval of secrets

In applications, you can retrieve your secrets by calling the GetSecretValue function in the available AWS SDKs. However, we recommend that you cache your secret values by using client-side caching. Caching secrets can improve speed, help to prevent throttling by limiting calls to the service, and potentially reduce your costs.

Secrets Manager integrates with the following AWS services to provide efficient retrieval of secrets:

  • For Amazon RDS, you can integrate with Secrets Manager to simplify managing master user passwords for Amazon RDS database instances. Amazon RDS can manage the master user password and stores it securely in Secrets Manager, which may eliminate the need for custom AWS Lambda functions to manage password rotations. The integration can help you secure your database by encrypting the secrets, using your own managed key or an AWS KMS key provided by Secrets Manager. As a result, the master user password is not visible in plaintext during the database creation workflow. This feature is available for the Amazon RDS and Aurora engines, and more information can be found in the Amazon RDS and Aurora User Guides.
  • For Amazon Elastic Kubernetes Service (Amazon EKS), you can use the AWS Secrets and Configuration Provider (ASCP) for the Kubernetes Secrets Store CSI Driver. This open-source project enables you to mount Secrets Manager secrets as Kubernetes secrets. The driver translates Kubernetes secret objects into Secrets Manager API calls, allowing you to access and manage secrets from within Kubernetes. After you configure the Kubernetes Secrets Store CSI Driver, you can create Kubernetes secrets backed by Secrets Manager secrets. These secrets are securely stored in Secrets Manager and can be accessed by your applications that are running in Amazon EKS.
  • For Amazon Elastic Container Service (Amazon ECS), sensitive data can be securely stored in Secrets Manager secrets and then accessed by your containers through environment variables or as part of the log configuration. This allows for a simple and potentially safe injection of sensitive data into your containers, making it a possible solution for your needs.
  • For AWS Lambda, you can use the AWS Parameters and Secrets Lambda Extension to retrieve and cache Secrets Manager secrets in Lambda functions without the need for an AWS SDK. It is noteworthy that retrieving a cached secret is faster compared to the standard method of retrieving secrets from Secrets Manager. Moreover, using a cache can be cost-efficient, because there is a charge for calling Secrets Manager APIs. For more details, see the Secrets Manager User Guide.

For additional information on how to use Secrets Manager secrets with AWS services, refer to the following resources:

Develop an incident response plan for security events

It is recommended that you prepare for unforeseeable incidents such as unauthorized access to your secrets. Developing an incident response plan can help minimize the impact of the security event, facilitate a prompt and effective response, and may help to protect your organization’s assets and reputation. The traceability and monitoring controls we discussed in the previous section can be used both during and after the incident.

The Computer Security Incident Handling Guide SP 800-61 Rev. 2, which was created by the National Institute of Standards and Technology (NIST), can help you create an incident response plan for specific incident types. It provides a thorough and organized approach to incident response, covering everything from initial preparation and planning to detection and analysis, containment, eradication, recovery, and follow-up. The framework emphasizes the importance of continual improvement and learning from past incidents to enhance the overall security posture of the organization.

Refer to the following documentation for further details and sample playbooks:

Conclusion

In this post, we discussed how organizations can take a phased approach to migrate their secrets to AWS Secrets Manager. Your teams can use the thought exercises mentioned in this post to decide if they would like to rehost, replatform, or retire secrets. We discussed what guardrails should be enabled for application teams to consume secrets in a safe and regulated manner. We also touched upon ways organizations can discover and classify their secrets.

In Part 2 of this series, we go into the details of the migration implementation phase and walk you through a sample solution that you can use to integrate on-premises applications with Secrets Manager.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Eric Swamy

Eric Swamy

Eric is a Senior Security Consultant working in the Professional Services team in Sydney, Australia. He is passionate about helping customers build the confidence and technical capability to move their most sensitive workloads to cloud. When not at work, he loves to spend time with his family and friends outdoors, listen to music, and go on long walks.

Adesh Gairola

Adesh Gairola

Adesh Gairola is a Senior Security Consultant at Amazon Web Services in Sydney, Australia. Adesh is eager to help customers build robust defenses, and design and implement security solutions that enable business transformations. He is always looking for new ways to help customers improve their security posture.

Migrating Netflix to GraphQL Safely

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/migrating-netflix-to-graphql-safely-8e1e4d4f1e72

By Jennifer Shin, Tejas Shikhare, Will Emmanuel

In 2022, a major change was made to Netflix’s iOS and Android applications. We migrated Netflix’s mobile apps to GraphQL with zero downtime, which involved a total overhaul from the client to the API layer.

Until recently, an internal API framework, Falcor, powered our mobile apps. They are now backed by Federated GraphQL, a distributed approach to APIs where domain teams can independently manage and own specific sections of the API.

Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.

Migration Details

Before diving into these techniques, let’s briefly examine the migration plan.

Before GraphQL: Monolithic Falcor API implemented and maintained by the API Team

Before moving to GraphQL, our API layer consisted of a monolithic server built with Falcor. A single API team maintained both the Java implementation of the Falcor framework and the API Server.

Phase 1

Created a GraphQL Shim Service on top of our existing Monolith Falcor API.

By the summer of 2020, many UI engineers were ready to move to GraphQL. Instead of embarking on a full-fledged migration top to bottom, we created a GraphQL shim on top of our existing Falcor API. The GraphQL shim enabled client engineers to move quickly onto GraphQL, figure out client-side concerns like cache normalization, experiment with different GraphQL clients, and investigate client performance without being blocked by server-side migrations. To launch Phase 1 safely, we used AB Testing.

Phase 2

Deprecate the GraphQL Shim Service and Legacy API Monolith in favor of GraphQL services owned by the domain teams.

We didn’t want the legacy Falcor API to linger forever, so we leaned into Federated GraphQL to power a single GraphQL API with multiple GraphQL servers.

We could also swap out the implementation of a field from GraphQL Shim to Video API with federation directives. To launch Phase 2 safely, we used Replay Testing and Sticky Canaries.

Testing Strategies: A Summary

Two key factors determined our testing strategies:

  • Functional vs. non-functional requirements
  • Idempotency

If we were testing functional requirements like data accuracy, and if the request was idempotent, we relied on Replay Testing. We knew we could test the same query with the same inputs and consistently expect the same results.

We couldn’t replay test GraphQL queries or mutations that requested non-idempotent fields.

And we definitely couldn’t replay test non-functional requirements like caching and logging user interaction. In such cases, we were not testing for response data but overall behavior. So, we relied on higher-level metrics-based testing: AB Testing and Sticky Canaries.

Let’s discuss the three testing strategies in further detail.

Tool: AB Testing

Netflix traditionally uses AB Testing to evaluate whether new product features resonate with customers. In Phase 1, we leveraged the AB testing framework to isolate a user segment into two groups totaling 1 million users. The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render.

We set up a client-side AB experiment that tested Falcor versus GraphQL and reported coarse-grained quality of experience metrics (QoE). The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. We spent the next few months diving into these high-level metrics and fixing issues such as cache TTLs, flawed client assumptions, etc.

Wins

High-Level Health Metrics: AB Testing provided the assurance we needed in our overall client-side GraphQL implementation. This helped us successfully migrate 100% of the traffic on the mobile homepage canvas to GraphQL in 6 months.

Gotchas

Error Diagnosis: With an AB test, we could see coarse-grained metrics which pointed to potential issues, but it was challenging to diagnose the exact issues.

Tool: Replay Testing — Validation at Scale!

The next phase in the migration was to reimplement our existing Falcor API in a GraphQL-first server (Video API Service). The Falcor API had become a logic-heavy monolith with over a decade of tech debt. So we had to ensure that the reimplemented Video API server was bug-free and identical to the already productized Shim service.

We developed a Replay Testing tool to verify that idempotent APIs were migrated correctly from the GraphQL Shim to the Video API service.

How does it work?

The Replay Testing framework leverages the @override directive available in GraphQL Federation. This directive tells the GraphQL Gateway to route to one GraphQL server over another. Take, for instance, the following two GraphQL schemas defined by the Shim Service and the Video Service:

The GraphQL Shim first defined the certificationRating field (things like Rated R or PG-13) in Phase 1. In Phase 2, we stood up the VideoService and defined the same certificationRating field marked with the @override directive. The presence of the identical field with the @override directive informed the GraphQL Gateway to route the resolution of this field to the new Video Service rather than the old Shim Service.

The Replay Tester tool samples raw traffic streams from Mantis. With these sampled events, the tool can capture a live request from production and run an identical GraphQL query against both the GraphQL Shim and the new Video API service. The tool then compares the results and outputs any differences in response payloads.

Note: We do not replay test Personally Identifiable Information. It’s used only for non-sensitive product features on the Netflix UI.

Once the test is completed, the engineer can view the diffs displayed as a flattened JSON node. You can see the control value on the left side of the comma in parentheses and the experiment value on the right.

/data/videos/0/tags/3/id: (81496962, null)
/data/videos/0/tags/5/displayName: (Série, value: “S\303\251rie”)

We captured two diffs above, the first had missing data for an ID field in the experiment, and the second had an encoding difference. We also saw differences in localization, date precisions, and floating point accuracy. It gave us confidence in replicated business logic, where subscriber plans and user geographic location determined the customer’s catalog availability.

Wins

  • Confidence in parity between the two GraphQL Implementations
  • Enabled tuning configs in cases where data was missing due to over-eager timeouts
  • Tested business logic that required many (unknown) inputs and where correctness can be hard to eyeball

Gotchas

  • PII and non-idempotent APIs should not be tested using Replay Tests, and it would be valuable to have a mechanism to prevent that.
  • Manually constructed queries are only as good as the features the developer remembers to test. We ended up with untested fields simply because we forgot about them.
  • Correctness: The idea of correctness can be confusing too. For example, is it more correct for an array to be empty or null, or is it just noise? Ultimately, we matched the existing behavior as much as possible because verifying the robustness of the client’s error handling was difficult.

Despite these shortcomings, Replay Testing was a key indicator that we had achieved functional correctness of most idempotent queries.

Tool: Sticky Canary

While Replay Testing validates the functional correctness of the new GraphQL APIs, it does not provide any performance or business metric insight, such as the overall perceived health of user interaction. Are users clicking play at the same rates? Are things loading in time before the user loses interest? Replay Testing also cannot be used for non-idempotent API validation. We reached for a Netflix tool called the Sticky Canary to build confidence.

A Sticky Canary is an infrastructure experiment where customers are assigned either to a canary or baseline host for the entire duration of an experiment. All incoming traffic is allocated to an experimental or baseline host based on their device and profile, similar to a bucket hash. The experimental host deployment serves all the customers assigned to the experiment. Watch our Chaos Engineering talk from AWS Reinvent to learn more about Sticky Canaries.

In the case of our GraphQL APIs, we used a Sticky Canary experiment to run two instances of our GraphQL gateway. The baseline gateway used the existing schema, which routes all traffic to the GraphQL Shim. The experimental gateway used the new proposed schema, which routes traffic to the latest Video API service. Zuul, our primary edge gateway, assigns traffic to either cluster based on the experiment parameters.

We then collect and analyze the performance of the two clusters. Some KPIs we monitor closely include:

  • Median and tail latencies
  • Error rates
  • Logs
  • Resource utilization–CPU, network traffic, memory, disk
  • Device QoE (Quality of Experience) metrics
  • Streaming health metrics

We started small, with tiny customer allocations for hour-long experiments. After validating performance, we slowly built up scope. We increased the percentage of customer allocations, introduced multi-region tests, and eventually 12-hour or day-long experiments. Validating along the way is essential since Sticky Canaries impact live production traffic and are assigned persistently to a customer.

After several sticky canary experiments, we had assurance that phase 2 of the migration improved all core metrics, and we could dial up GraphQL globally with confidence.

Wins

Sticky Canaries was essential to build confidence in our new GraphQL services.

  • Non-Idempotent APIs: these tests are compatible with mutating or non-idempotent APIs
  • Business metrics: Sticky Canaries validated our core Netflix business metrics had improved after the migration
  • System performance: Insights into latency and resource usage help us understand how scaling profiles change after migration

Gotchas

  • Negative Customer Impact: Sticky Canaries can impact real users. We needed confidence in our new services before persistently routing some customers to them. This is partially mitigated by real-time impact detection, which will automatically cancel experiments.
  • Short-lived: Sticky Canaries are meant for short-lived experiments. For longer-lived tests, a full-blown AB test should be used.

In Summary

Technology is constantly changing, and we, as engineers, spend a large part of our careers performing migrations. The question is not whether we are migrating but whether we are migrating safely, with zero downtime, in a timely manner.

At Netflix, we have developed tools that ensure confidence in these migrations, targeted toward each specific use case being tested. We covered three tools, AB testing, Replay Testing, and Sticky Canaries that we used for the GraphQL Migration.

This blog post is part of our Migrating Critical Traffic series. Also, check out: Migrating Critical Traffic at Scale (part 1, part 2) and Ensuring the Successful Launch of Ads.


Migrating Netflix to GraphQL Safely was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Let’s Architect! Streamlining business with migration and modernization

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-streamlining-business-with-migration-and-modernization/

Many customers migrate their systems to Amazon Web Services (AWS) to increase their competitive edge and drive business value. To maximize the benefits of a cloud migration, companies tend to move their applications in conjunction with modernization initiatives. These joined efforts help your applications gain more agility, scalability, and resilience. Modernizing the portfolio of workloads with AWS means that you can re-platform, refactor, or replace these workloads by using containers, serverless technologies, purpose-built data stores, and software automation. These functionalities allow you to benefit from the best of the AWS agility and total cost optimization (TCO) benefits.

In this edition of Let’s Architect! we share hands-on activities, customer stories, and tips and tricks to migrate and modernize your applications with AWS.

Migrating to the cloud: What is the cost of doing nothing?

Would you think that small companies always migrate faster than large enterprises? Actually, cloud migration speed doesn’t necessarily depend on the size of the business! Company size is not a clear indicator of migration and modernization success, but a shift of culture and mindset is essential for successful company evolution.

When it comes to migration, the cost of doing nothing is not just financial: Businesses can also expect a slower pace of innovation and a higher security burden. This video analyzes the financial benefits of migration and shares mental models for approaching an AWS cloud migration, and Marriott team members explain how they planned their migration and the lessons learned along the way.

Take me to this re:Invent 2022 video!

Benefits of an early migration start

Benefits of an early migration start

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Organizations aim to deliver the best technological solutions based on customer needs. At any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. Let’s explore a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS, and discuss AWS tools to break the monolith into microservices and containerize applications.

Cost optimization is another key factor for modernizing your workloads and solutions include moving to Linux-based systems or using open-source database engines. This Migrate and Modernize enterprise workloads with AWS video walks you through the process of migrating and modernizing enterprise workloads with AWS.

Take me to this blog post with more detail!

A modernized microservices-based rearchitecture

A modernized microservices-based rearchitecture

Implementing a serverless-first strategy in an enterprise

Organizations of all sizes want to benefit from the agility, cost savings, and developer experience that serverless architectures can provide on AWS. For large enterprises, the return on investment (ROI) can be massive, but overcoming architecture inertia while ensuring security best practices and governance stay in place is a hurdle that many struggle with. In this lightning talk, learn how your organization can implement a serverless-first strategy to overcome these obstacles. Delta Air Lines shares the story of making serverless-first a reality as part of their AWS journey.

Take me to this video

Benefits of serverless

Benefits of serverless

Application Migration with AWS

This workshop shows you how to migrate and modernize a fictional application to the AWS Cloud by:

  1. Performing a database migration
  2. Migrating and modernizing your web server using different migration strategies (for example, breaking down the monolith into containers)
  3. Teaching you how to improve Operation excellence, Security, Performance efficiency, and Cost optimization of the deployed architecture by following these pillars of the AWS Well-Architected Framework.

Take me to this workshop!

Different migration strategies for web servers

Different migration strategies for web servers

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about distributed systems with containers.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

  • Varied Technology: Multiple teams used various products, languages, and versions of software.
  • Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
  • Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
  • Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
  • Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
  • Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
  • Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
  • Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

  • Prioritize by business functionality.
  • Apply best practices when building a modern data architecture from Day 1.
  • Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
  • Build the data architecture while keeping data formats, volumes, governance, and security in mind.
  • Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
  • Rehost machine learning (ML) jobs that were critical for business.
  • Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
  • Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

  • Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
  • Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
  • Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
    • Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
    • Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
    • Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
    • Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
    • Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
  • Effort vs. cost analysis – Team performed the following assessments:
    • Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
    • Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
    • Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

  • Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
  • Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
  • Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
  • Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
  • Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
  • Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
  • Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

  1. Source systems – These include the following:
    • Data from transactional systems stored in MariaDB (booking and transactions).
    • User interaction clickstream data via Kafka consumers to DataOps MariaDB.
    • Members and seat allocation information from MongoDB.
    • SQL Server for specific offers and payment information.
  2. Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
  3. Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
  4. Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
  5. Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
  6. Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
  7. Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
  8. Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
  9. Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
    • Content popularity – Decides the recommended content for users.
    • Live event popularity – Calculates the popularity of live entertainment events in different regions.
    • Trending searches – Identifies trending searches across regions.

Walkthrough

Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

  • Data lake
    • Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
    • Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
    • Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
    • Ran the bulk data migration to Amazon S3 from various sources.
  • Ingestion
    • Transaction systems – Retained the existing Kafka queues and consumers.
    • Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
  • Processing
  • Listed the data pipelines relevant to business and migrated them with minimal modification.
  • Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
    • Spark jobs were migrated to Amazon EMR.
    • HBase jobs were migrated to Amazon EMR with HBase.
    • Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
    • NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
  • Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
  • AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
  • Datawarehouse
    • Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
    • Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
    • Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
  • Visualization
    • Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
      • Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
      • Live entertainment – A dedicated report for live entertainment vertical events.
      • Coupons – A report for coupons purchased and redeemed.
      • BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
  • Machine learning
    • Listed the ML workloads to be migrated based on current business needs.
    • Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
    • Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
  • Security
    • AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
    • Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
    • Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
  • Automation
    • We used AWS CloudFormation for the creation and modification of all the core and analytics services.
    • AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
    • Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
    • Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
    • Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

  • Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
  • Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
  • Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
  • Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.

Conclusion

In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.


About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

  • Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
  • You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
  • Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
  • Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

  1. A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
    1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
    2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
    3. Download the Amazon Redshift JDBC driver from this location.
    4. Download and install Amazon Corretto 11.
  2. A GCP service account that AWS SCT can use to connect to your source BigQuery project.
    1. Grant BigQuery Admin and Storage Admin roles to the service account.
    2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
    3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

  • Create an Amazon Redshift Serverless Workgroup and Namespace
  • Create the AWS S3 Bucket and Folder
  • Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
    • Connecting to the Google BigQuery Source
    • Connect to the Amazon Redshift Target
    • Convert BigQuery schema to an Amazon Redshift
    • Analyze the assessment report and address the action items
    • Apply converted schema to target Amazon Redshift
  • Migrate data using AWS SCT data extraction agents
    • Generating Trust and Key Stores (Optional)
    • Install and start data extraction agent
    • Register data extraction agent
    • Add virtual partitions for large tables (Optional)
    • Create a local migration task
    • Start the Local Data Migration Task
  • View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

  • Navigate to the Amazon Redshift console.
  • In the upper right, choose the AWS Region that you want to use.
  • Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
  • Choose Create Workgroup.
  • For Workgroup name, enter a name that describes the compute resources.
  • Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  • Choose Next.

  • For Namespace name, enter a name that describes your dataset.
  • In Database name and password section, select the checkbox Customize admin user credentials.
    • For Admin user name, enter a username of your choice, for example awsuser.
    • For Admin user password: enter a password of your choice, for example MyRedShiftPW2022
  • Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
  • In the Review and Create page, choose Create.
  • Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
    • Navigate to the Amazon Redshift Serverless Dashboard.
    • Under Namespaces / Workgroups, choose the namespace that you just created.
    • Navigate toSecurity and encryption.
    • Under Permissions, choose Manage IAM roles.
    • Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
    • Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
      • Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
      • Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
      • Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
    • Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
  • Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

  • Navigate to the Amazon S3 console
  • Choose Create bucket. The Create bucket wizard opens.
  • For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
  • For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
  • Select Create Bucket.
  • In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
  • Choose “Create folder” to create a new folder.
  • For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

  • From the File Menu choose Create New Project.
  • Choose a location to store your project files and data.
  • Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
  • To connect to the BigQuery source data warehouse, choose Add source from the main menu.
  • Choose BigQuery and choose Next. The Add source dialog box appears.
  • For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
  • For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
  • Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
  • Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

  • In AWS SCT, choose Add Target from the main menu.
  • Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
  • For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
  • For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
  • For Server port, enter 5439.
  • For Database, enter dev.
  • For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
  • For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
  • Uncheck the “Use AWS Glue” box.
  • Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
  • Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager. 

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

  • On the left pane, right-click on the schema that you want to convert.
  • Choose Convert Schema.
  • A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, navigate to Settings → Global Settings → Security.
  2. Choose Generate trust and key store.
  3. Enter the name and password for trust and key stores and choose a location where you would like to store them.
  4. Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Listening port, enter the port number on which the agent listens. It is 8192 by default.
  2. For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
  3. For Add the Amazon Redshift driver, enter YES.
  4. For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
  5. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
  6. For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
  7. For Key store, enter the storage location chosen when creating the trust and key store.
  8. For Key store password, enter the password for the key store.
  9. For Enable client SSL authentication, enter yes.
  10. For Trust store, enter the storage location chosen when creating the trust and key store.
  11. For Trust store password, enter the password for the trust store.
*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

  • On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
  • To check the status of the agent, run the same command but replace start with status.
  • To stop an agent, run the same command but replace start with stop.
  • To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

  1. In AWS SCT, change the view to Data Migration view (other) and choose + Register.
  2. In the connection tab:
    1. For Description, enter a name to identify the Data Extraction Agent.
    2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
    3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
    4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
  3. If you’re using SSL, then in the SSL Tab:
    1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
    2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
  4. Choose Test Connection.
  5. Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

  1. Deselect all objects on the source database view in AWS SCT.
  2. Choose the table for which you would like to add virtual partitioning.
  3. Right-click on the table, and choose Add Virtual Partitioning.
  4. You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
    1. For Start Value, enter 1000000.
    2. For End Value, enter 3000000.
    3. For Interval, enter 1000000 to indicate partition size.
    4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

    1. <1000000
    2. >=1000000 and <=2000000
    3. >2000000 and <=3000000
    4. >3000000
    5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

  1. In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
  2. Choose Create Local task.
  3. There are three migration modes from which you can choose:
    1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
    2. Extract data and upload it on an S3 bucket.
    3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
  4. In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
  5. In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
  6. Choose Test Task.
  7. Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

  • First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
  • Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
  • At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

  1. Navigate to the Amazon Redshift QueryEditor V2.
  2. Double-click on the Amazon Redshift Serverless workgroup name that you created.
  3. Choose the Federated User option under Authentication.
  4. Choose Create Connection.
  5. Create a new editor by choosing the + icon.
  6. In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

  1. Navigate to the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the check-box for the EC2 instance that you created.
  4. Choose Instance state, and then Terminate instance.
  5. Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

  1. Navigate to Amazon Redshift Serverless Dashboard.
  2. Under Namespaces / Workgroups, choose the workspace that you created.
  3. Under Actions, choose Delete workgroup.
  4. Select the checkbox Delete the associated namespace.
  5. Uncheck Create final snapshot.
  6. Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

  1. Navigate to Amazon S3 console.
  2. Choose the bucket that you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket in the text input field.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!


About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Migrate from Snowflake to Amazon Redshift using AWS Glue Python shell

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/migrate-from-snowflake-to-amazon-redshift-using-aws-glue-python-shell/

As the most widely used cloud data warehouse, Amazon Redshift makes it simple and cost-effective to analyze your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. It natively integrates with other AWS services, facilitating the process of building enterprise-grade analytics applications in a manner that is not only cost-effective, but also avoids point solutions.

We are continuously innovating and releasing new features of Amazon Redshift, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. For example, Amazon Redshift Serverless allows you to run and scale analytics workloads without having to provision and manage data warehouse clusters. Other features that help power analytics at scale with Amazon Redshift include automatic concurrency scaling for read and write queries, automatic workload management (WLM) for concurrency scaling, automatic table optimization, the new RA3 instances with managed storage to scale cloud data warehouses and reduce costs, cross-Region data sharing, data exchange, and the SUPER data type to store semi-structured data or documents as values. For the latest feature releases for Amazon Redshift, see Amazon Redshift What’s New. In addition to improving performance and scale, you can also gain up to three times better price performance with Amazon Redshift than other cloud data warehouses.

To take advantage of the performance, security, and scale of Amazon Redshift, customers are looking to migrate their data from their existing cloud warehouse in a way that is both cost optimized and performant. This post describes how to migrate a large volume of data from Snowflake to Amazon Redshift using AWS Glue Python shell in a manner that meets both these goals.

AWS Glue is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, allowing you to analyze your data in minutes instead of weeks or months. AWS Glue supports the ability to use a Python shell job to run Python scripts as a shell, enabling you to author ETL processes in a familiar language. In addition, AWS Glue allows you to manage ETL jobs using AWS Glue workflows, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Functions, automating and facilitating the orchestration of ETL steps.

Solution overview

The following architecture shows how an AWS Glue Python shell job migrates the data from Snowflake to Amazon Redshift in this solution.

Architecture

The solution is comprised of two stages:

  • Extract – The first part of the solution extracts data from Snowflake into an Amazon Simple Storage Service (Amazon S3) data lake
  • Load – The second part of the solution reads the data from the same S3 bucket and loads it into Amazon Redshift

For both stages, we connect the AWS Glue Python shell jobs to Snowflake and Amazon Redshift using database connectors for Python. The first AWS Glue Python shell job reads a SQL file from an S3 bucket to run the relevant COPY commands on the Snowflake database using Snowflake compute capacity and parallelism to migrate the data to Amazon S3. When this is complete, the second AWS Glue Python shell job reads another SQL file, and runs the corresponding COPY commands on the Amazon Redshift database using Redshift compute capacity and parallelism to load the data from the same S3 bucket.

Both jobs are orchestrated using AWS Glue workflows, as shown in the following screenshot. The workflow pushes data processing logic down to the respective data warehouses by running COPY commands on the databases themselves, minimizing the processing capacity required by AWS Glue to just the resources needed to run the Python scripts. The COPY commands load data in parallel both to and from Amazon S3, providing one of the fastest and most scalable mechanisms to transfer data from Snowflake to Amazon Redshift.

Because all heavy lifting around data processing is pushed down to the data warehouses, this solution is designed to provide a cost-optimized and highly performant mechanism to migrate a large volume of data from Snowflake to Amazon Redshift with ease.

Glue Workflow

The entire solution is packaged in an AWS CloudFormation template for simplicity of deployment and automatic provisioning of most of the required resources and permissions.

The high-level steps to implement the solution are as follows:

  1. Generate the Snowflake SQL file.
  2. Deploy the CloudFormation template to provision the required resources and permissions.
  3. Provide Snowflake access to newly created S3 bucket.
  4. Run the AWS Glue workflow to migrate the data.

Prerequisites

Before you get started, you can optionally build the latest version of the Snowflake Connector for Python package locally and generate the wheel (.whl) package. For instructions, refer to How to build.

If you don’t provide the latest version of the package, the CloudFormation template uses a pre-built .whl file that may not be on the most current version of Snowflake Connector for Python.

By default, the CloudFormation template migrates data from all tables in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database, which is a sample dataset provided by Snowflake when an account is created. The following stored procedure is used to dynamically generate the Snowflake COPY commands required to migrate the dataset to Amazon S3. It accepts the database name, schema name, and stage name as the parameters.

CREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)
   returns varchar not null
   language javascript
   as
   $$
var return_value = "";
var sql_query = "select table_catalog, table_schema, lower(table_name) as table_name from " + DB_NAME + ".information_schema.tables where table_schema = '" + SCHEMA_NAME + "'" ;
   var sql_statement = snowflake.createStatement(
          {
          sqlText: sql_query
          }
       );
/* Creates result set */
var result_scan = sql_statement.execute();
while (result_scan.next())  {
       return_value += "\n";
       return_value += "COPY INTO @"
       return_value += STAGE_NAME
       return_value += "/"
       return_value += result_scan.getColumnValue(3);
       return_value += "/"
       return_value += "\n";
       return_value += "FROM ";
       return_value += result_scan.getColumnValue(1);
       return_value += "." + result_scan.getColumnValue(2);
       return_value += "." + result_scan.getColumnValue(3);
       return_value += "\n";
       return_value += "FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)";
       return_value += "\n";
       return_value += "OVERWRITE = TRUE;"
       return_value += "\n";
       }
return return_value;
$$
;

Deploy the required resources and permissions using AWS CloudFormation

You can use the provided CloudFormation template to deploy this solution. This template automatically provisions an Amazon Redshift cluster with your desired configuration in a private subnet, maintaining a high standard of security.

  1. Sign in to the AWS Management Console, preferably as admin user.
  2. Select your desired Region, preferably the same Region where your Snowflake instance is provisioned.
  3. Choose Launch Stack:
  4. Choose Next.
  5. For Stack name, enter a meaningful name for the stack, for example, blog-resources.

The Parameters section is divided into two subsections: Source Snowflake Infrastructure and Target Redshift Configuration.

  1. For Snowflake Unload SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database.
  2. For Data S3 Bucket, enter a prefix for the name of the S3 bucket that is automatically provisioned to stage the Snowflake data, for example, sf-migrated-data.
  3. For Snowflake Driver, if applicable, enter the S3 location (URI) of the .whl package built earlier as a prerequisite. By default, it uses a pre-built .whl file.
  4. For Snowflake Account Name, enter your Snowflake account name.

You can use the following query in Snowflake to return your Snowflake account name:

SELECT CURRENT_ACCOUNT();
  1. For Snowflake Username, enter your user name to connect to the Snowflake account.
  2. For Snowflake Password, enter the password for the preceding user.
  3. For Snowflake Warehouse Name, enter the warehouse name for running the SQL queries.

Make sure the aforementioned user has access to the warehouse.

  1. For Snowflake Database Name, enter the database name. The default is SNOWFLAKE_SAMPLE_DATA.
  2. For Snowflake Schema Name, enter schema name. The default is TPCH_SF1.

CFN Param Snowflake

  1. For VPC CIDR Block, enter the desired CIDR block of Redshift cluster. The default is 10.0.0.0/16.
  2. For Subnet 1 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.0.0/24.
  3. For Subnet 2 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.1.0/24.
  4. For Redshift Load SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in S3 to Redshift.

The following database view in Redshift is used to dynamically generate Redshift COPY commands required to migrate the dataset from Amazon S3. It accepts the schema name as the filter criteria.

CREATE OR REPLACE VIEW v_generate_copy
AS
SELECT
    schemaname ,
    tablename  ,
    seq        ,
    ddl
FROM
    (
        SELECT
            table_id   ,
            schemaname ,
            tablename  ,
            seq        ,
            ddl
        FROM
            (
                --COPY TABLE
                SELECT
                    c.oid::bigint  as table_id   ,
                    n.nspname      AS schemaname ,
                    c.relname      AS tablename  ,
                    0              AS seq        ,
                    'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --COPY TABLE continued                
                UNION                
                SELECT
                    c.oid::bigint as table_id   ,
                    n.nspname     AS schemaname ,
                    c.relname     AS tablename  ,
                    2             AS seq        ,
                    '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --END SEMICOLON                
                UNION                
                SELECT
                    c.oid::bigint as table_id  ,
                    n.nspname     AS schemaname,
                    c.relname     AS tablename ,
                    600000005     AS seq       ,
                    ';'           AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r' 
             )
        ORDER BY
            table_id  ,
            schemaname,
            tablename ,
            seq 
    );

SELECT ddl
FROM v_generate_copy
WHERE schemaname = 'tpch_sf1';
  1. For Redshift Database Name, enter your desired database name, for example, dev.
  2. For Number of Redshift Nodes, enter the desired compute nodes, for example, 2.
  3. For Redshift Node Type, choose the desired node type, for example, ra3.4xlarge.
  4. For Redshift Password, enter your desired password with the following constraints: it must be 8–64 characters in length, and contain at least one uppercase letter, one lowercase letter, and one number.
  5. For Redshift Port, enter the Amazon Redshift port number to connect to. The default port is 5439.

CFN Param Redshift 1 CFN Param Redshift 2

  1. Choose Next.
  2. Review and choose Create stack.

It takes around 5 minutes for the template to finish creating all resources and permissions. Most of the resources have the prefix of the stack name you specified for easy identification of the resources later. For more details on the deployed resources, see the appendix at the end of this post.

Create an IAM role and external Amazon S3 stage for Snowflake access to the data S3 bucket

In order for Snowflake to access the TargetDataS3Bucket created earlier by CloudFormation template, you must create an AWS Identity and Access Management (IAM) role and external Amazon S3 stage for Snowflake access to the S3 bucket. For instructions, refer to Configuring Secure Access to Amazon S3.

When you create an external stage in Snowflake, use the value for TargetDataS3Bucket on the Outputs tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.

CF Output

Make sure to name the external stage unload_to_s3 if you’re migrating the sample data using the default scripts provided in the CloudFormation template.

Convert Snowflake tables to Amazon Redshift

You can simply run the following DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can also use AWS Schema Conversion Tool (AWS SCT) to convert Snowflake custom objects to Amazon Redshift. For instructions on converting your schema, refer to Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool.

CREATE SCHEMA TPCH_SF1;
SET SEARCH_PATH to TPCH_SF1;
CREATE TABLE customer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Primary Key(C_CUSTKEY)
) ;

CREATE TABLE lineitem (
  l_orderkey int8 not null ,
  l_partkey int8 not null,
  l_suppkey int4 not null,
  l_linenumber int4 not null,
  l_quantity numeric(12,2) not null,
  l_extendedprice numeric(12,2) not null,
  l_discount numeric(12,2) not null,
  l_tax numeric(12,2) not null,
  l_returnflag char(1) not null,
  l_linestatus char(1) not null,
  l_shipdate date not null ,
  l_commitdate date not null,
  l_receiptdate date not null,
  l_shipinstruct char(25) not null,
  l_shipmode char(10) not null,
  l_comment varchar(44) not null,
  Primary Key(L_ORDERKEY, L_LINENUMBER)
)  ;

CREATE TABLE nation (
  n_nationkey int4 not null,
  n_name char(25) not null ,
  n_regionkey int4 not null,
  n_comment varchar(152) not null,
  Primary Key(N_NATIONKEY)                                
) ;

CREATE TABLE orders (
  o_orderkey int8 not null,
  o_custkey int8 not null,
  o_orderstatus char(1) not null,
  o_totalprice numeric(12,2) not null,
  o_orderdate date not null,
  o_orderpriority char(15) not null,
  o_clerk char(15) not null,
  o_shippriority int4 not null,
  o_comment varchar(79) not null,
  Primary Key(O_ORDERKEY)
) ;

CREATE TABLE part (
  p_partkey int8 not null ,
  p_name varchar(55) not null,
  p_mfgr char(25) not null,
  p_brand char(10) not null,
  p_type varchar(25) not null,
  p_size int4 not null,
  p_container char(10) not null,
  p_retailprice numeric(12,2) not null,
  p_comment varchar(23) not null,
  PRIMARY KEY (P_PARTKEY)
) ;

CREATE TABLE partsupp (
  ps_partkey int8 not null,
  ps_suppkey int4 not null,
  ps_availqty int4 not null,
  ps_supplycost numeric(12,2) not null,
  ps_comment varchar(199) not null,
  Primary Key(PS_PARTKEY, PS_SUPPKEY)
) ;

CREATE TABLE region (
  r_regionkey int4 not null,
  r_name char(25) not null ,
  r_comment varchar(152) not null,
  Primary Key(R_REGIONKEY)                             
) ;

CREATE TABLE supplier (
  s_suppkey int4 not null,
  s_name char(25) not null,
  s_address varchar(40) not null,
  s_nationkey int4 not null,
  s_phone char(15) not null,
  s_acctbal numeric(12,2) not null,
  s_comment varchar(101) not null,
  Primary Key(S_SUPPKEY)
);

Run an AWS Glue workflow for data migration

When you’re ready to start the data migration, complete the following steps:

  1. On the AWS Glue console, choose Workflows in the navigation pane.
  2. Select the workflow to run (<stack name>snowflake-to-redshift-migration).
  3. On the Actions menu, choose Run.Glue Workflow Run
  4. To check the status of the workflow, choose the workflow and on the History tab, select the Run ID and choose View run details.
    Glue Workflow Status
  5. When the workflow is complete, navigate to the Amazon Redshift console and launch the Amazon Redshift query editor v2 to verify the successful migration of data.
  6. Run the following query in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Make sure to adjust the table_schema value accordingly if you’re not migrating the sample data.
SELECT tab.table_schema,
       tab.table_name,
       nvl(tinf.tbl_rows,0) tbl_rows,
       nvl(tinf.size,0) size
FROM svv_tables tab
LEFT JOIN svv_table_info tinf 
          on tab.table_schema = tinf.schema 
          and tab.table_name = tinf.”table”
WHERE tab.table_type = 'BASE TABLE'
      and tab.table_schema in ('tpch_sf1')
ORDER BY tbl_rows;

Redshift Editor

  1. Run the following query in Snowflake to compare and validate the data:
USE DATABASE snowflake_sample_data;
SELECT  TABLE_CATALOG,
        TABLE_SCHEMA,
        TABLE_NAME,
        ROW_COUNT,
        BYTES AS SIZE,
        COMMENT
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'TPCH_SF1'
ORDER BY ROW_COUNT;

Snowflake Editor

Clean up

To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack blog-resources, and choosing Delete.

Conclusion

In this post, we discussed how to perform an efficient, fast, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one data warehouse environment to another can typically be very time-consuming and resource-intensive; this solution uses the power of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell provides additional cost optimization.

With this solution, you can facilitate your migration from Snowflake to Amazon Redshift. If you’re interested in further exploring the potential of using Amazon Redshift, please reach out to your AWS Account Team for a proof of concept.

Appendix: Resources deployed by AWS CloudFormation

The CloudFormation stack deploys the following resources in your AWS account:

  • Networking resourcesAmazon Virtual Private Cloud (Amazon VPC), subnets, ACL, and security group.
  • Amazon S3 bucket – This is referenced as TargetDataS3Bucket on the Outputs tab of the CloudFormation stack. This bucket holds the data being migrated from Snowflake to Amazon Redshift.
  • AWS Secrets Manager secrets – Two secrets in AWS Secrets Manager store credentials for Snowflake and Amazon Redshift.
  • VPC endpoints – The two VPC endpoints are deployed to establish a private connection from VPC resources like AWS Glue to services that run outside of the VPC, such as Secrets Manager and Amazon S3.
  • IAM roles – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a production environment, you need to adjust the IAM policies so they’re not as permissive as presented in this post (which were set for simplicity and demonstration). Particularly, AWS Glue and Amazon Redshift don’t require all the actions granted in the *FullAccess policies, which would be considered overly permissive.
  • Amazon Redshift cluster – An Amazon Redshift cluster is created in a private subnet, which isn’t publicly accessible.
  • AWS Glue connection – The connection for Amazon Redshift makes sure that the AWS Glue job runs within the same VPC as Amazon Redshift. This also ensures that AWS Glue can access the Amazon Redshift cluster in a private subnet.
  • AWS Glue jobs – Two AWS Glue Python shell jobs are created:
    • <stack name>-glue-snowflake-unload – The first job runs the SQL scripts in Snowflake to copy data from the source database to Amazon S3. The Python script is available in S3. The Snowflake job accepts two parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Snowflake to migrate data to Amazon S3. This is referenced as the Snowflake Unload SQL Script parameter in the input section of the CloudFormation template.
      • SECRET – The Secrets Manager ARN that stores Snowflake connection details.
    • <stack name>-glue-redshift-load – The second job runs another SQL script in Amazon Redshift to copy data from Amazon S3 to the target Amazon Redshift database. The Python script link is available in S3. The Amazon Redshift job accepts three parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Amazon Redshift to migrate data from Amazon S3. If you provide custom SQL script to migrate the Snowflake data to Amazon S3 (as mentioned in the prerequisites), the file location is referenced as LoadFileLocation on the Outputs tab of the CloudFormation stack.
      • SECRET – The Secrets Manager ARN that stores Amazon Redshift connection details.
      • PARAMS – This includes any additional parameters required for the SQL script, including the Amazon Redshift IAM role used in the COPY commands and the S3 bucket staging the Snowflake data. Multiple parameter values can be provided separated by a comma.
  • AWS Glue workflow – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed via an AWS Glue workflow. The workflow <stack name>snowflake-to-redshift-migration runs later for actual migration of data.

About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Julia BeckJulia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Seamlessly migrate on-premises legacy workloads using a strangler pattern

Post Syndicated from Arnab Ghosh original https://aws.amazon.com/blogs/architecture/seamlessly-migrate-on-premises-legacy-workloads-using-a-strangler-pattern/

Replacing a complex workload can be a huge job. Sometimes you need to gradually migrate complex workloads but still keep parts of the on-premises system to handle features that haven’t been migrated yet. Gradually replacing specific functions with new applications and services is known as a “strangler pattern.”

When you use a strangler pattern, monolithic workloads are broken down and individual services are scheduled for rehosting, replatforming, and even retirement. As you do this, there is value in having a uniform point of access for the various services, as well as a uniform level of security and a way to manage workloads in the cloud and on-premises.

This blog post covers how to implement a strangler architecture pattern for on-premises legacy workloads to create uniform access and security across your workloads. We walk you through how to implement this pattern, which uses an API facade to ensure your customers continue to see and use the same interface while you “strangle” the monolith by incrementally creating and deploying new microservices in the cloud.

Solution overview

API facade with connectivity to an on-premises monolith

Figure 1. API facade with connectivity to an on-premises monolith

This solution uses Amazon API Gateway to create an API facade for your on-premises monolith application. As you deploy new microservices on AWS, you can create new API resources/methods under the same API Gateway endpoint (to learn more about creating REST APIs, see Creating a REST API in Amazon API Gateway).

AWS Direct Connect, along with API Gateway private integrations that use virtual private cloud (VPC) links, provide secure network connectivity to your on-premises services.

The following sections provide more detail on these services and their functions.

On-premises Connectivity

Direct Connect provides a dedicated connection between the on-premises services and AWS. This allows you to implement a hybrid workload by securely connecting the API Gateway and the application deployed on your on-premises environment.

You can use an AWS Site-to-Site VPN to connect to on-premises environments, but Direct Connect is preferred for its reduced latency and dedicated bandwidth.

API facade

API Gateway creates the facade for customer APIs/services (the monolith and the new microservices) deployed in the on-premises environment as well as the ones migrated to AWS.

API Gateway uses private integrations to securely connect to on-premises services and resources launched into Amazon Virtual Private Cloud (Amazon VPC) like re-hosted microservices running on Amazon Elastic Compute Cloud (Amazon EC2) or modernized applications running on container services like Amazon Elastic Container Service (Amazon ECS).

The Network Load Balancer is part of the private integration for API Gateway. It acts as a high throughput, high availability resource that fronts the API backends deployed either in the on-premises environment or Amazon VPC. Network Load Balancers support different target types. Use the IP target type to target on-premises servers hosting legacy workloads and use the instance and Application Load Balancer target types for applications hosted within AWS environments.

Security

Use AWS Web Application Firewall for API Gateway REST endpoints. It provides the ability to monitor and block HTTP and HTTPS traffic according to stateless and stateful rule groups.

Amazon GuardDuty provides threat detection across your microservices.

(Optional) Enable AWS Shield Advanced for Amazon CloudFront distributions that are configured for regional API Gateway endpoints. This provides added distributed denial of service (DDoS) protection beyond AWS Shield Standard, which is automatically included.

Logging and monitoring

AWS X-Ray and Amazon CloudWatch give you visibility into your requests and assorted service metrics.

AWS CloudTrail allows you to track interactions with your infrastructure through the AWS control plane APIs.

Strangler process

The strangler pattern allows you to smoothly migrate resources from on-premises environments by placing a cloud-based API facade in front of them. The next sections show an example scenario of what a strangler pattern-based migration process could look like for a given workload.

Putting a facade in front of the monolith

First, we add our API Gateway facade in front of our on premises monolith. The API Gateway acts as a facade to the customer APIs/services (the monolith and the new microservices) deployed in the on-premises environment as well as the ones migrated to AWS. This means that as the on-premises monolith application is strangled and new microservices are created, the new services are added to the API Gateway so that they can consumed along with the monolith services, as shown in Figure 2.

API facade with connectivity to an on-premises monolith

Figure 2. API facade with connectivity to an on-premises monolith

Breaking up the monolith behind the facade

Next, let’s break up our monolith into component microservices, as shown in Figure 3. This allows us more flexibility in deciding how best to migrate individual services. With the strangler pattern, we can incrementally update sections of code and functionality of the monolith (extract as a microservice with minimum dependency to the monolith application) without needing to completely refactor the entire application. Eventually, all the monolith’s services and components will be migrated, and the legacy system can be retired. Monoliths can be decomposed by business capability, subdomain, transactions, or based on the teams that maintain them.

Microservices A and B being decomposed from a legacy monolith, component C scheduled for retirement is not broken out into a microservice

Figure 3. Microservices A and B being decomposed from a legacy monolith, component C scheduled for retirement is not broken out into a microservice

Migrating microservices into the cloud

With our monolith broken up into its component microservices, we can begin moving the microservices into the cloud.

In our example, we rehost microservice A and refactor microservice B.

  • Rehosting a microservice: Here, we take microservice A and rehost it from on-premises virtual machines onto EC2 instances in AWS. We have deployed the microservice across multiple Availability Zones with Amazon EC2 Auto Scaling group. As you see from Figure 4, even after deployment to AWS, microservice A continues to have limited dependency on the monolith application. This dependency will eventually be removed as the strangling process is completed and the monolith is completely decomposed.
Microservice A being rehosted onto EC2 instances within an Amazon EC2 Auto Scaling Group

Figure 4. Microservice A being rehosted onto EC2 instances within an Amazon EC2 Auto Scaling Group

  • Refactoring a microservice: With functionality broken out across microservices, we can opt to refactor certain services using containerization and orchestration platforms like Amazon ECS. Here, we take microservice B and containerize it using Docker and then use Amazon ECS to deploy it.
Microservice B being refactored and after containerization and being moved onto Amazon ECS

Figure 5. Microservice B being refactored and after containerization and being moved onto Amazon ECS

Retire the monolith

Finally, when ready (application users have all been migrated to the new microservice endpoints), you can retire the legacy monolith application. Figure 6 shows the end state where the monolith application is retired along with hybrid connectivity. The API facade now serves the new migrated microservices. At this point, you can decide to retire application components.

Microservices A and B after the legacy monolith retired and on-premises connectivity has ceased

Figure 6. Microservices A and B after the legacy monolith retired and on-premises connectivity has ceased

Conclusion

In this blog post, we showed you how to use a strangler pattern to smoothly transition on-premises workloads through a hybrid migration process with a uniform entry point in AWS. We walked you through the process of strangling a legacy monolith by decomposing it into microservices and bringing microservices into the cloud one by one with migration approaches that best fit each service.

Ready to get started? Learn how to implement private integration for API Gateway. See how to further integrate mediation layers to support legacy XML and other non-JSON-based API responses. Get hands-on with the Break a Monolith Application into Microservices project.

A guide to migrating to Zabbix 6.0 LTS by Edgars Melveris / Zabbix Summit Online 2021

Post Syndicated from Edgars Melveris original https://blog.zabbix.com/a-guide-to-migrating-to-zabbix-6-0-lts-by-edgars-melveris-zabbix-summit-online-2021/18569/

Upgrading to a new software version can be an intimidating process, especially if you are upgrading your Zabbix instance for the first time. In this blog post, we will take a look at the upgrade process itself, the necessary pre-requisites, and also what changes you can expect to the existing functionality when you’ve migrated to Zabbix 6.0 LTS.

The full recording of the speech is available on the official Zabbix Youtube channel.

Pre-upgrade checklist

Database versions

The first step before performing the upgrade to a new Zabbix version is ensuring that your underlying infrastructure is ready for the upgrade process. There are some changes in Zabbix that you should be aware of and address before the upgrade. One of these changes is the list of supported database engines and their versions for Zabbix 6.0 LTS:

  • MySQL/Percona 8.0.x
  • MariaDB 10.5.0 -10.6.x
  • PostgreSQL 13.x
  • Oracle 19c – 21c

And if you’re using PostgreSQL + TimescaleDB or Zabbix proxies:

  • TimescaleDB 2.0.1-2.3
  • SQLite 3.3.5 – 3.34.x

You may have noticed that we have increased the version requirements for the Zabbix backend databases. The reason for this is Zabbix utilizes the features that only these newer database versions provide, thus ensuring optimal Zabbix performance. If you’re using an unsupported database version, Zabbix will not start. There will be a configuration parameter to override this behavior, but that is not recommended since we cannot ensure that your Zabbix version will work without encountering any performance issues or crashes. A database upgrade to a supported version should be performed first before moving to Zabbix 6.0 LTS.

Supported operating systems

Zabbix supports all Linux distributions and many other Unix-like operating systems. Unfortunately, it is not feasible to provide Zabbix packages for each and every distribution out there. One of the major changes that were made back in Zabbix 5.2 – we no longer provide packages for RHEL/CentOS 7. This is because some of the libraries included in these distributions are outdated, and it becomes more and more complicated to build Zabbix on these OS versions. Though it is still possible to build Zabbix from sources if you provide the correct versions of the required libraries.

Some of the officially supported operating systems for Zabbix 6.0 LTS:

  • RHEL/CentOS/Oracle Linux 8
  • Ubuntu 18.04+
  • Debian 10+
  • SLES 12+

Other installation options

There are additional Zabbix deployment options:

  • Docker – all of the dependencies are already provided in the official docker images
  • Cloud image – the image includes all of the required dependencies
  • Zabbix appliance – All of the available Zabbix appliance images contain the required dependencies

Environment review

Before upgrading between major Zabbix versions, it is very much recommended to do an environment review and take a look at the pending maintenance tasks for our environment and also do a health check. Some of the things that we should consider before performing the upgrade to Zabbix 6.0 LTS:

  • Apply any required OS or DB upgrades before upgrading Zabbix and check for any issues before moving on
  • Check for any customizations in your installation – are there any DB schema changes? Any custom modules or patches?
    • The best way to test this is to make a copy of your existing Zabbix instance and test the upgrade in a QA environment.
  • Are the required packages available for all of the Zabbix components?
  • Are all proxies installed on the supported OS versions?
  • Check the documentation for any known issues in the version to which you are upgrading

Important changes that affect the upgrade process

There are some changes in Zabbix 6.0 LTS that could potentially affect the upgrade process or your existing Zabbix workflows.

API Changes

Below is a list of documentation pages related to API changes between versions 5.0 and 6.0:

  • https://www.zabbix.com/documentation/6.0/manual/api/changes_5.4_-_6.0
  • https://www.zabbix.com/documentation/current/manual/api/changes_5.2_-_5.4
  • https://www.zabbix.com/documentation/5.2/manual/api/changes_5.0_-_5.2

Some of the more important API changes:

  • Trigger and calculated/aggregated item syntax change introduced in Zabbix 5.4 also change the API calls responsible for creating triggers (ZBXNEXT-6451)
    • You will need to change the trigger syntax in your API calls to avoid any issues
  • For user.create and user.update methods the user_medias parameter was renamed to medias (ZBX-17955)
    • user_medias parameter is now deprecated
  • The type property is no longer supported for user.create, user.update, and user.get methods (ZBXNEXT-6148)
    • The type property is not supported for the user object since we now define it in user roles
  • Items no longer support applications. Applications have been replaced with tags (ZBXNEXT-2976)
  • Since value maps can no longer be defined globally, the valuemap.create and valuemap.get methods now require a hostid field (ZBXNEXT-5868)

Other important changes

There are a couple of important changes that users should be aware of when migrating to Zabbix 6.0 LTS:

  • Previously, trailing spaces in passwords – both when setting a password and entering it, were trimmed. This has been changed, and trailing spaces in passwords are no longer trimmed.
  • Global value maps that remain unused will be removed
  • Existing audit log records will be removed due to major changes in the audit log design.

Upgrade steps

Next,  let’s discuss the steps that you should take to perform the upgrade procedure in a correct and safe manner:

  • Backup your database, as well as any customizations (external scripts, alert scripts) and configuration files
  • Update the Zabbix server and Zabbix frontend
    • Once the new Zabbix server process is started, it will automatically check the database schema version and automatically upgrade it
    • Depending on the database size and the version from which you are migrating – this can take a while
    • Once the automatic database schema upgrade is done, the Zabbix server will be started automatically
  • Update your proxies. Proxies are required to have the same major version as the Zabbix server
  • Check if there are no issues and your Zabbix instance is up and running
    • Check if the metrics are being collected by your Zabbix server and Zabbix proxies
    • Check if the triggers are detecting any problems and if you’re receiving notifications about them

Backup

Let’s take a more in-depth look at the backup process and discuss the required steps with some examples:

  • Backup the database – methods depend on the DB type
    • In most cases, you can ignore history and trends tables – simply backing up only your configuration data
    • History and trends tables tend to be extremely large. That’s why the above approach is a lot faster
    • If at some point you are required to perform a restore from this backup, history, and trends tables will have to be manually recreated
  • Backup the Zabbix configuration files
  • Optionally – backup any custom alert scripts, external scripts, and any other customizations

Example MySQL database backup with history and trends tables ignored:

mysqldump -uroot -p --single-transaction --ignore-table=zabbix.history --ignore-table=zabbix.history_uint --ignoretable=zabbix.history_text --ignore-table=zabbix.history_log --ignore-table=zabbix.history_str --ignore-table=zabbix.trends --ignore-table=zabbix.trends_uint zabbix | gzip > zabbix_backup.sql.gz

Backing up the configuration

At the very least, you should back up the configuration files located in:

  • /etc/zabbix/*
  • external scripts from /usr/lib/zabbix/externalscripts/
  • alert scripts from /usr/lib/zabbix/alertscripts
  • /etc/httpd/conf.d/zabbix.conf
  • /etc/php-fpm.d/zabbix.conf

Upgrade process with docker

There are multiple approaches to running Zabbix in docker. For this example, we will assume that you’re running Zabbix server and Zabbix frontend in docker and using the official Zabbix docker images with a MySQL backend database and apache web backend.

Stop the Zabbix server, frontend, and proxy containers:

docker stop my-zabbix-server
docker stop my-zabbix-frontend

Start the Zabbix 6.0 LTS container and point it at the same backend database:

docker run --name my-zabbix-server-6.0 -e DB_SERVER_HOST="some-mysql-server" -e
MYSQL_USER="some-user" -e MYSQL_PASSWORD="some-password" -d zabbix/zabbixserver-mysql:6.0-latest

Once again, an automatic DB schema upgrade will be started

Lastly, start the Zabbix frontend container:

docker run --name my-zabbix-web-apache-6.0 -e DB_SERVER_HOST="some-mysqlserver" -e MYSQL_USER="some-user" -e MYSQL_PASSWORD="some-password" -e ZBX_SERVER_HOST= "my-zabbix-server-6.0" -d zabbix/zabbix-web-apache-mysql:6.0-latest

Upgrade process with Zabbix packages

Upgrading the main Zabbix components

If you’re using the official Zabbix packages, then the upgrade process will take a few more steps and can seem a bit more complicated. Let’s take a look at the required upgrade steps in detail. For our example, we will use a CentOS 8 OS distribution.

Install the Zabbix 6.0 LTS release package. This will add the necessary Zabbix 6.0 LTS repository information:

rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-1.el8.noarch.rpm

Clear the DNF package manager cache:

dnf clean all

Install all of the required packages:

dnf install zabbix-server-mysql zabbix-web zabbix-web-mysql zabbix-web-deps
zabbix-apache-conf zabbix-selinux-policy

Start Zabbix components and observe the log file. You should see that the database schema upgrade is in progress. Once it has finished, all of the internal Zabbix processes should be started without any issues:

17602:20210921:131335.333 completed 96% of database upgrade
17602:20210921:131335.355 completed 97% of database upgrade
17602:20210921:131335.379 completed 98% of database upgrade
17602:20210921:131335.606 completed 99% of database upgrade
17602:20210921:131335.711 completed 100% of database upgrade
17602:20210921:131335.711 database upgrade fully completed
17602:20210921:131335.804 server #0 started [main process]
17602:20210921:131335.808 server #2 started [configuration syncer #1]
17602:20210921:131335.810 server #1 started [service manager #1]

Upgrading the Zabbix proxies

In addition, we are also required to update our Zabbix proxies. The procedure is very similar to what we did in our previous steps:

Install the Zabbix 6.0 LTS release package:

rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-1.el8.noarch.rpm

Clear the DNF package manager cache:

dnf clean all

Update the Zabbix proxy packages:

dnf update zabbix-proxy-mysql(pgsql, sqlite3)

For MySQL, PostgreSQL, and Oracle proxy backend databases, the DB schema is performed automatically.

For Zabbix proxies using SQLite3 backend databases, automatic database schema upgrade is not supported. We will simply have to remove the old SQLite3 database file – it will then be automatically recreated once we start the Zabbix proxy.

rm –rf /tmp/proxy.sqlite

Post-upgrade tasks

After the upgrade to Zabbix 6.0 LTS, there are a few additional tasks that we should take care of. Let’s take a look at what needs to be done.

History table primary key

Zabbix 6.0 LTS backend database history table schema has been changed. These tables now contain primary keys. The upgrade or these history tables is not done automatically since it can cause additional downtime. Depending on the size of the database, executing the required changes can be extremely slow since every record in the history tables needs to be altered. In addition, duplicate entries in history tables could potentially cause this manual database schema upgrade to fail. There are multiple benefits to the history table schema changes:

  • All history tables will now have primary keys
  • Decreased history table storage size
  • Increased history table query performance
  • Not recommended when upgrading an existing instance

For new Zabbix 6.0 LTS installations, this change will be included by default, while for the existing installations, it is recommended to thoroughly test the history table schema change procedure and evaluate the potential downtimes. The exact history table upgrade steps will be documented with the release of Zabbix 6.0 LTS.

Check new processes

There are some new Zabbix processes that have been added to Zabbix 6.0 LTS that yous should be aware of:

  • StartHistoryPollers
    • The process responsible for handling calculated, aggregated, and internal checks requiring a database connection
    • The default value is 5. Consider increasing this number if you have many such items
  • If migrating from 4.0: StartLLDProcessors
    • Worker process for low-level discovery tasks
    • The default value is 2. Consider increasing if you have many low-level discovery rules.

Update the existing templates

If you’ve performed a Zabbix upgrade before, you will be aware of the fact that Zabbix does not update your existing templates automatically since we assume there could be some custom changes performed by the end-users on said templates. Therefore, to see the aforementioned new processes in the Zabbix server internal monitoring graphs, you should download and import the latest Zabbix server template.

You can download the templates from the official Zabbix git page. You can read the release notes to see the full list of the updated templates and changes that have been performed on said templates.

Update the Zabbix agents

You may also consider upgrading your Zabbix agents. This is not mandatory since Zabbix agents are backward compatible, so you can use an older version of Zabbix agents with Zabbix 6.0 LTS. All of the previous functionality will continue to function, but you may still consider updating the agents since the updates could contain some bug fixes or support for a brand new set of items.

Upgrading the Zabbix agent:

dnf install zabbix-agent

Upgrading the Zabbix agent 2:

dnf install zabbix-agent2

New Zabbix packages

You may have noticed that in Zabbix 6.0 LTS, there are multiple new packages. Most of these packages are repackagings of some of the old components for better package management, but there are exceptions:

  • zabbix-selinux-policy – basic SELinux policy for Zabbix
  • zabbix-sql-scripts – All of the .sql backend database scripts
    • These used to be a part of the zabbix-server package
    • This package is required to, for example, deploy the initial Zabbix database schema or data during the Zabbix install process.
  • zabbix-web-service – The service responsible for the scheduled report generation

Questions

Q: Are my custom templates going to be affected in any way by the upgrade process?

A: Yes, all of your templates will continue to work. Any changes that we have made to trigger syntax, for example, will be automatically applied to your existing entities.

 

Q: How long will the migration process take? How can I estimate the downtime?

A: Unfortunately, it’s impossible to estimate a precise downtime duration without creating a QA copy of your existing Zabbix instance with the same exact hardware and checking the downtime duration there with a test upgrade. At the end of the day, this will depend not only on the size of the database but also on the size of individual tables, the version from which you are upgrading, and how optimized your software and hardware are.

 

Q: What about migrating from a very old version – say Zabbix 3.0 or older?

A: It should work, but there can be some caveats and additional pre-requisites required for the older version upgrades. I would recommend going through our previous Summit recordings since we have covered the upgrade process for older versions in previous years. Those should provide you a pre-requisite checklist that you can perform before upgrading to Zabbix 6.0 LTS.

The post A guide to migrating to Zabbix 6.0 LTS by Edgars Melveris / Zabbix Summit Online 2021 appeared first on Zabbix Blog.

Summary of Zabbix Summit Online 2021, Zabbix 6.0 LTS release date and Zabbix Workshops

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/summary-of-zabbix-summit-online-2021-zabbix-6-0-lts-release-date-and-zabbix-workshops/17155/

Now that the Zabbix Summit Online 2021 has concluded, we are thrilled to report we hosted attendees from over 3000 organizations from more than 130 countries all across the globe.

This year, the main focus of the speeches was the upcoming Zabbix 6.0 LTS release, as well as speeches focused on automating Zabbix data collection and configuration, Integrating Zabbix within existing company infrastructures, and migrating from legacy tools to Zabbix. 21 speakers in total presented their use cases and talked about new Zabbix features during the Summit with over 8 hours of content.

In case you missed the Summit or wish to come back to some of the speeches – both the presentations (in PDF format) and the videos of the speeches are available on the Zabbix Summit Online 2021 Event page.

Zabbix 6.0 LTS release date

As for Zabbix 6.0 LTS – as per our statement during the event, you can expect Zabbix 6.0 LTS to release in early 2022. At the time of this post, the latest pre-release version is Zabbix 6.0 Alpha 7, with the first Beta version scheduled for release VERY soon. Feel free to deploy the latest pre-release version and take a look at features such as Geomaps, Business Service monitoring, improved Audit log, UX improvements, Anomaly detection with Machine Learning, and more! The list of the latest released Zabbix 6.0 versions as well as the improvements and fixes they contain is available in the Release notes section of our website.

Zabbix 6.0 LTS Workshops

The workshops will focus on particular Zabbix 6.0 LTS features and will be available once the Zabbix 6.0 LTS is released. The workshops will provide a unique chance to learn and practice the configuration of specific Zabbix 6.0 LTS features under the guidance of a certified Zabbix trainer at absolutely no cost! Some of the topics covered in the workshops will include – Deploying Zabbix server HA cluster, Creating triggers for Baseline monitoring and Anomaly detection, Displaying your infrastructure status on Geomaps, Deploying Business Service monitoring with root cause analysis, and more!

Upcoming events

But there’s more! On December 9 2021 Zabbix will host PostgreSQL Monitoring Day with Zabbix & Postgres Pro. The speeches will focus on monitoring PostgreSQL databases, running Zabbix on PostgreSQL DB backends with TimescaleDB, and securing your Zabbix + PostgreSQL instances. If you’re currently using PostgreSQL DB backends r plan to do so in the future – you definitely don’t want to miss out!

As for 2022 – you can expect multiple meetups regarding Zabbix 6.0 LTS features and use cases, as well as events focused on specific monitoring use cases. More information will be publicly available with the release of Zabbix 6.0 LTS.

New – AWS Migration Hub Refactor Spaces Helps to Incrementally Refactor Your Applications

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-aws-migration-hub-refactor-spaces-helps-to-incrementally-refactor-your-applications/

I am excited to announce the preview of AWS Migration Hub Refactor Spaces, a new capability of AWS Migration Hub to let you refactor existing applications into distributed applications, typically based on microservices.

There are multiple reasons why you want to refactor existing applications. You might want to make your code more modular, use more modern frameworks, use different data storage, etc. In general, when refactoring, your objective is to make your application easier to maintain and evolve over time. Other benefits might include handling larger workloads, increasing resiliency, or lowering costs. But let’s face it, refactoring is hard. I usually compare refactoring to changing the engines, cabin seats, and entertainment system of a plane while keeping the plane in the air, fully loaded with passengers, and without having them notice any change.

When talking with customers who have successfully been through these refactoring projects, we noticed a common pattern: the Strangler Fig design pattern.

strangler fig

A strangler fig is a family of plants that grow their roots from the top of the trees that host them, eventually enveloping or replacing their host. Author Martin Fowler first coined the term to describe a migration design pattern. The idea is “to gradually create a new system around the edges of the old, letting it grow slowly over several years until the old system is strangled”.

How Can I Apply This Plant Behavior To My Application Migration?
Inspired by this family of plants, I might want to extract capabilities from a monolithic application and rewrite them as microservices. Then, I incrementally route traffic away from the old to the new. Over time, all of the requests are routed to microservices, and the existing application is retired.

While effective, this approach to application transformation creates hurdles. I must create the required infrastructure to separate the existing applications and the microservices. In the AWS cloud, this often involves creating multiple AWS accounts, so teams or services can more easily operate independently. Having multiple accounts is the most efficient way to separate concerns and billing across teams. When dealing with multiple AWS accounts, it is required to maintain networking infrastructure to connect my existing application and new services together. Furthermore, I must create a routing control system to route traffic gradually from the old application to the new services in different accounts. Creating and managing that infrastructure at scale is complex. It introduces additional risks and costs to the refactor project.

How Refactor Spaces Helps
AWS Migration Hub Refactor Spaces takes care of the heavy lifting for me. First, it lays down the networking infrastructure to enable connectivity between multiple AWS accounts. Second, it creates and manages a mechanism to route API calls away from my legacy application.

Let’s imagine I have a monolithic application that I want to refactor. The application is made of a web-based front-end using ReactJS. The front-end application is hosted on Amazon Simple Storage Service (Amazon S3) and distributed through Amazon CloudFront. The front-end makes API calls to a monolithic application developed in NodeJS or Python and deployed on several EC2 instances. The API uses a relational database, because this is how we store data since the company existed.

The architecture of this application is illustrated by the following diagram.

refactor spaces - monolith

Each API has a distinct URI. For example the /cart API handles the shopping basket, the /order API handles the ordering system, etc. I apply the strangler fig pattern and decide to extract the /cart capabilities to a set of new microservices. I create an AWS account for these microservices. I develop and deploy a set of AWS Lambda functions to implement the cart management functionalities. I chose to use Amazon DynamoDB for the shopping basket data storage because of its low latency at scale.

The schema of my new architecture is shown in the following diagram:

Refactor Space - target architectureBut now I have two challenges. First, I have to design, code, and deploy a routing mechanism to route API calls made by the front-end application to the correct back-end: either the monolith, or the new microservices. This service will likely be deployed into a distinct AWS account. Then, I have to configure network connectivity between these multiple AWS accounts.

This is where Refactor Spaces comes into the picture.

Introducing AWS Migration Hub Refactor Spaces
Refactor Spaces makes it easy to manage application refactoring by taking care of the two challenges I just described: the routing of the API calls and the network connectivity between AWS accounts. It is made of Environments, Services, and an Application proxy. Let’s see it in action.

I open the AWS Management Console, navigate to AWS Migration Hub, and select Refactor Spaces.

I first create a Refactor Spaces Environment. An Environment is a multi-account network fabric consisting of peered VPCs. This lets AWS resources in service VPCs added to the environment communicate directly across AWS accounts. It also provides a unified view of networking and services across accounts.

In Create environment, I give my environment a name and a description, and then select Next.

Refactor Spaces - Create environment

Then, I define my application. I give my application a name, and select the VPC where the proxy will be deployed.

An application is a services container. It has a proxy that defines routes. The proxy lets your front-end application use a single endpoint to contact multiple services. All of the traffic hits the single proxy endpoint, and then it’s sent to multiple services based on your rules.

Refactor Space - Create application

You may want to use multiple AWS accounts as explained before. Typically, an application is made of one AWS Account that hosts the Refactor Spaces Application proxy, one or multiple AWS accounts to host the legacy application, and one AWS account for the first microservice. Therefore, I invite the other AWS account owners to join this Refactor Spaces environment. I add one principal per AWS Account. Refactor Spaces doesn’t reinvent the wheel, but it leverages AWS Resource Access Manager (RAM) to do so.

This step is optional. Refactor Spaces may work within one AWS Account. It is possible to share the environments with other AWS accounts at a later stage.

I enter the AWS account IDs as Principals, and then select Next.

Refactor Space - Shared Accounts

Finally, I review my choices and select Create & share environment (not shown here).

Assuming that the microservices are ready to use, the next step is registering them as Refactor Spaces Services. Refactor Spaces Services are entities that provide business capabilities, typically microservices. These services are reachable through unique endpoints, and they can interoperate across accounts in a Refactor Spaces Environment. In this example, there are four services:

  • The monolithic app. This is the default service where Refactor Spaces routes all API calls initiated by the front-end.
  • Three microservices to implement the /cart capability. I decided to refactor this capability with three distinct sevices: AddItem, RemoveItem, and ListItems.

A Refactor Spaces Service may target any compute resource type: EC2, containers deployed on AWS Fargate, an Application Load Balancer, an AWS Lambda function, etc.

I select Create service from the left menu. The service configuration is in three steps. First, I select the Refactor Spaces Environment and Application where I want to define this service. Second, I give my service a name and a description. And third, I select the service endpoint: either an HTTP/HTTPS URL in a VPC, or a Lambda function.

The monolithic application is the default route where Refactor Spaces Application proxy routes all of the API calls, unless otherwise specified. I enter / as Source path and select Include child paths. Then, I make make sure Match all is selected for HTTP verbs.

When finished, I select Create service. I repeat this process for each of my microservices. For this demo, I create four Refactor Spaces Services in total.

AWS Migration Hub refactor Spaces - create service

The last step defines the routing rules for the Refactor Spaces Application proxy. When configured, the proxy becomes the new API endpoint for my front-end application. The sole change that I have to make in my front-end application is to point it at the Refactor Spaces Application proxy URI. The proxy routes API calls to Services, according to a route definition. An Application proxy supports routing to all compute platforms with public or private visibility. At the moment, private endpoints must be referred through a public DNS name or their private IP address. Each API call is run against the set of routes configured in the proxy. When a path matches a rule, the request is sent to the target service configured for that path. Proxies have a default route that forwards requests to a default service if they don’t match any of the path rules.

I select the service that I just created. Then, I enter the route Source path and the HTTP Verb to support. When my service expects subpaths (such as /cart/123), I make sure to select Include child paths, as well.

Refactor Spaces - Define route

I repeat this process for the GetItem and RemoveItem microservices. They are invoked for different HTTP verbs: GET and DELETE respectively.

Based on this configuration, Refactor Spaces creates and manages the following architecture for me. The Refactor Spaces Application proxy and network fabric are deployed in a separate AWS account. I might further configure the Amazon API Gateway based on the needs of my monolithic application or microservices.

Refactor Spaces - final architecture

The ultimate change is for the application front-end. I modify its configuration to point to the Refactor Spaces Application proxy endpoint, instead of the monolith’s endpoint. From now on, Refactor Spaces routes API calls to the monolith by default. It routes the /cart calls for GET, POST, and DELETE verbs to my new microservices implemented as Lambda functions.

Over time, I will repeat this process to move other capabilities out of the monolithic application, one-by-one, until the old monolith is strangled replaced by the new microservices architecture.

Pricing and Availaibility
AWS Migration Hub Refactor Spaces is available today in the ten following AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Asia Pacific (Singapore) Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Ireland), Europe (Frankfurt), Europe (London), and Europe (Stockholm). As per usual, we’re looking forward to expanding to additional Regions in the future.

This new capability is available today as an open preview, and no registration is necessary. You can start to use it today. There is no charge for using Refactor Space during the preview period. However, you may be charged for the resources that it provisions on your AWS accounts: Amazon API Gateway, AWS Transit Gateway, and Network Load Balancer. The pricing details are available on AWS Migration Hub’s pricing page. Billing will start when Refactor Spaces will be generally available.

Go and start refactoring your applications today!

— seb

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Post Syndicated from Andrew Lee original https://aws.amazon.com/blogs/big-data/copy-large-datasets-from-google-cloud-storage-to-amazon-s3-using-amazon-emr/

Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR and the Hadoop file copy tools Apache DistCp and S3DistCp, we can migrate large datasets from Google Cloud Storage (GCS) to Amazon Simple Storage Service (Amazon S3).

Apache DistCp is an open-source tool for Hadoop clusters that you can use to perform data transfers and inter-cluster or intra-cluster file transfers. AWS provides an extension of that tool called S3DistCp, which is optimized to work with Amazon S3. Both these tools use Hadoop MapReduce to parallelize the copy of files and directories in a distributed manner. Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.

Prerequisites

The following are the prerequisites for configuring the EMR cluster:

  1. Install the AWS Command Line Interface (AWS CLI) on your computer or server. For instructions, see Installing, updating, and uninstalling the AWS CLI.
  2. Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair for SSH access to your EMR nodes. For instructions, see Create a key pair using Amazon EC2.
  3. Create an S3 bucket to store the configuration files, bootstrap shell script, and the GCS connector JAR file. Make sure that you create a bucket in the same Region as where you plan to launch your EMR cluster.
  4. Create a shell script (sh) to copy the GCS connector JAR file and the Google Cloud Platform (GCP) credentials to the EMR cluster’s local storage during the bootstrapping phase. Upload the shell script to your bucket location: s3://<S3 BUCKET>/copygcsjar.sh. The following is an example shell script:
#!/bin/bash
sudo aws s3 cp s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar /tmp/gcs-connector-hadoop3-latest.jar
sudo aws s3 cp s3://<S3 BUCKET>/gcs.json /tmp/gcs.json
  1. Download the GCS connector JAR file for Hadoop 3.x (if using a different version, you need to find the JAR file for your version) to allow reading of files from GCS.
  2. Upload the file to s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar.
  3. Create GCP credentials for a service account that has access to the source GCS bucket. The credentials should be named json and be in JSON format.
  4. Upload the key to s3://<S3 BUCKET>/gcs.json. The following is a sample key:
{
   "type":"service_account",
   "project_id":"project-id",
   "private_key_id":"key-id",
   "private_key":"-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
   "client_email":"service-account-email",
   "client_id":"client-id",
   "auth_uri":"https://accounts.google.com/o/oauth2/auth",
   "token_uri":"https://accounts.google.com/o/oauth2/token",
   "auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs",
   "client_x509_cert_url":"https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}
  1. Create a JSON file named gcsconfiguration.json to enable the GCS connector in Amazon EMR. Make sure the file is in the same directory as where you plan to run your AWS CLI commands. The following is an example configuration file:
[
   {
      "Classification":"core-site",
      "Properties":{
         "fs.AbstractFileSystem.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
         "google.cloud.auth.service.account.enable":"true",
         "google.cloud.auth.service.account.json.keyfile":"/tmp/gcs.json",
         "fs.gs.status.parallel.enable":"true"
      }
   },
   {
      "Classification":"hadoop-env",
      "Configurations":[
         {
            "Classification":"export",
            "Properties":{
               "HADOOP_USER_CLASSPATH_FIRST":"true",
               "HADOOP_CLASSPATH":"$HADOOP_CLASSPATH:/tmp/gcs-connector-hadoop3-latest.jar"
            }
         }
      ]
   },
   {
      "Classification":"mapred-site",
      "Properties":{
         "mapreduce.application.classpath":"/tmp/gcs-connector-hadoop3-latest.jar"
      }
   }
]

Launch and configure Amazon EMR

For our test dataset, we start with a basic cluster consisting of one primary node and four core nodes for a total of five c5n.xlarge instances. You should iterate on your copy workload by adding more core nodes and check on your copy job timings in order to determine the proper cluster sizing for your dataset.

  1. We use the AWS CLI to launch and configure our EMR cluster (see the following basic create-cluster command):
aws emr create-cluster \
--name "My First EMR Cluster" \
--release-label emr-6.3.0 \
--applications Name=Hadoop \
--ec2-attributes KeyName=myEMRKeyPairName \
--instance-type c5n.xlarge \
--instance-count 5 \
--use-default-roles

  1. Create a custom bootstrap action to be performed at cluster creation to copy the GCS connector JAR file and GCP credentials to the EMR cluster’s local storage. You can add the following parameter to the create-cluster command to configure your custom bootstrap action:
--bootstrap-actions Path="s3://<S3 BUCKET>/copygcsjar.sh"

Refer to Create bootstrap actions to install additional software for more details about this step.

  1. To override the default configurations for your cluster, you need to supply a configuration object. You can add the following parameter to the create-cluster command to specify the configuration object:
--configurations file://gcsconfiguration.json

Refer to Configure applications when you create a cluster for more details on how to supply this object when creating the cluster.

Putting it all together, the following code is an example of a command to launch and configure an EMR cluster that can perform migrations from GCS to Amazon S3:

aws emr create-cluster \
--name "My First EMR Cluster" \
--release-label emr-6.3.0 \
--applications Name=Hadoop \
--ec2-attributes KeyName=myEMRKeyPairName \
--instance-type c5n.xlarge \
--instance-count 5 \
--use-default-roles \
--bootstrap-actions Path="s3:///copygcsjar.sh" \
--configurations file://gcsconfiguration.json

Submit S3DistCp or DistCp as a step to an EMR cluster

You can run the S3DistCp or DistCp tool in several ways.

When the cluster is up and running, you can SSH to the primary node and run the command in a terminal window, as mentioned in this post.

You can also start the job as part of the cluster launch. After the job finishes, the cluster can either continue running or be stopped. You can do this by submitting a step directly via the AWS Management Console when creating a cluster. Provide the following details:

  • Step type – Custom JAR
  • NameS3DistCp Step
  • JAR locationcommand-runner.jar
  • Argumentss3-dist-cp --src=gs://<GCS BUCKET>/ --dest=s3://<S3 BUCKET>/
  • Action of failure – Continue

We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:

aws emr add-steps \
--cluster-id j-ABC123456789Z \
--steps 'Name=LoadData,Jar=command-runner.jar,ActionOnFailure=CONTINUE,Type=CUSTOM_JAR,Args=s3-dist-cp,--src=gs://<GCS BUCKET>/, --dest=s3://<S3 BUCKET>/'

DistCp settings and parameters

In this section, we optimize the cluster copy throughput by adjusting the number of maps or reducers and other related settings.

Memory settings

We use the following memory settings:

-Dmapreduce.map.memory.mb=1536
-Dyarn.app.mapreduce.am.resource.mb=1536

Both parameters determine the size of the map containers that are used to parallelize the transfer. Setting this value in line with the cluster resources and the number of maps defined is key to ensuring efficient memory usage. You can calculate the number of launched containers by using the following formula:

Total number of launched containers = Total memory of cluster / Map container memory

Dynamic strategy settings

We use the following dynamic strategy settings:

-Ddistcp.dynamic.max.chunks.tolerable=4000
-Ddistcp.dynamic.split.ratio=3 -strategy dynamic

The dynamic strategy settings determine how DistCp splits up the copy task into dynamic chunk files. Each of these chunks is a subset of the source file listing. The map containers then draw from this pool of chunks. If a container finishes early, it can get another unit of work. This makes sure that containers finish the copy job faster and perform more work than slower containers. The two tunable settings are split ratio and max chunks tolerable. The split ratio determines how many chunks are created from the number of maps. The max chunks tolerable setting determines the maximum number of chunks to allow. The setting is determined by the ratio and the number of maps defined:

Number of chunks = Split ratio * Number of maps
Max chunks tolerable must be > Number of chunks

Map settings

We use the following map setting:

-m 640

This determines the number of map containers to launch.

List status settings

We use the following list status setting:

-numListstatusThreads 15

This determines the number of threads to perform the file listing of the source GCS bucket.

Sample command

The following is a sample command when running with 96 core or task nodes in the EMR cluster:

hadoop distcp
-Dmapreduce.map.memory.mb=1536 \
-Dyarn.app.mapreduce.am.resource.mb=1536 \
-Ddistcp.dynamic.max.chunks.tolerable=4000 \
-Ddistcp.dynamic.split.ratio=3 \
-strategy dynamic \
-update \
-m 640 \
-numListstatusThreads 15 \
gs://<GCS BUCKET>/ s3://<S3 BUCKET>/

S3DistCp settings and parameters

When running large copies from GCS using S3DistCP, make sure you have the parameter fs.gs.status.parallel.enable (also shown earlier in the sample Amazon EMR application configuration object) set in core-site.xml. This helps parallelize getFileStatus and listStatus methods to reduce latency associated with file listing. You can also adjust the number of reducers to maximize your cluster utilization. The following is a sample command when running with 24 core or task nodes in the EMR cluster:

s3-dist-cp -Dmapreduce.job.reduces=48 --src=gs://<GCS BUCKET>/--dest=s3://<S3 BUCKET>/

Testing and performance

To test the performance of DistCp with S3DistCp, we used a test dataset of 9.4 TB (157,000 files) stored in a multi-Region GCS bucket. Both the EMR cluster and S3 bucket were located in us-west-2. The number of core nodes that we used in our testing varied from 24–120.

The following are the results of the DistCp test:

  • Workload – 9.4 TB and 157,098 files
  • Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
Nodes Throughput Transfer Time Maps
24 1.5GB/s 100 mins 168
48 2.9GB/s 53 mins 336
96 4.4GB/s 35 mins 640
120 5.4GB/s 29 mins 840

The following are the results of the S3DistCp test:

  • Workload – 9.4 TB and 157,098 files
  • Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
Nodes Throughput Transfer Time Reducers
24 1.9GB/s 82 mins 48
48 3.4GB/s 45 mins 120
96 5.0GB/s 31 mins 240
120 5.8GB/s 27 mins 240

The results show that S3DistCP performed slightly better than DistCP for our test dataset. In terms of node count, we stopped at 120 nodes because we were satisfied with the performance of the copy. Increasing nodes might yield better performance if required for your dataset. You should iterate through your node counts to determine the proper number for your dataset.

Using Spot Instances for task nodes

Amazon EMR supports the capacity-optimized allocation strategy for EC2 Spot Instances for launching Spot Instances from the most available Spot Instance capacity pools by analyzing capacity metrics in real time. You can now specify up to 15 instance types in your EMR task instance fleet configuration. For more information, see Optimizing Amazon EMR for resilience and cost with capacity-optimized Spot Instances.

Clean up

Make sure to delete the cluster when the copy job is complete unless the copy job was a step at the cluster launch and the cluster was set up to stop automatically after the completion of the copy job.

Conclusion

In this post, we showed how you can copy large datasets from GCS to Amazon S3 using an EMR cluster and two Hadoop file copy tools: DistCp and S3DistCp.

We also compared the performance of DistCp with S3DistCp with a test dataset stored in a multi-Region GCS bucket. As a follow-up to this post, we will run the same test on Graviton instances to compare the performance/cost of the latest x86 based instances vs. Graviton 2 instances.

You should conduct your own tests to evaluate both tools and find the best one for your specific dataset. Try copying a dataset using this solution and let us know your experience by submitting a comment or starting a new thread on one of our forums.


About the Authors

Hammad Ausaf is a Sr Solutions Architect in the M&E space. He is a passionate builder and strives to provide the best solutions to AWS customers.

Andrew Lee is a Solutions Architect on the Snap Account, and is based in Los Angeles, CA.

New Strategy Recommendations Service Helps Streamline AWS Cloud Migration and Modernization

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/new-strategy-recommendations-service-helps-streamline-aws-cloud-migration-and-modernization/

Determining viable strategies for successful application migration and modernization to the cloud takes time. It can also require significant effort, depending on the size and complexity of the application portfolio to analyze. To date, the analysis process has been largely manual and nonstandard in nature, making it difficult to apply at scale on large portfolios. Limited time to make decisions, a lack of domain knowledge and cloud expertise, and low awareness of the available modernization tools and services can compound the effort and complexity.

Today, I’m pleased to announce AWS Migration Hub Strategy Recommendations to help automate the analysis of your application portfolios. Strategy Recommendations analyzes your running applications to determine runtime environments and process dependencies, optionally analyzes source code and databases, and more. The data collected from analysis is assessed against a set of business objectives that you prioritize, such as license cost reduction, speed of migration, reducing operational overhead from using managed services, or modernizing infrastructure using cloud-native technologies. Then, it produces recommendations of viable paths to migrate and modernize your applications.

Any given application could have multiple paths for migration and modernization, including rehosting, replatforming, or refactoring. You’ll get recommendations on all viable paths, and you can elect to override the recommendations as you see fit. Everyone can use Strategy Recommendations, regardless of experience, to lower the effort and time required and complexity involved in assessing application portfolios, whether they’re on premises awaiting migration or already in the AWS Cloud pending further modernization.

Taking as an example a typical N-tier application, an ASP.NET web application with a Microsoft SQL Server database, Strategy Recommendations helps you analyze the various components such as the servers hosting the web front end, the backend servers, and the database itself to determine viable paths and tools you can use to migrate and modernize onto the AWS Cloud. For instance, if your goal is to reduce licensing costs for the application, Strategy Recommendations may recommend you to refactor your application to .NET on Linux using the Porting Assistant for .NET.

Registering your Application Servers for Strategy Recommendations
Registration of the servers hosting your application portfolio with AWS Application Discovery Service is a prerequisite for Strategy Recommendations. The servers to register can be running on-premises as physical servers or virtual machines (VMs), or they can be Amazon Elastic Compute Cloud (Amazon EC2) instances for applications you’ve already migrated with a “lift-and-shift” process. You can find details on the different options for registering your application servers in the AWS Application Discovery Service User Guide.

Automated Data Collection for Analysis
With your servers registered in AWS Application Discovery Service, you can set up automated collection of the process level analysis of your application portfolio using an agentless data collector provided by Strategy Recommendations. The agentless collector can be downloaded as an Open Virtualization Appliance (OVA) for VMWare vCenter environments. If you’ve already migrated some or all of your applications to EC2, there’s also an EC2 Amazon Machine Image (AMI), which includes the collector, to help further analyze these applications for modernization opportunities.

If you don’t want, or cannot use, automated collection methods, or you’ve already collected this data using another tool or service, then you can instead manually import the data for analysis. However, the recommendations you obtain for manually imported data won’t be as in-depth as those originating from automated data collection. One additional benefit of automated collection is that it’s much easier to refresh the data as you progress, too.

Application and process discovery on your servers is language-agnostic. For .NET and Java applications in GitHub and GitHub Enterprise repositories and Microsoft SQL Server databases, you can optionally include detection of cloud anti-patterns. It’s important to note that if you elect to have source code or database analysis performed, no actual code or data is uploaded to Strategy Recommendations; only the results of the analysis are sent. By the way, if you elect to manually import your data for analysis, the option to perform deeper source code and database analysis is not supported.

Analyzing your Application Portfolio
Full details on how to set up automated data collection, the analysis options, and other important prerequisites can be found in the Strategy Recommendations User Guide, so I won’t go into further detail here. Instead, I want to look at how you can start analyzing an application portfolio that’s already been migrated to EC2, with an intent to modernize further, using the agentless collector. As mentioned earlier, Strategy Recommendations supports analysis of application portfolios hosted on physical on-premises servers or virtual machines, or (as shown in this post) on EC2 instances.

To start collection of data for analysis, I need to follow a small number of steps:

  1. Start and configure the Strategy Recommendations agentless collector, using either the downloadable OVA or the provided EC2 AMI.
  2. Configure each of the Windows and Linux instances hosting my applications to allow access from the collector.
  3. Configure my initial business priorities and other application and database preferences to get my initial recommendations. I can fine-tune these options later.

My first stop is at the Migration Hub console, where I click Strategy in the navigation panel to take me to the Get started page. On clicking any of the Download data collector, Download import template, or Get recommendations buttons, I’m first asked to agree to the creation of a service-linked role, granting Strategy Recommendations the necessary permissions to access other services on my behalf. Once I agree, I start at the Configure data sources page of a short wizard. Here, I can view a list of any previously registered collectors. I can also download the OVA version of the data collector and an import template for any application data I want to import manually, outside of automated collection.

Initial data source configuration page

I’m going to use the EC2 AMI-based collector so, before proceeding with this wizard, I open the EC2 console in a new browser tab to launch it. To find the image for the Strategy Recommendations data collector I can either go to the AMIs page, select Public images, and filter by owner 703163444405, or, from the Launch Instances wizard, enter the name AWSMHubApplicationDataCollector in the Search field. Once I’ve found the image, I proceed through the launch wizard as I would for any other AMI.

Configuration of the collector is a simple process, and I’m guided using a series of questions. As I mentioned earlier, full information is in the user guide that I linked to, so I won’t go into every detail here. To start the configuration process, I first use SSH to connect to my collector instance and then run a Docker container, using the command docker exec -it application-data-collector bash. In the running container, I start the configuration Q&A with the command collector setup. During the process, you’re asked to supply data for the following items of information:

  1. Usage agreement and confirmation that all required roles have been set up, followed by a set of AWS access and secret keys.
  2. For on-premises Windows application servers that are not managed by vCenter, or EC2 Windows instances, I need to provide a user ID and password that will allow the collector to connect to my servers using WinRM.
  3. If I have any Linux application servers, I can choose whether the collector connects using SSH or certificate-based authentication.
  4. Finally, I can configure source code analysis for .NET and Java applications in repositories on GitHub or GitHub Enterprise. These require a Git username and personal access token (PAT). I can also configure additional, deeper, source code analysis for C# applications. This does, however, require a separate server running Windows, on which I’ve installed the Porting Assistant for .NET.

Once I have completed these steps, my data collector is registered and ready to start inspecting my servers. Back on the Strategy Recommendations Configure data sources page, I refresh the page and can now see my collector listed.

Registered data collector

The second step is to enable access from the collector to my application servers, details for which can be found in the Step 4: Set up the Strategy Recommendations collector topic of the user guide. For my Windows Server, I used RDP to connect and then downloaded and ran two PowerShell scripts from links provided in the guide to configure WinRM. For larger server fleets, you might consider using AWS Systems Manager Automation to perform this task. For my Linux servers, having chosen to use SSH authentication for the collector, I needed to copy public key material generated during collector configuration process to each server.

At this point, the servers to be analyzed are known to AWS Application Discovery Service, the Strategy Recommendations data collector is configured, and each server is configured to allow access from the collector. It’s now time for my third and final step; namely, to set my business and other priorities for the analysis and let the service get to work to generate my recommendations.

Back in the Get started page in Strategy Recommendations, since my collector is registered and I have no manual application data to import, I just choose Next. This takes me to the Specify Preferences page, where I set my business priorities and other preferences. I can revise these and reanalyze at any time, but for now, I use drag and drop to set License cost reduction, Modernizing infrastructure using cloud-native technologies, and Reduce operational overhead with managed services as my highest priorities. I leave the remaining options, for application and database preferences, unchanged.

Configuring my business priorities and other settings for the analysis recommendations

Choosing Next, I reach the Review page, summarizing my choices, then choose Start data analysis. One item of note, the analysis runs against all servers that you’ve configured in Application Discovery Service, so you may see more servers being processed than you imported in the earlier step (servers not configured to allow access by the collector show up in results with a collection status of “data collection failed”).

With analysis complete, my recommendations are summarized (no anti-pattern analysis has been run yet).

Initial recommendations from server analysis

One of my servers is running Windows and hosts an older version of nopCommerce, originally a .NET Framework-based application, and a related SQL Server database. As my highest business priority was license cost reduction, I start my inspection at that server. The recommendations available so far are based on inspection of just the server itself. Analysis of the source code and components comprising the application may likely influence those recommendations, so I request further analysis of the application source code by drilling down to the server and application of interest.

Adding source code analysis for the server application

Code analysis creates a JSON-format report file in Amazon Simple Storage Service (Amazon S3), which when I open it, shows anti-patterns such as accessing log files using Windows file system paths instead of a cloud-based service such as Amazon CloudWatch, fixed IP addresses, a server-specific database connection, and more.

Following code analysis, the suggested recommendations update slightly from those based on just inspection of the servers. One application component that was originally recommended for a replatforming approach is now a candidate for refactoring.

Revised recommendations

Returning to my server of interest, clicking the Strategy options tab shows me the recommendations. The results of the code analysis have played a part in the weightings, along with my business priorities. The image below shows the initial recommendations, which are based on just analysis of the server itself.

The initial recommendations

Below are the revised recommendations for the server, following source code analysis.

Revised recommendations following source code analysis

The recommendations for the server also include replatforming the application’s SQL Server database to MySQL on Amazon Relational Database Service (RDS). This is suggested because in my priorities, I requested consideration of managed services. Before following this recommendation I may want to perform an additional anti-pattern analysis of the database, which I can do after creating a secret in AWS Secrets Manager to hold the database credentials (check the user guide topic on database analysis for more details). Analysis of databases, which is currently only available for SQL Server, identifies migration incompatibilities such as unsupported data types.

In the screenshots, you’ll notice additional viable paths for migration and modernization. This applies to both servers and application components. I can choose a viable path over the recommended strategy if I so want by selecting the viable strategy option and clicking Set preferred. In the screenshot below, for the nopCommerce application component, I’ve chosen to prefer the replatform route to containers for the application, using AWS App2Container. And of course, I can always rewind to the start and adjust my business priorities and other options and reanalyze my data.

Setting a preferred approach for a recommendation

Taking the initial recommendations, then using code and database analysis, or revising your priorities for analysis and the suggested recommendations, provides scope to experiment with multiple “what if” options to discover the optimal strategy for migrating and modernizing application portfolios to the cloud. Once that optimal strategy is determined, you can communicate it to downstream teams to begin the migration and modernization process for your application portfolio.

Get Recommendations for Migration and Modernization Today
You can get started analyzing your servers and application portfolios today with AWS Migration Hub Strategy Recommendations, at no extra charge, in the US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (London) Regions. You can, of course, deploy the applications you choose to migrate and modernize based on recommendations from the tool to all Regions. As I noted earlier, you can find more details on prerequisites, getting started with the collector, and working with recommendations in the user guide.

— Steve

Migrate Resources Between AWS Accounts

Post Syndicated from Ashok Srirama original https://aws.amazon.com/blogs/architecture/migrate-resources-between-aws-accounts/

Have you ever wondered how to move resources between Amazon Web Services (AWS) accounts? You can really view this as a migration of resources. Migrating resources from one AWS account to another may be desired or required due to your business needs. Following are a few scenarios where this may be of benefit:

  1. When you acquire, sell, or merge overseas operations from other businesses.
  2. When you move regional operations from one Managed Service Provider (MSP) to another.
  3. When you reorganize your AWS account and organizational structure.

This process may involve migrating the infrastructure either partially or completely.

In this blog, we will discuss various approaches to migrating resources based on type, configuration, and workload needs. Usually, the first consideration is infrastructure. What’s in your environment? What are the interdependencies? How will you migrate each resource?

Using this information, you can outline a plan on how you will approach migrating each of the resources in your portfolio, and in what order.

Here are some considerations to address for a typical migration:

Let’s look at each of these considerations in detail.

Migrating infrastructure

To migrate infrastructure that includes ephemeral resources, you can use one of the following Infrastructure as Code (IaC) approaches, shown in Figure 1. IaC templates are like programming scripts that automate the provisioning of IT resources.

Figure 1. Approaches to migrate infrastructure using IaC

Figure 1. Approaches to migrate infrastructure using IaC

1. If you are already using AWS CloudFormation templates, you can easily import the existing templates to the target AWS account.

AWS CloudFormation simplifies provisioning and management on AWS. You can create templates for quick and reliable provisioning of services or applications (called “stacks”).

2. You can use tools like Former2 to templatize your existing resources in the source AWS account and deploy them in the target account.

Former2 is an open-source project that allows you to generate IaC templates. For example, AWS CloudFormation or HashiCorp Terraform can be generated from the existing resources within your AWS account. Read Accelerate infrastructure as code development with open source Former2 for step-by-step guidance.

Migrating compute resources

To migrate compute resources that have a persistent state, you can use one of the following approaches, shown in Figure 2. These provide a virtual computing environment, allowing you to launch instances with a variety of operating systems.

Figure 2. Approaches to migrate compute resources

Figure 2. Approaches to migrate compute resources

1. If you are already using AWS Backup service and AWS Organizations to centrally manage backup policies, you can enable AWS Backup cross-account management feature. This manages, monitors, restores your backup, and copies jobs across AWS accounts. Ensure you have both accounts in same AWS Organization. Once the backups are available in the target account, you can restore EC2 instances. Follow detailed instructions at Creating backup copies across AWS accounts.

AWS Backup is a fully managed data protection service that centralizes and automates data across AWS services, in the cloud, and on-premises. You can configure backup policies and monitor activity for your AWS resources. You can automate and consolidate backup tasks that were previously performed service-by-service. This removes the need to create custom scripts and use manual processes.

2. Create an Amazon Machine Image of your EC2 instances and share it with the target account. You can launch new EC2 instances using the shared AMI. Follow step-by-step instructions: How do I transfer an Amazon EC2 instance or AMI to a different AWS account?

Amazon Machine Image (AMI) provides the information required to launch an instance. Specify an AMI and then launch multiple instances from a single AMI with the same configuration. You can use different AMIs to launch instances when you need instances with different configurations.

For migrating non-persistent compute resources, refer Migrating Infrastructure section.

Migrating storage resources

AWS offers various storage services including object, file, and block storage. To migrate objects from a S3 bucket, you can take the following approaches, shown in Figure 3a.

Figure 3a. Approaches to migrate S3 buckets

Figure 3a. Approaches to migrate S3 buckets

1. Use Amazon S3 command line interface (CLI) commands to copy the initial load of objects from the source account to the target account. Read How can I copy S3 objects from another AWS account? After the initial copy, you can enable Amazon S3 replication feature to continuously replicate object changes across accounts. Add a bucket policy to grant source bucket permission to replicate objects in destination bucket. Read this walkthrough on how to configure replications.

2. If the S3 bucket contains large number of objects, use Amazon S3 Batch operations to copy objects across AWS accounts in bulk. Read Cross-account bulk transfer of files using Amazon S3 Batch Operations.

To migrate files from an Amazon EFS file system, you can take the following approach, shown in Figure 3b.

Figure 3b. Approach to migrate EFS file systems

Figure 3b. Approach to migrate EFS file systems

Use AWS DataSync agent to transfer data from one EFS file system to another. AWS DataSync is an online transfer service that simplifies moving, copying, and synchronizing large amounts of data between on-premises storage systems and AWS storage services. Read Transferring file data across AWS Regions and accounts using AWS DataSync for step-by-step guidance.

Migrating database resources

AWS offers various purpose-built database engines. These include relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases. To migrate relational databases, you can take one of the following approaches, shown in Figure 4.

Figure 4. Approaches to migrate relational database resources

Figure 4. Approaches to migrate relational database resources

1. If you want to continuously replicate data changes, use AWS Database Migration Service (AWS DMS) to replicate your data across AWS accounts with high availability. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. You can set up a DMS task for either one-time migration or on-going replication. An on-going replication task keeps your source and target databases in sync. Once set up, the on-going replication task will continuously apply source changes to the target with minimal latency. Learn how to Set Up AWS DMS for Cross-Account Migration.

AWS DMS is a cloud service that streamlines the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.

2. Use RDS Snapshots to create and share database backups across AWS accounts. Use the shared snapshots to launch new Amazon Relational Database Service (RDS) instances in the target account. Read step-by-step instructions: How can I share an encrypted Amazon RDS DB snapshot with another account?

3. Use AWS Backup to create backup policies that automatically back up your AWS resources. Use AWS Backup cross-account management feature to manage and monitor your backup, restore, and copy jobs across AWS accounts. Once the backups are available in the target account, you can restore RDS instances. Learn about Creating backup copies across AWS accounts.

In this section, we discussed relational databases migration. You can also use AWS DMS for migrating other databases. Read supported AWS DMS source and target databases.

Conclusion

In this blog post, we discussed various approaches you can take to migrate resources from one account to another depending upon the resource type and configuration. Additionally, you can also explore CloudEndure Migration for continuous data replication. Learn more about Migrating workloads across AWS Regions with CloudEndure Migration.

Middleware-assisted Zero-downtime Live Database Migration to AWS

Post Syndicated from Seif Elharaki original https://aws.amazon.com/blogs/architecture/middleware-assisted-zero-downtime-live-database-migration-to-aws/

When trying to figure out how to refactor your applications to leverage AWS Managed Services, you have some decisions to make. You may have decided to move your storage layer to AWS before the computational layer. This may help with using advanced database features, in addition to reducing costs associated with writing and reading data. AWS Professional Services recently helped a large customer with this implementation.

With more than a quarter billion daily users, this customer uses highly transactional NoSQL databases that are few hundred GBs in size. Volume of data is growing rapidly. The downtime requirements for the migration were stringently low, as their applications are used globally, 24-7. The source data layer was Cloud Datastore, which runs outside of AWS. The destination was Amazon DynamoDB. Several hundred globally distributed applications (writing to and reading from the database) had little or no room for refactoring in the initial phase. While the go-to solution for this scenario is usually AWS Database Migration Service, Cloud Datastore is not yet supported by AWS DMS as a source.

Using a configurable middleware to migrate in-use data layer

The architectural approach chosen was to develop a custom middleware that the applications would communicate with rather than directly calling the database. Common database operations such as Get, Put, Delete, Conditional Updates, Deletes, and Transactions, would be issued against this middleware that is loaded as an in-memory library. Data would be read from and written to this middleware layer. It would then issue reads and writes to multiple databases with configurable load factors. The solution was developed and tested in stages (Dual-Write Single-Read, Dual-Write Dual-Read, and Single-Write Single-Read) as shown in the diagrams following.

Architecture: routing database traffic to multiple storage targets

Figure 1 shows the initial state when the application layer communicates directly with the source database.

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figures 2 and 3 illustrate the intermediate and final states, respectively, with database traffic moved to DynamoDB progressively.

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 3. Architecture of post-migration final state

Figure 3. Architecture of post-migration final state

A closer look into the configurable stages of the migration

Initially, the middleware should be tested with the source database alone. It can then be configured to work with DynamoDB in Dual-Write mode. Reads will still continue from the source database. The target database is synchronized by copying old data in parallel.

In the next stage, reads are expanded to the target database. Reading from two sources allows in-memory comparison of the final result set. This ensures consistency of the data being returned. Upon successful validation, the system is finally configured as Single-Write Single-Read, operating solely on the target database. This is the “Point of no Return,” where the target database surges ahead with new data. In this mode, the migration is deemed complete, and the older database is ready to be taken offline.

This multi-stage approach results in a “live migration” of the data layer to DynamoDB with zero downtime. Higher-level applications are also left intact. This increases the speed and accuracy of the overall migration.

Configurable stages of migration load balanced traffic to underlying databases

The middleware layer acts as a valve or switch regulating traffic from the applications to one or more databases. This allows support for a canary-like load balancing, where a certain percentage of traffic can be diverted in either direction. We can visualize this behavior with the analogy of a 3-stage dial, as shown in Figures 4 through 6. These stages are developed and tested in a non-production environment with production-like data. All related sets of tables should be migrated together.

Dual-Write Single-Read stage

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

In this stage, shown in Figure 4, data is written to both the source database and the target database (DynamoDB). At this point, data is read only from the source, because the target is not ready to handle reads yet. While new data is being written to the target database, older data is copied and backfilled by background processes.

Avoid data corruption while copying older data. Don’t change it while you’re bringing the target database on par with the source. The middleware can implement a locking mechanism for write operations based on primary keys. One way to monitor the movement of older data can be a temporary table, which the copying process can update. The middleware can read this table to allow or deny a write operation. In most use cases, writes taper off with time, making it easier for older data to be copied without running into contention.

Dual-Write Dual-Read stage

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

The prerequisite of this stage is the parity of content between the two databases. As shown in Figure 5, both reads and writes are routed to both databases. In this stage, the middleware layer activates data validation. The records that are read from both the data layers are compared and contrasted for accuracy and consistency. This allows any discrepancies in the data to be fixed and the solution redeployed.

Single-Write Single-Read stage

Figure 6. Migration Stage 3: Single-Write Single-Read mode

Figure 6. Migration Stage 3: Single-Write Single-Read mode

In this stage shown in Figure 6, all reads and writes are directed only to the target database on AWS. This is the “Point of no Return”, as new data is written to the target database alone. The source database falls behind, and can be taken offline for eventual retirement.

Dealing with differences in database features

Apart from acting as a switch, the job of the middleware layer in this design pattern is to accept, translate, and forward the generic database call. For example, when it receives a “Put” call, it must invoke the “Put” API on the specific underlying target. After due translation, it follows the rules governing the corresponding service. The middleware layer does this twice for two different underlying databases, when operated in Dual-Write or Dual-Read modes.

You must deal with differences in databases in terms of specific features, limits, and limitations. The following is a non-exhaustive list of such areas:

  1. Specific quantitative limits: DynamoDB imposes a size limit of 16 MB on Transactions. This limit is likely different for the source database.
  2. Behavioral differences for features like indices: Cloud Datastore supports writing empty values to indexed fields which DynamoDB doesn’t support.
  3. Behavioral differences for primary and secondary keys: Other databases might not treat keys the same way DynamoDB treats its hash and sort keys.
  4. Differences in capacity, throughput, and latency: The middleware may need to throttle or even decline requests. This can happen if it starts operating in an Availability Zone where one underlying database is able to scale, but the other can’t scale.

An object-oriented approach can be an efficient way to deal with such differences. Create a base class encapsulating features that are common to different databases. Then use inheritance and polymorphism to account for the differences. This can ensure reusability, readability, and maintainability.

As the AWS Professional Services team has experienced, the resulting tool can be reused several times in a large organization to migrate different application suites. It can potentially be applied to other use cases. These include, but are not limited to:

  1. Support for more storage configurations and databases, abstraction of application code base by making them largely agnostic of underlying database technology
  2. Extensive database compatibility testing using granular migration stages
  3. Modularization and containerization with computational platforms such as Amazon Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS)

Conclusion

This design pattern showcases the power of abstraction in enabling live database migrations. Several optimizations are possible based on the rate of writes and pre-existing size of the database. The key benefit of this approach is the elimination of the need to make extensive changes in the application layer. This can result in significant savings in terms of effort, time, and cost, especially if different applications are managed by different teams in an organization. In addition, migration to DynamoDB alone can save AWS customers significantly. This depends on the size and access pattern of data, and whether the solution is architected for cost-savings. Refer to the Cost Optimization Pillar of the Well-Architected Framework for further best practices.

Further reading

Zabbix migration in a mid-sized bank environment

Post Syndicated from Angelo Porta original https://blog.zabbix.com/zabbix-migration-in-a-mid-sized-bank-environment/13040/

A real CheckMK/LibreNMS to Zabbix migration for a mid-sized Italian bank (1,700 branches, many thousands of servers and switches). The customer needed a very robust architecture and ancillary services around the Zabbix engine to manage a robust and error-free configuration.

Content

I. Bank monitoring landscape (1:45)
II. Zabbix monitoring project (h2)
III. Questions & Answers (19:40)

Bank monitoring landscape

The bank is one of the 25 largest European banks for market capitalization and one of the 10 largest banks in Italy for:

  • branch network,
  • loans to customers,
  • direct funding from customers,
  • total assets,

At the end of 2019, at least 20 various monitoring tools were used by the bank:

  • LibreNMS for networking,
  • CheckMK for servers besides Microsoft,
  • Zabbix for some limited areas inside DCs,
  • Oracle Enterprise Monitor,
  • Microsoft SCCM,
  • custom monitoring tools (periodic plain counters, direct HTML page access, complex dashboards, etc.)

For each alert, hundreds of emails were sent to different people, which made it impossible to really monitor the environment. There was no central monitoring and monitoring efforts were distributed.

The bank requirements:

  • Single pane of glass for two Data Centers and branches.
  • Increased monitoring capabilities.
  • Secured environment (end-to-end encryption).
  • More automation and audit features.
  • Separate monitoring of two DCs and branches.
  • No direct monitoring: all traffic via Zabbix Proxy.
  • Revised and improved alerting schema/escalation.
  • Parallel with CheckMK and LibreNMS for a certain period of time.

Why Zabbix?

The bank has chosen Zabbix among its competitors for many reasons:

  • better cross feature on the network/server/software environment;
  • opportunity to integrate with other internal bank software;
  • continuous enhancements on every Zabbix release;
  • the best integration with automation software (Ansible); and
  • personnel previous experience and skills.

Zabbix central infrastructure — DCs

First, we had to design one infrastructure able to monitor many thousands of devices in two data centers and the branches, and many items and thousands of values per second, respectively.

The architecture is now based on two database servers clusterized using Patroni and Etcd, as well as many Zabbix proxies (one for each environment — preproduction, production, test, and so on). Two Zabbix servers, one for DCs and another for the branches. We also suggested deploying a third Zabbix server to monitor the two main Zabbix servers. The DC database is replicated on the branches DB server, while the branches DB is replicated on the server handling the DCs using Patroni, so two copies of each database are available at any point in time. The two data centers are located more than 50 kilometers apart from each other. In this picture, the focus is on DC monitoring:

Zabbix central infrastructure — DCs

Zabbix central infrastructure — branches

In this picture the focus is on branches.

Before starting the project, we projected one proxy for each branch, that is, more or less 1,500 proxies. We changed this initial choice during implementation by reducing branch proxies to four.

Zabbix central infrastructure — branches

Zabbix monitoring project

New infrastructure

Hardware

  • Two nodes bare metal Cluster for PostgreSQL DB.
  • Two bare Zabbix Engines — each with 2 Intel Xeon Gold 5120 2.2G, 14C/28T processors, 4 NVMe disks, 256GB RAM.
  • A single VM for Zabbix MoM.
  • Another bare server for databases backup

Software

  • OS RHEL 7.
  • PostgreSQL 12 with TimeScaleDB 1.6 extension.
  • Patroni Cluster 1.6.5 for managing Postgres/TimeScaleDB.
  • Zabbix Server 5.0.
  • Proxy for metrics collection (5 for each DC and 4 for branches).

Zabbix templates customization

We started using Zabbix 5.0 official templates. We deleted many metrics and made changes to templates keeping in mind a large number of servers and devices to monitor. We have:

  • added throttling and keepalive tuning for massive monitoring;
  • relaxed some triggers and related recovery to have no false positives and false negatives;
  • developed a new Custom templates module for Linux Multipath monitoring;
  • developed a new Custom template for NFS/CIFS monitoring (ZBXNEXT 6257);
  • developed a new custom Webhook for event ingestion on third-party software (CMS/Ticketing).

Zabbix configuration and provisioning

  • An essential part of the project was Zabbix configuration and provisioning, which was handled using Ansible tasks and playbook. This allowed us to distribute and automate agent installation and associate the templates with the hosts according to their role in the environment and with the host groups using the CMDB.
  • We have also developed some custom scripts, for instance, to have user alignment with the Active Directory.
  • We developed the single sign-on functionality using the Active Directory Federation Service and Zabbix SAML2.0 in order to interface with the Microsoft Active Directory functionality.

 

Issues found and solved

During the implementation, we found and solved many issues.

  • Dedicated proxy for each of 1,500 branches turned out too expensive to provide maintenance and support. So, it was decided to deploy fewer proxies and managed to connect all the devices in the branches using only four proxies.
  • Following deployment of all the metrics and the templates associated with over 10,000 devices, the Data Center database exceeded 3.5TB. To decrease the size of the database, we worked on throttling and on keep-alive and had to increase the keep-alive from 15 to 60 minutes and lower the sample interval to 5 minutes.
  • There is no official Zabbix Agent for Solaris 10 operating system. So, we needed to recompile and test this agent extensively.
  • The preprocessing step is not available for NFS stale status (ZBXNEXT-6257).
  • We needed to increase the maximum length of user macro to 2,048 characters on the server-side (ZBXNEXT-2603).
  • We needed to ask for JavaScript preprocessing user macros support (ZBXNEXT-5185).

Project deliverables

  • The project was started in April 2020, and massive deployment followed in July/August.
  • At the moment, we have over 5,000 monitored servers in two data centers and over 8,000 monitored devices in branches — servers, ATMs, switches, etc.
  • Currently, the data center database is less than 3.5TB each, and the branches’ database is about 0.5 TB.
  • We monitor two data centers with over 3,800 NPVS (new values per second).
  • Decommissioning of LibreNMS and CheckML is planned for the end of 2020.

Next steps

  • To complete the data center monitoring for other devices — to expand monitoring to networking equipment.
  • To complete branch monitoring for switches and Wi-Fi AP.
  • To implement Custom Periodic reporting.
  • To integrate with C-level dashboard.
  • To tune alerting and escalation to send the right messages to the right people so that messages will not be discarded.

Questions & Answers

Question. Have you considered upgrading to Zabbix 5.0 and using TimeScaleDB compression? What TimeScaleDB features are you interested in the most — partitioning or compression?

Answer. We plan to upgrade to Zabbix 5.0 later. First, we need to hold our infrastructure stress testing. So, we might wait for some minor release and then activate compression.

We use Postgres solutions for database, backup, and cluster management (Patroni), and TimeScaleDB is important to manage all this data efficiently.

Question. What is the expected NVPS for this environment?

Answer. Nearly 4,000 for the main DC and about 500 for the branches — a medium-large instance.

Question. What methods did you use to migrate from your numerous different solutions to Zabbix?

Answer. We used the easy method — installed everything from scratch as it was a complex task to migrate from too many different solutions. Most of the time, we used all monitoring solutions to check if Zabbix can collect the same monitoring information.