All posts by Manikanta Gona

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

Post Syndicated from Manikanta Gona original https://aws.amazon.com/blogs/big-data/automatically-detect-personally-identifiable-information-in-amazon-redshift-using-aws-glue/

With the exponential growth of data, companies are handling huge volumes and a wide variety of data including personally identifiable information (PII). PII is a legal term pertaining to information that can identify, contact, or locate a single person. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have to adhere to data privacy, compliance, and regulatory requirements such as GDPR and CCPA, and it’s important to identify and protect PII to maintain compliance. You need to identify sensitive data, including PII such as name, Social Security Number (SSN), address, email, driver’s license, and more. Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive data at scale.

Many companies identify and label PII through manual, time-consuming, and error-prone reviews of their databases, data warehouses and data lakes, thereby rendering their sensitive data unprotected and vulnerable to regulatory penalties and breach incidents.

In this post, we provide an automated solution to detect PII data in Amazon Redshift using AWS Glue.

Solution overview

With this solution, we detect PII in data on our Redshift data warehouse so that the we take and protect the data. We use the following services:

  • Amazon Redshift is a cloud data warehousing service that uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price/performance at any scale. For our solution, we use Amazon Redshift to store the data.
  • AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, ML, and application development. We use AWS Glue to discover the PII data that is stored in Amazon Redshift.
  • Amazon Simple Storage Services (Amazon S3) is a storage service offering industry-leading scalability, data availability, security, and performance.

The following diagram illustrates our solution architecture.

The solution includes the following high-level steps:

  1. Set up the infrastructure using an AWS CloudFormation template.
  2. Load data from Amazon S3 to the Redshift data warehouse.
  3. Run an AWS Glue crawler to populate the AWS Glue Data Catalog with tables.
  4. Run an AWS Glue job to detect the PII data.
  5. Analyze the output using Amazon CloudWatch.

Prerequisites

The resources created in this post assume that a VPC is in place along with a private subnet and both their identifiers. This ensures that we don’t substantially change the VPC and subnet configuration. Therefore, we want to set up our VPC endpoints based on the VPC and subnet we choose to expose it in.

Before you get started, create the following resources as prerequisites:

  • An existing VPC
  • A private subnet in that VPC
  • A VPC gateway S3 endpoint
  • A VPC STS gateway endpoint

Set up the infrastructure with AWS CloudFormation

To create your infrastructure with a CloudFormation template, complete the following steps:

  1. Open the AWS CloudFormation console in your AWS account.
  2. Choose Launch Stack:
  3. Choose Next.
  4. Provide the following information:
    1. Stack name
    2. Amazon Redshift user name
    3. Amazon Redshift password
    4. VPC ID
    5. Subnet ID
    6. Availability Zones for the subnet ID
  5. Choose Next.
  6. On the next page, choose Next.
  7. Review the details and select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.
  9. Note the values for S3BucketName and RedshiftRoleArn on the stack’s Outputs tab.

Load data from Amazon S3 to the Redshift Data warehouse

With the COPY command, we can load data from files located in one or more S3 buckets. We use the FROM clause to indicate how the COPY command locates the files in Amazon S3. You can provide the object path to the data files as part of the FROM clause, or you can provide the location of a manifest file that contains a list of S3 object paths. COPY from Amazon S3 uses an HTTPS connection.

For this post, we use a sample personal health dataset. Load the data with the following steps:

  1. On the Amazon S3 console, navigate to the S3 bucket created from the CloudFormation template and check the dataset.
  2. Connect to the Redshift data warehouse using the Query Editor v2 by establishing a connection with the database you creating using the CloudFormation stack along with the user name and password.

After you’re connected, you can use the following commands to create the table in the Redshift data warehouse and copy the data.

  1. Create a table with the following query:
    CREATE TABLE personal_health_identifiable_information (
        mpi char (10),
        firstName VARCHAR (30),
        lastName VARCHAR (30),
        email VARCHAR (75),
        gender CHAR (10),
        mobileNumber VARCHAR(20),
        clinicId VARCHAR(10),
        creditCardNumber VARCHAR(50),
        driverLicenseNumber VARCHAR(40),
        patientJobTitle VARCHAR(100),
        ssn VARCHAR(15),
        geo VARCHAR(250),
        mbi VARCHAR(50)    
    );

  2. Load the data from the S3 bucket:
    COPY personal_health_identifiable_information
    FROM 's3://<S3BucketName>/personal_health_identifiable_information.csv'
    IAM_ROLE '<RedshiftRoleArn>'
    CSV
    delimiter ','
    region '<aws region>'
    IGNOREHEADER 1;

Provide values for the following placeholders:

  • RedshiftRoleArn – Locate the ARN on the CloudFormation stack’s Outputs tab
  • S3BucketName – Replace with the bucket name from the CloudFormation stack
  • aws region – Change to the Region where you deployed the CloudFormation template
  1. To verify the data was loaded, run the following command:
    SELECT * FROM personal_health_identifiable_information LIMIT 10;

Run an AWS Glue crawler to populate the Data Catalog with tables

On the AWS Glue console, select the crawler that you deployed as part of the CloudFormation stack with the name crawler_pii_db, then choose Run crawler.

When the crawler is complete, the tables in the database with the name pii_db are populated in the AWS Glue Data Catalog, and the table schema looks like the following screenshot.

Run an AWS Glue job to detect PII data and mask the corresponding columns in Amazon Redshift

On the AWS Glue console, choose ETL Jobs in the navigation pane and locate the detect-pii-data job to understand its configuration. The basic and advanced properties are configured using the CloudFormation template.

The basic properties are as follows:

  • Type – Spark
  • Glue version – Glue 4.0
  • Language – Python

For demonstration purposes, the job bookmarks option is disabled, along with the auto scaling feature.

We also configure advanced properties regarding connections and job parameters.
To access data residing in Amazon Redshift, we created an AWS Glue connection that utilizes the JDBC connection.

We also provide custom parameters as key-value pairs. For this post, we sectionalize the PII into five different detection categories:

  • universalPERSON_NAME, EMAIL, CREDIT_CARD
  • hipaaPERSON_NAME, PHONE_NUMBER, USA_SSN, USA_ITIN, BANK_ACCOUNT, USA_DRIVING_LICENSE, USA_HCPCS_CODE, USA_NATIONAL_DRUG_CODE, USA_NATIONAL_PROVIDER_IDENTIFIER, USA_DEA_NUMBER, USA_HEALTH_INSURANCE_CLAIM_NUMBER, USA_MEDICARE_BENEFICIARY_IDENTIFIER
  • networkingIP_ADDRESS, MAC_ADDRESS
  • united_statesPHONE_NUMBER, USA_PASSPORT_NUMBER, USA_SSN, USA_ITIN, BANK_ACCOUNT
  • custom – Coordinates

If you’re trying this solution from other countries, you can specify the custom PII fields using the custom category, because this solution is created based on US regions.

For demonstration purposes, we use a single table and pass it as the following parameter:

--table_name: table_name

For this post, we name the table personal_health_identifiable_information.

You can customize these parameters based on the individual business use case.

Run the job and wait for the Success status.

The job has two goals. The first goal is to identify PII data-related columns in the Redshift table and produce a list of these column names. The second goal is the obfuscation of data in those specific columns of the target table. As a part of the second goal, it reads the table data, applies a user-defined masking function to those specific columns, and updates the data in the target table using a Redshift staging table (stage_personal_health_identifiable_information) for the upserts.

Alternatively, you can also use dynamic data masking (DDM) in Amazon Redshift to protect sensitive data in your data warehouse.

Analyze the output using CloudWatch

When the job is complete, let’s review the CloudWatch logs to understand how the AWS Glue job ran. We can navigate to the CloudWatch logs by choosing Output logs on the job details page on the AWS Glue console.

The job identified every column that contains PII data, including custom fields passed using the AWS Glue job sensitive data detection fields.

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

  1. Empty the S3 buckets.
  2. Delete the endpoints you created.
  3. Delete the CloudFormation stack via the AWS CloudFormation console to delete the remaining resources.

Conclusion

With this solution, you can automatically scan the data located in Redshift clusters using an AWS Glue job, identify PII, and take necessary actions. This could help your organization with security, compliance, governance, and data protection features, which contribute towards the data security and data governance.


About the Authors

Manikanta Gona is a Data and ML Engineer at AWS Professional Services. He joined AWS in 2021 with 6+ years of experience in IT. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. In his spare time, he love to garden, and go on hikes and biking with his husband.

Denys Novikov is a Senior Data Lake Architect with the Professional Services team at Amazon Web Services. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems for Enterprise customers.

Anjan Mukherjee is a Data Lake Architect at AWS, specializing in big data and analytics solutions. He helps customers build scalable, reliable, secure and high-performance applications on the AWS platform.

Implement alerts in Amazon OpenSearch Service with PagerDuty

Post Syndicated from Manikanta Gona original https://aws.amazon.com/blogs/big-data/implement-alerts-in-amazon-opensearch-service-with-pagerduty/

In today’s fast-paced digital world, businesses rely heavily on their data to make informed decisions. This data is often stored and analyzed using various tools, such as Amazon OpenSearch Service, a powerful search and analytics service offered by AWS. OpenSearch Service provides real-time insights into your data to support use cases like interactive log analytics, real-time application monitoring, website search, and more. Analyzing logs can help businesses quickly identify and troubleshoot issues.

However, with the increasing amount of data, it can be challenging to monitor everything manually. Manual monitoring consumes a lot of resources and is hard to maintain as the application landscape changes. We need a sustainable and automated approach to monitor critical applications and infrastructure.

With automated alerting with a third-party service like PagerDuty, an incident management platform, combined with the robust and powerful alerting plugin provided by OpenSearch Service, businesses can proactively manage and respond to critical events. You can use this proactive alerting to monitor data patterns for existing data, monitor clusters, detect patterns, and more.

OpenSearch Dashboard provides an alerting plugin that you can use to set up various types of monitors and alerts. You can use the plugin to set up different monitors, including cluster health, an individual document, a custom query, or aggregated data. These monitors can be used to send alerts to users.

In this post, we demonstrate how to implement PagerDuty as the notification mechanism to get notified based on cluster health status. These notifications can be delivered via various channels, including email, SMS, or custom webhooks (like PagerDuty). The OpenSearch Service alerting plugin supports complex alert rules and provides a user interface to manage them.

Solution overview

PagerDuty is a cloud-based incident management platform that helps businesses handle their alerts and incidents in real time. PagerDuty works by consolidating alerts from various monitoring tools and routing them to the right team member, ensuring that issues are addressed promptly. Many businesses are using PagerDuty for real-time incident notifications via multiple channels, ensuring that the right team members are alerted quickly.

In this post, we describe how to set up PagerDuty and integrate it with an OpenSearch Service custom webhook for alert notifications when a threshold is met.

The following diagram illustrate OpenSearch Service running within an Amazon VPC using monitors and triggers to send a notification to the PagerDuty service using an Events API custom webhook

We need to set up a service and integration on PagerDuty to begin receiving incident notifications from OpenSearch Service. A service in PagerDuty represents an application, component, or team that we can trigger the notification against.

Prerequisites

Before you get started, create the following resources, if not already available:

Create a service on PagerDuty

To create a service on PagerDuty, complete the following steps:

  1. Log in to PagerDuty using your personal or enterprise account that is being used to enable the integration with OpenSearch Service.
  2. On the Services tab, choose New Service.
  3. Enter a name and optional description, then choose Next.

In the next step, we create or assign an escalation policy for the service. An escalation policy represents the order of responsibility for reacting to the issues detected on a service.

  1. If you already have an escalation policy defined within the organization or team, select Select an existing Escalation Policy and specify your policy. Otherwise, select Generate a new Escalation Policy, then choose Next.

In the next step, we can group the alerts based on time or content:

    • To group alerts together based on the alert content, select Content-Based grouping.
    • To group them based on a specific time duration, select Time-Based grouping.
    • Selecting the Intelligent grouping option will group the alerts intelligently based on content or time.
  1. Leave the defaults and choose Next.
  2. On the Integrations page, select the Events API V2 integration (this will be used for integration with OpenSearch Service) and choose Create Service.

If you don’t select the integration during this step, you can add it later.

  1. Take note of the integration key on the Integrations tab.

Create a notification channel on OpenSearch Service with a custom webhook

Custom webhooks provide the ability to send these notifications to third-party services like PagerDuty using a REST API. After we configure the notification channel, we can use it for other monitors beyond this use case and to detect data patterns that are stored within the cluster.

Complete the following steps to configure the notification channel:

  1. On the OpenSearch Dashboards page, choose Notifications under Amazon OpenSearch Plugins in the navigation pane.
  2. On the Channels tab, choose Create channel.
  3. Enter a name for the channel and an optional description.
  4. For Channel type, choose Custom webhook.
  5. For Method, choose POST.
  6. For Define endpoints by, select Custom attributes URL.
  1. For Host, enter events.PagerDuty.com.
  2. For Path, enter v2/enqueue.
  3. Under Webhook headers, choose Add header.
  4. Enter X-Routing-Key as the key and the integration key you obtained earlier as the value.
  5. Choose Create and ensure the channel is successfully created.

Configure OpenSearch Service alerts to send notifications to PagerDuty

We can monitor OpenSearch cluster health in two different ways:

  • Using the OpenSearch Dashboard alerting plugin by setting up a per cluster metrics monitor. This provides a query to retrieve metrics related to the cluster health.
  • Integrating with Amazon CloudWatch, a monitoring and observability service.

In this use case, we use the alerting plugin. Complete the following steps:

  1. On the OpenSearch Dashboards page, choose Alerting under Amazon OpenSearch Plugins in the navigation pane.
  2. On the Monitors tab, choose Create monitor.
  3. For Monitor name, enter a name (for example, Monitor Cluster Health).
  4. For Monitor type, select Per cluster metrics monitor.
  5. Under Schedule¸ configure the monitor to run every minute.
  6. In the Query section, for Request type, choose Cluster health.
  7. Choose Preview query.
  8. Create a trigger by choosing Add trigger.
  9. For Trigger name, enter a name (for example, Cluster Health Status is Red).
  10. Leave Severity level at 1 (Highest).
  11. Under Trigger condition, delete the default code and enter the following:
ctx.results[0].status == "red"
  1. Choose Preview condition response to confirm that Trigger condition response shows as false, indicating that the cluster is healthy.
  2. Under Actions, choose Add action.
  3. For Action name, enter a name (for example, Send a PagerDuty notification).
  4. For Channels, choose the channel you created earlier.
  5. For Message, enter the following code:
{ "event_action": "trigger",
"payload" :
	{	"summary": "{{ctx.trigger.name}}",
		"source": " {{ctx.monitor.name}}",
		"severity": "critical",
		"custom_details":
			{ 
				"-Severity" : "{{ctx.trigger.severity}}",
				"-Period start" : "{{ctx.periodStart}}",
				"-Period end": "{{ctx.periodEnd}}"
			}
	}
}

Note that apart from the custom_details section in the code, the rest of the fields are mandatory for PagerDuty.

  1. Choose Send test message and test to make sure you receive an alert on the PagerDuty service.
  2. Choose Create and ensure the monitor was created successfully.

A notification will be sent to the PagerDuty service as part of the test, which will trigger a notification via a phone call or text message for the person who is available based on the escalation policy defined earlier. This notification can be safely acknowledged and resolved from PagerDuty because this is was a test.

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

  1. Delete the PagerDuty service.
  2. Delete the OpenSearch Service domain that was created as part of the prerequisites.

Conclusion

The integration of OpenSearch Service alerts with PagerDuty provides a powerful and efficient solution for managing and responding to critical events in real time. With this integration, you can easily set up alerts and notifications to stay informed about potential issues within your OpenSearch Service clusters or issues related to data and documents stored within the cluster, and proactively take action to resolve any problems that arise. Additionally, the integration allows for seamless collaboration between teams, enabling them to work together to identify and troubleshoot issues as they occur.

For more information about anomaly detection and alerts in OpenSearch Service, refer to Anomaly Detection in Amazon OpenSearch and Configuring Alerts in Amazon OpenSearch.


About the Authors

Manikanta Gona is a Data and ML Engineer at AWS Professional Services. He joined AWS in 2021 with 6+ years of experience in IT. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. In his spare time, he love to garden, and go on hikes and biking with his husband.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation

Ravikiran Rao is a Data Architect at AWS and is passionate about solving complex data challenges for various customers. Outside of work, he is a theatre enthusiast and an amateur tennis player.

Hari Krishna KC is a Data Architect with the AWS Professional Services Team. He specializes in AWS Data Lakes & AWS OpenSearch Service and have helped numerous client migrate their workload to Data Lakes and Search data stores