Process price transparency data using AWS Glue

Post Syndicated from Hari Thatavarthy original https://aws.amazon.com/blogs/big-data/process-price-transparency-data-using-aws-glue/

The Transparency in Coverage rule is a federal regulation in the United States that was finalized by the Center for Medicare and Medicaid Services (CMS) in October 2020. The rule requires health insurers to provide clear and concise information to consumers about their health plan benefits, including costs and coverage details. Under the rule, health insurers must make available to their members a list of negotiated rates for in-network providers, as well as an estimate of the member’s out-of-pocket costs for specific health care services. This information must be made available to members through an online tool that is accessible and easy to use. The Transparency in Coverage rule also requires insurers to make available data files that contain detailed information on the prices they negotiate with health care providers. This information can be used by employers, researchers, and others to compare prices across different insurers and health care providers. Phase 1 implementation of this regulation, which went into effect on July 1, 2022, requires that payors publish machine-readable files publicly for each plan that they offer. CMS (Center for Medicare and Medicaid Services) has published a technical implementation guide with file formats, file structure, and standards on producing these machine-readable files.

This post walks you through the preprocessing and processing steps required to prepare data published by health insurers in light of this federal regulation using AWS Glue. We also show how to query and derive insights using Amazon Athena.

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data.

Challenges processing these machine-readable files

The machine-readable files published by these payors vary in size. A single file can range from a few megabytes to hundreds of gigabytes. These files contain large JSON objects that are deeply nested. Unlike NDJSON and JSONL formats, where each line in the file is a JSON object, these files contain a single large JSON object that can span across multiple lines. The following figure represents the schema of an in_network rate file published by a major health insurer on their website for public access. This file, when uncompressed, is about 20 GB in size, contains a single JSON object, and is deeply nested. The following figure represents the schema of this JSON object when printed using the Spark printSchema() function. Each highlighted box in red is a nested array structure.

JSON Schema

Loading a 20 GB deeply nested JSON object requires a machine with a large memory footprint. Data when loaded into memory is 4–10 times its size on disk. A 20 GB JSON object may need a machine with up to 200 GB memory. To process workloads larger than 20 GB, these machines need to be scaled vertically, thereby significantly increasing hardware costs. Vertical scaling has its limits, and it’s not possible to scale beyond a certain point. Analyzing this data requires unnesting and flattening of deeply nested array structures. These transformations explode the data at an exponential rate, thereby adding to the need for more memory and disk space.

You can use an in-memory distributed processing framework such as Apache Spark to process and analyze such large volumes of data. However, to load this single large JSON object as a Spark DataFrame and perform an action on it, a worker node needs enough memory to load this object in full. When a worker node tries to load this large deeply nested JSON object and there isn’t enough memory to load it in full, the processing job will fail with out-of-memory issues. This calls for splitting the large JSON object into smaller chunks using some form of preprocessing logic. Once preprocessed, these smaller files can then be further processed in parallel by worker nodes without running into out-of-memory issues.

Solution overview

The solution involves a two-step approach. The first is a preprocessing step, which takes the large JSON object as input and splits it to multiple manageable chunks. This is required to address the challenges we mentioned earlier. The second is a processing step, which prepares and publishes data for analysis.

The preprocessing step uses an AWS Glue Python shell job to split the large JSON object into smaller JSON files. The processing step unnests and flattens the array items from these smaller JSON files in parallel. It then partitions and writes the output as Parquet on Amazon Simple Storage Service (Amazon S3). The partitioned data is cataloged and analyzed using Athena. The following diagram illustrates this workflow.

Solution Overview

Prerequisites

To implement the solution in your own AWS account, you need to create or configure the following AWS resources in advance:

An S3 bucket to persist the source and processed data. Download the input file and upload it to the path s3://yourbucket/ptd/2023-03-01_United-HealthCare-Services—Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz.
An AWS Identity and Access Management (IAM) role for your AWS Glue extract, transform, and load (ETL) job. For instructions, refer to Setting up IAM permissions for AWS Glue. Adjust the permissions to ensure AWS Glue has read/write access to Amazon S3 locations.
An IAM role for Athena with AWS Glue Data Catalog permissions to create and query tables.

Create an AWS Glue preprocessing job

The preprocessing step uses ijson, an open-source iterative JSON parser to extract items in the outermost array of top-level attributes. By streaming and iteratively parsing the large JSON file, the preprocessing step loads only a portion of the file into memory, thereby avoiding out-of-memory issues. It also uses s3pathlib, an open-source Python interface to Amazon S3. This makes it easy to work with S3 file systems.

To create and run the AWS Glue job for preprocessing, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Python shell script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import ijson
import json
import decimal
from s3pathlib import S3Path
from s3pathlib import context
import boto3
from io import StringIO

class JSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, decimal.Decimal):
            return float(obj)
        return json.JSONEncoder.default(self, obj)
        
def upload_to_s3(data, upload_path):
    data = bytes(StringIO(json.dumps(data,cls=JSONEncoder)).getvalue(),encoding='utf-8')
    s3_client.put_object(Body=data, Bucket=bucket, Key=upload_path)
        
s3_client = boto3.client('s3')

#Replace with your bucket and path to JSON object on your bucket
bucket = 'yourbucket'
largefile_key = 'ptd/2023-03-01_United-HealthCare-Services--Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz'
p = S3Path(bucket, largefile_key)


#Replace the paths to suit your needs
upload_path_base = 'ptd/preprocessed/base/base.json'
upload_path_in_network = 'ptd/preprocessed/in_network/'
upload_path_provider_references = 'ptd/preprocessed/provider_references/'

#Extract top the values of the following top level attributes and persist them on your S3 bucket
# -- reporting_entity_name
# -- reporting_entity_type
# -- last_updated_on
# -- version

base ={
    'reporting_entity_name' : '',
    'reporting_entity_type' : '',
    'last_updated_on' :'',
    'version' : ''
}

with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_name')
    for evt in obj:
        base['reporting_entity_name'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_type')
    for evt in obj:
        base['reporting_entity_type'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'last_updated_on')
    for evt in obj:
        base['last_updated_on'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f,'version')
    for evt in obj:
        base['version'] = evt
        break
        
upload_to_s3(base,upload_path_base)

#Seek the position of JSON key provider_references 
#Iterate through items in provider_references array, and for every 1000 items create a JSON file on S3 bucket
with p.open("r") as f:
    provider_references = ijson.items(f, 'provider_references.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(provider_references):
        if rowcnt % 1000 == 0:
            if fk > 0:
                dest = upload_path_provider_references + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'provider_references_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)

    path = 'provider_references_{0}.json'.format(fk)
    dest = upload_path_provider_references + path
    upload_to_s3(lst,dest)
    
#Seek the position of JSON key in_network
#Iterate through items in in_network array, and for every 25 items create a JSON file on S3 bucket
with p.open("r") as f:
    in_network = ijson.items(f, 'in_network.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(in_network):
        if rowcnt % 25 == 0:
            if fk > 0:
                dest = upload_path_in_network + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'in_network_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)


    path = 'in_network_{0}.json'.format(fk)
    dest = upload_path_in_network + path
    upload_to_s3(lst,dest)

Update the properties of your job on the Job details tab:
1. For Type, choose Python Shell.
2. For Python version, choose Python 3.9.
3. For Data processing units, choose 1 DPU.

For Python shell jobs, you can allocate either 0.0625 or 1 DPU. The default is 0.0625 DPU. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

python shell job config

The Python libraries ijson and s3pathlib are available in pip and can be installed using the AWS Glue job parameter --additional-python-modules. You can also choose to package these libraries, upload them to Amazon S3, and refer to them from your AWS Glue job. For instructions on packaging your library, refer to Providing your own Python library.

To install the Python libraries, set the following job parameters:
- Key – --additional-python-modules
- Value – ijson,s3pathlib
Run the job.

The preprocessing step creates three folders in the S3 bucket: base, in_network and provider_references.

s3_folder_1

Files in in_network and provider_references folders contains array of JSON objects. Each of these JSON objects represents an element in the outermost array of the original large JSON object.

s3_folder_2

Create an AWS Glue processing job

The processing job uses the output of the preprocessing step to create a denormalized view of data by extracting and flattening elements and attributes from nested arrays. The extent of unnesting depends on the attributes we need for analysis. For example, attributes such as negotiated_rate, npi, and billing_code are essential for analysis and extracting values associated with these attributes requires multiple levels of unnesting. The denormalized data is then partitioned by the billing_code column, persisted as Parquet on Amazon S3, and registered as a table on the AWS Glue Data Catalog for querying.

The following code sample guides you through the implementation using PySpark. The columns used to partition the data depends on query patterns used to analyze the data. Arriving at a partitioning strategy that is in line with the query patterns will improve overall query performance during analysis. This post assumes that the queries used for analyzing data will always use the column billing_code to filter and fetch data of interest. Data in each partition is bucketed by npi to improve query performance.

To create your AWS Glue job, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Spark script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql.functions import explode

#create a dataframe of base objects - reporting_entity_name, reporting_entity_type, version, last_updated_on
#using the output of preprocessing step

base_df = spark.read.json('s3://yourbucket/ptd/preprocessed/base/')

#create a dataframe over provider_references objects using the output of preprocessing step
prvd_df = spark.read.json('s3://yourbucket/ptd/preprocessed/provider_references/')

#cross join dataframe of base objects with dataframe of provider_references 
prvd_df = prvd_df.crossJoin(base_df)

#create a dataframe over in_network objects using the output of preprocessing step
in_ntwrk_df = spark.read.json('s3://yourbucket/ptd/preprocessed/in_network/')

#unnest and flatten negotiated_rates and provider_references from in_network objects
in_ntwrk_df2 = in_ntwrk_df.select(
 in_ntwrk_df.billing_code, in_ntwrk_df.billing_code_type, in_ntwrk_df.billing_code_type_version,
 in_ntwrk_df.covered_services, in_ntwrk_df.description, in_ntwrk_df.name,
 explode(in_ntwrk_df.negotiated_rates).alias('exploded_negotiated_rates'),
 in_ntwrk_df.negotiation_arrangement)


in_ntwrk_df3 = in_ntwrk_df2.select(
 in_ntwrk_df2.billing_code, in_ntwrk_df2.billing_code_type, in_ntwrk_df2.billing_code_type_version,
 in_ntwrk_df2.covered_services, in_ntwrk_df2.description, in_ntwrk_df2.name,
 in_ntwrk_df2.exploded_negotiated_rates.negotiated_prices.alias(
 'exploded_negotiated_rates_negotiated_prices'),
 explode(in_ntwrk_df2.exploded_negotiated_rates.provider_references).alias(
 'exploded_negotiated_rates_provider_references'),
 in_ntwrk_df2.negotiation_arrangement)

#join the exploded in_network dataframe with provider_references dataframe
jdf = prvd_df.join(
 in_ntwrk_df3,
 prvd_df.provider_group_id == in_ntwrk_df3.exploded_negotiated_rates_provider_references,"fullouter")

#un-nest and flatten attributes from rest of the nested arrays.
jdf2 = jdf.select(
 jdf.reporting_entity_name,jdf.reporting_entity_type,jdf.last_updated_on,jdf.version,
 jdf.provider_group_id, jdf.provider_groups, jdf.billing_code,
 jdf.billing_code_type, jdf.billing_code_type_version, jdf.covered_services,
 jdf.description, jdf.name,
 explode(jdf.exploded_negotiated_rates_negotiated_prices).alias(
 'exploded_negotiated_rates_negotiated_prices'),
 jdf.exploded_negotiated_rates_provider_references,
 jdf.negotiation_arrangement)

jdf3 = jdf2.select(
 jdf2.reporting_entity_name,jdf2.reporting_entity_type,jdf2.last_updated_on,jdf2.version,
 jdf2.provider_group_id,
 explode(jdf2.provider_groups).alias('exploded_provider_groups'),
 jdf2.billing_code, jdf2.billing_code_type, jdf2.billing_code_type_version,
 jdf2.covered_services, jdf2.description, jdf2.name,
 jdf2.exploded_negotiated_rates_negotiated_prices.additional_information.
 alias('additional_information'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_class.alias(
 'billing_class'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_code_modifier.
 alias('billing_code_modifier'),
 jdf2.exploded_negotiated_rates_negotiated_prices.expiration_date.alias(
 'expiration_date'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_rate.alias(
 'negotiated_rate'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_type.alias(
 'negotiated_type'),
 jdf2.exploded_negotiated_rates_negotiated_prices.service_code.alias(
 'service_code'), jdf2.exploded_negotiated_rates_provider_references,
 jdf2.negotiation_arrangement)

jdf4 = jdf3.select(jdf3.reporting_entity_name,jdf3.reporting_entity_type,jdf3.last_updated_on,jdf3.version,
 jdf3.provider_group_id,
 explode(jdf3.exploded_provider_groups.npi).alias('npi'),
 jdf3.exploded_provider_groups.tin.type.alias('tin_type'),
 jdf3.exploded_provider_groups.tin.value.alias('tin'),
 jdf3.billing_code, jdf3.billing_code_type,
 jdf3.billing_code_type_version, jdf3.covered_services,
 jdf3.description, jdf3.name, jdf3.additional_information,
 jdf3.billing_class, jdf3.billing_code_modifier,
 jdf3.expiration_date, jdf3.negotiated_rate,
 jdf3.negotiated_type, jdf3.service_code,
 jdf3.negotiation_arrangement)

#repartition by billing_code. 
#Repartition changes the distribution of data on spark cluster. 
#By repartition data we will avoid writing too many small files.
jdf5=jdf4.repartition("billing_code")
 
datasink_path = "s3://yourbucket/ptd/processed/billing_code_npi/parquet/"

#persist dataframe as parquet on S3 and catalog it
#Partition the data by billing_code. This enables analytical queries to skip data and improve performance of queries
#Data is also bucketed and sorted npi to improve query performance during analysis

jdf5.write.format('parquet').mode("overwrite").partitionBy('billing_code').bucketBy(2, 'npi').sortBy('npi').saveAsTable('ptdtable', path = datasink_path)

Update the properties of your job on the Job details tab:
1. For Type, choose Spark.
2. For Glue version, choose Glue 4.0.
3. For Language, choose Python 3.
4. For Worker type, choose G 2X.
5. For Requested number of workers, enter 20.

Arriving at the number of workers and worker type to use for your processing job depends on factors such as the amount of data being processed, the speed at which it needs to be processed, and the partitioning strategy used. Repartitioning of data can result in out-of-memory issues, especially when data is heavily skewed on the column used to repartition. It’s possible to reach Amazon S3 service limits if too many workers are assigned to the job. This is because tasks running on these worker nodes may try to read/write from the same S3 prefix, causing Amazon S3 to throttle the incoming requests. For more details, refer to Best practices design patterns: optimizing Amazon S3 performance.

processing job config

Exploding array elements creates new rows and columns, thereby exponentially increasing the amount of data that needs to be processed. Apache Spark splits this data into multiple Spark partitions on different worker nodes so that it can process large amounts of data in parallel. In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. Shuffle operations are commonly triggered by wide transformations such as join, reduceByKey, groupByKey, and repartition. In case of exceptions due to local storage limitations, it helps to supplement or replace local disk storage capacity with Amazon S3 for large shuffle operations. This is possible with the AWS Glue Spark shuffle plugin with Amazon S3. With the cloud shuffle storage plugin for Apache Spark, you can avoid disk space-related failures.

To use the Spark shuffle plugin, set the following job parameters:
- Key – --write-shuffle-files-to-s3
- Value – true

Query the data

You can query the cataloged data using Athena. For instructions on setting up Athena, refer to Setting up.

On the Athena console, choose Query editor in the navigation pane to run your query, and specify your data source and database.

sql query

To find the minimum, maximum, and average negotiated rates for procedure codes, run the following query:

SELECT
billing_code,
round(min(negotiated_rate),2) as min_price,
round(avg(negotiated_rate),2) as avg_price,
round(max(negotiated_rate),2) as max_price,
description
FROM "default"."ptdtable"
group by billing_code, description
limit 10;

The following screenshot shows the query results.

sql query results

Clean up

To avoid incurring future charges, delete the AWS resources you created:

Delete the S3 objects and bucket.
Delete the IAM policies and roles.
Delete the AWS Glue jobs for preprocessing and processing.

Conclusion

This post guided you through the necessary preprocessing and processing steps to query and analyze price transparency-related machine-readable files. Although it’s possible to use other AWS services to process such data, this post focused on preparing and publishing data using AWS Glue.

To learn more about the Transparency in Coverage rule, refer to Transparency in Coverage. For best practices for scaling Apache Spark jobs and partitioning data with AWS Glue, refer to Best practices to scale Apache Spark jobs and partition data with AWS Glue. To learn how to monitor AWS Glue jobs, refer to Monitoring AWS Glue Spark jobs.

We look forward to hearing any feedback or questions.

About the Authors

Hari Thatavarthy is a Senior Solutions Architect on the AWS Data Lab team. He helps customers design and build solutions in the data and analytics space. He believes in data democratization and loves to solve complex data processing-related problems. In his spare time, he loves to play table tennis.

Krishna Maddileti is a Senior Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey and helps them with data engineering, data lakes, and analytics. In his spare time, he enjoys spending time with his family and playing video games with his 7-year-old.

Yadukishore Tatavarthi is a Senior Partner Solutions Architect at AWS. He works closely with global system integrator partners to enable and support customers moving their workloads to AWS.

Manish Kola is a Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Noise