Tag Archives: Healthcare

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

2024-03-27 Saeed Barghi

Post Syndicated from Saeed Barghi original https://aws.amazon.com/blogs/big-data/improve-healthcare-services-through-patient-360-a-zero-etl-approach-to-enable-near-real-time-data-analytics/

Healthcare providers have an opportunity to improve the patient experience by collecting and analyzing broader and more diverse datasets. This includes patient medical history, allergies, immunizations, family disease history, and individuals’ lifestyle data such as workout habits. Having access to those datasets and forming a 360-degree view of patients allows healthcare providers such as claim analysts to see a broader context about each patient and personalize the care they provide for every individual. This is underpinned by building a complete patient profile that enables claim analysts to identify patterns, trends, potential gaps in care, and adherence to care plans. They can then use the result of their analysis to understand a patient’s health status, treatment history, and past or upcoming doctor consultations to make more informed decisions, streamline the claim management process, and improve operational outcomes. Achieving this will also improve general public health through better and more timely interventions, identify health risks through predictive analytics, and accelerate the research and development process.

AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. The solution proposed in this post follows a zero-ETL approach to data integration to facilitate near real-time analytics and deliver a more personalized patient experience. The solution uses AWS services such as AWS HealthLake, Amazon Redshift, Amazon Kinesis Data Streams, and AWS Lake Formation to build a 360 view of patients. These services enable you to collect and analyze data in near real time and put a comprehensive data governance framework in place that uses granular access control to secure sensitive data from unauthorized users.

Zero-ETL refers to a set of features on the AWS Cloud that enable integrating different data sources with Amazon Redshift:

Integration between Amazon Redshift and Amazon Simple Storage Service (Amazon S3) via Amazon Redshift Spectrum and auto-copy features
Integration between Amazon Redshift and Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB via the zero-ETL feature
Integration between Amazon Redshift and streaming sources like Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) via streaming ingestion

Solution overview

Organizations in the healthcare industry are currently spending a significant amount of time and money on building complex ETL pipelines for data movement and integration. This means data will be replicated across multiple data stores via bespoke and in some cases hand-written ETL jobs, resulting in data inconsistency, latency, and potential security and privacy breaches.

With support for querying cross-account Apache Iceberg tables via Amazon Redshift, you can now build a more comprehensive patient-360 analysis by querying all patient data from one place. This means you can seamlessly combine information such as clinical data stored in HealthLake with data stored in operational databases such as a patient relationship management system, together with data produced from wearable devices in near real-time. Having access to all this data enables healthcare organizations to form a holistic view of patients, improve care coordination across multiple organizations, and provide highly personalized care for each individual.

The following diagram depicts the high-level solution we build to achieve these outcomes.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution components:

This stack creates the following resources and necessary permissions to integrate the services:

A Kinesis data stream. You can send data from your streaming source to this resource for ingesting the data into a Redshift data warehouse. We use on-demand capacity mode.
An Amazon Aurora MySQL-Compatible Edition cluster version 8.0. This will be your online transaction processing (OLTP) data store for transactional data. To set up zero-ETL integration for ingesting transaction data to the Redshift data warehouse, see Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift. The required parameter groups for source and target are already created as part of the CloudFormation stack.
An Amazon Redshift Serverless workgroup and associated namespace. The CloudFormation stack also deploys a provisioned Redshift cluster. If you would like to work with Redshift Serverless, you can remove the provisioned cluster from the template and vice versa.
An AWS Identity and Access Management (IAM) role with required policies and trust relationships.
Network components, including VPC, subnets, route table, and associations. You can customize these resources as per your organization’s rules.

AWS Solution setup

AWS HealthLake

AWS HealthLake enables organizations in the health industry to securely store, transform, transact, and analyze health data. It stores data in HL7 FHIR format, which is an interoperability standard designed for quick and efficient exchange of health data. When you create a HealthLake data store, a Fast Healthcare Interoperability Resources (FHIR) data repository is made available via a RESTful API endpoint. Simultaneously and as part of AWS HealthLake managed service, the nested JSON FHIR data undergoes an ETL process and is stored in Apache Iceberg open table format in Amazon S3.

To create an AWS HealthLake data store, refer to Getting started with AWS HealthLake. Make sure to select the option Preload sample data when creating your data store.

In real-world scenarios and when you use AWS HealthLake in production environments, you don’t need to load sample data into your AWS HealthLake data store. Instead, you can use FHIR REST API operations to manage and search resources in your AWS HealthLake data store.

We use two tables from the sample data stored in HealthLake: patient and allergyintolerance.

Query AWS HealthLake tables with Redshift Serverless

Amazon Redshift is the data warehousing service available on the AWS Cloud that provides up to six times better price-performance than any other cloud data warehouses in the market, with a fully managed, AI-powered, massively parallel processing (MPP) data warehouse built for performance, scale, and availability. With continuous innovations added to Amazon Redshift, it is now more than just a data warehouse. It enables organizations of different sizes and in different industries to access all the data they have in their AWS environments and analyze it from one single location with a set of features under the zero-ETL umbrella. Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3.

Query AWS HealthLake data with Amazon Redshift

Amazon Redshift makes it straightforward to query the data stored in S3-based data lakes with automatic mounting of an AWS Glue Data Catalog in the Redshift query editor v2. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. To get started with this feature, see Querying the AWS Glue Data Catalog. After it is set up and you’re connected to the Redshift query editor v2, complete the following steps:

Validate that your tables are visible in the query editor V2. The Data Catalog objects are listed under the awsdatacatalog database.

FHIR data stored in AWS HealthLake is highly nested. To learn about how to un-nest semi-structured data with Amazon Redshift, see Tutorial: Querying nested data with Amazon Redshift Spectrum.

Use the following query to un-nest the allergyintolerance and patient tables, join them together, and get patient details and their allergies:

WITH patient_allergy AS 
(
    SELECT
        resourcetype, 
        c AS allery_category,
        a."patient"."reference",
        SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id,
        a.recordeddate AS allergy_record_date,
        NVL(cd."code", 'NA') AS allergy_code,
        NVL(cd.display, 'NA') AS allergy_description

    FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a
            LEFT JOIN a.category c ON TRUE
            LEFT JOIN a.reaction r ON TRUE
            LEFT JOIN r.manifestation m ON TRUE
            LEFT JOIN m.coding cd ON TRUE
), patinet_info AS
(
    SELECT id,
            gender,
            g as given_name,
            n.family as family_name,
            pr as prefix

    FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p
            LEFT JOIN p.name n ON TRUE
            LEFT JOIN n.given g ON TRUE
            LEFT JOIN n.prefix pr ON TRUE
)
SELECT DISTINCT p.id, 
        p.gender, 
        p.prefix,
        p.given_name,
        p.family_name,
        pa.allery_category,
        pa.allergy_code,
        pa.allergy_description
from patient_allergy pa
    JOIN patinet_info p
        ON pa.patient_id = p.id
ORDER BY p.id, pa.allergy_code
;

To eliminate the need for Amazon Redshift to un-nest data every time a query is run, you can create a materialized view to hold un-nested and flattened data. Materialized views are an effective mechanism to deal with complex and repeating queries. They contain a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database.

Use the following SQL to create a materialized view. You use it later to build a complete view of patients:

CREATE MATERIALIZED VIEW patient_allergy_info AUTO REFRESH YES AS
WITH patient_allergy AS 
(
    SELECT
        resourcetype, 
        c AS allery_category,
        a."patient"."reference",
        SUBSTRING(a."patient"."reference", 9, LEN(a."patient"."reference")) AS patient_id,
        a.recordeddate AS allergy_record_date,
        NVL(cd."code", 'NA') AS allergy_code,
        NVL(cd.display, 'NA') AS allergy_description

    FROM
        "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."allergyintolerance" a
            LEFT JOIN a.category c ON TRUE
            LEFT JOIN a.reaction r ON TRUE
            LEFT JOIN r.manifestation m ON TRUE
            LEFT JOIN m.coding cd ON TRUE
), patinet_info AS
(
    SELECT id,
            gender,
            g as given_name,
            n.family as family_name,
            pr as prefix

    FROM "awsdatacatalog"."datastore_01_179674d36391d68926a8d74c12599306_healthlake_view"."patient" p
            LEFT JOIN p.name n ON TRUE
            LEFT JOIN n.given g ON TRUE
            LEFT JOIN n.prefix pr ON TRUE
)
SELECT DISTINCT p.id, 
        p.gender, 
        p.prefix,
        p.given_name,
        p.family_name,
        pa.allery_category,
        pa.allergy_code,
        pa.allergy_description
from patient_allergy pa
    JOIN patinet_info p
        ON pa.patient_id = p.id
ORDER BY p.id, pa.allergy_code
;

You have confirmed you can query data in AWS HealthLake via Amazon Redshift. Next, you set up zero-ETL integration between Amazon Redshift and Amazon Aurora MySQL.

Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless

Applications such as front-desk software, which are used to schedule appointments and register new patients, store data in OLTP databases such as Aurora. To get data out of OLTP databases and have them ready for analytics use cases, data teams might have to spend a considerable amount of time to build, test, and deploy ETL jobs that are complex to maintain and scale.

With the Amazon Redshift zero-ETL integration with Amazon Aurora MySQL, you can run analytics on the data stored in OLTP databases and combine them with the rest of the data in Amazon Redshift and AWS HealthLake in near real time. In the next steps in this section, we connect to a MySQL database and set up zero-ETL integration with Amazon Redshift.

Connect to an Aurora MySQL database and set up data

Connect to your Aurora MySQL database using your editor of choice using AdminUsername and AdminPassword that you entered when running the CloudFormation stack. (For simplicity, it is the same for Amazon Redshift and Aurora.)

When you’re connected to your database, complete the following steps:

Create a new database by running the following command:
```
CREATE DATABASE front_desk_app_db;
```
Create a new table. This table simulates storing patient information as they visit clinics and other healthcare centers. For simplicity and to demonstrate specific capabilities, we assume that patient IDs are the same in AWS HealthLake and the front-of-office application. In real-world scenarios, this can be a hashed version of a national health care number:
```
CREATE TABLE patient_appointment ( 
      patient_id varchar(250), 
      gender varchar(1), 
      date_of_birth date, 
      appointment_datetime datetime, 
      phone_number varchar(15), 
      PRIMARY KEY (patient_id, appointment_datetime) 
);
```

Having a primary key in the table is mandatory for zero-ETL integration to work.

Insert new records into the source table in the Aurora MySQL database. To demonstrate the required functionalities, make sure the patient_id of the sample records inserted into the MySQL database match the ones in AWS HealthLake. Replace [patient_id_1] and [patient_id_2] in the following query with the ones from the Redshift query you ran previously (the query that joined allergyintolerance and patient):

INSERT INTO front_desk_app_db.patient_appointment (patient_id, gender, date_of_birth, appointment_datetime, phone_number)

VALUES([PATIENT_ID_1], 'F', '1988-7-04', '2023-12-19 10:15:00', '0401401401'),
([PATIENT_ID_1], 'F', '1988-7-04', '2023-09-19 11:00:00', '0401401401'),
([PATIENT_ID_1], 'F', '1988-7-04', '2023-06-06 14:30:00', '0401401401'),
([PATIENT_ID_2], 'F', '1972-11-14', '2023-12-19 08:15:00', '0401401402'),
([PATIENT_ID_2], 'F', '1972-11-14', '2023-01-09 12:15:00', '0401401402');

Now that your source table is populated with sample records, you can set up zero-ETL and have data ingested into Amazon Redshift.

Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift

Complete the following steps to create your zero-ETL integration:

On the Amazon RDS console, choose Databases in the navigation pane.
Choose the DB identifier of your cluster (not the instance).
On the Zero-ETL Integration tab, choose Create zero-ETL integration.
Follow the steps to create your integration.

Create a Redshift database from the integration

Next, you create a target database from the integration. You can do this by running a couple of simple SQL commands on Amazon Redshift. Log in to the query editor V2 and run the following commands:

Get the integration ID of the zero-ETL you set up between your source database and Amazon Redshift:
```
SELECT * FROM svv_integration;
```

Create a database using the integration ID:

CREATE DATABASE ztl_demo FROM INTEGRATION '[INTEGRATION_ID ';

Query the database and validate that a new table is created and populated with data from your source MySQL database:
```
SELECT * FROM ztl_demo.front_desk_app_db.patient_appointment;
```

It might take a few seconds for the first set of records to appear in Amazon Redshift.

This shows that the integration is working as expected. To validate it further, you can insert a new record in your Aurora MySQL database, and it will be available in Amazon Redshift for querying in near real time within a few seconds.

Set up streaming ingestion for Amazon Redshift

Another aspect of zero-ETL on AWS, for real-time and streaming data, is realized through Amazon Redshift Streaming Ingestion. It provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams and Amazon MSK. It lowers the effort required to have data ready for analytics workloads, lowers the cost of running such workloads on the cloud, and decreases the operational burden of maintaining the solution.

In the context of healthcare, understanding an individual’s exercise and movement patterns can help with overall health assessment and better treatment planning. In this section, you send simulated data from wearable devices to Kinesis Data Streams and integrate it with the rest of the data you already have access to from your Redshift Serverless data warehouse.

For step-by-step instructions, refer to Real-time analytics with Amazon Redshift streaming ingestion. Note the following steps when you set up streaming ingestion for Amazon Redshift:

Select wearables_stream and use the following template when sending data to Amazon Kinesis Data Streams via Kinesis Data Generator, to simulate data generated by wearable devices. Replace [PATIENT_ID_1] and [PATIENT_ID_2] with the patient IDs you earlier when inserting new records into your Aurora MySQL table:
```
{
   "patient_id": "{{random.arrayElement(["[PATIENT_ID_1]"," [PATIENT_ID_2]"])}}",
   "steps_increment": "{{random.arrayElement(
      [0,1]
   )}}",
   "heart_rate": {{random.number( 
      {
         "min":45,
         "max":120}
   )}}
}
```
Create an external schema called from_kds by running the following query and replacing [IAM_ROLE_ARN] with the ARN of the role created by the CloudFormation stack (Patient360BlogRole):
```
CREATE EXTERNAL SCHEMA from_kds
FROM KINESIS
IAM_ROLE '[IAM_ROLE_ARN]';
```

Use the following SQL when creating a materialized view to consume data from the stream:

CREATE MATERIALIZED VIEW patient_wearable_data AUTO REFRESH YES AS 
SELECT approximate_arrival_timestamp, 
      JSON_PARSE(kinesis_data) as Data FROM from_kds."wearables_stream" 
WHERE CAN_JSON_PARSE(kinesis_data);

To validate that streaming ingestion works as expected, refresh the materialized view to get the data you already sent to the data stream and query the table to make sure data has landed in Amazon Redshift:
```
REFRESH MATERIALIZED VIEW patient_wearable_data;

SELECT *
FROM patient_wearable_data
ORDER BY approximate_arrival_timestamp DESC;
```

Query and analyze patient wearable data

The results in the data column of the preceding query are in JSON format. Amazon Redshift makes it straightforward to work with semi-structured data in JSON format. It uses PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. Use the following query to flatten data:

SELECT data."patient_id"::varchar AS patient_id,       
      data."steps_increment"::integer as steps_increment,       
      data."heart_rate"::integer as heart_rate, 
      approximate_arrival_timestamp 
FROM patient_wearable_data 
ORDER BY approximate_arrival_timestamp DESC;

The result looks like the following screenshot.

Now that you know how to flatten JSON data, you can analyze it further. Use the following query to get the number of minutes a patient has been physically active per day, based on their heart rate (greater than 80):

WITH patient_wearble_flattened AS
(
   SELECT data."patient_id"::varchar AS patient_id,
      data."steps_increment"::integer as steps_increment,
      data."heart_rate"::integer as heart_rate,
      approximate_arrival_timestamp,
      DATE(approximate_arrival_timestamp) AS date_received,
      extract(hour from approximate_arrival_timestamp) AS    hour_received,
      extract(minute from approximate_arrival_timestamp) AS minute_received
   FROM patient_wearable_data
), patient_active_minutes AS
(
   SELECT patient_id,
      date_received,
      hour_received,
      minute_received,
      avg(heart_rate) AS heart_rate
   FROM patient_wearble_flattened
   GROUP BY patient_id,
      date_received,
      hour_received,
      minute_received
   HAVING avg(heart_rate) > 80
)
SELECT patient_id,
      date_received,
      COUNT(heart_rate) AS active_minutes_count
FROM patient_active_minutes
GROUP BY patient_id,
      date_received
ORDER BY patient_id,
      date_received;

Create a complete patient 360

Now that you are able to query all patient data with Redshift Serverless, you can combine the three datasets you used in this post and form a comprehensive patient 360 view with the following query:

WITH patient_appointment_info AS
(
      SELECT "patient_id",
         "gender",
         "date_of_birth",
         "appointment_datetime",
         "phone_number"
      FROM ztl_demo.front_desk_app_db.patient_appointment
),
patient_wearble_flattened AS
(
      SELECT data."patient_id"::varchar AS patient_id,
         data."steps_increment"::integer as steps_increment,
         data."heart_rate"::integer as heart_rate,
         approximate_arrival_timestamp,
         DATE(approximate_arrival_timestamp) AS date_received,
         extract(hour from approximate_arrival_timestamp) AS hour_received,
         extract(minute from approximate_arrival_timestamp) AS minute_received
      FROM patient_wearable_data
), patient_active_minutes AS
(
      SELECT patient_id,
         date_received,
         hour_received,
         minute_received,
         avg(heart_rate) AS heart_rate
      FROM patient_wearble_flattened
      GROUP BY patient_id,
         date_received,
         hour_received,
         minute_received
         HAVING avg(heart_rate) > 80
), patient_active_minutes_count AS
(
      SELECT patient_id,
         date_received,
         COUNT(heart_rate) AS active_minutes_count
      FROM patient_active_minutes
      GROUP BY patient_id,
         date_received
)
SELECT pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth,
      ppi.appointment_datetime,
      ppi.phone_number,
      pamc.date_received,
      pamc.active_minutes_count
FROM patient_allergy_info pai
      LEFT JOIN patient_active_minutes_count pamc
            ON pai.patient_id = pamc.patient_id
      LEFT JOIN patient_appointment_info ppi
            ON pai.patient_id = ppi.patient_id
GROUP BY pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth,
      ppi.appointment_datetime,
      ppi.phone_number,
      pamc.date_received,
      pamc.active_minutes_count
ORDER BY pai.patient_id,
      pai.gender,
      pai.prefix,
      pai.given_name,
      pai.family_name,
      pai.allery_category,
      pai.allergy_code,
      pai.allergy_description,
      ppi.date_of_birth DESC,
      ppi.appointment_datetime DESC,
      ppi.phone_number DESC,
      pamc.date_received,
      pamc.active_minutes_count

You can use the solution and queries used here to expand the datasets used in your analysis. For example, you can include other tables from AWS HealthLake as needed.

Clean up

To clean up resources you created, complete the following steps:

Delete the zero-ETL integration between Amazon RDS and Amazon Redshift.
Delete the CloudFormation stack.
Delete AWS HealthLake data store

Conclusion

Forming a comprehensive 360 view of patients by integrating data from various different sources offers numerous benefits for organizations operating in the healthcare industry. It enables healthcare providers to gain a holistic understanding of a patient’s medical journey, enhances clinical decision-making, and allows for more accurate diagnosis and tailored treatment plans. With zero-ETL features for data integration on AWS, it is effortless to build a view of patients securely, cost-effectively, and with minimal effort.

You can then use visualization tools such as Amazon QuickSight to build dashboards or use Amazon Redshift ML to enable data analysts and database developers to train machine learning (ML) models with the data integrated through Amazon Redshift zero-ETL. The result is a set of ML models that are trained with a broader view into patients, their medical history, and their lifestyle, and therefore enable you make more accurate predictions about their upcoming health needs.

About the Authors

Saeed Barghi is a Sr. Analytics Specialist Solutions Architect specializing in architecting enterprise data platforms. He has extensive experience in the fields of data warehousing, data engineering, data lakes, and AI/ML. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

AWS completes the first cloud audit by the Ingelheim Kreis Initiative Joint Audits group for the pharmaceutical and life sciences sector

2024-01-22 Janice Leung

Post Syndicated from Janice Leung original https://aws.amazon.com/blogs/security/aws-completes-the-first-cloud-audit-by-the-ingelheim-kreis-initiative-joint-audits-group-for-the-pharmaceutical-and-life-sciences-sector/

We’re excited to announce that Amazon Web Services (AWS) has completed the first cloud service provider (CSP) audit by the Ingelheim Kreis (IK) Initiative Joint Audits group. The audit group represents quality and compliance professionals from some of our largest pharmaceutical and life sciences customers who collectively perform audits on their key suppliers.

As customers embrace the scalability and flexibility of AWS, we’re helping them evolve security, identity, and compliance into key business enablers. At AWS, we’re obsessed with earning and maintaining customer trust. We work hard to provide our pharmaceutical and life sciences customers and their regulatory bodies with the assurance that AWS has the necessary controls in place to help protect their most sensitive data and regulated workloads.

Our collaboration with the IK Joint Audits Group to complete the first CSP audit is a good example of how we support your risk management and regulatory efforts. Regulated pharmaceutical and life sciences customers are required by GxP to employ a risk-based approach to design, develop, and maintain computerized systems. GxP is a collection of quality guidelines and regulations that are designed to ensure safe development and manufacturing of medical devices, pharmaceuticals, biologic, and other food and medical products. Currently, no specific certifications for GxP compliance exist for CSPs. Pharmaceutical companies must do their own supplier assessment to determine the adequacy of their development and support processes.

The joint audit thoroughly assessed the AWS controls that are designed to protect your data and material workloads and help satisfy your regulatory requirements. As more pharmaceutical and life sciences companies use cloud technology for their operations, the industry is experiencing greater regulatory oversight. Because the joint audit of independent auditors represented a group of companies, both AWS and our customers were able to streamline common controls and increase transparency, and use audit resources more efficiently to help decrease the organizational burden on both the companies and the supplier (in this case, AWS).

Audit results

The IK audit results provide IK members with assurance regarding the AWS controls environment, enabling members to work to remove compliance blockers, accelerate their adoption of AWS services, and obtain confidence and trust in the security controls of AWS.

As stated by the IK auditors, “…in the course of the Audit it became obvious that AWS within their service development, their data center operation and with their employees acts highly professional with a clear customer focus. The amount of control that AWS implemented and continuously extends exceeds our expectations of a qualified CSP”.

The report is confidential and only available to IK members who signed the NDA with AWS prior to the start of the audit in 2023. Members can access the report and assess their own residual risk. To participate in a future audit cycle, contact [email protected].

To learn more about our commitment to safeguard customer data, see AWS Cloud Security. For more information about the robust controls that are in place at AWS, see the AWS Compliance Program page. By integrating governance-focused, audit-friendly service features with applicable compliance or audit standards, AWS Compliance helps you set up and operate in an AWS control environment. Customers can also access AWS Artifact to download other compliance reports that independent auditors have evaluated.

Pharmacies Giving Patient Records to Police without Warrants

2024-01-11 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/pharmacies-giving-patient-records-to-police-without-warrants.html

Add pharmacies to the list of industries that are giving private data to the police without a warrant.

Police Get Medical Records without a Warrant

2023-12-18 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/12/police-get-medical-records-without-a-warrant.html

More unconstrained surveillance:

Lawmakers noted the pharmacies’ policies for releasing medical records in a letter dated Tuesday to the Department of Health and Human Services (HHS) Secretary Xavier Becerra. The letter—signed by Sen. Ron Wyden (D-Ore.), Rep. Pramila Jayapal (D-Wash.), and Rep. Sara Jacobs (D-Calif.)—said their investigation pulled information from briefings with eight big prescription drug suppliers.

They include the seven largest pharmacy chains in the country: CVS Health, Walgreens Boots Alliance, Cigna, Optum Rx, Walmart Stores, Inc., The Kroger Company, and Rite Aid Corporation. The lawmakers also spoke with Amazon Pharmacy.

All eight of the pharmacies said they do not require law enforcement to have a warrant prior to sharing private and sensitive medical records, which can include the prescription drugs a person used or uses and their medical conditions. Instead, all the pharmacies hand over such information with nothing more than a subpoena, which can be issued by government agencies and does not require review or approval by a judge.

Three pharmacies—CVS Health, The Kroger Company, and Rite Aid Corporation—told lawmakers they didn’t even require their pharmacy staff to consult legal professionals before responding to law enforcement requests at pharmacy counters. According to the lawmakers, CVS, Kroger, and Rite Aid said that “their pharmacy staff face extreme pressure to immediately respond to law enforcement demands and, as such, the companies instruct their staff to process those requests in store.”

The rest of the pharmacies—Amazon, Cigna, Optum Rx, Walmart, and Walgreens Boots Alliance—at least require that law enforcement requests be reviewed by legal professionals before pharmacists respond. But, only Amazon said it had a policy of notifying customers of law enforcement demands for pharmacy records unless there were legal prohibitions to doing so, such as a gag order.

AWS Clean Rooms ML helps customers and partners apply ML models without sharing raw data (preview)

2023-11-29 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-clean-rooms-ml-helps-customers-and-partners-apply-ml-models-without-sharing-raw-data-preview/

Today, we’re introducing AWS Clean Rooms ML (preview), a new capability of AWS Clean Rooms that helps you and your partners apply machine learning (ML) models on your collective data without copying or sharing raw data with each other. With this new capability, you can generate predictive insights using ML models while continuing to protect your sensitive data.

During this preview, AWS Clean Rooms ML introduces its ﬁrst model specialized to help companies create lookalike segments for marketing use cases. With AWS Clean Rooms ML lookalike, you can train your own custom model, and you can invite partners to bring a small sample of their records to collaborate and generate an expanded set of similar records while protecting everyone’s underlying data.

In the coming months, AWS Clean Rooms ML will release a healthcare model. This will be the first of many models that AWS Clean Rooms ML will support next year.

AWS Clean Rooms ML helps you to unlock various opportunities for you to generate insights. For example:

Airlines can take signals about loyal customers, collaborate with online booking services, and offer promotions to users with similar characteristics.
Auto lenders and car insurers can identify prospective auto insurance customers who share characteristics with a set of existing lease owners.
Brands and publishers can model lookalike segments of in-market customers and deliver highly relevant advertising experiences.
Research institutions and hospital networks can find candidates similar to existing clinical trial participants to accelerate clinical studies (coming soon).

AWS Clean Rooms ML lookalike modeling helps you apply an AWS managed, ready-to-use model that is trained in each collaboration to generate lookalike datasets in a few clicks, saving months of development work to build, train, tune, and deploy your own model.

How to use AWS Clean Rooms ML to generate predictive insights
Today I will show you how to use lookalike modeling in AWS Clean Rooms ML and assume you have already set up a data collaboration with your partner. If you want to learn how to do that, check out the AWS Clean Rooms Now Generally Available — Collaborate with Your Partners without Sharing Raw Data post.

With your collective data in the AWS Clean Rooms collaboration, you can work with your partners to apply ML lookalike modeling to generate a lookalike segment. It works by taking a small sample of representative records from your data, creating a machine learning (ML) model, then applying the particular model to identify an expanded set of similar records from your business partner’s data.

The following screenshot shows the overall workflow for using AWS Clean Rooms ML.

By using AWS Clean Rooms ML, you don’t need to build complex and time-consuming ML models on your own. AWS Clean Rooms ML trains a custom, private ML model, which saves months of your time while still protecting your data.

Eliminating the need to share data
As ML models are natively built within the service, AWS Clean Rooms ML helps you protect your dataset and customer’s information because you don’t need to share your data to build your ML model.

You can specify the training dataset using the AWS Glue Data Catalog table, which contains user-item interactions.

Under Additional columns to train, you can define numerical and categorical data. This is useful if you need to add more features to your dataset, such as the number of seconds spent watching a video, the topic of an article, or the product category of an e-commerce item.

Applying custom-trained AWS-built models
Once you have defined your training dataset, you can now create a lookalike model. A lookalike model is a machine learning model used to find similar profiles in your partner’s dataset without either party having to share their underlying data with each other.

When creating a lookalike model, you need to specify the training dataset. From a single training dataset, you can create many lookalike models. You also have the flexibility to define the date window in your training dataset using Relative range or Absolute range. This is useful when you have data that is constantly updated within AWS Glue, such as articles read by users.

Easy-to-tune ML models
After you create a lookalike model, you need to configure it to use in AWS Clean Rooms collaboration. AWS Clean Rooms ML provides flexible controls that enable you and your partners to tune the results of the applied ML model to garner predictive insights.

On the Configure lookalike model page, you can choose which Lookalike model you want to use and define the Minimum matching seed size you need. This seed size deﬁnes the minimum number of profiles in your seed data that overlap with profiles in the training data.

You also have the flexibility to choose whether the partner in your collaboration receives metrics in Metrics to share with other members.

With your lookalike models properly configured, you can now make the ML models available for your partners by associating the configured lookalike model with a collaboration.

Creating lookalike segments
Once the lookalike models have been associated, your partners can now start generating insights by selecting Create lookalike segment and choosing the associated lookalike model for your collaboration.

Here on the Create lookalike segment page, your partners need to provide the Seed profiles. Examples of seed profiles include your top customers or all customers who purchased a specific product. The resulting lookalike segment will contain profiles from the training data that are most similar to the profiles from the seed.

Lastly, your partner will get the Relevance metrics as the result of the lookalike segment using the ML models. At this stage, you can use the Score to make a decision.

Export data and use programmatic API
You also have the option to export the lookalike segment data. Once it’s exported, the data is available in JSON format and you can process this output by integrating with AWS Clean Rooms API and your applications.

Join the preview
AWS Clean Rooms ML is now in preview and available via AWS Clean Rooms in US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Seoul, Singapore, Sydney, Tokyo), and Europe (Frankfurt, Ireland, London). Support for additional models is in the works.

Learn how to apply machine learning with your partners without sharing underlying data on the AWS Clean Rooms ML page.

Happy collaborating!
— Donnie

How healthcare organizations can analyze and create insights using price transparency data

2023-10-11 Gokhul Srinivasan

Post Syndicated from Gokhul Srinivasan original https://aws.amazon.com/blogs/big-data/how-healthcare-organizations-can-analyze-and-create-insights-using-price-transparency-data/

In recent years, there has been a growing emphasis on price transparency in the healthcare industry. Under the Transparency in Coverage (TCR) rule, hospitals and payors to publish their pricing data in a machine-readable format. With this move, patients can compare prices between different hospitals and make informed healthcare decisions. For more information, refer to Delivering Consumer-friendly Healthcare Transparency in Coverage On AWS.

The data in the machine-readable files can provide valuable insights to understand the true cost of healthcare services and compare prices and quality across hospitals. The availability of machine-readable files opens up new possibilities for data analytics, allowing organizations to analyze large amounts of pricing data. Using machine learning (ML) and data visualization tools, these datasets can be transformed into actionable insights that can inform decision-making.

In this post, we explain how healthcare organizations can use AWS services to ingest, analyze, and generate insights from the price transparency data created by hospitals. We use sample data from three different hospitals, analyze the data, and create comparative trends and insights from the data.

Solution overview

As part of the Centers for Medicare and Medicaid Services (CMS) mandate, all hospitals now have their machine-readable file containing the pricing data. As hospitals generate this data, they can use their organization data or ingest data from other hospitals to derive analytics and competitive comparison. This comparison can help hospitals do the following:

Derive a price baseline for all medical services and perform gap analysis
Analyze pricing trends and identify services where competitors don’t participate
Evaluate and identify the services where cost difference is above a specific threshold

The size of the machine-readable files from hospitals is smaller than those generated by the payors. This is due to the complexity of the JSON structure, contracts, and the risk evaluation process on the payor side. Due to this low complexity, the solution uses AWS serverless services to ingest the data, transform it, and make it available for analytics. The analysis of the machine-readable files from payors requires advanced computational capabilities due to the complexity and the interrelationship in the JSON file.

Prerequisites

As a prerequisite, evaluate the hospitals for which the pricing analysis will be performed and identify the machine-readable files for analysis. Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Create separate folders for each hospital inside the S3 bucket.

Architecture overview

The architecture uses AWS serverless technology for the implementation. The serverless architecture features auto scaling, high availability, and a pay-as-you-go billing model to increase agility and optimize costs. The architecture approach is split into a data intake layer, a data analysis layer, and a data visualization layer.

The architecture contains three independent stages:

File ingestion – Hospitals negotiate their contract and pricing with the payors one time a year with periodical revisions on a quarterly or monthly basis. The data ingestion process copies the machine-readable files from the hospitals, validates the data, and keeps the validated files available for analysis.
Data analysis – In this stage, the files are transformed using AWS Glue and stored in the AWS Glue Data Catalog. AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, ML, and application development. Then you can use Amazon Athena V3 to query the tables in the Data Catalog.
Data visualization – Amazon QuickSight is a cloud-powered business analytics service that makes it straightforward to build visualizations, perform ad hoc analysis, and quickly get business insights from the pricing data. This stage uses QuickSight to visually analyze the data in the machine-readable file using Athena queries.

File ingestion

The file ingestion process works as defined in the following figure. The architecture uses AWS Lambda, a serverless, event-driven compute service that lets you run code without provisioning or managing servers.

TCR Intake Architecture

The following flow defines the process to ingest and analyze the data:

Copy the machine-readable files from the hospitals into the respective raw data S3 bucket.
The file upload to the S3 bucket triggers an S3 event, which invokes a format Lambda function.
The Lambda function triggers a notification when it identifies issues in the file.
The Lambda function ingests the file, transforms the data, and stores the clean file in a new clean data S3 bucket.

Organizations can create new Lambda functions depending on the difference in the file formats.

Data analysis

The file intake and data analysis processes are independent of each other. Whereas the file intake happens on a scheduled or periodical basis, the data analysis happens regularly based on the business operation needs. The architecture for the data analysis is shown in the following figure.

TCR Data Analysis

This stage uses an AWS Glue crawler, the AWS Glue Data Catalog, and Athena v3 to analyze the data from the machine-readable files.

An AWS Glue crawler scans the clean data in the S3 bucket and creates or updates the tables in the AWS Glue Data Catalog. The crawler can run on demand or on a schedule, and can crawl multiple machine-readable files in a single run.
The Data Catalog now contains references to the machine-readable data. The Data Catalog contains the table definition, which contains metadata about the data in the machine-readable file. The tables are written to a database, which acts as a container.
Use the Data Catalog and transform the hospital price transparency data.
When the data is available in the Data Catalog, you can develop the analytics query using Athena. Athena is a serverless, interactive analytics service that provides a simplified, flexible way to analyze petabytes of data using SQL queries.
Any failure during the process will be captured in the Amazon CloudWatch logs, which can be used for troubleshooting and analysis. The Data Catalog needs to be refreshed only when there is a change in the machine-readable file structure or a new machine-readable file is uploaded to the clean S3 bucket. When the crawler runs periodically, it automatically identifies the changes and updates the Data Catalog.

Data visualization

When the data analysis is complete and queries are developed using Athena, we can visually analyze the results and gain insights using QuickSight. As shown in the following figure, once the data ingestion and data analysis are complete, the queries are built using Athena.

TCR Visualization

In this stage, we use QuickSight to create datasets using the Athena queries, build visualizations, and deploy dashboards for visual analysis and insights.

Create a QuickSight dataset

Complete the following steps to create a QuickSight dataset:

On the QuickSight console, choose Manage data.
On the Datasets page, choose New data set.
In the Create a Data Set page, choose the connection profile icon for the existing Athena data source that you want to use.
Choose Create data set.
On the Choose your table page, choose Use custom SQL and enter the Athena query.

After the dataset is created, you can add visualizations and analyze the data from the machine-readable file. With the QuickSight dashboard, organizations can easily perform price comparisons across different hospitals, identify high-cost services, and find other price outliers. In addition, you can use ML in QuickSight to gain ML-driven insights, detect pricing anomalies, and create forecasts based on historical files.

The following figure shows an illustrative QuickSight dashboard with insights comparing the machine-readable files from three different hospitals. With these visuals, you compare the pricing data across hospitals, create price benchmarks, determine cost-effective hospitals, and identify opportunities for competitive advantage.

Performance, operational, and cost considerations

The solution recommends QuickSight Enterprise for visualization and insights. For QuickSight dashboards, the Athena query results can be stored within the SPICE database for better performance.

The approach uses Athena V3, which offers performance improvements, reliability enhancements, and newer features. Using the Athena query result reuse feature enables caching and query result reuse. When multiple identical queries are run with the query result reuse option, repeat queries run up to five times faster, giving you increased productivity for interactive data analysis. Because you don’t scan the data, you get improved performance at a lower cost.

Cost

Hospitals create the machine-readable files on a monthly basis. This approach uses a serverless architecture that keeps the cost low and takes away the challenge of maintenance overhead. The analysis can begin with the machine-readable files for a few hospitals, and they can add new hospitals as they scale. The following example helps understand the cost for different hospital based on the data size:

A typical hospital with 100 GB storage/month, querying 20 GB data with 2 authors and 5 readers, costs around $2,500/year

AWS offers you a pay-as-you-go approach for pricing for the vast majority of our cloud services. With AWS you pay only for the individual services you need, for as long as you use them, and without requiring long-term contracts or complex licensing.

TCR Monthly cost

Conclusion

This post illustrated how to collect and analyze hospital-created price transparency data and generate insights using AWS services. This type of analysis and the visualizations provide the framework to analyze the machine-readable files. Hospitals, payors, brokers, underwriters, and other healthcare stakeholders can use this architecture to analyze and draw insights from pricing data published by hospitals of their choice. Our AWS teams can assist you to identify the correct strategy by offering thought leadership and prescriptive technical support for price transparency analysis.

Contact your AWS account team for more help on design and to explore private pricing. If you don’t have a contact with AWS yet, please reach out to be connected with an AWS representative.

About the Authors

Gokhul Srinivasan is a Senior Partner Solutions Architect leading AWS Healthcare and Life Sciences (HCLS) Global Startup Partners. Gokhul has over 19 years of Healthcare experience helping organizations with digital transformation, platform modernization, and deliver business outcomes.

Laks Sundararajan is a seasoned Enterprise Architect helping companies reset, transform and modernize their IT, digital, cloud, data and insight strategies. A proven leader with significant expertise around Generative AI, Digital, Cloud and Data/Analytics Transformation, Laks is a Sr. Solutions Architect with Healthcare and Life Sciences (HCLS).

Anil Chinnam is a Solutions Architect in the Digital Native Business Segment at Amazon Web Services(AWS). He enjoys working with customers to understand their challenges and solve them by creating innovative solutions using AWS services. Outside of work, Anil enjoys being a father, swimming and traveling.

Improving medical imaging workflows with AWS HealthImaging and SageMaker

2023-07-31 Sukhomoy Basak

Post Syndicated from Sukhomoy Basak original https://aws.amazon.com/blogs/architecture/improving-medical-imaging-workflows-with-aws-healthimaging-and-sagemaker/

Medical imaging plays a critical role in patient diagnosis and treatment planning in healthcare. However, healthcare providers face several challenges when it comes to managing, storing, and analyzing medical images. The process can be time-consuming, error-prone, and costly.

There’s also a radiologist shortage across regions and healthcare systems, making the demand for this specialty increases due to an aging population, advances in imaging technology, and the growing importance of diagnostic imaging in healthcare.

As the demand for imaging studies continues to rise, the limited number of available radiologists results in delays in available appointments and timely diagnoses. And while technology enables healthcare delivery improvements for clinicians and patients, hospitals seek additional tools to solve their most pressing challenges, including:

Professional burnout due to an increasing demand for imaging and diagnostic services
Labor-intensive tasks, such as volume measurement or structural segmentation of images
Increasing expectations from patients expecting high-quality healthcare experiences that match retail and technology in terms of convenience, ease, and personalization

To improve clinician and patient experiences, run your picture archiving and communication system (PACS) with an artificial intelligence (AI)-enabled diagnostic imaging cloud solution to securely gain critical insights and improve access to care.

AI helps reduce the radiologist burndown rate through automation. For example, AI saves radiologist chest x-ray interpretation time. It is also a powerful tool to identify areas that need closer inspection, and helps capture secondary findings that weren’t initially identified. The advancement of interoperability and analytics gives radiologist a 360-degree, longitudinal view of patient health records to provide better healthcare at potentially lower costs.

AWS offers services to address these challenges. This blog post discusses AWS HealthImaging (AWS AHI) and Amazon SageMaker, and how they are used together to improve healthcare providers’ medical imaging workflows. This ultimately accelerates imaging diagnostics and increases radiology productivity. AWS AHI enables developers to deliver performance, security, and scale to cloud-native medical imaging applications. It allows ingestion of Digital Imaging and Communication in Medicine (DICOM) images. Amazon SageMaker provides end-to-end solution for AI and machine learning.

Let’s explore an example use case involving X-rays after an auto accident. In this diagnostic medical imaging workflow, a patient is in the emergency room. From there:

The patient undergoes an X-ray to check for fractures.
The scanned acquisition device images flow to the PACS system.
The radiologist reviews the information gathered from this procedure and authors the report.
The patient workflow continues as the reports are made available to the referring physician.

Next-generation imaging solutions and workflows

Healthcare providers can use AWS AHI and Amazon SageMaker together to enable next-generation imaging solutions and improve medical imaging workflows. The following architecture illustrates this example.

Figure 1: X-ray images are sent to AWS HealthImaging and an Amazon SageMaker endpoint extracts insights.

Let’s review the architecture and the key components:

1. Imaging Scanner: Captures the images from a patient’s body. Depending on the modality, this can be an X-ray detector; a series of detectors in a CT scanner; a magnetic field and radio frequency coils in an MRI scanner; or an ultrasound transducer. This example uses an X-ray device.

AWS IoT Greengrass: Edge runtime and cloud service configured with DICOM C-Store SCP that receives the images and sends it to Amazon Simple Storage Service (Amazon S3). The images along with the related metadata are sent to Amazon S3 and Amazon Simple Queue Service (Amazon SQS) respectively, that triggers the workflow.

2. Amazon SQS message queue: Consumes event from S3 bucket and triggers an AWS Step Functions workflow orchestration.

3. AWS Step Functions runs the transform and import jobs to further process and import the images into AWS AHI data store instance.

4. The final diagnostic image—along with any relevant patient information and metadata—is stored in the AWS AHI datastore. This allows for efficient imaging date retrieval and management. It also enables medical imaging data access with sub-second image retrieval latencies at scale, powered by cloud-native APIs and applications from AWS partners.

5. Radiologists responsible for ground truth for ML images perform medical image annotations using Amazon SageMaker Ground Truth. They visualize and label DICOM images using a custom data labeling workflow—a fully managed data labeling service that supports built-in or custom data labeling workflows. They also leverage tools like 3D Slicer for interactive medical image annotations.

6. Data scientists build or leverage built-in deep learning models using the annotated images on Amazon SageMaker. SageMaker offers a range of deployment options that vary from low latency and high throughput to long-running inference jobs. These options include considerations for batch, real-time, or near real-time inference.

7. Healthcare providers use AWS AHI and Amazon SageMaker to run AI-assisted detection and interpretation workflow. This workflow is used to identify hard-to-see fractures, dislocations, or soft tissue injuries to allow surgeons and radiologist to be more confident in their treatment choices.

8. Finally, the image stored in AWS AHI is displayed on a monitor or other visual output device where it can be analyzed and interpreted by a radiologist or other medical professional.

The Open Health Imaging Foundation (OHIF) Viewer is an open source, web-based, medical imaging platform. It provides a core framework for building complex imaging applications.
Radical Imaging or Arterys are AWS partners that provide OHIF-based medical imaging viewer.

Each of these components plays a critical role in the overall performance and accuracy of the medical imaging system as well as ongoing research and development focused on improving diagnostic outcomes and patient care. AWS AHI uses efficient metadata encoding, lossless compression, and progressive resolution data access to provide industry leading performance for loading images. Efficient metadata encoding enables image viewers and AI algorithms to understand the contents of a DICOM study without having to load the image data.

Security

The AWS shared responsibility model applies to data protection in AWS AHI and Amazon SageMaker.

Amazon SageMaker is HIPAA-eligible and can operate with data containing Protected Health Information (PHI). Encryption of data in transit is provided by SSL/TLS and is used when communicating both with the front-end interface of Amazon SageMaker (to the Notebook) and whenever Amazon SageMaker interacts with any other AWS services.

AWS AHI is also HIPAA-eligible service and provides access control at the metadata level, ensuring that each user and application can only see the images and metadata fields that are required based upon their role. This prevents the proliferation of Patient PHI. All access to AWS AHI APIs is logged in detail in AWS CloudTrail.

Both of these services leverage AWS Key Management service (AWS KMS) to satisfy the requirement that PHI data is encrypted at rest.

Conclusion

In this post, we reviewed a common use case for early detection and treatment of conditions, resulting in better patient outcomes. We also covered an architecture that can transform the radiology field by leveraging the power of technology to improve accuracy, efficiency, and accessibility of medical imaging.

Process price transparency data using AWS Glue

2023-05-04 Hari Thatavarthy

Post Syndicated from Hari Thatavarthy original https://aws.amazon.com/blogs/big-data/process-price-transparency-data-using-aws-glue/

The Transparency in Coverage rule is a federal regulation in the United States that was finalized by the Center for Medicare and Medicaid Services (CMS) in October 2020. The rule requires health insurers to provide clear and concise information to consumers about their health plan benefits, including costs and coverage details. Under the rule, health insurers must make available to their members a list of negotiated rates for in-network providers, as well as an estimate of the member’s out-of-pocket costs for specific health care services. This information must be made available to members through an online tool that is accessible and easy to use. The Transparency in Coverage rule also requires insurers to make available data files that contain detailed information on the prices they negotiate with health care providers. This information can be used by employers, researchers, and others to compare prices across different insurers and health care providers. Phase 1 implementation of this regulation, which went into effect on July 1, 2022, requires that payors publish machine-readable files publicly for each plan that they offer. CMS (Center for Medicare and Medicaid Services) has published a technical implementation guide with file formats, file structure, and standards on producing these machine-readable files.

This post walks you through the preprocessing and processing steps required to prepare data published by health insurers in light of this federal regulation using AWS Glue. We also show how to query and derive insights using Amazon Athena.

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data.

Challenges processing these machine-readable files

The machine-readable files published by these payors vary in size. A single file can range from a few megabytes to hundreds of gigabytes. These files contain large JSON objects that are deeply nested. Unlike NDJSON and JSONL formats, where each line in the file is a JSON object, these files contain a single large JSON object that can span across multiple lines. The following figure represents the schema of an in_network rate file published by a major health insurer on their website for public access. This file, when uncompressed, is about 20 GB in size, contains a single JSON object, and is deeply nested. The following figure represents the schema of this JSON object when printed using the Spark printSchema() function. Each highlighted box in red is a nested array structure.

JSON Schema

Loading a 20 GB deeply nested JSON object requires a machine with a large memory footprint. Data when loaded into memory is 4–10 times its size on disk. A 20 GB JSON object may need a machine with up to 200 GB memory. To process workloads larger than 20 GB, these machines need to be scaled vertically, thereby significantly increasing hardware costs. Vertical scaling has its limits, and it’s not possible to scale beyond a certain point. Analyzing this data requires unnesting and flattening of deeply nested array structures. These transformations explode the data at an exponential rate, thereby adding to the need for more memory and disk space.

You can use an in-memory distributed processing framework such as Apache Spark to process and analyze such large volumes of data. However, to load this single large JSON object as a Spark DataFrame and perform an action on it, a worker node needs enough memory to load this object in full. When a worker node tries to load this large deeply nested JSON object and there isn’t enough memory to load it in full, the processing job will fail with out-of-memory issues. This calls for splitting the large JSON object into smaller chunks using some form of preprocessing logic. Once preprocessed, these smaller files can then be further processed in parallel by worker nodes without running into out-of-memory issues.

Solution overview

The solution involves a two-step approach. The first is a preprocessing step, which takes the large JSON object as input and splits it to multiple manageable chunks. This is required to address the challenges we mentioned earlier. The second is a processing step, which prepares and publishes data for analysis.

The preprocessing step uses an AWS Glue Python shell job to split the large JSON object into smaller JSON files. The processing step unnests and flattens the array items from these smaller JSON files in parallel. It then partitions and writes the output as Parquet on Amazon Simple Storage Service (Amazon S3). The partitioned data is cataloged and analyzed using Athena. The following diagram illustrates this workflow.

Solution Overview

Prerequisites

To implement the solution in your own AWS account, you need to create or configure the following AWS resources in advance:

An S3 bucket to persist the source and processed data. Download the input file and upload it to the path s3://yourbucket/ptd/2023-03-01_United-HealthCare-Services—Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz.
An AWS Identity and Access Management (IAM) role for your AWS Glue extract, transform, and load (ETL) job. For instructions, refer to Setting up IAM permissions for AWS Glue. Adjust the permissions to ensure AWS Glue has read/write access to Amazon S3 locations.
An IAM role for Athena with AWS Glue Data Catalog permissions to create and query tables.

Create an AWS Glue preprocessing job

The preprocessing step uses ijson, an open-source iterative JSON parser to extract items in the outermost array of top-level attributes. By streaming and iteratively parsing the large JSON file, the preprocessing step loads only a portion of the file into memory, thereby avoiding out-of-memory issues. It also uses s3pathlib, an open-source Python interface to Amazon S3. This makes it easy to work with S3 file systems.

To create and run the AWS Glue job for preprocessing, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Python shell script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import ijson
import json
import decimal
from s3pathlib import S3Path
from s3pathlib import context
import boto3
from io import StringIO

class JSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, decimal.Decimal):
            return float(obj)
        return json.JSONEncoder.default(self, obj)
        
def upload_to_s3(data, upload_path):
    data = bytes(StringIO(json.dumps(data,cls=JSONEncoder)).getvalue(),encoding='utf-8')
    s3_client.put_object(Body=data, Bucket=bucket, Key=upload_path)
        
s3_client = boto3.client('s3')

#Replace with your bucket and path to JSON object on your bucket
bucket = 'yourbucket'
largefile_key = 'ptd/2023-03-01_United-HealthCare-Services--Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz'
p = S3Path(bucket, largefile_key)


#Replace the paths to suit your needs
upload_path_base = 'ptd/preprocessed/base/base.json'
upload_path_in_network = 'ptd/preprocessed/in_network/'
upload_path_provider_references = 'ptd/preprocessed/provider_references/'

#Extract top the values of the following top level attributes and persist them on your S3 bucket
# -- reporting_entity_name
# -- reporting_entity_type
# -- last_updated_on
# -- version

base ={
    'reporting_entity_name' : '',
    'reporting_entity_type' : '',
    'last_updated_on' :'',
    'version' : ''
}

with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_name')
    for evt in obj:
        base['reporting_entity_name'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_type')
    for evt in obj:
        base['reporting_entity_type'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'last_updated_on')
    for evt in obj:
        base['last_updated_on'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f,'version')
    for evt in obj:
        base['version'] = evt
        break
        
upload_to_s3(base,upload_path_base)

#Seek the position of JSON key provider_references 
#Iterate through items in provider_references array, and for every 1000 items create a JSON file on S3 bucket
with p.open("r") as f:
    provider_references = ijson.items(f, 'provider_references.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(provider_references):
        if rowcnt % 1000 == 0:
            if fk > 0:
                dest = upload_path_provider_references + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'provider_references_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)

    path = 'provider_references_{0}.json'.format(fk)
    dest = upload_path_provider_references + path
    upload_to_s3(lst,dest)
    
#Seek the position of JSON key in_network
#Iterate through items in in_network array, and for every 25 items create a JSON file on S3 bucket
with p.open("r") as f:
    in_network = ijson.items(f, 'in_network.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(in_network):
        if rowcnt % 25 == 0:
            if fk > 0:
                dest = upload_path_in_network + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'in_network_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)


    path = 'in_network_{0}.json'.format(fk)
    dest = upload_path_in_network + path
    upload_to_s3(lst,dest)

Update the properties of your job on the Job details tab:
1. For Type, choose Python Shell.
2. For Python version, choose Python 3.9.
3. For Data processing units, choose 1 DPU.

For Python shell jobs, you can allocate either 0.0625 or 1 DPU. The default is 0.0625 DPU. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

python shell job config

The Python libraries ijson and s3pathlib are available in pip and can be installed using the AWS Glue job parameter --additional-python-modules. You can also choose to package these libraries, upload them to Amazon S3, and refer to them from your AWS Glue job. For instructions on packaging your library, refer to Providing your own Python library.

To install the Python libraries, set the following job parameters:
- Key – --additional-python-modules
- Value – ijson,s3pathlib
Run the job.

The preprocessing step creates three folders in the S3 bucket: base, in_network and provider_references.

s3_folder_1

Files in in_network and provider_references folders contains array of JSON objects. Each of these JSON objects represents an element in the outermost array of the original large JSON object.

s3_folder_2

Create an AWS Glue processing job

The processing job uses the output of the preprocessing step to create a denormalized view of data by extracting and flattening elements and attributes from nested arrays. The extent of unnesting depends on the attributes we need for analysis. For example, attributes such as negotiated_rate, npi, and billing_code are essential for analysis and extracting values associated with these attributes requires multiple levels of unnesting. The denormalized data is then partitioned by the billing_code column, persisted as Parquet on Amazon S3, and registered as a table on the AWS Glue Data Catalog for querying.

The following code sample guides you through the implementation using PySpark. The columns used to partition the data depends on query patterns used to analyze the data. Arriving at a partitioning strategy that is in line with the query patterns will improve overall query performance during analysis. This post assumes that the queries used for analyzing data will always use the column billing_code to filter and fetch data of interest. Data in each partition is bucketed by npi to improve query performance.

To create your AWS Glue job, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Spark script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql.functions import explode

#create a dataframe of base objects - reporting_entity_name, reporting_entity_type, version, last_updated_on
#using the output of preprocessing step

base_df = spark.read.json('s3://yourbucket/ptd/preprocessed/base/')

#create a dataframe over provider_references objects using the output of preprocessing step
prvd_df = spark.read.json('s3://yourbucket/ptd/preprocessed/provider_references/')

#cross join dataframe of base objects with dataframe of provider_references 
prvd_df = prvd_df.crossJoin(base_df)

#create a dataframe over in_network objects using the output of preprocessing step
in_ntwrk_df = spark.read.json('s3://yourbucket/ptd/preprocessed/in_network/')

#unnest and flatten negotiated_rates and provider_references from in_network objects
in_ntwrk_df2 = in_ntwrk_df.select(
 in_ntwrk_df.billing_code, in_ntwrk_df.billing_code_type, in_ntwrk_df.billing_code_type_version,
 in_ntwrk_df.covered_services, in_ntwrk_df.description, in_ntwrk_df.name,
 explode(in_ntwrk_df.negotiated_rates).alias('exploded_negotiated_rates'),
 in_ntwrk_df.negotiation_arrangement)


in_ntwrk_df3 = in_ntwrk_df2.select(
 in_ntwrk_df2.billing_code, in_ntwrk_df2.billing_code_type, in_ntwrk_df2.billing_code_type_version,
 in_ntwrk_df2.covered_services, in_ntwrk_df2.description, in_ntwrk_df2.name,
 in_ntwrk_df2.exploded_negotiated_rates.negotiated_prices.alias(
 'exploded_negotiated_rates_negotiated_prices'),
 explode(in_ntwrk_df2.exploded_negotiated_rates.provider_references).alias(
 'exploded_negotiated_rates_provider_references'),
 in_ntwrk_df2.negotiation_arrangement)

#join the exploded in_network dataframe with provider_references dataframe
jdf = prvd_df.join(
 in_ntwrk_df3,
 prvd_df.provider_group_id == in_ntwrk_df3.exploded_negotiated_rates_provider_references,"fullouter")

#un-nest and flatten attributes from rest of the nested arrays.
jdf2 = jdf.select(
 jdf.reporting_entity_name,jdf.reporting_entity_type,jdf.last_updated_on,jdf.version,
 jdf.provider_group_id, jdf.provider_groups, jdf.billing_code,
 jdf.billing_code_type, jdf.billing_code_type_version, jdf.covered_services,
 jdf.description, jdf.name,
 explode(jdf.exploded_negotiated_rates_negotiated_prices).alias(
 'exploded_negotiated_rates_negotiated_prices'),
 jdf.exploded_negotiated_rates_provider_references,
 jdf.negotiation_arrangement)

jdf3 = jdf2.select(
 jdf2.reporting_entity_name,jdf2.reporting_entity_type,jdf2.last_updated_on,jdf2.version,
 jdf2.provider_group_id,
 explode(jdf2.provider_groups).alias('exploded_provider_groups'),
 jdf2.billing_code, jdf2.billing_code_type, jdf2.billing_code_type_version,
 jdf2.covered_services, jdf2.description, jdf2.name,
 jdf2.exploded_negotiated_rates_negotiated_prices.additional_information.
 alias('additional_information'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_class.alias(
 'billing_class'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_code_modifier.
 alias('billing_code_modifier'),
 jdf2.exploded_negotiated_rates_negotiated_prices.expiration_date.alias(
 'expiration_date'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_rate.alias(
 'negotiated_rate'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_type.alias(
 'negotiated_type'),
 jdf2.exploded_negotiated_rates_negotiated_prices.service_code.alias(
 'service_code'), jdf2.exploded_negotiated_rates_provider_references,
 jdf2.negotiation_arrangement)

jdf4 = jdf3.select(jdf3.reporting_entity_name,jdf3.reporting_entity_type,jdf3.last_updated_on,jdf3.version,
 jdf3.provider_group_id,
 explode(jdf3.exploded_provider_groups.npi).alias('npi'),
 jdf3.exploded_provider_groups.tin.type.alias('tin_type'),
 jdf3.exploded_provider_groups.tin.value.alias('tin'),
 jdf3.billing_code, jdf3.billing_code_type,
 jdf3.billing_code_type_version, jdf3.covered_services,
 jdf3.description, jdf3.name, jdf3.additional_information,
 jdf3.billing_class, jdf3.billing_code_modifier,
 jdf3.expiration_date, jdf3.negotiated_rate,
 jdf3.negotiated_type, jdf3.service_code,
 jdf3.negotiation_arrangement)

#repartition by billing_code. 
#Repartition changes the distribution of data on spark cluster. 
#By repartition data we will avoid writing too many small files.
jdf5=jdf4.repartition("billing_code")
 
datasink_path = "s3://yourbucket/ptd/processed/billing_code_npi/parquet/"

#persist dataframe as parquet on S3 and catalog it
#Partition the data by billing_code. This enables analytical queries to skip data and improve performance of queries
#Data is also bucketed and sorted npi to improve query performance during analysis

jdf5.write.format('parquet').mode("overwrite").partitionBy('billing_code').bucketBy(2, 'npi').sortBy('npi').saveAsTable('ptdtable', path = datasink_path)

Update the properties of your job on the Job details tab:
1. For Type, choose Spark.
2. For Glue version, choose Glue 4.0.
3. For Language, choose Python 3.
4. For Worker type, choose G 2X.
5. For Requested number of workers, enter 20.

Arriving at the number of workers and worker type to use for your processing job depends on factors such as the amount of data being processed, the speed at which it needs to be processed, and the partitioning strategy used. Repartitioning of data can result in out-of-memory issues, especially when data is heavily skewed on the column used to repartition. It’s possible to reach Amazon S3 service limits if too many workers are assigned to the job. This is because tasks running on these worker nodes may try to read/write from the same S3 prefix, causing Amazon S3 to throttle the incoming requests. For more details, refer to Best practices design patterns: optimizing Amazon S3 performance.

processing job config

Exploding array elements creates new rows and columns, thereby exponentially increasing the amount of data that needs to be processed. Apache Spark splits this data into multiple Spark partitions on different worker nodes so that it can process large amounts of data in parallel. In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. Shuffle operations are commonly triggered by wide transformations such as join, reduceByKey, groupByKey, and repartition. In case of exceptions due to local storage limitations, it helps to supplement or replace local disk storage capacity with Amazon S3 for large shuffle operations. This is possible with the AWS Glue Spark shuffle plugin with Amazon S3. With the cloud shuffle storage plugin for Apache Spark, you can avoid disk space-related failures.

To use the Spark shuffle plugin, set the following job parameters:
- Key – --write-shuffle-files-to-s3
- Value – true

Query the data

You can query the cataloged data using Athena. For instructions on setting up Athena, refer to Setting up.

On the Athena console, choose Query editor in the navigation pane to run your query, and specify your data source and database.

sql query

To find the minimum, maximum, and average negotiated rates for procedure codes, run the following query:

SELECT
billing_code,
round(min(negotiated_rate),2) as min_price,
round(avg(negotiated_rate),2) as avg_price,
round(max(negotiated_rate),2) as max_price,
description
FROM "default"."ptdtable"
group by billing_code, description
limit 10;

The following screenshot shows the query results.

sql query results

Clean up

To avoid incurring future charges, delete the AWS resources you created:

Delete the S3 objects and bucket.
Delete the IAM policies and roles.
Delete the AWS Glue jobs for preprocessing and processing.

Conclusion

This post guided you through the necessary preprocessing and processing steps to query and analyze price transparency-related machine-readable files. Although it’s possible to use other AWS services to process such data, this post focused on preparing and publishing data using AWS Glue.

To learn more about the Transparency in Coverage rule, refer to Transparency in Coverage. For best practices for scaling Apache Spark jobs and partitioning data with AWS Glue, refer to Best practices to scale Apache Spark jobs and partition data with AWS Glue. To learn how to monitor AWS Glue jobs, refer to Monitoring AWS Glue Spark jobs.

We look forward to hearing any feedback or questions.

About the Authors

Hari Thatavarthy is a Senior Solutions Architect on the AWS Data Lab team. He helps customers design and build solutions in the data and analytics space. He believes in data democratization and loves to solve complex data processing-related problems. In his spare time, he loves to play table tennis.

Krishna Maddileti is a Senior Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey and helps them with data engineering, data lakes, and analytics. In his spare time, he enjoys spending time with his family and playing video games with his 7-year-old.

Yadukishore Tatavarthi is a Senior Partner Solutions Architect at AWS. He works closely with global system integrator partners to enable and support customers moving their workloads to AWS.

Manish Kola is a Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Cloud Security Strategies for Healthcare

2023-03-14 Marla Rosner

Post Syndicated from Marla Rosner original https://blog.rapid7.com/2023/03/14/cloud-security-strategies-for-healthcare/

How to Stay Secure in the Cloud While Driving Innovation and Discovery

Cloud Security Strategies for Healthcare

The healthcare industry is undergoing a transformational shift. Health organizations are traditionally entrenched in an on-prem way of life, but the past three years have plunged them into a digital revolution. A heightened demand for improved healthcare services—like distributed care and telehealth—ignited a major push for health orgs to move to the cloud, and as a result, implement new cloud security strategies.

But the processes and tools that worked well to secure healthcare organizations’ traditional data centers do not directly translate to the public cloud. Resource and budget strain, priority negotiation with leadership, and challenges with regulatory compliance only exacerbate a daunting digital maturity gap. These challenges are why many healthcare organizations have approached public cloud adoption tentatively.

The healthcare industry must innovate in the cloud to meet patient and business needs, but they need to do so without creating unnecessary or unmanaged risk. Most importantly, they must move to and adopt cloud solutions securely to protect patients in a new world of digital threats.

Major Challenges

Modern technologies bring modern challenges. Here are the main obstacles healthcare organizations face when it comes to securing the cloud.

Resource Strain

Like most industries, healthcare organizations face major obstacles when finding qualified security talent. That means hospitals, clinics, and other healthcare businesses must compete with tech giants, startups, and other more traditionally cybersecurity-savvy companies for the best and brightest minds on the market.

What’s more challenging is that the typical day of a security professional in healthcare tends to be disproportionately focused on time-consuming and often monotonous tasks. These duties are often related to maintaining and reporting on compliance with a sea of regulatory standards and requirements. Carrying out these repetitive but necessary tasks can quickly lead to burnout—and, as a result, turnover.

Moreover, those security professionals who do end up working within healthcare organizations can quickly find themselves inundated with more work than any one person is capable of handling. Small teams are tasked with securing massive amounts of sensitive data—both on-prem and as it moves into the cloud. And sometimes, cybersecurity departments at healthcare orgs can be as small as a CISO and a few analysts.

Those challenges with resource strain can lead to worse problems for security teams, including:

Burnout and rapid turnover, as discussed above
Slow MTTR, exacerbating the impact of breaches
Shadow IT, letting vulnerable assets fall through the cracks

Balancing Priorities With Leadership

It’s up to cybersecurity professionals to connect the dots for leadership on how investing in cloud security leads to greater ROI and less risk. Decision-makers in the healthcare industry are already juggling a great deal—and those concerns can be, quite literally, a matter of life or death.

In the modern threat landscape, poor cybersecurity health also has the potential to mean life or death. As medical science tools become more sophisticated, they’re also becoming more digitally connected. That means malicious actors who manage to infiltrate and shut down servers could also possibly shut down life-saving technology.

Tech professionals must illustrate to stakeholders how cybersecurity risk is interconnected with business risk and—perhaps most importantly—patient risk. To do that, they must regularly engage with and educate leadership to effectively balance priorities.

Achieving that perfect balance includes meeting leadership where they’re at. In healthcare, what is typically the biggest security concern for leaders? The answer: Meeting the necessary compliance standards with every new technology investment.

HIPAA Compliance and Protected Health Information

For stakeholders, achieving, maintaining, and substantiating legal and regulatory compliance is of critical importance. When it comes to the healthcare industry, one compliance standard often reigns supreme over all business decisions: HIPAA.

HIPAA provides data privacy and security provisions for safeguarding Protected Health Information (PHI). It addresses the use and disclosure of individuals’ health information and requires that sensitive information be governed by strict data security and confidentiality. It also obligates organizations to provide PHI to patients upon request.

When migrating to the cloud, healthcare organizations need a centralized approach to protecting sensitive data. InsightCloudSec allows you to automate compliance with HIPAA. Through our HIPAA Compliance Pack, InsightCloudSec provides dozens of out-of-the-box checks that map back to specific HIPAA requirements. For example, InsightCloudSec’s “Snapshot With PHI Unencrypted” policy supports compliance with HIPAA §164.312(a)(2)(iv), Encryption Controls.

Experience Gap

An evolving threat landscape and growing attack surface are challenging enough to deal with for even the most experienced security professionals. Add the health industry’s talent gap into the mix, and those challenges are multiplied.

Cloud security in the healthcare space is still relatively new. That means internal cybersecurity teams are not only playing a relentless game of catch-up—they also might consist of more traditional network engineers and IT pros who have historically been tasked with securing on-premises environments.

This makes it critical that the cloud security solutions healthcare industries implement be user-friendly, low-maintenance, and ultra-reliable.

Cloud Security Solutions and Services

As health organizations dive into work in the cloud, their digital footprints will likely grow far faster than their teams can keep up with. Visibility into these cloud environments is essential to an organization’s ability to identify, assess, prioritize, and remediate risk. Without a clear picture of what they have and where they have it, companies can be vulnerable to malicious attacks.

To avoid biting off more than they can chew, security professionals in healthcare must leverage cloud security strategies and solutions that grant them complete real-time visibility in the cloud over all their most sensitive assets. Enterprise cloud security tools like InsightCloudSec can enable automated discovery and inventory assessment. That unlocks visibility across all their CSPs and containers.

InsightCloudSec also makes it easier for teams, regardless of their cloud security expertise, to effectively define, implement, and enforce security guardrails. With resource normalization, InsightCloudSec removes the need for security teams to learn and keep track of an ever-expanding list of cloud resources and services. Security teams can make use of InsightCloudSec’s native, no-code automation to enable hands-off enforcement of their organization’s security practices and policies when a non-compliant resource is created or a risk configuration change is made.

The fact of the matter is that many healthcare security teams will need to build their cybersecurity programs from the ground up. With limited resources, strained budgets, and patients’ lives on the line, they can’t afford to make big mistakes. That’s why, for many organizations, partnering with a managed service provider is the right approach.

Rapid7’s managed services relieve security teams from the strain of running and building cloud security frameworks. They can also help healthcare security pros better connect lack of investment with risks to stakeholders—acting as an external set of experts.

The Bottom Line

Staying continuously secure in the cloud can be daunting, particularly for those responsible for not only sensitive medical, patient, and research data, but also the digitally connected machines and tools that ensure top-of-the-line patient care. Protecting the health of patients is paramount in the healthcare industry.

With the right tools (and teams) to support continuous security and compliance, this responsibility becomes manageable—and even, dare we say, easy.

InsightCloudSec

A complete cloud security toolbox in a single solution.

LEARN MORE

Uptick in healthcare organizations experiencing targeted DDoS attacks

2023-02-02 Cat Allen

Post Syndicated from Cat Allen original https://blog.cloudflare.com/uptick-in-healthcare-organizations-experiencing-targeted-ddos-attacks/

Healthcare in the crosshairs

Uptick in healthcare organizations experiencing targeted DDoS attacks

Over the past few days, Cloudflare, as well as other sources, have observed healthcare organizations targeted by a pro-Russian hacktivist group claiming to be Killnet. There has been an increase in the amount of healthcare organizations coming to us to help get out from under these types of attacks. Multiple healthcare organizations behind Cloudflare have also been targeted by HTTP DDoS attacks and Cloudflare has helped them successfully mitigate these attacks. The United States Department of Health and Human Services issued an Analyst Note detailing the threat of Killnet-related cyberattacks to the healthcare industry.

A rise in political tensions and escalation of the conflict in Ukraine are all factors that play into the current cybersecurity threat landscape. Unlike traditional warfare, the Internet has enabled and empowered groups of individuals to carry out targeted attacks regardless of their location or involvement. Distributed-denial-of-Service (DDoS) attacks have the unfortunate advantage of not requiring an intrusion or a foothold to be launched and have, unfortunately, become more accessible than ever before.

The attacks observed by the Cloudflare global network do not show a clear indication that they are originating from a single botnet and the attack methods and sources seem to vary. This could indicate the involvement of multiple threat actors acting on behalf of Killnet or it could indicate a more sophisticated, coordinated attack.

Cloudflare application services customers are protected against the attacks. Cloudflare systems have been automatically detecting and mitigating the attacks on behalf of our customers. Our team continues to monitor the situation closely and is prepared to deploy countermeasures, if needed.

As an extra precaution, customers in the Healthcare industry are advised to follow the mitigation recommendations in the “How to Prepare” section below.

Who is Killnet?

Killnet is a group of pro-Russian individuals that gather and communicate on a Telegram channel. The channel provides a space for pro-Russian sympathizers to volunteer their expertise by participating in cyberattacks against Western interests. Previously, in the fourth quarter of 2022, Killnet called to attack US airport websites.

Why DDoS attacks?

DDoS attacks, unlike ransomware, do not require an intrusion or foothold in the target network to be launched. Much like how physical addresses are publicly available via directories or for services like mail delivery, IP addresses and domain names are also publicly available. Unfortunately, this means that every domain name (layer 7) and every network that connects to the Internet (layers 3 & 4) must proactively prepare to defend against DDoS attacks. DDoS attacks are not new threats, but they have become larger, more sophisticated, and more frequent in recent years.

How to prepare

While Cloudflare’s systems have been automatically detecting and mitigating these DDoS attacks, we recommend additional precautionary measures to improve your security posture:

Ensure all other DDoS Managed Rules are set to default settings (High sensitivity level and mitigation actions) for optimal DDoS activation
Cloudflare Enterprise customers with Advanced DDoS should consider enabling Adaptive DDoS Protection, which mitigates traffic that deviates based on your traffic profiles
Deploy firewall rules and rate-limiting rules to enforce a combined positive and negative security model. Reduce the traffic allowed to your website based on your known usage.
Ensure your origin is not exposed to the public Internet (i.e. only enable access to Cloudflare IP addresses)
Customers with access to Managed IP Lists should consider leveraging those lists in firewall rules
Enable caching as much as possible to reduce the strain on your origin servers
Enable DDoS alerting to improve your response time

Though attacks are launched by humans, they are carried out by bots. Defenders who do not leverage automated defenses are at a disadvantage. Cloudflare has helped, and will continue to help, our customers in the healthcare industry prepare for and respond to these attacks.

Under attack? We can help. Visit this webpage or call us at +1 (888) 99 FLARE

How Novo Nordisk built a modern data architecture on AWS

2023-01-04 Jonatan Selsing

Post Syndicated from Jonatan Selsing original https://aws.amazon.com/blogs/big-data/how-novo-nordisk-built-a-modern-data-architecture-on-aws/

Novo Nordisk is a leading global pharmaceutical company, responsible for producing life-saving medicines that reach more than 34 million patients each day. They do this following their triple bottom line—that they must strive to be environmentally sustainable, socially sustainable, and financially sustainable. The combination of using AWS and data supports all these targets.

Data is pervasive throughout the entire value chain of Novo Nordisk. From foundational research, manufacturing lines, sales and marketing, clinical trials, pharmacovigilance, through patient-facing data-driven applications. Therefore, getting the foundation around how data is stored, safeguarded, and used in a way that provides the most value is one of the central drivers of improved business outcomes.

Together with AWS Professional Services, we’re building a data and analytics solution using a modern data architecture. The collaboration between Novo Nordisk and AWS Professional Services is a strategic and long-term close engagement, where developers from both organizations have worked together closely for years. The data and analytics environments are built around of the core tenets of the data mesh—decentralized domain ownership of data, data as a product, self-service data infrastructure, and federated computational governance. This enables the users of the environment to work with data in the way that drives the best business outcomes. We have combined this with elements from evolutionary architectures that will allow us to adapt different functionalities as AWS continuously develops new services and capabilities.

In this series of posts, you will learn how Novo Nordisk and AWS Professional Services built a data and analytics ecosystem to speed up innovation at petabyte scale:

In this first post, you will learn how the overall design has enabled the individual components to come together in a modular way. We dive deep into how we built a data management solution based on the data mesh architecture.
The second post discusses how we built a trust network between the systems that comprise the entire solution. We show how we use event-driven architectures, coupled with the use of attribute-based access controls, to ensure permission boundaries are respected at scale.
In the third post, we show how end-users can consume data from their tool of choice, without compromising data governance. This includes how to configure Okta, AWS Lake Formation, and Microsoft Power BI to enable SAML-based federated use of Amazon Athena for an enterprise business intelligence (BI) activity.

Pharma-compliant environment

As a pharmaceutical industry, GxP compliance is a mandate for Novo Nordisk. GxP is a general abbreviation for the “Good x Practice” quality guidelines and regulations defined by regulators such as European Medicines Agency, U.S. Food and Drug Administration, and others. These guidelines are designed to ensure that medicinal products are safe and effective for their intended use. In the context of a data environment, GxP compliance involves implementing integrity controls for data used to in decision making and processes and is used to guide how change management processes are implemented to continuously ensure compliance over time.

Because this data environment supports teams across the whole organization, each individual data owner must retain accountability on their data. Features were designed to provide data owners autonomy and transparency when managing their data, enabling them to take this responsibility. This includes the capability to handle personally identifiable information (PII) data and other sensitive workloads. To provide traceability on the environment, audit capabilities were added, which we describe more in this post.

Solution overview

The full solution is a sprawling landscape of independent services that work together to enable data and analytics with a decentralized data governance model at petabyte scale. Schematically, it can be represented as in the following figure.

The architecture is split into three independent layers: data management, virtualization, and consumption. The end-user sits in the consumption layer and works with their tool of choice. It’s meant to abstract as much of the AWS-native resources to application primitives. The consumption layer is integrated into the virtualization layer, which abstracts the access to data. The purpose of the virtualization layer is to translate between data consumption and data management solutions. The access to data is managed by what we refer to as data management solutions. We discuss one of our versatile data management solutions later in this post. Each layer in this architecture is independent of each other and instead only relies on well-defined interfaces.

Central to this architecture is that access is encapsulated in an AWS Identity and Access Management (IAM) role session. The data management layer focuses on providing the IAM role with the right permissions and governance, the virtualization layer provides access to the role, and the consumption layer abstracts the use of the roles in the tools of choice.

Technical architecture

Each of the three layers in the overall architecture has a distinct responsibility, but no singular implementation. Think of them as abstract classes. They can be implemented in concrete classes, and in our case they rely on foundational AWS services and capabilities. Let’s go through each of the three layers.

Data management layer

The data management layer is responsible for providing access to and governance of data. As illustrated in the following diagram, a minimal construct in the data management layer is the combination of an Amazon Simple Storage Service (Amazon S3) bucket and an IAM role that gives access to the S3 bucket. This construct can be expanded to include granular permission with Lake Formation, auditing with AWS CloudTrail, and security response capabilities from AWS Security Hub. The following diagram also shows that a single data management solution has no singular span. It can cross many AWS accounts and be comprised of any number of IAM role combinations.

We have purposely not illustrated the trust policy of these roles in this figure, because those are a collaborative responsibility between the virtualization layer and the data management layer. We go into detail of how that works in the next post in this series. Data engineering professionals often interface directly with the data management layer, where they curate and prepare data for consumption.

Virtualization layer

The purpose of the virtualization layer is to keep track of who can do what. It doesn’t have any capabilities in itself, but translates the requirements from the data management ecosystems to the consumption layers and vice versa. It enables end-users on the consumption layer to access and manipulate data on one or more data management ecosystems, according to their permissions. This layer abstracts from end-users the technical details on data access, such as permission model, role assumptions, and storage location. It owns the interfaces to the other layers and enforces the logic of the abstraction. In the context of hexagonal architectures (see Developing evolutionary architecture with AWS Lambda), the interface layer plays the role of the domain logic, ports, and adapters. The other two layers are actors. The data management layer communicates the state of the layer to the virtualization layer and conversely receives information about the service landscape to trust. The virtualization layer architecture is shown in the following diagram.

Consumption layer

The consumption layer is where the end-users of the data products are sitting. This can be data scientists, business intelligence analysts, or any third party that generates value from consuming the data. It’s important for this type of architecture that the consumption layer has a hook-based sign-in flow, where the authorization into the application can be modified at sign-in time. This is to translate the AWS-specific requirement into the target applications. After the session in the client-side application has successfully been started, it’s up to the application itself to instrument for data layer abstraction, because this will be application specific. And this is an additional important decoupling, where some responsibility is pushed to the decentralized units. Many modern software as a service (SaaS) applications support these built-in mechanisms, such as Databricks or Domino Data Lab, whereas more traditional client-side applications like RStudio Server have more limited native support for this. In the case where native support is missing, a translation down to the OS user session can be done to enable the abstraction. The consumption layer is shown schematically in the following diagram.

When using the consumption layer as intended, the users don’t know that the virtualization layer exists. The following diagram illustrates the data access patterns.

Modularity

One of the main advantages of adopting the hexagonal architecture pattern, and delegating both the consuming layer and the data management layer to primary and secondary actors, means that they can be changed or replaced as new functionalities are released that require new solutions. This gives a hub-and-spoke type pattern, where many different types of producer/consumer type systems can be connected and work simultaneously in union. An example of this is that the current solution running in Novo Nordisk supports multiple, simultaneous data management solutions and are exposed in a homogenous way in the consuming layer. This includes both a data lake, the data mesh solution presented in this post, and several independent data management solutions. And these are exposed to multiple types of consuming applications, from custom managed, self-hosted applications, to SaaS offerings.

Data management ecosystem

To scale the usage of the data and increase the freedom, Novo Nordisk, jointly with AWS Professional Services, built a data management and governance environment, named Novo Nordisk Enterprise DataHub (NNEDH). NNEDH implements a decentralized distributed data architecture, and data management capabilities such as an enterprise business data catalog and data sharing workflow. NNEDH is an example of a data management ecosystem in the conceptual framework introduced earlier.

Decentralized architecture: From a centralized data lake to a distributed architecture

Novo Nordisk’s centralized data lake consists of 2.3 PB of data from more than 30 business data domains worldwide serving over 2000+ internal users throughout the value chain. It has been running successfully for several years. It is one of the data management ecosystems currently supported.

Within the centralized data architecture, data from each data domain is copied, stored, and processed in one central location: a central data lake hosted in one data storage. This pattern has challenges at scale because it retains the data ownership with the central team. At scale, this model slows down the journey toward a data-driven organization, because ownership of the data isn’t sufficiently anchored with the professionals closest to the domain.

The monolithic data lake architecture is shown in the following diagram.

Within the decentralized distributed data architecture, the data from each domain is kept within the domain on its own data storage and compute account. In this case, the data is kept close to domain experts, because they’re the ones who know their own data best and are ultimately the owner of any data products built around their data. They often work closely with business analysts to build the data product and therefore know what good data means to consumers of their data products. In this case, the data responsibility is also decentralized, where each domain has its own data owner, putting the accountability onto the true owners of the data. Nevertheless, this model might not work at small scale, for example an organization with only one business unit and tens of users, because it would introduce more overhead on the IT team to manage the organization data. It better suits large organizations, or small and medium ones that would like to grow and scale.

The Novo Nordisk data mesh architecture is shown in the following diagram.

Data domains and data assets

To enable the scalability of data domains across the organization, it’s mandatory to have a standard permission model and data access pattern. This standard must not be too restrictive in such a way that it may be a blocker for specific use cases, but it should be standardized in such a way to use the same interface between the data management and virtualization layers.

The data domains on NNEDH are implemented by a construct called an environment. An environment is composed of at least one AWS account and one AWS Region. It’s a workplace where data domain teams can work and collaborate to build data products. It links the NNEDH control plane to the AWS accounts where the data and compute of the domain reside. The data access permissions are also defined at the environment level, managed by the owner of the data domain. The environments have three main components: a data management and governance layer, data assets, and optional blueprints for data processing.

For data management and governance, the data domains rely on Lake Formation, AWS Glue, and CloudTrail. The deployment method and setup of these components is standardized across data domains. This way, the NNEDH control plane can provide connectivity and management to data domains in a standardized way.

The data assets of each domain residing in an environment are organized in a dataset, which is a collection of related data used for building a data product. It includes technical metadata such as data format, size, and creation time, and business metadata such as the producer, data classification, and business definition. A data product can use one or several datasets. It is implemented through managed S3 buckets and the AWS Glue Data Catalog.

Data processing can be implemented in different ways. NNEDH provides blueprints for data pipelines with predefined connectivity to data assets to speed up the delivery of data products. Data domain users have the freedom to use any other compute capability on their domain, for example using AWS services not predefined on the blueprints or accessing the datasets from other analytics tools implemented in the consumption layer, as mentioned earlier in this post.

Data domain personas and roles

On NNEDH, the permission levels on data domains are managed through predefined personas, for example data owner, data stewards, developers, and readers. Each persona is associated with an IAM role that has a predefined permission level. These permissions are based on the typical needs of users on these roles. Nevertheless, to give more flexibility to data domains, these permissions can be customized and extended as needed.

The permissions associated with each persona are related only to actions allowed on the AWS account of the data domain. For the accountability on data assets, the data access to the assets is managed by specific resource policies instead of IAM roles. Only the owner of each dataset, or data stewards delegated by the owner, can grant or revoke data access.

On the dataset level, a required persona is the data owner. Typically, they work closely with one or many data stewards as data products managers. The data steward is the data subject matter expert of the data product domain, responsible for interpreting collected data and metadata to derive deep business insights and build the product. The data steward bridges between business users and technical teams on each data domain.

Enterprise business data catalog

To enable freedom and make the organization data assets discoverable, a web-based portal data catalog is implemented. It indexes in a single repository metadata from datasets built on data domains, breaking data silos across the organization. The data catalog enables data search and discovery across different domains, as well as automation and governance on data sharing.

The business data catalog implements data governance processes within the organization. It ensures the data ownership—someone in the organization is responsible for the data origin, definition, business attributes, relationships, and dependencies.

The central construct of a business data catalog is a dataset. It’s the search unit within the business catalog, having both technical and business metadata. To collect technical metadata from structured data, it relies on AWS Glue crawlers to recognize and extract data structures from the most popular data formats, including CSV, JSON, Avro, and Apache Parquet. It provides information such as data type, creation date, and format. The metadata can be enriched by business users by adding a description of the business context, tags, and data classification.

The dataset definition and related metadata are stored in an Amazon Aurora Serverless database and Amazon OpenSearch Service, enabling you to run textual queries on the data catalog.

Data sharing

NNEDH implements a data sharing workflow, enabling peer-to-peer data sharing across AWS accounts using Lake Formation. The workflow is as follows:

A data consumer requests access to the dataset.
The data owner grants access by approving the access request. They can delegate the approval of access requests to the data steward.
Upon the approval of an access request, a new permission is added to the specific dataset in Lake Formation of the producer account.

The data sharing workflow is shown schematically in the following figure.

Security and audit

The data in the Novo Nordisk data mesh lies in AWS accounts owned by Novo Nordisk business accounts. The configuration and the states of the data mesh are stored in Amazon Relational Database Service (Amazon RDS). The Novo Nordisk security architecture is shown in the following figure.

Access and edits to the data in NNEDH needs to be logged for audit purposes. We need to be able to tell who modified data, when the modification happened, and what modifications were applied. In addition, we need to be able to answer why the modification was allowed by that person at that time.

To meet these requirements, we use the following components:

CloudTrail to log API calls. We specifically enable CloudTrail data event logging for S3 buckets and objects. By activating the logging, we can trace back any modification to any files in the data lake to the person who made the modification. We enforce usage of source identity for IAM role sessions to ensure user traceability.
We use Amazon RDS to store the configuration of the data mesh. We log queries against the RDS database. Together with CloudTrail, this log allows us to answer the question of why a modification to a file in Amazon S3 at a specific time by a specific person is possible.
Amazon CloudWatch to log activities across the mesh.

In addition to those logging mechanisms, the S3 buckets are created using the following properties:

The bucket is encrypted using server-side encryption with AWS Key Management Service (AWS KMS) and customer managed keys
Amazon S3 versioning is activated by default

Access to the data in NNEDH is controlled at the group level instead of individual users. The group corresponds to the group defined in the Novo Nordisk directory group. To keep track of the person who modified the data in the data lakes, we use the source identity mechanism explained in the post How to relate IAM role activity to corporate identity.

Conclusion

In this post, we showed how Novo Nordisk built a modern data architecture to speed up the delivery of data-driven use cases. It includes a distributed data architecture, to scale the usage to petabyte scale for over 2,000 internal users throughout the value chain, as well as a distributed security and audit architecture handling data accountability and traceability on the environment to meet their compliance requirements.

The next post in this series describes the implementation of distributed data governance and control at scale of Novo Nordisk’s modern data architecture.

About the Authors

Jonatan Selsing is former research scientist with a PhD in astrophysics that has turned to the cloud. He is currently the Lead Cloud Engineer at Novo Nordisk, where he enables data and analytics workloads at scale. With an emphasis on reducing the total cost of ownership of cloud-based workloads, while giving full benefit of the advantages of cloud, he designs, builds, and maintains solutions that enable research for future medicines.

Hassen Riahi is a Sr. Data Architect at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He works with AWS customers on building data-driven solutions.

Anwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Moses Arthur comes from a mathematics and computational research background and holds a PhD in Computational Intelligence specialized in Graph Mining. He is currently a Cloud Product Engineer at Novo Nordisk building GxP-compliant enterprise data lakes and analytics platforms for Novo Nordisk global factories producing digitalized medical products.

Alessandro Fior is a Sr. Data Architect at AWS Professional Services. With over 10 years of experience delivering data and analytics solutions, he is passionate about designing and building modern and scalable data platforms that accelerate companies to get value from their data.

Kumari Ramar is an Agile certified and PMP certified Senior Engagement Manager at AWS Professional Services. She delivers data and AI/ML solutions that speed up cross-system analytics and machine learning models, which enable enterprises to make data-driven decisions and drive new innovations.

Introducing Amazon Omics – A Purpose-Built Service to Store, Query, and Analyze Genomic and Biological Data at Scale

2022-11-29 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/introducing-amazon-omics-a-purpose-built-service-to-store-query-and-analyze-genomic-and-biological-data-at-scale/

You might learn in high school biology class that the human genome is composed of over three billion letters of code using adenine (A), guanine (G), cytosine (C), and thymine (T) paired in the deoxyribonucleic acid (DNA). The human genome acts as the biological blueprint of every human cell. And that’s only the foundation for what makes us human.

Healthcare and life sciences organizations collect myriad types of biological data to improve patient care and drive scientific research. These organizations map an individual’s genetic predisposition to disease, identify new drug targets based on protein structure and function, profile tumors based on what genes are expressed in a specific cell, or investigate how gut bacteria can influence human health. Collectively, these studies are often known as “omics”.

AWS has helped healthcare and life sciences organizations accelerate the translation of this data into actionable insights for over a decade. Industry leaders such as as Ancestry, AstraZeneca, Illumina, DNAnexus, Genomics England, and GRAIL leverage AWS to accelerate time to discovery while concurrently reducing costs and enhancing security.

The scale these customers, and others, operate at continues to increase rapidly. When omics data across thousand or hundreds of thousands (or more!) of individuals are compared and analyzed, new insights for predicting disease and the efficacy of different drug treatments are possible.

However, this scale, which can be many petabytes of data, can add complexity. When I studied medical informatics in my Ph.D course, I experienced this complexity in data access, processing, and tooling. You need a way to store omics data that is cost-efficient and easy to access. You need to scale compute across millions of biological samples while preserving accuracy and reliability. You also need specialized tools to analyze genetic patterns across populations and train machine learning (ML) models to predict diseases.

Today I’m excited to announce the general availability of Amazon Omics, a purpose-built service to help bioinformaticians, researchers, and scientists store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health and advance scientific discoveries.

With just a few clicks in the Omics console, you can import and normalize petabytes of data into formats optimized for analysis. Amazon Omics provides scalable workflows and integrated tools for preparing and analyzing omics data and automatically provisions and scales the underlying cloud infrastructure. So, you can focus on advancing science and translate discoveries into diagnostics and therapies.

Amazon Omics has three primary components:

Omics-optimized object storage that helps customers store and share their data efficiently and at low cost.
Managed compute for bioinformatics workflows that allows customers to run the exact analysis they specify, without worrying about provisioning underlying infrastructure.
Optimized data stores for population-scale variant analysis.

Now let’s learn more about each component of Amazon Omics. Generally, it follows the steps to create a data store and import data files, such as genome sequencing raw data, set up a basic bioinformatics workflow, and analyze results using existing AWS analytics and ML services.

The Getting Started page in the Omics console contains tutorial examples using Amazon SageMaker notebooks with the Python SDK. I will demonstrate Amazon Omics features through an example using a human genome reference.

Omics Data Storage
The Omics data storage helps you store and share petabytes of omics data efficiently. You can create data stores and import sample data in the Omics console and also do the same job in the AWS Command Line Interface (AWS CLI).

Let’s make a reference store and import a reference genome. This example uses Genome Reference Consortium Human Reference 38 (hg38), which is open access and available from the following Amazon S3 bucket: s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.

As prerequisites, you need to create Amazon S3 bucket in your preferred Region and the necessary IAM permissions to access S3 buckets. In the Omics console, you can easily create and select IAM role during the Omics storage setup.

Use the following AWS CLI command to create your reference store, copy the genome data to your S3 bucket, and import it data into your reference store.

// Create your reference store
$ aws omics create-reference-store --name "Reference Store"

// Import your reference data into your data store
$ aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta,name=hg38 s3://channy-omics
$ aws omics start-reference-import-job --sources sourceFile=s3://channy-omics/Homo_sapiens_assembly38.fasta,name=hg38 --reference-store-id 123456789 --role-arn arn:aws:iam::01234567890:role/OmicsImportRole

You can see the result in your console too.

Now you can create a sequence store. A sequence store is similar to an S3 bucket. Each object in a sequence store is known as a “read set”. A read set is an abstraction of a set of genomics file types:

FASTQ – A text-based file format that stores information about a base (sequence letter) from a sequencer and the corresponding quality information.
BAM – The compressed binary version of raw reads and their mapping to a reference genome.
CRAM – Similar to BAM, but uses the reference genome information to aid in compression.

Amazon Omics allows you to specify domain-specific metadata to your read sets you import. These are searchable and defined when you start a read set import job.

As an example, we will use the 1000 Genomes Project, a highly detailed catalogue of more than 80 million human genetic variants for more than 400 billions data points from over 2500 individuals. Let’s make a sequence store and then import genome sequence files into it.

// Create your sequence store 
$ aws omics create-sequence-store --name "MySequenceStore"

// Import your reference data into your data store
$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://channy-omics
$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://channy-omics

$ aws omics start-read-set-import-job --cli-input-json ‘
{
    "sourceFiles":
    {
        "source1": "s3://channy-omics/SRR233106_1.filt.fastq.gz",
        "source2": "s3://channy-omics/SRR233106_2.filt.fastq.gz"

    },
    "sourceFileType": "FASTQ",
    "subjectId": "mySubject2",
    "sampleId": "mySample2",
    "referenceArn": "arn:aws:omics:us-east-1:123456789012:referenceStore/123467890",
    "name": "HG00100"
}’

You can see the result in your console again.

Analytics Transformations
You can store variant data referring to a mutation, a difference between what the sequencer read at a position compared to the known reference and annotation data, known information about a location or variant in a genome, such as whether it may cause disease.

A variant store supports both variant call format files (VCF) where there is a called variant and gVCF inputs with records covering every position in a genome. An annotation store supports either a generic feature format (GFF3), tab-separated values (TSV), or VCF file. An annotation store can be mapped to the same coordinate system as variant stores during an import.

Once you’ve imported your data, you can now run queries like as followings which search for Single Nucleotide Variants (SNVs), the most common type of genetic variation among people, on human chromosome 1.

SELECT
    sampleid,
    contigname,
    start,
    referenceallele,
    alternatealleles
FROM "myvariantstore"."myvariantstore"
WHERE
    contigname = 'chr1'
    and cardinality(alternatealleles) = 1
    and length(alternatealleles[1]) = 1
    and length(referenceallele) = 1
LIMIT 10

You can see the output of this query:

#	sampleid	contigname	start	referenceallele	alternatealleles
1	NA20858	chr1	10096	T	[A]
2	NA19347	chr1	10096	T	[A]
3	NA19735	chr1	10096	T	[A]
4	NA20827	chr1	10102	T	[A]
5	HG04132	chr1	10102	T	[A]
6	HG01961	chr1	10102	T	[A]
7	HG02314	chr1	10102	T	[A]
8	HG02837	chr1	10102	T	[A]
9	HG01111	chr1	10102	T	[A]
10	NA19205	chr1	10108	A	[T]

You can view, manage, and query those data by integrating with existing analytics engines such as Amazon Athena. These query results can be used to train ML models in Amazon SageMaker.

Bioinformatics Workflows
Amazon Omics allows you to perform bioinformatics workflow, such as variant calling or gene expression, analysis on AWS. These compute workloads are defined using workflow languages like Workflow Description Language (WDL) and Nextflow, domain-specific languages that specify multiple compute tasks and their input and output dependencies.

You can define and execute a workflow using a few simple CLI commands. As an example, create a main.wdl file with the following WDL codes to create a simple WDL workflow with one task that creates a copy of a file.

version 1.0
workflow Test {
	input {
		File input_file
	}
	call FileCopy {
		input:
			input_file = input_file,
	}
	output {
		File output_file = FileCopy.output_file
	}
}
task FileCopy {
	input {
		File input_file
	}
	command {
		echo "copying ~{input_file}" >&2
		cat ~{input_file} > output
	}
	output {
		File output_file = "output"
	}
}

Then zip up your workflow and create your workflow with Amazon Omics using the AWS CLI:

$ zip my-wdl-workflow-zip main.wdl
$ aws omics create-workflow \
    --name MyWDLWorkflow \
    --description "My WDL Workflow" \
    --definition-zip file://my-wdl-workflow.zip \
    --parameter-template '{"input_file": "input test file to copy"}'

To run the workflow we just created, you can use the following command:

aws omics start-run \
  --workflow-id // id of the workflow we just created  \
  --role-arn // arn of the IAM role to run the workflow with  \
  --parameters '{"input_file": "s3://bucket/path/to/file"}' \
  --output-uri s3://bucket/path/to/results

Once the workflow completes, you could use these results in s3://bucket/path/to/results for downstream analyses in the Omics variant store.

You can execute a run, a single invocation of a workflow with a task and defined compute specifications. An individual run acts on your defined input data and produces an output. Runs also can have priorities associated with them, which allow specific runs to take execution precedence over other submitted and concurrent runs. For example, you can specify that a run that is high priority will be run before one that is lower priority.

You can optionally use a run group, a group of runs that you can set the max vCPU and max duration runs to help limit the compute resources used per run. This can help you partition users who may need access to different workflows to run on different data. It can also be used as a budget control/resource fairness mechanism by isolating users to specific run groups.

As you saw, Amazon Omics gives you a managed service with a couple of clicks and simple commands, and APIs in analyzing large-scale omic data, such as human genome samples so you can derive meaningful insights from this data, in hours rather than weeks. We also provide more tutorial SageMaker notebooks that you can use in Amazon SageMaker to help you get started.

In terms of data security, Amazon Omics helps ensure that your data remains secure and patient privacy is protected with customer-managed encryption keys, and HIPAA eligibility.

Customer and Partner Voices
Customers and partners in the healthcare and life science industry have shared how they are using Amazon Omics to accelerate scientific insights.

Children’s Hospital of Philadelphia (CHOP) is the oldest hospital in the United States dedicated exclusively to pediatrics and strives to advance healthcare for children with the integration of excellent patient care and innovative research. AWS has worked with the CHOP Research Institute for many years as they’ve led the way in utilizing data and technology to solve challenging problems in child health.

“At Children’s Hospital of Philadelphia, we know that getting a comprehensive view of our patients is crucial to delivering the best possible care, based on the most innovative research. Combining multiple clinical modalities is foundational to achieving this. With Amazon Omics, we can expand our understanding of our patients’ health, all the way down to their DNA.” – Jeff Pennington, Associate Vice President & Chief Research Informatics Officer, Children’s Hospital of Philadelphia

G42 Healthcare enables AI-powered healthcare that uses data and emerging technologies to personalize preventative care.

“Amazon Omics allows G42 to accelerate a competitive and deployable end-to-end service with globally leading data governance. We’re able to leverage the extensive omics data management and bioinformatics solutions hosted globally on AWS, at our customers’ fingertips. Our collaboration with AWS is much more than data – it’s about value.” – Ashish Koshi, CEO, G42 Healthcare

C2i Genomics brings together researchers, physicians and patients to utilize ultra-sensitive whole-genome cancer detection to personalize medicine, reduce cancer treatment costs, and accelerate drug development.

“In C2i Genomics, we empower our data scientists by providing them cloud-based computational solutions to run high-scale, customizable genomic pipelines, allowing them to focus on method development and clinical performance, while the company’s engineering teams are responsible for the operations, security and privacy aspects of the workloads. Amazon Omics allows researchers to use tools and languages from their own domain, and considerably reduces the engineering maintenance effort while taking care of cost and resource allocation considerations, which in turn reduce time-to-market and NRE costs of new features and algorithmic improvements.” – Ury Alon, VP Engineering, C2i Genomics

We are excited to work hand in hand with our AWS partners to build scalable, multi-modal solutions that enable the conversion of raw sequencing data into insights.

Lifebit builds enterprise data platforms for organizations with complex and sensitive biomedical datasets, empowering customers across the life sciences sector to transform how they use sensitive biomedical data.

“At Lifebit, we’re on a mission to connect the world’s biomedical data to obtain novel therapeutic insights. Our customers work with vast cohorts of linked genomic, multi-omics and clinical data – and these data volumes are expanding rapidly. With Amazon Omics they will have access to optimised analytics and storage for this large-scale data, allowing us to provide even more scalable bioinformatics solutions. Our customers will benefit from significantly lower cost per gigabase of data, essentially achieving hot storage performance at cold storage prices, removing cost as a barrier to generating insights from their population-scale biomedical data.” – Thorben Seeger, Chief Business Development Officer, Lifebit

To hear more customers and partner voices, see Amazon Omics Customers page.

Now Available
Amazon Omics is now available in the US East (N. Virginia), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Singapore) Regions.

To learn more, see the Amazon Omics page, Amazon Omics User Guide, Genomics on AWS, and Healthcare & Life Sciences on AWS. Give it a try, and please contact AWS genomics team and send feedback through your usual AWS support contacts.

– Channy

A Digital Red Cross

2022-11-14 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/11/a-digital-red-cross.html

The International Committee of the Red Cross wants some digital equivalent to the iconic red cross, to alert would-be hackers that they are accessing a medical network.

The emblem wouldn’t provide technical cybersecurity protection to hospitals, Red Cross infrastructure or other medical providers, but it would signal to hackers that a cyberattack on those protected networks during an armed conflict would violate international humanitarian law, experts say, Tilman Rodenhäuser, a legal adviser to the International Committee of the Red Cross, said at a panel discussion hosted by the organization on Thursday.

I can think of all sorts of problems with this idea and many reasons why it won’t work, but those also apply to the physical red cross on buildings, vehicles, and people’s clothing. So let’s try it.

EDITED TO ADD: Original reference.

Let’s Architect! Architecting in health tech

2022-10-05 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-in-health-tech/

Healthcare technology, commonly referred to as “health tech,” is the use of technologies developed for the purpose of improving any and all aspects of the healthcare system. For example, IT tools or software designed to boost hospital/administrative productivity, give insights into new and existing treatments, or improve the overall quality of care.

Also known as “digital health”, health tech uses databases, applications, mobile devices, and wearables to facilitate the delivery, payment, and/or consumption of healthcare. The increased accessibility to these technologies can further increase the development and launch of additional healthcare products.

In this post, we explore how to build and manage health tech architectures using Amazon Web Services (AWS).

HIPAA Reference Architecture on the AWS Cloud

This Quick Start provides guidance for deploying a U.S. Health Insurance Portability and Accountability Act (HIPAA) architecture on the AWS Cloud. Specifically, this aims to help those in the healthcare industry build and implement HIPAA-ready environments that fit with an organization’s larger HIPAA compliance program. It includes AWS CloudFormation templates that are customizable, plus automatically deploy the environment and configure AWS resources.

Using AppStream 2.0 to Deliver PACS and Image Analysis in Clinical Trials

Amazon AppStream 2.0 is a fully managed, non-persistent desktop and application service for remotely accessing your work. This means that clinical staff can now access the medical applications and data they need from anywhere. Benefits of using AppStream 2.0 include reduced overhead cost. This Architecture Blog post examines how to construct the AWS architecture for an image analysis application used in clinical trials, while keeping cost down. Furthermore, it demonstrates how something seemingly complex can be built with ease using both AWS services and image analysis applications already in place.

How fEMR Delivers Cryptographically Secure and Verifiable Medical Data with Amazon QLDB

Data veracity is fundamental. Patient data is confidential, and when a system deals with sensitive data, there needs to be a clear chain of ownership.

This blog post depicts an architecture based on the use of Amazon Quantum Ledger Database (Amazon QLDB), which addresses the need for data integrity and verifiability in healthcare. By using Amazon QLDB, the team can take advantage of an append-only journal to create a verifiable electronic medical record.

Also explored are the challenges architects face while working on these types of systems, as well as considerations about security, operational efficiency, processes for repeatable deployments using infrastructure as code, and data replication across multiple databases. The design choices architects make when developing a system depend on the context; read more about the mental models adopted in this use case.

Service Workbench on AWS

Service Workbench on AWS is a cloud solution that enables IT teams to provide secure, repeatable, and federated control of access to data, tooling, and compute power. Service Workbench can help redirect researchers’ focus from technical duties back to the research itself, by allowing them to automate of the creation of baseline research setups and providing data access. It gives researchers the ability to build research environments in minutes without having to know about cloud infrastructure or wait for research IT to respond. It is fully HIPAA-compliant and allows for secure peer-to-peer collaboration, including with individuals from other institutions.

See you next time!

Thanks for joining our discussion on health tech architectures! See you in two weeks for more architecture best practices and guidance.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

AWS achieves HDS certification to three additional Regions

2022-07-15 Janice Leung

Post Syndicated from Janice Leung original https://aws.amazon.com/blogs/security/aws-achieves-hds-certification-to-three-additional-regions/

We’re excited to announce that three additional AWS Regions—Asia Pacific (Korea), Europe (London), and Europe (Stockholm)—have been granted the Health Data Hosting (Hébergeur de Données de Santé, HDS) certification. This alignment with the HDS requirements demonstrates our continued commitment to adhere to the heightened expectations for cloud service providers. AWS customers who handle personal health data can be hosted in the AWS Cloud certified Regions with confidence.

The following 16 Regions are now in scope of this certification:

US East (Ohio)
US East (Northern Virginia)
US West (Northern California)
US West (Oregon)
Asia Pacific (Mumbai)
Asia Pacific (Korea)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
Canada (Central)
Europe (Frankfurt)
Europe (Ireland)
Europe (London)
Europe (Paris)
Europe (Stockholm)
South America (Sao Paulo)

Introduced by the French governmental agency for health, Agence Française de la Santé Numérique (ASIP Santé), HDS certification aims to strengthen the security and protection of personal health data. Achieving this certification demonstrates that AWS provides a framework for technical and governance measures to secure and protect personal health data, governed by French law.

AWS was evaluated and certified by independent third-party auditors on June 30, 2022. The Certificate of Compliance demonstrating the AWS compliance status is available on the Agence du Numérique en Santé (ANS) website and through AWS Artifact. AWS Artifact is a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

For up-to-date information, including when additional Regions are added, see the AWS Compliance Program, and choose HDS.

AWS strives to continuously bring services into scope of its compliance programs to help you meet your architectural and regulatory needs. Please reach out to your AWS account team if you have questions or feedback about HDS compliance.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Building a serverless cloud-native EDI solution with AWS

2022-04-12 Ripunjaya Pattnaik

Post Syndicated from Ripunjaya Pattnaik original https://aws.amazon.com/blogs/architecture/building-a-serverless-cloud-native-edi-solution-with-aws/

Electronic data interchange (EDI) is a technology that exchanges information between organizations in a structured digital form based on regulated message formats and standards. EDI has been used in healthcare for decades on the payer side for determination of coverage and benefits verification. There are different standards for exchanging electronic business documents, like American National Standards Institute X12 (ANSI), Electronic Data Interchange for Administration, Commerce and Transport (EDIFACT), and Health Level 7 (HL7).

HL7 is the standard to exchange messages between enterprise applications, like a Patient Administration System and a Pathology Laboratory Information. However, HL7 messages are embedded in Health Insurance Portability and Accountability Act (HIPAA) X12 for transactions between enterprises, like hospital and insurance companies.

HIPAA is a federal law that required the creation of national standards to protect sensitive patient health information from being disclosed without the patient’s consent or knowledge. It also mandates healthcare organizations to follow a standardized mechanism of EDI to submit and process insurance claims.

In this blog post, we will discuss how you can build a serverless cloud-native EDI implementation on AWS using the Edifecs XEngine Server.

EDI implementation challenges

Due to its structured format, EDI facilitates the consistency of business information for all participants in the exchange process. The primary EDI software that is used processes the information and then translates it into a more readable format. This can be imported directly and automatically into your integration systems. Figure 1 shows a high-level transaction for a healthcare EDI process.

Figure 1. EDI Transaction Sets exchanges between healthcare provider and payer

Along with the implementation itself, the following are some of the common challenges encountered in EDI system development:

Scaling. Despite the standard protocols of EDI, the document types and business rules differ across healthcare providers. You must scale the scope of your EDI judiciously to handle a diverse set of data rules with multiple EDI protocols.
Flexibility in EDI integration. As standards evolve, your EDI system development must reflect those changes.
Data volumes and handling bad data. As the volume of data increases, so does the chance for errors. Your storage plans must adjust as well.
Agility. In healthcare, EDI handles business documents promptly, as real-time document delivery is critical.
Compliance. State Medicaid and Medicare rules and compliance can be difficult to manage. HIPAA compliance and CAQH CORE certifications can be difficult to acquire.

Solution overview and architecture data flow

Providers and Payers can send requests as enrollment inquiry, certification request, or claim encounter to one another. This architecture uses these as source data requests coming from the Providers and Payers as flat files (.txt and .csv), Active Message Queues, and API calls (submitters).

The steps for the solution shown in Figure 2 are as follows:

1. Flat, on-premises files are transferred to Amazon Simple Storage Service (S3) buckets using AWS Transfer Family (2).
3. AWS Fargate on Amazon Elastics Container Service (Amazon ECS) runs Python packages to convert the transactions into JSON messages, then queues it on Amazon MQ (4).
5. Java Message Service (JMS) Bridge, which runs Apache Camel on Fargate, pulls the messages from the on-premises messaging systems and queues them on Amazon MQ (6).
7. Fargate also runs programs to call the on-premises API or web services to get the transactions and queues it on Amazon MQ (8).
9. Amazon CloudWatch monitors the queue depth. If queue depth goes beyond a set threshold, CloudWatch sends notifications to the containers through Amazon Simple Notification Service (SNS) (10).
11. Amazon SNS triggers AWS Lambda, which adds tasks to Fargate (12), horizontally scaling it to handle the spike.
13. Fargate runs Python programs to read the messages on Amazon MQ and uses PYX12 packages to convert the JSON messages to EDI file formats, depending on the type of transactions.
14. The container also may queue the EDI requests on different queues, as the solution uses multiple trading partners for these requests.
15. The solution runs Edifecs XEngine Server on Fargate with Docker image. This polls the messages from the queues previously mentioned and converts them to EDI specification by the trading partners that are registered with Edifecs.
16. Python module running on Fargate converts the response from the trading partners to JSON.
17. Fargate sends JSON payload as a POST request using Amazon API Gateway, which updates requestors’ backend systems/databases (12) that are running microservices on Amazon ECS (11).
18. The solution also runs Elastic Load Balancing to balance the load across the Amazon ECS cluster to take care of any spikes.
19. Amazon ECS runs microservices that uses Amazon RDS (20) for domain specific data.

Figure 2. EDI transaction-processing system architecture on AWS

Handling PII/PHI data

The EDI request and response file includes protected health information (PHI)/personal identifiable information (PII) data related to members, claims, and financial transactions. The solution leverages all AWS services that are HIPAA eligible and encrypts data at rest and in-transit. The file transfers are through FTP, and the on-premises request/response files are Pretty Good Privacy (PGP) encrypted. The Amazon S3 buckets are secured through bucket access policies and are AES-256 encrypted.

Amazon ECS tasks that are hosted in Fargate use ephemeral storage that is encrypted with AES-256 encryption, using an encryption key managed by Fargate. User data stored in Amazon MQ is encrypted at rest. Amazon MQ encryption at rest provides enhanced security by encrypting data using encryption keys stored in the AWS Key Management Service. All connections between Amazon MQ brokers use Transport Layer Security to provide encryption in transit. All APIs are accessed through API gateways secured through Amazon Cognito. Only authorized users can access the application.

The architecture provides many benefits to EDI processing:

Scalability. Because the solution is highly scalable, it can speed integration of new partner/provider requirements.
Compliance. Use the architecture to run sensitive, HIPAA-regulated workloads. If you plan to include PHI (as defined by HIPAA) on AWS services, first accept the AWS Business Associate Addendum (AWS BAA). You can review, accept, and check the status of your AWS BAA through a self-service portal available in AWS Artifact. Any AWS service can be used with a healthcare application, but only services covered by the AWS BAA can be used to store, process, and transmit protected health information under HIPAA.
Cost effective. Though serverless cost is calculated by usage, with this architecture you save as your traffic grows.
Visibility. Visualize and understand the flow of your EDI processing using Amazon CloudWatch to monitor your databases, queues, and operation portals.
Ownership. Gain ownership of your EDI and custom or standard rules for rapid change management and partner onboarding.

Conclusion

In this healthcare use case, we demonstrated how a combination of AWS services can be used to increase efficiency and reduce cost. This architecture provides a scalable, reliable, and secure foundation to develop your EDI solution, while using dependent applications. We established how to simplify complex tasks in order to manage and scale your infrastructure for a high volume of data. Finally, the solution provides for monitoring your workflow, services, and alerts.

For further reading:

Are Fake COVID Testing Sites Harvesting Data?

2022-01-19 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/01/are-fake-covid-testing-sites-harvesting-data.html

Over the past few weeks, I’ve seen a bunch of writing about what seems to be fake COVID-19 testing sites. They take your name and info, and do a nose swab, but you never get test results. Speculation centered around data harvesting, but that didn’t make sense because it was far too labor intensive for that and — sorry to break it to you — your data isn’t worth all that much.

It seems to be multilevel marketing fraud instead:

The Center for COVID Control is a management company to Doctors Clinical Laboratory. It provides tests and testing supplies, software, personal protective equipment and marketing services — online and printed — to testing sites, said a person who was formerly associated with the Center for COVID Control. Some of the sites are owned independently but operate in partnership with the chain under its name and with its guidance.

[…]

Doctors Clinical Lab, the lab Center for COVID Control uses to process tests, makes money by billing patients’ insurance companies or seeking reimbursement from the federal government for testing. Insurance statements reviewed by Block Club show the lab has, in multiple instances, billed insurance companies $325 for a PCR test, $50 for a rapid test, $50 for collecting a person’s sample and $80 for a “supplemental fee.”

In turn, the testing sites are paid for providing samples to the lab to be processed, said a person formerly associated with the Center for COVID Control.

In a January video talking to testing site operators, Syed said the Center for COVID Control will no longer provide them with PCR tests, but it will continue supplying them with rapid tests at a cost of $5 per test. The companies will keep making money for the rapid tests they collect, he said.

“You guys will continue making the $28.50 you’re making for the rapid test,” Syed said in the video.

Read the article for the messy details. Or take a job and see for yourself.

A Death Due to Ransomware

2021-10-01 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/10/a-death-due-to-ransomware.html

The Wall Street Journal is reporting on a baby’s death at an Alabama hospital in 2019, which they argue was a direct result of the ransomware attack the hospital was undergoing.

Amid the hack, fewer eyes were on the heart monitors — normally tracked on a large screen at the nurses’ station, in addition to inside the delivery room. Attending obstetrician Katelyn Parnell texted the nurse manager that she would have delivered the baby by caesarean section had she seen the monitor readout. “I need u to help me understand why I was not notified.” In another text, Dr. Parnell wrote: “This was preventable.”

[The mother] Ms. Kidd has sued Springhill [Medical Center], alleging information about the baby’s condition never made it to Dr. Parnell because the hack wiped away the extra layer of scrutiny the heart rate monitor would have received at the nurses’ station. If proven in court, the case will mark the first confirmed death from a ransomware attack.

What will be interesting to see is whether the courts rule that the hospital was negligent in its security, contributing to the success of the ransomware and by extension the death of the infant.

Springhill declined to name the hackers, but Allan Liska, a senior intelligence analyst at Recorded Future, said it was likely the Russianbased Ryuk gang, which was singling out hospitals at the time.

They’re certainly never going to be held accountable.

Another article.

Alaska’s Department of Health and Social Services Hack

2021-09-21 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/09/alaskas-department-of-health-and-social-services-hack.html

Apparently, a nation-state hacked Alaska’s Department of Health and Social Services.

Not sure why Alaska’s Department of Health and Social Services is of any interest to a nation-state, but that’s probably just my failure of imagination.

Ensure Workload Resiliency and Comply with Data Residency Requirements with AWS Outposts

2021-09-07 Pedram Jahangiri

Post Syndicated from Pedram Jahangiri original https://aws.amazon.com/blogs/architecture/ensure-workload-resiliency-and-comply-with-data-residency-requirements-with-aws-outposts/

Personally Identifiable Information (PII) and Personal Health Information (PHI) data are critical to the continued operation of healthcare organizations. Disaster recovery (DR) strategies must ensure that healthcare provider workloads with PII and PHI data operate with near-zero data loss recovery point objective (RPO) and recovery time objective (RTO). In most cases, an organization would typically use a multi-Region active/active strategy for DR to achieve this goal.

However, in a country like the UK where there is only one AWS Region available, you can’t use the typical approach to set up a multi-Region DR strategy. Additionally, laws like the UK General Data Protection Regulation (UK GDPR) apply to many businesses and organizations. These data residency requirements protect the privacy of data and workloads by storing and processing data within a particular country or geographic location.

In this post, we show you a use case from a healthcare organization in the UK. This example shows you how to set up a modified multi-Region DR strategy that uses one Region and AWS Outposts to ensure the resiliency of your workload and address data residency requirements. While our use case applies to healthcare, this solution can be applied to any industry where you need a highly available workload that must meet data residency requirements.

Overview of AWS Outposts

With Outposts, you get same reliable, secure, and high-performance infrastructure as AWS Cloud. It also provides services, APIs, tools for automation, deployment pipelines, and security controls.

Outposts are racks (as shown in Figure 1) moved from a Region and placed in your local data center. They are typically connected to a Region using AWS Direct Connect or an AWS VPN. These racks reside in a particular country, state, or municipality for regulatory, contractual, or security reasons. They provide a secondary data store in the same country or geographic area, which can be independent from the Region.

Outposts offers similar AWS infrastructure, AWS services, APIs, and tools to almost any data center, co-location space, or on-premises facility for a consistent integrated experience. They can run in any facility. AWS will deliver, install, and maintain the hardware infrastructure. If you do not want to host Outposts long term, you can migrate your applications to an AWS Local Zone or a Region.

Figure 1. Outpost (open standard 42U rack)

Addressing UK GDPR requirements and maintaining workload RTO/RPO

For our use case, we worked with a healthcare organization in the UK. Their workload presented us with a couple of uncommon requirements. We had to set up a multi-Region DR strategy to comply with UK GDPR data residency requirements for PHI and PII data, and only one Region (eu-west-2 in London) was available.

Implementing a multi-Region DR strategy with AWS Outposts

To maintain business continuity in the event of a disaster, we physically installed an Outposts rack in the customer’s on-premises location in the UK. We selected the Outpost’s parent Region as another Region (Ireland). This way, the Outpost’s control plane stays accessible if the London Region is impacted by an outage.

All customer and workload data is stored locally in the UK in Outposts and provisioned by Amazon Elastic Block Store (Amazon EBS) and Amazon Simple Storage Service (Amazon S3). It is then synced between the workload Region (London Region) and the Outpost.

With the architecture shown in Figure 2, only control plane traffic such as instance health, instance activity (launched, stopped), and the underlying hypervisor system is sent back to the parent Region (Ireland). This information allows us to provide alerts on instance health and capacity and apply patches and updates to the Outpost.

Figure 2. Architecture for a multi-Region setup using a Region and Outpost

Meeting data residency requirements

In our architecture, no customer data gets transferred out of the country. Also, through Direct Connect, all IP addressing is private. No public IPs are required.

The data plane path for the Outposts local gateway travels from the Outpost, through the local gateway, and to a private local gateway LAN segment. It then follows a private path back to the AWS service endpoints in the London Region through Direct Connect.

Figure 3 shows how to connect Outposts to a local network:

The local network equipment is connected via ports provided in the Outpost’s top of rack (TOR) switches.
Virtual interfaces (VIFs) are mapped to VLANs using Link Aggregation Control Protocol (LACP).
The new local gateway (LGW) on the Outpost routes traffic to and from the local network using these VIFs.

Figure 3. How to connect Outposts to your local network

Summary

This post outlines our solution for running mission-critical workloads that need multi-Region environments and have data residency requirements. We showed you how to configure an Outpost to use a control plane in an external Region. This limits workload traffic to the country with the data residency requirement.

Our use case applies to healthcare. However, this solution is widely applicable. It can be used for other industries, such as power and utilities, that have workloads that need a multi-Region DR strategy and must meet data residency requirements. To deploy an Outpost in your facility, follow the steps in Getting Started with AWS Outposts.

Solution overview

Deploy the solution

AWS Solution setup

AWS HealthLake

Query AWS HealthLake tables with Redshift Serverless

Query AWS HealthLake data with Amazon Redshift

Set up zero-ETL integration between Amazon Aurora MySQL and Redshift Serverless

Connect to an Aurora MySQL database and set up data

Set up zero-ETL integration between Amazon Aurora MySQL and Amazon Redshift

Create a Redshift database from the integration

Set up streaming ingestion for Amazon Redshift

Query and analyze patient wearable data

Create a complete patient 360

Clean up

Conclusion

About the Authors

Audit results

Further reading

Solution overview

Prerequisites

Architecture overview

File ingestion

Data analysis

Data visualization

Create a QuickSight dataset

Performance, operational, and cost considerations

Cost

Conclusion

About the Authors

Next-generation imaging solutions and workflows

Security

Conclusion

Further reading

Challenges processing these machine-readable files

Solution overview

Prerequisites

Create an AWS Glue preprocessing job

Create an AWS Glue processing job

Query the data

Clean up

Conclusion

About the Authors

How to Stay Secure in the Cloud While Driving Innovation and Discovery

Major Challenges

Resource Strain

Balancing Priorities With Leadership

HIPAA Compliance and Protected Health Information

Experience Gap

Cloud Security Solutions and Services

The Bottom Line

InsightCloudSec

Healthcare in the crosshairs

Who is Killnet?

Why DDoS attacks?

How to prepare

Pharma-compliant environment

Solution overview

Technical architecture

Data management layer

Virtualization layer

Consumption layer

Modularity

Data management ecosystem

Decentralized architecture: From a centralized data lake to a distributed architecture

Data domains and data assets

Data domain personas and roles

Enterprise business data catalog

Data sharing

Security and audit

Conclusion

About the Authors

See you next time!

Other posts in this series

Looking for more architecture content?

EDI implementation challenges

Solution overview and architecture data flow

Handling PII/PHI data

Conclusion

Overview of AWS Outposts

Addressing UK GDPR requirements and maintaining workload RTO/RPO