Tag Archives: GDPR

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Post Syndicated from Edvin Hallvaxhiu original https://aws.amazon.com/blogs/big-data/build-a-pseudonymization-service-on-aws-to-protect-sensitive-data-part-2/

Part 1 of this two-part series described how to build a pseudonymization service that converts plain text data attributes into a pseudonym or vice versa. A centralized pseudonymization service provides a unique and universally recognized architecture for generating pseudonyms. Consequently, an organization can achieve a standard process to handle sensitive data across all platforms. Additionally, this takes away any complexity and expertise needed to understand and implement various compliance requirements from development teams and analytical users, allowing them to focus on their business outcomes.

Following a decoupled service-based approach means that, as an organization, you are unbiased towards the use of any specific technologies to solve your business problems. No matter which technology is preferred by individual teams, they are able to call the pseudonymization service to pseudonymize sensitive data.

In this post, we focus on common extract, transform, and load (ETL) consumption patterns that can use the pseudonymization service. We discuss how to use the pseudonymization service in your ETL jobs on Amazon EMR (using Amazon EMR on EC2) for streaming and batch use cases. Additionally, you can find an Amazon Athena and AWS Glue based consumption pattern in the GitHub repo of the solution.

Solution overview

The following diagram describes the solution architecture.

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series.

The account on the left is the one that you set up as part of this post, representing the ETL platform based on Amazon EMR using the pseudonymization service.

You can deploy the pseudonymization service and the ETL platform on the same account.

Amazon EMR empowers you to create, operate, and scale big data frameworks such as Apache Spark quickly and cost-effectively.

In this solution, we show how to consume the pseudonymization service on Amazon EMR with Apache Spark for batch and streaming use cases. The batch application reads data from an Amazon Simple Storage Service (Amazon S3) bucket, and the streaming application consumes records from Amazon Kinesis Data Streams.

PySpark code used in batch and streaming jobs

Both applications use a common utility function that makes HTTP POST calls against the API Gateway that is linked to the pseudonymization AWS Lambda function. The REST API calls are made per Spark partition using the Spark RDD mapPartitions function. The POST request body contains the list of unique values for a given input column. The POST request response contains the corresponding pseudonymized values. The code swaps the sensitive values with the pseudonymized ones for a given dataset. The result is saved to Amazon S3 and the AWS Glue Data Catalog, using Apache Iceberg table format.

Iceberg is an open table format that supports ACID transactions, schema evolution, and time travel queries. You can use these features to implement the right to be forgotten (or data erasure) solutions using SQL statements or programming interfaces. Iceberg is supported by Amazon EMR starting with version 6.5.0, AWS Glue, and Athena. Batch and streaming patterns use Iceberg as their target format. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Prerequisites

You must have the following prerequisites:

  • An AWS account.
  • An AWS Identity and Access Management (IAM) principal with privileges to deploy the AWS CloudFormation stack and related resources.
  • The AWS Command Line Interface (AWS CLI) installed on the development or deployment machine that you will use to run the provided scripts.
  • An S3 bucket in the same account and AWS Region where the solution is to be deployed.
  • Python3 installed in the local machine where the commands are run.
  • PyYAML installed using pip.
  • A bash terminal to run bash scripts that deploy CloudFormation stacks.
  • An additional S3 bucket containing the input dataset in Parquet files (only for batch applications). Copy the sample dataset to the S3 bucket.
  • A copy of the latest code repository in the local machine using git clone or the download option.

Open a new bash terminal and navigate to the root folder of the cloned repository.

The source code for the proposed patterns can be found in the cloned repository. It uses the following parameters:

  • ARTEFACT_S3_BUCKET – The S3 bucket where the infrastructure code will be stored. The bucket must be created in the same account and Region where the solution lives.
  • AWS_REGION – The Region where the solution will be deployed.
  • AWS_PROFILE – The named profile that will be applied to the AWS CLI command. This should contain credentials for an IAM principal with privileges to deploy the CloudFormation stack of related resources.
  • SUBNET_ID – The subnet ID where the EMR cluster will be spun up. The subnet is pre-existing and for demonstration purposes, we use the default subnet ID of the default VPC.
  • EP_URL – The endpoint URL of the pseudonymization service. Retrieve this from the solution deployed as Part 1 of this series.
  • API_SECRET – An Amazon API Gateway key that will be stored in AWS Secrets Manager. The API key is generated from the deployment depicted in Part 1 of this series.
  • S3_INPUT_PATH – The S3 URI pointing to the folder containing the input dataset as Parquet files.
  • KINESIS_DATA_STREAM_NAMEThe Kinesis data stream name deployed with the CloudFormation stack.
  • BATCH_SIZEThe number of records to be pushed to the data stream per batch.
  • THREADS_NUM The number of parallel threads used in the local machine to upload data to the data stream. More threads correspond to a higher message volume.
  • EMR_CLUSTER_ID – The EMR cluster ID where the code will be run (the EMR cluster was created by the CloudFormation stack).
  • STACK_NAME – The name of the CloudFormation stack, which is assigned in the deployment script.

Batch deployment steps

As described in the prerequisites, before you deploy the solution, upload the Parquet files of the test dataset to Amazon S3. Then provide the S3 path of the folder containing the files as the parameter <S3_INPUT_PATH>.

We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_1.sh script, which is inside the deployment_scripts folder.

After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:

sh ./deployment_scripts/deploy_1.sh \
-a <ARTEFACT_S3_BUCKET> \
-r <AWS_REGION> \
-p <AWS_PROFILE> \
-s <SUBNET_ID> \
-e <EP_URL> \
-x <API_SECRET> \
-i <S3_INPUT_PATH>

The output should look like the following screenshot.

The required parameters for the cleanup command are printed out at the end of the run of the deploy_1.sh script. Make sure to note down these values.

Test the batch solution

In the CloudFormation template deployed using the deploy_1.sh script, the EMR step containing the Spark batch application is added at the end of the EMR cluster setup.

To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.

You can also use Athena to query the table pseudo_table in the database blog_batch_db.

Clean up batch resources

To destroy the resources created as part of this exercise,

in a bash terminal, navigate to the root folder of the cloned repository. Enter the cleanup command shown as the output of the previously run deploy_1.sh script:

sh ./deployment_scripts/cleanup_1.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>

The output should look like the following screenshot.

Streaming deployment steps

We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_2.sh script, which is inside the deployment_scripts folder. The CloudFormation stack template for this pattern is available in the GitHub repo.

After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:

sh deployment_scripts/deploy_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-r <AWS_REGION> \
-p <AWS_PROFILE> \
-s <SUBNET_ID> \
-e <EP_URL> \
-x <API_SECRET>

The output should look like the following screenshot.

The required parameters for the cleanup command are printed out at the end of the output of the deploy_2.sh script. Make sure to save these values to use later.

Test the streaming solution

In the CloudFormation template deployed using the deploy_2.sh script, the EMR step containing the Spark streaming application is added at the end of the EMR cluster setup. To test the end-to-end pipeline, you need to push records to the deployed Kinesis data stream. With the following commands in a bash terminal, you can activate a Kinesis producer that will continuously put records in the stream, until the process is manually stopped. You can control the producer’s message volume by modifying the BATCH_SIZE and the THREADS_NUM variables.

python3 -m pip install kiner
python3 \
consumption-patterns/emr/1_pyspark-streaming/kinesis_producer/producer.py \
<KINESIS_DATA_STREAM_NAME> \
<BATCH_SIZE> \
<THREADS_NUM>

To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.

In the Athena query editor, check the results by querying the table pseudo_table in the database blog_stream_db.

Clean up streaming resources

To destroy the resources created as part of this exercise, complete the following steps:

  1. Stop the Python Kinesis producer that was launched in a bash terminal in the previous section.
  2. Enter the following command:
sh ./deployment_scripts/cleanup_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>

The output should look like the following screenshot.

Performance details

Use cases might differ in requirements with respect to data size, compute capacity, and cost. We have provided some benchmarking and factors that may influence performance; however, we strongly advise you to validate the solution in lower environments to see if it meets your particular requirements.

You can influence the performance of the proposed solution (which aims to pseudonymize a dataset using Amazon EMR) by the maximum number of parallel calls to the pseudonymization service and the payload size for each call. In terms of parallel calls, factors to consider are the GetSecretValue calls limit from Secrets Manager (10.000 per second, hard limit) and the Lambda default concurrency parallelism (1,000 by default; can be increased by quota request). You can control the maximum parallelism adjusting the number of executors, the number of partitions composing the dataset, and the cluster configuration (number and type of nodes). In terms of payload size for each call, factors to consider are the API Gateway maximum payload size (6 MB) and the Lambda function maximum runtime (15 minutes). You can control the payload size and the Lambda function runtime by adjusting the batch size value, which is a parameter of the PySpark script that determines the number of items to be pseudonymized per each API call. To capture the influence of all these factors and assess the performance of the consumption patterns using Amazon EMR, we have designed and monitored the following scenarios.

Batch consumption pattern performance

To assess the performance for the batch consumption pattern, we ran the pseudonymization application with three input datasets composed of 1, 10, and 100 Parquet files of 97.7 MB each. We generated the input files using the dataset_generator.py script.

The cluster capacity nodes were 1 primary (m5.4xlarge) and 15 core (m5d.8xlarge). This cluster configuration remained the same for all three scenarios, and it allowed the Spark application to use up to 100 executors. The batch_size, which was also the same for the three scenarios, was set to 900 VINs per API call, and the maximum VIN size was 5 bytes.

The following table captures the information of the three scenarios.

Execution ID Repartition Dataset Size Number of Executors Cores per Executor Executor Memory Runtime
A 800 9.53 GB 100 4 4 GiB 11 minutes, 10 seconds
B 80 0.95 GB 10 4 4 GiB 8 minutes, 36 seconds
C 8 0.09 GB 1 4 4 GiB 7 minutes, 56 seconds

As we can see, properly parallelizing the calls to our pseudonymization service enables us to control the overall runtime.

In the following examples, we analyze three important Lambda metrics for the pseudonymization service: Invocations, ConcurrentExecutions, and Duration.

The following graph depicts the Invocations metric, with the statistic SUM in orange and RUNNING SUM in blue.

By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during each run.

Run ID Dataset Size Total Invocations
A 9.53 GB 1.467.000 – 0 = 1.467.000
B 0.95 GB 1.467.000 – 1.616.500 = 149.500
C 0.09 GB 1.616.500 – 1.631.000 = 14.500

As expected, the number of invocations increases proportionally by 10 with the dataset size.

The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.

The application is designed such that the maximum number of concurrent Lambda function runs is given by the amount of Spark tasks (Spark dataset partitions), which can be processed in parallel. This number can be calculated as MIN (executors x executor_cores, Spark dataset partitions).

In the test, run A processed 800 partitions, using 100 executors with four cores each. This makes 400 tasks processed in parallel so the Lambda function concurrent runs can’t be above 400. The same logic was applied for runs B and C. We can see this reflected in the preceding graph, where the amount of concurrent runs never surpasses the 400, 40, and 4 values.

To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. If that is the case, you should either increase the Lambda function concurrency limit (if you want to keep up the performance) or reduce either the amount of partitions or the number of available executors (impacting the application performance).

The following graph depicts the Lambda Duration metric, with the statistic AVG in orange and MAX in green.

As expected, the size of the dataset doesn’t affect the duration of the pseudonymization function run, which, apart from some initial invocations facing cold starts, remains constant to an average of 3 milliseconds throughout the three scenarios. This because the maximum number of records included in each pseudonymization call is constant (batch_size value).

Lambda is billed based on the number of invocations and the time it takes for your code to run (duration). You can use the average duration and invocations metrics to estimate the cost of the pseudonymization service.

Streaming consumption pattern performance

To assess the performance for the streaming consumption pattern, we ran the producer.py script, which defines a Kinesis data producer that pushes records in batches to the Kinesis data stream.

The streaming application was left running for 15 minutes and it was configured with a batch_interval of 1 minute, which is the time interval at which streaming data will be divided into batches. The following table summarizes the relevant factors.

Repartition Cluster Capacity Nodes Number of Executors Executor’s Memory Batch Window Batch Size VIN Size
17

1 Primary (m5.xlarge),

3 Core (m5.2xlarge)

6 9 GiB 60 seconds 900 VINs/API call. 5 Bytes / VIN

The following graphs depict the Kinesis Data Streams metrics PutRecords (in blue) and GetRecords (in orange) aggregated with 1-minute period and using the statistic SUM. The first graph shows the metric in bytes, which peaks 6.8 MB per minute. The second graph shows the metric in record count peaking at 85,000 records per minute.

We can see that the metrics GetRecords and PutRecords have overlapping values for almost the entire application’s run. This means that the streaming application was able to keep up with the load of the stream.

Next, we analyze the relevant Lambda metrics for the pseudonymization service: Invocations, ConcurrentExecutions, and Duration.

The following graph depicts the Invocations metric, with the statistic SUM (in orange) and RUNNING SUM in blue.

By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during the run. In specific, in 15 minutes, the streaming application invoked the pseudonymization API 977 times, which is around 65 calls per minute.

The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.

The repartition and the cluster configuration allow the application to process all Spark RDD partitions in parallel. As a result, the concurrent runs of the Lambda function are always equal to or below the repartition number, which is 17.

To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. For this aspect, the same suggestions as for the batch use case are valid.

The following graph depicts the Lambda Duration metric, with the statistic AVG in blue and MAX in orange.

As expected, aside the Lambda function’s cold start, the average duration of the pseudonymization function was more or less constant throughout the run. This because the batch_size value, which defines the number of VINs to pseudonymize per call, was set to and remained constant at 900.

The ingestion rate of the Kinesis data stream and the consumption rate of our streaming application are factors that influence the number of API calls made against the pseudonymization service and therefore the related cost.

The following graph depicts the Lambda Invocations metric, with the statistic SUM in orange, and the Kinesis Data Streams GetRecords.Records metric, with the statistic SUM in blue. We can see that there is correlation between the amount of records retrieved from the stream per minute and the amount of Lambda function invocations, thereby impacting the cost of the streaming run.

In addition to the batch_interval, we can control the streaming application’s consumption rate using Spark streaming properties like spark.streaming.receiver.maxRate and spark.streaming.blockInterval. For more details, refer to Spark Streaming + Kinesis Integration and Spark Streaming Programming Guide.

Conclusion

Navigating through the rules and regulations of data privacy laws can be difficult. Pseudonymization of PII attributes is one of many points to consider while handling sensitive data.

In this two-part series, we explored how you can build and consume a pseudonymization service using various AWS services with features to assist you in building a robust data platform. In Part 1, we built the foundation by showing how to build a pseudonymization service. In this post, we showcased the various patterns to consume the pseudonymization service in a cost-efficient and performant manner. Check out the GitHub repository for additional consumption patterns.


About the Authors

Edvin Hallvaxhiu is a Senior Global Security Architect with AWS Professional Services and is passionate about cybersecurity and automation. He helps customers build secure and compliant solutions in the cloud. Outside work, he likes traveling and sports.

Rahul Shaurya is a Principal Big Data Architect with AWS Professional Services. He helps and works closely with customers building data platforms and analytical applications on AWS. Outside of work, Rahul loves taking long walks with his dog Barney.

Andrea Montanari is a Senior Big Data Architect with AWS Professional Services. He actively supports customers and partners in building analytics solutions at scale on AWS.

María Guerra is a Big Data Architect with AWS Professional Services. Maria has a background in data analytics and mechanical engineering. She helps customers architecting and developing data related workloads in the cloud.

Pushpraj Singh is a Senior Data Architect with AWS Professional Services. He is passionate about Data and DevOps engineering. He helps customers build data driven applications at scale.

Reflecting on the GDPR to celebrate Privacy Day 2024

Post Syndicated from Emily Hancock http://blog.cloudflare.com/author/emily-hancock/ original https://blog.cloudflare.com/reflecting-on-the-gdpr-to-celebrate-privacy-day-2024


Just in time for Data Privacy Day 2024 on January 28, the EU Commission is calling for evidence to understand how the EU’s General Data Protection Regulation (GDPR) has been functioning now that we’re nearing the 6th anniversary of the regulation coming into force.

We’re so glad they asked, because we have some thoughts. And what better way to celebrate privacy day than by discussing whether the application of the GDPR has actually done anything to improve people’s privacy?

The answer is, mostly yes, but in a couple of significant ways – no.

Overall, the GDPR is rightly seen as the global gold standard for privacy protection. It has served as a model for what data protection practices should look like globally, it enshrines data subject rights that have been copied across jurisdictions, and when it took effect, it created a standard for the kinds of privacy protections people worldwide should be able to expect and demand from the entities that handle their personal data. On balance, the GDPR has definitely moved the needle in the right direction for giving people more control over their personal data and in protecting their privacy.

In a couple of key areas, however, we believe the way the GDPR has been applied to data flowing across the Internet has done nothing for privacy and in fact may even jeopardize the protection of personal data. The first area where we see this is with respect to cross-border data transfers. Location has become a proxy for privacy in the minds of many EU data protection regulators, and we think that is the wrong result. The second area is an overly broad interpretation of what constitutes “personal data” by some regulators with respect to Internet Protocol or “IP” addresses. We contend that IP addresses should not always count as personal data, especially when the entities handling IP addresses have no ability on their own to tie those IP addresses to individuals. This is important because the ability to implement a number of industry-leading cybersecurity measures relies on the ability to do threat intelligence on Internet traffic metadata, including IP addresses.  

Location should not be a proxy for privacy

Fundamentally, good data security and privacy practices should be able to protect personal data regardless of where that processing or storage occurs. Nevertheless, the GDPR is based on the idea that legal protections should attach to personal data based on the location of the data – where it is generated, processed, or stored. Articles 44 to 49 establish the conditions that must be in place in order for data to be transferred to a jurisdiction outside the EU, with the idea that even if the data is in a different location, the privacy protections established by the GDPR should follow the data. No doubt this approach was influenced by political developments around government surveillance practices, such as the revelations in 2013 of secret documents describing the relationship between the US NSA (and its Five Eyes partners) and large Internet companies, and that intelligence agencies were scooping up data from choke points on the Internet. And once the GDPR took effect, many data regulators in the EU were of the view that as a result of the GDPR’s restrictions on cross-border data transfers, European personal data simply could not be processed in the United States in a way that would be consistent with the GDPR.

This issue came to a head in July 2020, when the European Court of Justice (CJEU), in its “Schrems II” decision1, invalidated the EU-US Privacy Shield adequacy standard and questioned the suitability of the EU standard contractual clauses (a mechanism entities can use to ensure that GDPR protections are applied to EU personal data even if it is processed outside the EU). The ruling in some respects left data protection regulators with little room to maneuver on questions of transatlantic data flows. But while some regulators were able to view the Schrems II ruling in a way that would still allow for EU personal data to be processed in the United States, other data protection regulators saw the decision as an opportunity to double down on their view that EU personal data cannot be processed in the US consistent with the GDPR, therefore promoting the misconception that data localization should be a proxy for data protection.

In fact, we would argue that the opposite is the case. From our own experience and according to recent research2, we know that data localization threatens an organization’s ability to achieve integrated management of cybersecurity risk and limits an entity’s ability to employ state-of-the-art cybersecurity measures that rely on cross-border data transfers to make them as effective as possible. For example, Cloudflare’s Bot Management product only increases in accuracy with continued use on the global network: it detects and blocks traffic coming from likely bots before feeding back learnings to the models backing the product. A diversity of signal and scale of data on a global platform is critical to help us continue to evolve our bot detection tools. If the Internet were fragmented – preventing data from one jurisdiction being used in another – more and more signals would be missed. We wouldn’t be able to apply learnings from bot trends in Asia to bot mitigation efforts in Europe, for example. And if the ability to identify bot traffic is hampered, so is the ability to block those harmful bots from services that process personal data.

The need for industry-leading cybersecurity measures is self-evident, and it is not as if data protection authorities don’t realize this. If you look at any enforcement action brought against an entity that suffered a data breach, you see data protection regulators insisting that the impacted entities implement ever more robust cybersecurity measures in line with the obligation GDPR Article 32 places on data controllers and processors to “develop appropriate technical and organizational measures to ensure a level of security appropriate to the risk”, “taking into account the state of the art”. In addition, data localization undermines information sharing within industry and with government agencies for cybersecurity purposes, which is generally recognized as vital to effective cybersecurity.

In this way, while the GDPR itself lays out a solid framework for securing personal data to ensure its privacy, the application of the GDPR’s cross-border data transfer provisions has twisted and contorted the purpose of the GDPR. It’s a classic example of not being able to see the forest for the trees. If the GDPR is applied in such a way as to elevate the priority of data localization over the priority of keeping data private and secure, then the protection of ordinary people’s data suffers.

Applying data transfer rules to IP addresses could lead to balkanization of the Internet

The other key way in which the application of the GDPR has been detrimental to the actual privacy of personal data is related to the way the term “personal data” has been defined in the Internet context – specifically with respect to Internet Protocol or “IP” addresses. A world where IP addresses are always treated as personal data and therefore subject to the GDPR’s data transfer rules is a world that could come perilously close to requiring a walled-off European Internet. And as noted above, this could have serious consequences for data privacy, not to mention that it likely would cut the EU off from any number of global marketplaces, information exchanges, and social media platforms.

This is a bit of a complicated argument, so let’s break it down. As most of us know, IP addresses are the addressing system for the Internet. When you send a request to a website, send an email, or communicate online in any way, IP addresses connect your request to the destination you’re trying to access. These IP addresses are the key to making sure Internet traffic gets delivered to where it needs to go. As the Internet is a global network, this means it’s entirely possible that Internet traffic – which necessarily contains IP addresses – will cross national borders. Indeed, the destination you are trying to access may well be located in a different jurisdiction altogether. That’s just the way the global Internet works. So far, so good.

But if IP addresses are considered personal data, then they are subject to data transfer restrictions under the GDPR. And with the way those provisions have been applied in recent years, some data regulators were getting perilously close to saying that IP addresses cannot transit jurisdictional boundaries if it meant the data might go to the US. The EU’s recent approval of the EU-US Data Privacy Framework established adequacy for US entities that certify to the framework, so these cross-border data transfers are not currently an issue. But if the Data Privacy Framework were to be invalidated as the EU-US Privacy Shield was in the Schrems II decision, then we could find ourselves in a place where the GDPR is applied to mean that IP addresses ostensibly linked to EU residents can’t be processed in the US, or potentially not even leave the EU.

If this were the case, then providers would have to start developing Europe-only networks to ensure IP addresses never cross jurisdictional boundaries. But how would people in the EU and US communicate if EU IP addresses can’t go to the US? Would EU citizens be restricted from accessing content stored in the US? It’s an application of the GDPR that would lead to the absurd result – one surely not intended by its drafters. And yet, in light of the Schrems II case and the way the GDPR has been applied, here we are.

A possible solution would be to consider that IP addresses are not always “personal data” subject to the GDPR. In 2016 – even before the GDPR took effect – the Court of Justice of the European Union (CJEU) established the view in Breyer v. Bundesrepublik Deutschland that even dynamic IP addresses, which change with every new connection to the Internet, constituted personal data if an entity processing the IP address could link the IP addresses to an individual. While the court’s decision did not say that dynamic IP addresses are always personal data under European data protection law, that’s exactly what EU data regulators took from the decision, without considering whether an entity actually has a way to tie the IP address to a real person3.

The question of when an identifier qualifies as “personal data” is again before the CJEU: In April 2023, the lower EU General Court ruled in SRB v EDPS4 that transmitted data can be considered anonymised and therefore not personal data if the data recipient does not have any additional information reasonably likely to allow it to re-identify the data subjects and has no legal means available to access such information. The appellant – the European Data Protection Supervisor (EDPS) – disagrees. The EDPS, who mainly oversees the privacy compliance of EU institutions and bodies, is appealing the decision and arguing that a unique identifier should qualify as personal data if that identifier could ever be linked to an individual, regardless of whether the entity holding the identifier actually had the means to make such a link.

If the lower court’s common-sense ruling holds, one could argue that IP addresses are not personal data when those IP addresses are processed by entities like Cloudflare, which have no means of connecting an IP address to an individual. If IP addresses are then not always personal data, then IP addresses will not always be subject to the GDPR’s rules on cross-border data transfers.

Although it may seem counterintuitive, having a standard whereby an IP address is not necessarily “personal data” would actually be a positive development for privacy. If IP addresses can flow freely across the Internet, then entities in the EU can use non-EU cybersecurity providers to help them secure their personal data. Advanced Machine Learning/predictive AI techniques that look at IP addresses to protect against DDoS attacks, prevent bots, or otherwise guard against personal data breaches will be able to draw on attack patterns and threat intelligence from around the world to the benefit of EU entities and residents. But none of these benefits can be realized in a world where IP addresses are always personal data under the GDPR and where the GDPR’s data transfer rules are interpreted to mean IP addresses linked to EU residents can never flow to the United States.

Keeping privacy in focus

On this Data Privacy Day, we urge EU policy makers to look closely at how the GDPR is working in practice, and to take note of the instances where the GDPR is applied in ways that place privacy protections above all other considerations – even appropriate security measures mandated by the GDPR’s Article 32 that take into account the state of the art of technology. When this happens, it can actually be detrimental to privacy. If taken to the extreme, this formulaic approach would not only negatively impact cybersecurity and data protection, but even put into question the functioning of the global Internet infrastructure as a whole, which depends on cross-border data flows. So what can be done to avert this?

First, we believe EU policymakers could adopt guidelines (if not legal clarification) for regulators that IP addresses should not be considered personal data when they cannot be linked by an entity to a real person. Second, policymakers should clarify that the GDPR’s application should be considered with the cybersecurity benefits of data processing in mind. Building on the GDPR’s existing recital 49, which rightly recognizes cybersecurity as a legitimate interest for processing, personal data that needs to be processed outside the EU for cybersecurity purposes should be exempted from GDPR restrictions to international data transfers. This would avoid some of the worst effects of the mindset that currently views data localization as a proxy for data privacy. Such a shift would be a truly pro-privacy application of the GDPR.

1 Case C-311/18, Data Protection Commissioner v Facebook Ireland and Maximillian Schrems.
2 Swire, Peter and Kennedy-Mayo, DeBrae and Bagley, Andrew and Modak, Avani and Krasser, Sven and Bausewein, Christoph, Risks to Cybersecurity from Data Localization, Organized by Techniques, Tactics, and Procedures (2023).
3 Different decisions by the European data protection authorities, namely the Austrian DSB (December 2021), the French CNIL (February 2022) and the Italian Garante (June 2022), while analyzing the use of Google Analytics, have rejected the relative approach used by the Breyer case and considered that an IP address should always be considered as personal data. Only the decision issued by the Spanish AEPD (December 2022) followed the same interpretation of the Breyer case. In addition, see paragraphs 109 and 136 in Guidelines by Supervisory Authorities for Tele-Media Providers, DSK (2021).
4 Single Resolution Board v EDPS, Court of Justice of the European Union, April 2023.

CISPE Code of Conduct Public Register now has 107 compliant AWS services

Post Syndicated from Gokhan Akyuz original https://aws.amazon.com/blogs/security/cispe-code-of-conduct-public-register-now-has-107-compliant-aws-services/

We continue to expand the scope of our assurance programs at Amazon Web Services (AWS) and are pleased to announce that 107 services are now certified as compliant with the Cloud Infrastructure Services Providers in Europe (CISPE) Data Protection Code of Conduct. This alignment with the CISPE requirements demonstrates our ongoing commitment to adhere to the heightened expectations for data protection by cloud service providers. AWS customers who use AWS certified services can be confident that their data is processed in adherence with the European Union’s General Data Protection Regulation (GDPR).

The CISPE Code of Conduct is the first pan-European, sector-specific code for cloud infrastructure service providers, which received a favorable opinion that it complies with the GDPR. It helps organizations across Europe accelerate the development of GDPR compliant, cloud-based services for consumers, businesses, and institutions.

The accredited monitoring body EY CertifyPoint evaluated AWS on January 26, 2023, and successfully audited 100 certified services. AWS added seven additional services to the current scope in June 2023. As of the date of this blog post, 107 services are in scope of this certification. The Certificate of Compliance that illustrates AWS compliance status is available on the CISPE Public Register. For up-to-date information, including when additional services are added, search the CISPE Public Register by entering AWS as the Seller of Record; or see the AWS CISPE page.

AWS strives to bring additional services into the scope of its compliance programs to help you meet your architectural and regulatory needs. If you have questions or feedback about AWS compliance with CISPE, reach out to your AWS account team.

To learn more about our compliance and security programs, see AWS Compliance Programs, AWS General Data Protection Regulation (GDPR) Center, and the EU data protection section of the AWS Cloud Security site. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Gokhan Akyuz

Gokhan Akyuz

Gokhan is a Security Audit Program Manager at AWS based in Amsterdam, Netherlands. He leads security audits, attestations, and certification programs across Europe and the Middle East. Gokhan has more than 15 years of experience in IT and cybersecurity audits, and controls implementation in a wide range of industries.

New Global AWS Data Processing Addendum

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/new-global-aws-data-processing-addendum/

Navigating data protection laws around the world is no simple task. Today, I’m pleased to announce that AWS is expanding the scope of the AWS Data Processing Addendum (Global AWS DPA) so that it applies globally whenever customers use AWS services to process personal data, regardless of which data protection laws apply to that processing. AWS is proud to be the first major cloud provider to adopt this global approach to help you meet your compliance needs for data protection.

The Global AWS DPA is designed to help you satisfy requirements under data protection laws worldwide, without having to create separate country-specific data processing agreements for every location where you use AWS services. By introducing this global, one-stop addendum, we are simplifying contracting procedures and helping to reduce the time that you spend assessing contractual data privacy requirements on a country-by-country basis.

If you have signed a copy of the previous AWS General Data Protection Regulation (GDPR) DPA, then you do not need to take any action and can continue to rely on that addendum to satisfy data processing requirements. AWS is always innovating to help you meet your compliance obligations wherever you operate. We’re confident that this expanded Global AWS DPA will help you on your journey. If you have questions or need more information, see Data Protection & Privacy at AWS and GDPR Center.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Chad Woolf

Chad joined Amazon in 2010 and built the AWS compliance functions from the ground up, including audit and certifications, privacy, contract compliance, control automation engineering and security process monitoring. Chad’s work also includes enabling public sector and regulated industry adoption of the AWS cloud and leads the AWS trade and product compliance team.

Helping protect personal information in the cloud, all across the world

Post Syndicated from Rory Malone original https://blog.cloudflare.com/cloudflare-official-gdpr-code-of-conduct/

Helping protect personal information in the cloud, all across the world

Helping protect personal information in the cloud, all across the world

Cloudflare has achieved a new EU Cloud Code of Conduct privacy validation, demonstrating GDPR compliance to strengthen trust in cloud services

Internet privacy laws around the globe differ, and in recent years there’s been much written about cross-border data transfers. Many regulations require adequate protections to be in place before personal information flows around the world, as with the European General Data Protection Regulation (GDPR). The law rightly sets a high bar for how organizations must carefully handle personal information, and in drafting the regulation lawmakers anticipated personal data crossing-borders: Chapter V of the regulation covers those transfers specifically.

Whilst transparency on where personal information is stored is important, it’s also critically important how personal information is handled, and how it is kept safe and secure. At Cloudflare, we believe in protecting the privacy of personal information across the world, and we give our customers the tools and the choice on how and where to process their data. Put simply, we require that data is handled and protected in the same, secure, and careful way, whether our customers choose to transfer data across the world, or for it to remain in one country.

And today we are proud to announce that we have successfully completed our assessment journey and received the EU Cloud Code of Conduct compliance mark as a demonstration of our compliance with the GDPR, protecting personal data in the cloud, all across the world.

It matters how personal information is handled – not just where in the world it is saved

The same GDPR lawmakers also anticipated that organizations would want to handle and protect personal information in a consistent, transparent, and safe way too. Article 40, called ‘Codes of Conduct’ starts:

“The Member States, the supervisory authorities, the Board and the Commission shall encourage the drawing up of codes of conduct intended to contribute to the proper application of this Regulation, taking account of the specific features of the various processing sectors and the specific needs of micro, small and medium-sized enterprises.”

Using codes of conduct to demonstrate compliance with privacy law has a longer history, too. Like the GDPR, the pioneering 1995 EU Data Protection Directive, officially Directive 95/46/EC, also included provision for draft community codes to be submitted to national authorities, and for those codes to be formally approved by an official body of the European Union.

An official GDPR Code of Conduct

It took a full five years after the GDPR was adopted in 2016 for the first code of conduct to be officially approved. Finally in May 2021, the European Data Protection Board, a group composed of representatives of all the national data protection authorities across the union, approved the “EU Data Protection Code of Conduct for Cloud Service Providers” – the EU Cloud Code of Conduct (or ‘EU Cloud CoC’ for short) as the first official GDPR code of conduct. The EU Cloud CoC was brought to the board by the Belgian supervisory authority on behalf of SCOPE Europe, the organization who collaborated to develop the code over a number of years, including with input from the European Commission, members of the cloud computing community, and European data protection authorities.

The code is a framework for buyers and providers of cloud services. Buyers can understand in a straightforward way how a provider of cloud services will handle personal information. Providers of cloud services undergo an independent assessment to demonstrate to the buyers of their cloud services that they will handle personal information in a safe and codified way. In the case of the EU Cloud CoC and only because the code has received formal approval, buyers of cloud services compliant with code will know that the cloud provider handled customer personal information in a way that is compliant with the GDPR.

What the code covers

The code defines clear requirements for providers of cloud services to implement Article 28 of the GDPR (“Processor”) and related articles. The framework covers data protection policies, as well as technical and organizational security measures. There are sections covering providers’ terms and conditions, confidentiality and recordkeeping, the audit rights of the customer, how to handle potential data breaches, and how the provider approaches subprocessing – when a third-party is subcontracted to process personal data alongside the main data processor – and more.

The framework also covers how personal data may be legitimately transferred internationally, although whilst the EU Cloud CoC covers ensuring this is done in a legally-compliant way, the code itself is not a ‘safeguard’ or a tool for third country transfers. A future update to the code may expand into that with an additional module, but as of March 2023 that is still under development.

Let us do a deeper dive into some of the requirements of the EU Cloud CoC and how it can demonstrate compliance with the GDPR

Example one
One requirement in the code is to have documented procedures in place to assist customers with their ‘data protection impact assessments’. According to the GDPR, these are:

“…an assessment of the impact of the envisaged processing operations
on the protection of personal data.” – Article 35.1, GDPR

So a cloud service provider should have a written process in place to support customers as they undertake their own assessments. In supporting the customer, the service provider is demonstrating their commitment to the rigorous data protection standards of the GDPR too. Cloudflare meets this requirement, and further supports transparency by publishing details of sub-processors used in the processing of personal data, and directing customers to audit reports available in the Cloudflare dashboard.

There’s also another reference in the GDPR to codes of conduct in the context of data protection impact assessments too:

Compliance with approved codes of conduct… shall be taken into due account in assessing the impact of the processing operations performed… in particular for the purposes of a data protection impact assessment.” – Article 35.8, GDPR

So when preparing an impact assessment, a cloud customer shall take into account that a service provider complies with an approved code of conduct. Another way that both customers and cloud providers benefit from using codes of conduct!

Example two
Another example of a requirement of the code is that when cloud service providers provide encryption capabilities, they shall be implemented effectively. The requirement clarifies further that this should be undertaken by following strong and trusted encryption techniques, by taking into account the state-of-the-art, and by adequately preventing abusive access to customer personal data. Encryption is critical to protecting personal data in the cloud; without encryption, or with weakened or outdated encryption, privacy and security are not possible. So in using and reviewing encryption appropriately, cloud services providers help meet the requirements of the GDPR in protecting their customers’ personal data.

At Cloudflare, we are particularly proud of our track record: we make effective encryption available, for free, to all our customers. We help our customers understand encryption, and most importantly, we use strong and trusted encryption algorithms and techniques ourselves to protect customer personal data. We have a formal Research Team, including academic researchers and cryptographers who design and deploy state-of-the-art encryption protocols designed to provide effective protection against active and passive attacks, including with resources known to be available to public authorities; and we use trustworthy public-key certification authorities and infrastructure. Most recently this month, we announced that post-quantum crypto should be free, and so we are including it for free, forever.

More information
The code contains requirements described in 87 statements, called controls. You can find more about the EU Cloud CoC, download a full copy of the code, and keep up to date with news at https://eucoc.cloud/en/home

Why this matters to Cloudflare customers

Cloudflare joined the EU Cloud Code of Conduct’s General Assembly last May. Members of the General Assembly undertake an assessment journey which includes declaring named cloud services compliant with the EU Cloud Code, and after completing an independent assessment process by SCOPE Europe, the accredited monitoring body, receive the EU Cloud Code of Conduct compliance mark.

Cloudflare has completed the assessment process and been verified for 47 cloud services.

Cloudflare services that are in scope for EU Cloud Code of Conduct:

Helping protect personal information in the cloud, all across the world
EU Cloud CoC Verification-ID: 2023LVL02SCOPE4316.

Services are verified compliant with the EU Cloud Code of Conduct,
Verification-ID: 2023LVL02SCOPE4316.
For further information please visit https://eucoc.cloud/en/public-register

And we’re not done yet…

The EU Cloud Code of Conduct is the newest privacy validation to add to our growing list of privacy certifications. Two years ago, Cloudflare was one of the first organisations in our industry to have received the new ISO privacy certification, ISO/IEC 27701:2019, and the first Internet performance & security company to be certified to it. Last year, Cloudflare certified to a second international privacy standard related to the processing of personal data, ISO/IEC 27018:2019. Most recently, in January this year Cloudflare completed our annual ISO audit with third-party auditor Schellman; and our new certificate, covering ISO 27001:2013, ISO 27018:2019, and ISO 27701:2019 is now available for customers to download from the Cloudflare dashboard.

And there’s more to come! As we blogged about in January for Data Privacy Day, we’re following the progress of the emerging Global Cross Border Privacy Rules (CBPR) certification with interest. This proposed single global certification could suffice for participating companies to safely transfer personal data between participating countries worldwide, and having already been supported by several governments from North America and Asia, looks very promising in this regard.

Cloudflare certifications

Find out how existing customers may download a copy of Cloudflare’s certifications and reports from the Cloudflare dashboard; new customers may also request these from your sales representative.

For the latest information about our certifications and reports, please visit our Trust Hub.

Facebook Fined $276M under GDPR

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/11/facebook-fined-276m-under-gdpr.html

Facebook—Meta—was just fined $276 million (USD) for a data leak that included full names, birth dates, phone numbers, and location.

Meta’s total fine by the Data Protection Commission is over $700 million. Total GDPR fines are over €2 billion (EUR) since 2018.

Build a pseudonymization service on AWS to protect sensitive data, part 1

Post Syndicated from Rahul Shaurya original https://aws.amazon.com/blogs/big-data/part-1-build-a-pseudonymization-service-on-aws-to-protect-sensitive-data/

According to an article in MIT Sloan Management Review, 9 out of 10 companies believe their industry will be digitally disrupted. In order to fuel the digital disruption, companies are eager to gather as much data as possible. Given the importance of this new asset, lawmakers are keen to protect the privacy of individuals and prevent any misuse. Organizations often face challenges as they aim to comply with data privacy regulations like Europe’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). These regulations demand strict access controls to protect sensitive personal data.

This is a two-part post. In part 1, we walk through a solution that uses a microservice-based approach to enable fast and cost-effective pseudonymization of attributes in datasets. The solution uses the AES-GCM-SIV algorithm to pseudonymize sensitive data. In part 2, we will walk through useful patterns for dealing with data protection for varying degrees of data volume, velocity, and variety using Amazon EMR, AWS Glue, and Amazon Athena.

Data privacy and data protection basics

Before diving into the solution architecture, let’s look at some of the basics of data privacy and data protection. Data privacy refers to the handling of personal information and how data should be handled based on its relative importance, consent, data collection, and regulatory compliance. Depending on your regional privacy laws, the terminology and definition in scope of personal information may differ. For example, privacy laws in the United States use personally identifiable information (PII) in their terminology, whereas GDPR in the European Union refers to it as personal data. Techgdpr explains in detail the difference between the two. Through the rest of the post, we use PII and personal data interchangeably.

Data anonymization and pseudonymization can potentially be used to implement data privacy to protect both PII and personal data and still allow organizations to legitimately use the data.

Anonymization vs. pseudonymization

Anonymization refers to a technique of data processing that aims to irreversibly remove PII from a dataset. The dataset is considered anonymized if it can’t be used to directly or indirectly identify an individual.

Pseudonymization is a data sanitization procedure by which PII fields within a data record are replaced by artificial identifiers. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing. This technique is especially useful because it protects your PII data at record level for analytical purposes such as business intelligence, big data, or machine learning use cases.

The main difference between anonymization and pseudonymization is that the pseudonymized data is reversible (re-identifiable) to authorized users and is still considered personal data.

Solution overview

The following architecture diagram provides an overview of the solution.

Solution overview

This architecture contains two separate accounts:

  • Central pseudonymization service: Account 111111111111 – The pseudonymization service is running in its own dedicated AWS account (right). This is a centrally managed pseudonymization API that provides access to two resources for pseudonymization and reidentification. With this architecture, you can apply authentication, authorization, rate limiting, and other API management tasks in one place. For this solution, we’re using API keys to authenticate and authorize consumers.
  • Compute: Account 222222222222 – The account on the left is referred to as the compute account, where the extract, transform, and load (ETL) workloads are running. This account depicts a consumer of the pseudonymization microservice. The account hosts the various consumer patterns depicted in the architecture diagram. These solutions are covered in detail in part 2 of this series.

The pseudonymization service is built using AWS Lambda and Amazon API Gateway. Lambda enables the serverless microservice features, and API Gateway provides serverless APIs for HTTP or RESTful and WebSocket communication.

We create the solution resources via AWS CloudFormation. The CloudFormation stack template and the source code for the Lambda function are available in GitHub Repository.

We walk you through the following steps:

  1. Deploy the solution resources with AWS CloudFormation.
  2. Generate encryption keys and persist them in AWS Secrets Manager.
  3. Test the service.

Demystifying the pseudonymization service

Pseudonymization logic is written in Java and uses the AES-GCM-SIV algorithm developed by codahale. The source code is hosted in a Lambda function. Secret keys are stored securely in Secrets Manager. AWS Key Management System (AWS KMS) makes sure that secrets and sensitive components are protected at rest. The service is exposed to consumers via API Gateway as a REST API. Consumers are authenticated and authorized to consume the API via API keys. The pseudonymization service is technology agnostic and can be adopted by any form of consumer as long as they’re able to consume REST APIs.

As depicted in the following figure, the API consists of two resources with the POST method:

API Resources

  • Pseudonymization – The pseudonymization resource can be used by authorized users to pseudonymize a given list of plaintexts (identifiers) and replace them with a pseudonym.
  • Reidentification – The reidentification resource can be used by authorized users to convert pseudonyms to plaintexts (identifiers).

The request response model of the API utilizes Java string arrays to store multiple values in a single variable, as depicted in the following code.

Request/Response model

The API supports a Boolean type query parameter to decide whether encryption is deterministic or probabilistic.

The implementation of the algorithm has been modified to add the logic to generate a nonce, which is dependent on the plaintext being pseudonymized. If the incoming query parameters key deterministic has the value True, then the overloaded version of the encrypt function is called. This generates a nonce using the HmacSHA256 function on the plaintext, and takes 12 sub-bytes from a predetermined position for nonce. This nonce is then used for the encryption and prepended to the resulting ciphertext. The following is an example:

  • IdentifierVIN98765432101234
  • NonceNjcxMDVjMmQ5OTE5
  • PseudonymNjcxMDVjMmQ5OTE5q44vuub5QD4WH3vz1Jj26ZMcVGS+XB9kDpxp/tMinfd9

This approach is useful especially for building analytical systems that may require PII fields to be used for joining datasets with other pseudonymized datasets.

The following code shows an example of deterministic encryption.Deterministic Encryption

If the incoming query parameters key deterministic has the value False, then the encrypt method is called without the deterministic parameter and the nonce generated is a random 12 bytes. This generates a different ciphertext for the same incoming plaintext.

The following code shows an example of probabilistic encryption.

Probabilistic Encryption

The Lambda function utilizes a couple of caching mechanisms to boost the performance of the function. It uses Guava to build a cache to avoid generation of the pseudonym or identifier if it’s already available in the cache. For the probabilistic approach, the cache isn’t utilized. It also uses SecretCache, an in-memory cache for secrets requested from Secrets Manager.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploy the solution resources with AWS CloudFormation

The deployment is triggered by running the deploy.sh script. The script runs the following phases:

  1. Checks for dependencies.
  2. Builds the Lambda package.
  3. Builds the CloudFormation stack.
  4. Deploys the CloudFormation stack.
  5. Prints to standard out the stack output.

The following resources are deployed from the stack:

  • An API Gateway REST API with two resources:
    • /pseudonymization
    • /reidentification
  • A Lambda function
  • A Secrets Manager secret
  • A KMS key
  • IAM roles and policies
  • An Amazon CloudWatch Logs group

You need to pass the following parameters to the script for the deployment to be successful:

  • STACK_NAME – The CloudFormation stack name.
  • AWS_REGION – The Region where the solution is deployed.
  • AWS_PROFILE – The named profile that applies to the AWS Command Line Interface (AWS CLI). command
  • ARTEFACT_S3_BUCKET – The S3 bucket where the infrastructure code is stored. The bucket must be created in the same account and Region where the solution lives.

Use the following commands to run the ./deployments_scripts/deploy.sh script:

chmod +x ./deployment_scripts/deploy.sh ./deployment_scripts/deploy.sh -s STACK_NAME -b ARTEFACT_S3_BUCKET -r AWS_REGION -p AWS_PROFILE AWS_REGION

Upon successful deployment, the script displays the stack outputs, as depicted in the following screenshot. Take note of the output, because we use it in subsequent steps.

Stack Output

Generate encryption keys and persist them in Secrets Manager

In this step, we generate the encryption keys required to pseudonymize the plain text data. We generate those keys by calling the KMS key we created in the previous step. Then we persist the keys in a secret. Encryption keys are encrypted at rest and in transit, and exist in plain text only in-memory when the function calls them.

To perform this step, we use the script key_generator.py. You need to pass the following parameters for the script to run successfully:

  • KmsKeyArn – The output value from the previous stack deployment
  • AWS_PROFILE – The named profile that applies to the AWS CLI command
  • AWS_REGION – The Region where the solution is deployed
  • SecretName – The output value from the previous stack deployment

Use the following command to run ./helper_scripts/key_generator.py:

python3 ./helper_scripts/key_generator.py -k KmsKeyArn -s SecretName -p AWS_PROFILE -r AWS_REGION

Upon successful deployment, the secret value should look like the following screenshot.

Encryption Secrets

Test the solution

In this step, we configure Postman and query the REST API, so you need to make sure Postman is installed in your machine. Upon successful authentication, the API returns the requested values.

The following parameters are required to create a complete request in Postman:

  • PseudonymizationUrl – The output value from stack deployment
  • ReidentificationUrl – The output value from stack deployment
  • deterministic – The value True or False for the pseudonymization call
  • API_Key – The API key, which you can retrieve from API Gateway console

Follow these steps to set up Postman:

  1. Start Postman in your machine.
  2. On the File menu, choose Import.
  3. Import the Postman collection.
  4. From the collection folder, navigate to the pseudonymization request.
  5. To test the pseudonymization resource, replace all variables in the sample request with the parameters mentioned earlier.

The request template in the body already has some dummy values provided. You can use the existing one or exchange with your own.

  1. Choose Send to run the request.

The API returns in the body of the response a JSON data type.

Reidentification

  1. From the collection folder, navigate to the reidentification request.
  2. To test the reidentification resource, replace all variables in the sample request with the parameters mentioned earlier.
  3. Pass to the response template in the body the pseudonyms output from earlier.
  4. Choose Send to run the request.

The API returns in the body of the response a JSON data type.

Pseudonyms

Cost and performance

There are many factors that can determine the cost and performance of the service. Performance especially can be influenced by payload size, concurrency, cache hit, and managed service limits on the account level. The cost is mainly influenced by how much the service is being used. For our cost and performance exercise, we consider the following scenario:

The REST API is used to pseudonymize Vehicle Identification Numbers (VINs). On average, consumers request pseudonymization of 1,000 VINs per call. The service processes on average 40 requests per second, or 40,000 encryption or decryption operations per second. The average process time per request is as follows:

  • 15 milliseconds for deterministic encryption
  • 23 milliseconds for probabilistic encryption
  • 6 milliseconds for decryption

The number of calls hitting the service per month is distributed as follows:

  • 50 million calls hitting the pseudonymization resource for deterministic encryption
  • 25 million calls hitting the pseudonymization resource for probabilistic encryption
  • 25 million calls hitting the reidentification resource for decryption

Based on this scenario, the average cost is $415.42 USD per month. You may find the detailed cost breakdown in the estimate generated via the AWS Pricing Calculator.

We use Locust to simulate a similar load to our scenario. Measurements from Amazon CloudWatch metrics are depicted in the following screenshots (network latency isn’t considered during our measurement).

The following screenshot shows API Gateway latency and Lambda duration for deterministic encryption. Latency is high at the beginning due to the cold start, and flattens out over time.

API Gateway Latency & Lamdba Duration for deterministic encryption. Latency is high at the beginning due to the cold start and flattens out over time.

The following screenshot shows metrics for probabilistic encryption.

metrics for probabilistic encryption

The following shows metrics for decryption.

metrics for decryption

Clean up

To avoid incurring future charges, delete the CloudFormation stack by running the destroy.sh script. The following parameters are required to run the script successfully:

  • STACK_NAME – The CloudFormation stack name
  • AWS_REGION – The Region where the solution is deployed
  • AWS_PROFILE – The named profile that applies to the AWS CLI command

Use the following commands to run the ./deployment_scripts/destroy.sh script:

chmod +x ./deployment_scripts/destroy.sh ./deployment_scripts/destroy.sh -s STACK_NAME -r AWS_REGION -p AWS_PROFILE

Conclusion

In this post, we demonstrated how to build a pseudonymization service on AWS. The solution is technology agnostic and can be adopted by any form of consumer as long as they’re able to consume REST APIs. We hope this post helps you in your data protection strategies.

Stay tuned for part 2, which will cover consumption patterns of the pseudonymization service.


About the authors

Edvin Hallvaxhiu is a Senior Global Security Architect with AWS Professional Services and is passionate about cybersecurity and automation. He helps customers build secure and compliant solutions in the cloud. Outside work, he likes traveling and sports.

Rahul Shaurya is a Senior Big Data Architect with AWS Professional Services. He helps and works closely with customers building data platforms and analytical applications on AWS. Outside of work, Rahul loves taking long walks with his dog Barney.

Andrea Montanari is a Big Data Architect with AWS Professional Services. He actively supports customers and partners in building analytics solutions at scale on AWS.

María Guerra is a Big Data Architect with AWS Professional Services. Maria has a background in data analytics and mechanical engineering. She helps customers architecting and developing data related workloads in the cloud.

Pushpraj is a Data Architect with AWS Professional Services. He is passionate about Data and DevOps engineering. He helps customers build data driven applications at scale.

Cloudflare achieves key cloud computing certifications — and there’s more to come

Post Syndicated from Rory Malone original https://blog.cloudflare.com/iso-27018-second-privacy-certification-and-c5/

Cloudflare achieves key cloud computing certifications — and there’s more to come

Cloudflare achieves key cloud computing certifications — and there’s more to come

Back in the early days of the Internet, you could physically see the hardware where your data was stored. You knew where your data was and what kind of locks and security protections you had in place. Fast-forward a few decades, and data is all “in the cloud”. Now, you have to trust that your cloud services provider is putting security precautions in place just as you would have if your data was still sitting on your hardware. The good news is, you don’t have to merely trust your provider anymore. There are a number of ways a cloud services provider can prove it has robust privacy and security protections in place.

Today, we are excited to announce that Cloudflare has taken three major steps forward in proving the security and privacy protections we provide to customers of our cloud services: we achieved a key cloud services certification, ISO/IEC 27018:2019; we completed our independent audit and received our Cloud Computing Compliance Criteria Catalog (“C5”) attestation; and we have joined the EU Cloud Code of Conduct General Assembly to help increase the impact of the trusted cloud ecosystem and encourage more organizations to adopt GDPR-compliant cloud services.

Cloudflare has been committed to data privacy and security since our founding, and it is important to us that we can demonstrate these commitments. Certification provides assurance to our customers that a third party has independently verified that Cloudflare meets the requirements set out in the standard.

ISO/IEC 27018:2019 – Cloud Services Certification

2022 has been a big year for people who like the number ‘two’. February marked the second when the 22nd Feb 2022 20:22:02 passed: the second second of the twenty-second minute of the twentieth hour of the twenty-second day of the second month, of the year twenty-twenty-two! As well as the date being a palindrome — something that reads the same forwards and backwards — on an vintage ‘80s LCD clock, the date and time could be written as an ambigram too — something that can be read upside down as well as the right way up:

Cloudflare achieves key cloud computing certifications — and there’s more to come

When we hit 2022 02 22, our team was busy completing our second annual audit to certify to ISO/IEC 27701:2019, having been one of the first organizations in our industry to have achieved this new ISO privacy certification in 2021, and the first Internet performance & security company to be certified to it. And now Cloudflare has now been certified to a second international privacy standard related to the processing of personal data — ISO/IEC 27018:2019.1

ISO 27018 is a privacy extension to the widespread industry standards ISO/IEC 27001 and ISO/IEC 27002, which describe how to establish and run an Information Security Management System. ISO 27018 extends the standards into a code of practice for how any personal information should be protected when processed in a public cloud, such as Cloudflare’s.

What does ISO 27018 mean for Cloudflare customers?

Put simply, with Cloudflare’s certifications to both ISO 27701 and ISO 27018, customers can be assured that Cloudflare both has a privacy program that meets GDPR-aligned industry standards and also that Cloudflare protects the personal data processed in our network as part of that privacy program.

These certifications, in addition to the Data Processing Addendum (“DPA”) we make available to our customers, offer our customers multiple layers of assurance that any personal data that Cloudflare processes on their behalf will be handled in a way that meets the GDPR’s requirements.

The ISO 271018 standard contains enhancements to existing ISO 27002 controls and an additional set of 25 controls identified for organizations that are personal data processors. Controls are essentially a set of best practices that processors must meet in terms of data handling practices and transparency about those practices, protecting and encrypting the personal data processed, and handling data subject rights, among others. As an example, one of the ISO 27018 requirements is:

Where the organization is contracted to process personal data, that personal data may not be used for the purpose of marketing and advertising without establishing that prior consent was obtained from the appropriate data subject. Such consent shall not be a condition for receiving the service.

When Cloudflare acts as a data processor for our customers’ data, that data (and any personal data it may contain) belongs to our customers, not to us. Cloudflare does not track our customers’ end users for marketing or advertising purposes, and we never will. We even went beyond what the ISO control required and added this commitment to our customer DPA:

“… Cloudflare shall not use the Personal Data for the purposes of marketing or advertising…”
– 3.1(b), Cloudflare Data Protection Addendum

Cloudflare achieves ISO 27018:2019 Certification

For ISO 27018, Cloudflare was assessed by a third-party auditor, Schellman, between December 2021 and February 2022. Certifying to an ISO privacy standard is a multi-step process that includes an internal and an external audit, before finally being certified against the standard by the independent auditor. Cloudflare’s new single joint certificate, covering ISO 27001:2013, ISO 27018:2019, and ISO 27701:2019 is now available to download from the Cloudflare Dashboard.

Cloudflare achieves key cloud computing certifications — and there’s more to come

C5:2020 – Cloud Computing Compliance Criteria Catalog

ISO 27108 isn’t all we’re announcing: as we blogged in February, Cloudflare has also been undergoing a separate independent audit for the Cloud Computing Compliance Criteria Catalog certification — also known as C5 — which was introduced by the German government’s Federal Office for Information Security (“BSI”) in 2016 and updated in 2020. C5 evaluates an organization’s security program against a standard of robust cloud security controls. Both German government agencies and private companies place a high level of importance on aligning their cloud computing requirements with these standards. Learn more about C5 here.

Today, we’re excited to announce that we have completed our independent audit and received our C5 attestation from our third-party auditors. The C5 attestation report is now available  to download from the Cloudflare Dashboard.

And we’re not done yet…

When the European Union’s benchmark-setting General Data Protection Regulation (“GDPR”) was adopted four years ago this week, Article 40 encouraged:

“…the drawing up of codes of conduct intended to contribute to the proper application of this Regulation, taking account of the specific features of the various processing sectors and the specific needs of micro, small and medium-sized enterprises.”

The first code officially approved as GDPR-compliant by the EU one year ago this past weekend is ‘The EU Cloud Code of Conduct’. This code is designed to help cloud service providers demonstrate the protections they provide for the personal data they process on behalf of their customers. It covers all cloud service layers, and its compliance is overseen by accredited monitoring body SCOPE Europe. Initially, cloud service providers join as members of the code’s General Assembly, and then the second step is to undergo an audit to validate their adherence to the code.

Today, we are pleased to announce today that Cloudflare has joined the General Assembly of the EU Cloud Code of Conduct. We look forward to the second stage in this process, undertaking our audit and publicly affirming our compliance to the GDPR as a processor of personal data.

Cloudflare Certifications

Customers may now download a copy of Cloudflare’s certifications and reports from the Cloudflare Dashboard; new customers may request these from your sales representative. For the latest information about our certifications and reports, please visit our Trust Hub.


1The International Organization for Standardization (“ISO”) is an international, nongovernmental organization made up of national standards bodies that develops and publishes a wide range of proprietary, industrial, and commercial standards.

New IDC whitepaper released – Trusted Cloud: Overcoming the Tension Between Data Sovereignty and Accelerated Digital Transformation

Post Syndicated from Marta Taggart original https://aws.amazon.com/blogs/security/new-idc-whitepaper-released-trusted-cloud-overcoming-the-tension-between-data-sovereignty-and-accelerated-digital-transformation/

A new International Data Corporation (IDC) whitepaper sponsored by AWS, Trusted Cloud: Overcoming the Tension Between Data Sovereignty and Accelerated Digital Transformation, examines the importance of the cloud in building the future of digital EU organizations. IDC predicts that 70% of CEOs of large European organizations will be incentivized to generate at least 40% of their revenues from digital by 2025, which means they have to accelerate their digital transformation. In a 2022 IDC survey of CEOs across Europe, 46% of European CEOs will accelerate the shift to cloud as their most strategic IT initiative in 2022.

In the whitepaper, IDC offers perspectives on how operational effectiveness, digital investment, and ultimately business growth need to be balanced with data sovereignty requirements. IDC defines data sovereignty as “a subset of digital sovereignty. It is the concept of data being subject to the laws and governance structures within the country it is collected or pertains to.”

IDC provides a perspective on some of the current discourse on cloud data sovereignty, including extraterritorial reach of foreign intelligence under national security laws, and the level of protection for individuals’ privacy in-country or with cross-border data transfer. The Schrems II decision and its implications with respect to personal data transfers between the EU and US has left many organizations grappling with how to comply with their legal requirements when transferring data outside the EU.

IDC provides the following background on controls in the cloud:

  • Cloud providers do not have unrestricted access to customer data in the cloud. Organizations retain all ownership and control of their data. Through credential and permission settings, the customer is the controller of who has access to their data.
  • Cloud providers use a rigorous set of organizational and technical controls based on least privilege to protect data from unauthorized access and inappropriate use.
  • Most cloud service operations, including maintenance and trouble-shooting, are fully automated. Should human access to customer data be required, it is temporary and limited to what is necessary to provide the contracted service to the customer. All access should be strictly logged, monitored, and audited to verify that activity is valid and compliant.
  • Technical controls such as encryption and key management assume greater importance. Encryption is considered fundamental to data protection best practices and highly recommended by regulators. Encrypted data processed in memory within hardware-based trusted execution environment (TEEs), also known as enclaves, can alleviate these regulatory concerns by rendering sensitive information invisible to host operating systems and cloud providers. The AWS Nitro System, the underlying platform that runs Amazon EC2 instances, is an industry example that provides such protection capability.
  • Independent accreditation against official standards are a recognized basis for assessing adherence to privacy and security practices. Approved by the European Data Protection Board, the EU Cloud Code of Conduct and CISPE’s Code of Conduct for Cloud Infrastructure Service Providers provide an accountability framework to help demonstrate compliance with processor obligations under GDPR Article 28. Whilst not required for GDPR compliance, CISPE requires accredited cloud providers to offer customers the option to retain all personal data in their customer content in the European Economic Area (EEA).
  • Greater data control and security is often cited as a driver to hosting data in-country. However, IDC notes that the physical location of the data has no bearing on mitigating data risk to cyber threats. Data residency can run counter to an organization’s objectives for security and resilience. More and more European organizations now are trusting the cloud for their security needs, as many organizations simply do not have the resource and expertise to provide the same security benefits as large cloud providers can.

For more information about how to translate your data sovereignty requirements into an actionable business and IT strategy, read the full IDC whitepaper Trusted Cloud: Overcoming the Tension Between Data Sovereignty and Accelerated Digital Transformation. You can also read more about AWS commitments to protect EU customers’ data on our EU data protection webpage.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Marta Taggart

Marta is a Seattle-native and Senior Product Marketing Manager in AWS Security Product Marketing, where she focuses on data protection services. Outside of work you’ll find her trying to convince Jack, her rescue dog, not to chase squirrels and crows (with limited success).

Orlando Scott-Cowley

Orlando Scott-Cowley

Orlando is Amazon Web Services’ Worldwide Public Sector Lead for Security & Compliance in EMEA. Orlando customers with their security and compliance and adopting AWS. Orlando specialises in Cyber Security, with a background in security consultancy, penetration testing and compliance; he holds a CISSP, CCSP and CCSK.

LGPD workbook for AWS customers managing personally identifiable information in Brazil

Post Syndicated from Rodrigo Fiuza original https://aws.amazon.com/blogs/security/lgpd-workbook-for-aws-customers-managing-personally-identifiable-information-in-brazil/

Portuguese version

AWS is pleased to announce the publication of the Brazil General Data Protection Law Workbook.

The General Data Protection Law (LGPD) in Brazil was first published on 14 August 2018, and started its applicability on 18 August 2020. Companies that manage personally identifiable information (PII) in Brazil as defined by LGPD will have to comply with and attend to the law.

To better help customers prepare and implement controls that focus on LGPD Chapter VII Security and Best Practices, AWS created a workbook based on industry best practices, AWS service offerings, and controls.

Amongst other topics, this workbook covers information security and AWS controls from:

In combination with Brazil General Data Protection Law Workbook, customers can use the detailed Navigating LGPD Compliance on AWS whitepaper.

AWS adheres to a shared responsibility model. Customers will have to observe which services offer privacy features and determine their applicability to their specific compliance requirements. Further information about data privacy at AWS can be found at our Data Privacy Center. Specific information about LGPD and data privacy at AWS in Brazil can be found on our Brazil Data Privacy page.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security news? Follow us on Twitter.
 


Portuguese

Workbook da LGPD para Clientes AWS que gerenciam Informações de Identificação Pessoal no Brasil

A AWS tem o prazer de anunciar a publicação do Workbook Lei Geral de Proteção de Dados do Brasil.

A Lei Geral de Proteção de Dados (LGPD) teve sua primeira publicação em 14 de agosto de 2018 no Brasil e iniciou sua aplicabilidade em 18 de agosto de 2020. Empresas que gerenciam informações pessoais identificáveis (PII) conforme definido na LGPD terão que cumprir e atender às cláusulas da lei.

Para ajudar melhor os clientes a preparar e implementar controles que se concentram no Capítulo VII da LGPD “da Segurança e Boas Práticas”, a AWS criou uma pasta de trabalho com base nas melhores práticas do setor, ofertas de serviços e controles da AWS.

Entre outros tópicos, esta pasta de trabalho aborda a segurança da informação e os controles da AWS de:

Em combinação com o Workbook Lei Geral de Proteção de Dados do Brasil, os clientes podem usar o whitepaper detalhado Navegando na conformidade com a LGPD na AWS.

A AWS adere a um modelo de responsabilidade compartilhada. Clientes terão que observar quais serviços oferecem recursos de privacidade e determinar sua aplicabilidade aos seus requisitos específicos de compliance. Mais informações sobre a privacidade de dados na AWS podem ser encontradas em nosso Centro de Privacidade de Dados. Informações adicionais sobre LGPD e Privacidade de dados na AWS no Brasil podem ser encontradas em nossa página de Privacidade de Dados no Brasil.

Para saber mais sobre nossos programas de conformidade e segurança, consulte Programas de conformidade da AWS. Como sempre, valorizamos seus comentários e perguntas; entre em contato com a equipe de conformidade da AWS por meio da página Fale conosco.

Se você tiver feedback sobre esta postagem, envie comentários na seção Comentários abaixo.

Quer mais notícias sobre segurança da AWS? Siga-nos no Twitter.

Author

Rodrigo Fiuza

Rodrigo is a Security Audit Manager at AWS, based in São Paulo. He leads audits, attestations, certifications, and assessments across Latin America, Caribbean and Europe. Rodrigo has previously worked in risk management, security assurance, and technology audits for the past 12 years.

AWS welcomes new Trans-Atlantic Data Privacy Framework

Post Syndicated from Michael Punke original https://aws.amazon.com/blogs/security/aws-welcomes-new-trans-atlantic-data-privacy-framework/

Amazon Web Services (AWS) welcomes the new Trans-Atlantic Data Privacy Framework (Data Privacy Framework) that was agreed to, in principle, between the European Union (EU) and the United States (US) last month. This announcement demonstrates the common will between the US and EU to strengthen privacy protections in trans-Atlantic data flows, and will supplement the safeguards AWS and other companies already offer today. AWS commits to undertaking certification in accordance with the Data Privacy Framework as it is adopted, and we look forward to our customers and their end users benefiting from the new safeguards.

The Data Privacy Framework, once finalized, will re-establish a mechanism for certified businesses to conduct trans-Atlantic data transfers between the US and EU. According to the announcement, the new Data Privacy Framework will address the concerns raised by the Court of Justice of the European Union (CJEU) when it invalidated the EU-US Privacy Shield in its Schrems II decision in uly 2020. The Data Privacy Framework will adopt new safeguards to ensure that US intelligence activities are limited to what is necessary and proportionate to protect national security, and also create a new redress system to address the complaints of EU citizens.

As one of the architects of the Trusted Cloud Principles (a cloud-industry initiative to help safeguard the interests of organizations and the basic rights of individuals using cloud), AWS fully supports improved rules and regulations that advance privacy and security protections for any organization that wants to use cloud technologies and maintain control of their data.

While organizations using AWS technology have been able to conduct trans-Atlantic data transfers even after Schrems II, the new Data Privacy Framework will ensure further clarity and agility for our customers in their data transfer assessments. This will help our customers unlock value in terms of growth, digital transformation, and global competitive advantage.

Organizations that want to trade with speed and agility to and from the European Economic Area (EEA) need certainty that their goals to innovate and invest in the best technology for growth is supported by international frameworks promoting privacy across borders. Once finalized, the new Data Privacy Framework, coupled with our continued commitment to privacy at AWS, will provide even more simplicity and confidence for customers who choose to transfer data to and from Europe when using AWS services.

More than ever, our collective security requires mutual trust across both sides of the Atlantic and beyond. We therefore look forward to participating in, and remain committed to, the finalization of the Data Privacy Framework. We also support efforts to build broad consensus around the appropriate balance between privacy and security in forums such as the OECD’s workstream on trusted government access to data held by the private sector.

About AWS privacy and security

AWS is committed to protecting customer data. We continue to help customers successfully meet evolving European laws and standards, and achieve the highest levels of security, privacy, and resilience. AWS already offers comprehensive technical, operational, and contractual measures to protect and transfer customer content outside of Europe in compliance with the General Data Protection Regulation (GDPR) and the Schrems II ruling. Customers can also choose to store their content in the European Union by selecting any one or more of our regions in France, Germany, Ireland, Italy, Sweden, and later in 2022, Spain, with the confidence that their data stays in the AWS Region they select. In addition, customers can use an advanced set of access, encryption, and logging features to maintain full control of their content.

Today, AWS customers can also transfer their data outside of the European Economic Area (EEA) by relying on the new Standard Contractual Clauses (SCCs) included in the AWS Data Processing Addendum (DPA), which is supplemented by our strengthened contractual commitments to protect customer data, such as challenging law enforcement requests that conflict with EU law.

We also have a wide variety of tools available to enhance the security of cross-border data transfers for customers with global services. For example, AWS CloudHSM and AWS Key Management Service (AWS KMS) allow customers to encrypt data in transit and at rest, and securely generate and manage control of encryption keys. By building on top of the AWS Nitro System, our answer to confidential computing, which includes the use of specialized hardware and associated firmware to protect customer code and data during processing from outside access, customers can further secure data during processing, and thereby enhance confidentiality and privacy.

AWS has achieved internationally recognized certifications and attestations that demonstrate compliance with rigorous international privacy and security standards, including the Cloud Infrastructure Services in Europe (CISPE) Data Protection Code of Conduct, Cloud Computing Compliance Controls Catalog (C5), ISO27018, and the Esquema National de Securidad (ENS, Spain).

As well as benefitting from these existing measures, our extensive online resources can help customers more easily complete data-transfer assessments and fulfill their GDPR compliance requirements, in accordance with the European Data Protection Board (EDPB) recommendations. This includes regular Information Request Reports showing requests to access data from governments and our responses.

Further information

Our technical paper Navigating Compliance with EU Data Transfer Requirements and AWS’s Privacy Features for AWS Services provide further information to help customers assess the right services for their individual needs.

If you have questions or need more information, visit our EU Data Protection page.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Punke

Michael Punke

Michael Punke is Vice President for Global Public Policy, Amazon Web Services, and lives with his family in Montana. He has more than 25 years of experience in international trade and regulatory issues. Punke served from 2010 to 2017 as Deputy US Trade Representative and US Ambassador to the World Trade Organization (WTO) in Geneva.

Enabling data classification for Amazon RDS database with Macie

Post Syndicated from Bruno Silveira original https://aws.amazon.com/blogs/security/enabling-data-classification-for-amazon-rds-database-with-amazon-macie/

Customers have been asking us about ways to use Amazon Macie data discovery on their Amazon Relational Database Service (Amazon RDS) instances. This post presents how to do so using AWS Database Migration Service (AWS DMS) to extract data from Amazon RDS, store it on Amazon Simple Storage Service (Amazon S3), and then classify the data using Macie. Macie’s resulting findings will also be made available to be queried with Amazon Athena by appropriate teams.

The challenge

Let’s suppose you need to find sensitive data in an RDS-hosted database using Macie, which currently only supports S3 as a data source. Therefore, you will need to extract and store the data from RDS in S3. In addition, you will need an interface for audit teams to audit these findings.

Solution overview

Figure 1: Solution architecture workflow

Figure 1: Solution architecture workflow

The architecture of the solution in Figure 1 can be described as:

  1. A MySQL engine running on RDS is populated with the Sakila sample database.
  2. A DMS task connects to the Sakila database, transforms the data into a set of Parquet compressed files, and loads them into the dcp-macie bucket.
  3. A Macie classification job analyzes the objects in the dcp-macie bucket using a combination of techniques such as machine learning and pattern matching to determine whether the objects contain sensitive data and to generate detailed reports on the findings.
  4. Amazon EventBridge routes the Macie findings reports events to Amazon Kinesis Data Firehose.
  5. Kinesis Data Firehose stores these reports in the dcp-glue bucket.
  6. S3 event notification triggers an AWS Lambda function whenever an object is created in the dcp-glue bucket.
  7. The Lambda function named Start Glue Workflow starts a Glue Workflow.
  8. Glue Workflow transforms the data from JSONL to Apache Parquet file format and places it in the dcp-athena bucket. This provides better performance during data query and optimized storage usage using a binary optimized columnar storage.
  9. Athena is used to query and visualize data generated by Macie.

Note: For better readability, the S3 bucket nomenclature omits the suffix containing the AWS Region and AWS account ID used to meet the global uniqueness naming requirement (for example, dcp-athena-us-east-1-123456789012).

The Sakila database schema consists of the following tables:

  • actor
  • address
  • category
  • city
  • country
  • customer

Building the solution

Prerequisites

Before configuring the solution, the AWS Identity and Access Management (IAM) user must have appropriate access granted for the following services:

You can find an IAM policy with the required permissions here.

Step 1 – Deploying the CloudFormation template

You’ll use CloudFormation to provision quickly and consistently the AWS resources illustrated in Figure 1. Through a pre-built template file, it will create the infrastructure using an Infrastructure-as-Code (IaC) approach.

  1. Download the CloudFormation template.
  2. Go to the CloudFormation console.
  3. Select the Stacks option in the left menu.
  4. Select Create stack and choose With new resources (standard).
  5. On Step 1 – Specify template, choose Upload a template file, select Choose file, and select the file template.yaml downloaded previously.
  6. On Step 2 – Specify stack details, enter a name of your preference for Stack name. You might also adjust the parameters as needed, like the parameter CreateRDSServiceRole to create a service role for RDS if it does not exist in the current account.
  7. On Step 3 – Configure stack options, select Next.
  8. On Step 4 – Review, check the box for I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then select Create Stack.
  9. Wait for the stack to show status CREATE_COMPLETE.

Note: It is expected that provisioning will take around 10 minutes to complete.

Step 2 – Running an AWS DMS task

To extract the data from the Amazon RDS instance, you need to run an AWS DMS task. This makes the data available for Amazon Macie in an S3 bucket in Parquet format.

  1. Go to the AWS DMS console.
  2. In the left menu, select Database migration tasks.
  3. Select the task Identifier named rdstos3task.
  4. Select Actions.
  5. Select the option Restart/Resume.

When the Status changes to Load Complete the task has finished and you will be able to see migrated data in your target bucket (dcp-macie).

Inside each folder you can see parquet file(s) with names similar to LOAD00000001.parquet. Now you can use Macie to discover if you have sensitive data in your database contents as exported to S3.

Step 3 – Running a classification job with Amazon Macie

Now you need to create a data classification job so you can assess the contents of your S3 bucket. The job you create will run once and evaluate the complete contents of your S3 bucket to determine whether it can identify PII among the data. As mentioned earlier, this job only uses the managed identifiers available with Macie – you could also add your own custom identifiers.

  1. Go to the Macie console.
  2. Select the S3 buckets option in the left menu.
  3. Choose the S3 bucket dcp-macie containing the output data from the DMS task. You may need to wait a minute and select the Refresh icon if the bucket names do not display.

  4. Select Create job.
  5. Select Next to continue.
  6. Create a job with the following scope.
    1. Sensitive data Discovery options: One-time job
    2. Sampling Depth: 100%
    3. Leave all other settings with their default values
  7. Select Next to continue.
  8. Select Next again to skip past the Custom data identifiers section.
  9. Give the job a name and description.
  10. Select Next to continue.
  11. Verify the details of the job you have created and select Submit to continue.

You will see a green banner stating that The Job was successfully created. The job can take up to 15 minutes to conclude and the Status will change from Active to Complete. To open the findings from the job, select the job’s check box, choose Show results, and select Show findings.
 

Figure 2: Macie Findings screen

Figure 2: Macie Findings screen

Note: You can navigate in the findings and select each checkbox to see the details.

Step 4 – Enabling querying on classification job results with Amazon Athena

  1. Go to the Athena console and open the Query editor.
  2. If it’s your first-time using Athena you will see a message Before you run your first query, you need to set up a query result location in Amazon S3. Learn more. Select the link presented with this message.
  3. In the Settings window, choose Select and then choose the bucket dcp-assets to store the Athena query results.
  4. (Optional) To store the query results encrypted, check the box for Encrypt query results and select your preferred encryption type. To learn more about Amazon S3 encryption types, see Protecting data using encryption.
  5. Select Save.

Step 5 – Query Amazon Macie results with Amazon Athena.

It might take a few minutes for the data to complete the flow between Amazon Macie and AWS Glue. After it’s finished, you’ll be able to see in Athena’s console the table dcp_athena within the database dcp.

Then, select the three dots next to the table dcp_athena and select the Preview table option to see a data preview, or run your own custom queries.
 

Figure 3: Athena table preview

Figure 3: Athena table preview

As your environment grows, this blog post on Top 10 Performance Tuning Tips for Amazon Athena can help you apply partitioning of data and consolidate your data into larger files if needed.

Clean up

After you finish, to clean up the solution and avoid unnecessary expenses, complete the following steps:

  1. Go to the Amazon S3 console.
  2. Navigate to each of the buckets listed below and delete all its objects:
    • dcp-assets
    • dcp-athena
    • dcp-glue
    • dcp-macie
  3. Go to the CloudFormation console.
  4. Select the Stacks option in the left menu.
  5. Choose the stack you created in Step 1 – Deploying the CloudFormation template.
  6. Select Delete and then select Delete Stack in the pop-up window.

Conclusion

In this blog post, we show how you can find Personally Identifiable Information (PII), and other data defined as sensitive, in Macie’s Managed Data Identifiers in an RDS-hosted MySQL database. You can use this solution with other relational databases like PostgreSQL, SQL Server, or Oracle, whether hosted on RDS or EC2. If you’re using Amazon DynamoDB, you may also find useful the blog post Detecting sensitive data in DynamoDB with Macie.

By classifying your data, you can inform your management of appropriate data protection and handling controls for the use of that data.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Bruno Silveira

Bruno is a Solutions Architect Manager in the Public Sector team with a focus on educational institutions in Brazil. His previous career was in government, financial services, utilities, and nonprofit institutions. Bruno is an enthusiast of cloud security and an appreciator of good rock’n roll with a good beer.

Author

Thiago Pádua

Thiago is Solutions Architect in the AWS Worldwide Public Sector team working in the development and support of partners. He is experienced in software development and systems integration, mainly in the telecommunications industry. He has a special interest in microservices, serverless, and containers.

New Standard Contractual Clauses now part of the AWS GDPR Data Processing Addendum for customers

Post Syndicated from Stéphane Ducable original https://aws.amazon.com/blogs/security/new-standard-contractual-clauses-now-part-of-the-aws-gdpr-data-processing-addendum-for-customers/

Today, we’re happy to announce an update to our online AWS GDPR Data Processing Addendum (AWS GDPR DPA) and our online Service Terms to include the new Standard Contractual Clauses (SCCs) that the European Commission (EC) adopted in June 2021. The EC-approved SCCs give our customers the ability to comply with the General Data Protection Regulation (GDPR) when they transfer personal data subject to GDPR to countries outside the European Economic Area (EEA) that haven’t received an EC adequacy decision (third countries). The new SCCs are now better adapted to how our customers operate their applications or run their workloads in the cloud, because they cover different transfer scenarios, and also provide enhanced safeguards for data transfers.

Achieving compliance with GDPR is critical for hundreds of thousands of AWS customers and AWS. The new SCCs allow all AWS customers that are controllers or processors under GDPR to continue to transfer personal data from their AWS accounts in compliance with GDPR. As part of the online Service Terms, the new SCCs will apply automatically whenever an AWS customer uses AWS services to transfer customer data to third countries.

The updated AWS GDPR DPA incorporating the new SCCs supplements our announcement in February 2021 of strengthened commitments to protect customer data, such as challenging law enforcement requests that conflict with EU law. We have also published the blog post How AWS is helping EU customers navigate the new normal for data protection, and the whitepaper Navigating Compliance with EU Data Transfer Requirements to help AWS customers conduct their data transfer assessments and comply with GDPR, the Schrems II ruling, and the recommendations issued by the European Data Protection Board. AWS is constantly working to ensure that our customers can enjoy the benefits of AWS everywhere they operate, and we welcome the new SCCs because they enable our customers to continue using AWS services in compliance with GDPR. If you have questions or need more information, visit our EU Data Protection page and our GDPR Center.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Stéphane Ducable

Stéphane is Vice President of Public Policy – EMEA at AWS. He is focused on increasing awareness of the benefits of adopting cloud computing technology across the EMEA region.

How AWS is helping EU customers navigate the new normal for data protection

Post Syndicated from Stephen Schmidt original https://aws.amazon.com/blogs/security/how-aws-is-helping-eu-customers-navigate-the-new-normal-for-data-protection/

Achieving compliance with the European Union’s data protection regulations is critical for hundreds of thousands of Amazon Web Services (AWS) customers. Many of them are subject to the EU’s General Data Protection Regulation (GDPR), which ensures individuals’ fundamental right to privacy and the protection of personal data. In February, we announced strengthened commitments to protect customer data, such as challenging law enforcement requests for customer data that conflict with EU law.

Today, we’re excited to announce that we’ve launched two new online resources to help customers more easily complete data transfer assessments and comply with the GDPR, taking into account the European Data Protection Board (EDPB) recommendations. These resources will also assist AWS customers in other countries to understand whether their use of AWS services involves a data transfer.

Using AWS’s new “Privacy Features for AWS Services,” customers can determine whether their use of an individual AWS service involves the transfer of customer data (the personal data they’ve uploaded to their AWS account). Knowing this information enables customers to choose the right action for their applications, such as opting out of the data transfer or creating an appropriate disclosure of the transfer for end user transparency.

We’re also providing additional information on the processing activities and locations of the limited number of sub-processors that AWS engages to provide services that involve the processing of customer data. AWS engages three types of sub-processors:

  • Local AWS entities that provide the AWS infrastructure.
  • AWS entities that process customer data for specific AWS services.
  • Third parties that AWS contracts with to provide processing activities for specific AWS services.

The enhanced information available on our updated Sub-processors page enables customers to assess if a sub-processor is relevant to their use of AWS services and AWS Regions.

These new resources make it easier for AWS customers to conduct their data transfer assessments as set out in the EDPB recommendations and, as a result, comply with GDPR. After completing their data transfer assessments, customers will also be able to determine whether they need to implement supplemental measures in line with the EDPB’s recommendations.

These resources support our ongoing commitment to giving customers control over where their data is stored, how it’s stored, and who has access to it.

Since we opened our first region in the EU in 2007, customers have been able to choose to store customer data with AWS in the EU. Today, customers can store their data in our AWS Regions in France, Germany, Ireland, Italy, and Sweden, and we’re adding Spain in 2022. AWS will never transfer data outside a customer’s selected AWS Region without the customer’s agreement.

AWS customers control how their data is stored, and we have a variety of tools at their disposal to enhance security. For example, AWS CloudHSM and AWS Key Management Service (AWS KMS) allow customers to encrypt data in transit and at rest and securely generate and manage encryption keys that they control.

Finally, our customers control who can access their data. We never use customer data for marketing or advertising purposes. We also prohibit, and our systems are designed to prevent, remote access by AWS personnel to customer data for any purpose, including service maintenance, unless requested by a customer, required to prevent fraud and abuse, or to comply with the law.

As previously mentioned, we challenge law enforcement requests for customer data from governmental bodies, whether inside or outside the EU, where the request conflicts with EU law, is overbroad, or we otherwise have any appropriate grounds to do so.

Earning customer trust is the foundation of our business at AWS, and we know protecting customer data is key to achieving this. We also know that helping customers protect their data in a world with constantly changing regulations, technology, and risks takes teamwork. We would never expect our customers to go it alone.

As we continue to enhance the capabilities customers have at their fingertips, they can be confident that choosing AWS will ensure they have the tools necessary to help them meet the most stringent security, privacy, and compliance requirements.

If you have questions or need more information, visit our EU Data Protection page.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Steve Schmidt

Steve is Vice President and Chief Information Security Officer for AWS. His duties include leading product design, management, and engineering development efforts focused on bringing the competitive, economic, and security benefits of cloud computing to business and government customers. Prior to AWS, he had an extensive career at the Federal Bureau of Investigation, where he served as a senior executive and section chief. He currently holds 11 patents in the field of cloud security architecture. Follow Steve on Twitter.

Author

Donna Dodson

Donna is a Senior Principal Scientist at AWS focusing on security and privacy capabilities including cryptography, risk management, standards, and assessments. Before joining AWS, Donna was the Chief Cybersecurity Advisor at the National Institute of Standards and Technology (NIST). She led NIST’s comprehensive cybersecurity research and development to cultivate trust in technology for stakeholders nationally and internationally.

Creating a notification workflow from sensitive data discover with Amazon Macie, Amazon EventBridge, AWS Lambda, and Slack

Post Syndicated from Bruno Silviera original https://aws.amazon.com/blogs/security/creating-a-notification-workflow-from-sensitive-data-discover-with-amazon-macie-amazon-eventbridge-aws-lambda-and-slack/

Following the example of the EU in implementing the General Data Protection Regulation (GDPR), many countries are implementing similar data protection laws. In response, many companies are forming teams that are responsible for data protection. Considering the volume of information that companies maintain, it’s essential that these teams are alerted when sensitive data is at risk.

This post shows how to deploy a solution that uses Amazon Macie to discover sensitive data. This solution enables you to set up automatic notification to your company’s designated data protection team via a Slack channel when sensitive data that needs to be protected is discovered by Amazon EventBridge and AWS Lambda.

The challenge

Let’s imagine that you’re part of a team that’s responsible for classifying your organization’s data but the data structure isn’t documented. Amazon Macie provides you the ability to run a scheduled classification job that examines your data, and you want to notify the data protection team when there’s new sensitive data to classify. Let’s build a solution to automatically notify the data protection team.

Solution overview

To be scalable and cost-effective, this solution uses serverless technologies and managed AWS services, including:

  • Macie – A fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in Amazon Web Services (AWS).
  • EventBridge – A serverless event bus that connects application data from your apps, SaaS, and AWS services. EventBridge can respond to specific events or run according to a schedule. The solution presented in this post uses EventBridge to initiate a custom Lambda function in response to a specific event.
  • Lambda – Runs code in response to events such as changes in data, changes in application state, or user actions. In this solution, a Lambda function is initiated by EventBridge.

Solution architecture

The architecture workflow is shown in Figure 1 and includes the following steps:

  1. Macie runs a classification job and publishes its findings to EventBridge as a JSON object.
  2. The EventBridge rule captures the findings and invokes a Lambda function as a target.
  3. The Lambda function parses the JSON object. The function then sends a custom message to a Slack channel with the sensitive data finding for the data protection team to evaluate and respond to.

 

Figure 1: Solution architecture workflow

Figure 1: Solution architecture workflow

Set up Slack

For this solution, you need a Slack workspace and an incoming webhook. The workspace must be in place before you create the webhook.

Create a Slack workspace

If you already have a Slack workspace in your environment, you can skip forward, to creating the webhook.

If you don’t have a Slack workspace, follow the steps in Create a Slack Workspace to create one.

Create an incoming webhook in Slack API

  1. Go to your Slack API.
  2. Choose Start Building to create an app.
  3. Enter the following details for your app:
    • App Namemacie-to-slack.
    • Development Slack Workspace – Choose the Slack workspace—either an existing workspace or one you created for this solution—to receive the Macie findings.
  4. Choose the Create App button.
  5. In the left menu, choose Incoming Webhooks.
  6. At the Activate Incoming Webhooks screen, move the slider from OFF to ON.
  7. Scroll down and choose Add New Webhook to Workspace.
  8. In the screen asking where your app should post, enter the name of the Slack channel from your Workspace that you want to send notification to and choose Authorize.
  9. On the next screen, scroll down to the Webhook URL section. Make a note of the URL to use later.

Deploy the CloudFormation template with the solution

The deployment of the CloudFormation template automatically creates the following resources:

  • A Lambda function that begins with the name named macie-to-slack-lambdafindingsToSlack-.
  • An EventBridge rule named MacieFindingsToSlack.
  • An IAM role named MacieFindingsToSlackkRole.
  • A permission to invoke the Lambda function named LambdaInvokePermission.

Note: Before you proceed, make sure you’re deploying the template to the same Region that your production Macie is running.

To deploy the Cloudformation template

  1. Download the YAML template to your computer.

    Note: To save the template, you can right click the Raw button at the top of the code and then select Save link as if you’re using Chrome, or the equivalent in your browser. This file is used in Step 4.

  2. Open CloudFormation in the AWS Management Console.
  3. On the Welcome page, choose Create stack and then choose With new resources.
  4. On Step 1 — Specify template, choose Upload a template file, select Choose file and then select the file template.yaml (the file extension might be .YML), then choose Next.
  5. On Step 2 — Specify stack details:
    1. Enter macie-to-slack as the Stack name.
    2. At the Slack Incoming Web Hook URL, paste the webhook URL you copied earlier.
    3. At Slack channel, enter the name of the channel in your workspace that will receive the alerts and choose Next.
    Figure 2: Defining stack details

    Figure 2: Defining stack details

  6. On Step 3 – Configure Stack options, you can leave the default settings, or change them for your environment. Choose Next to continue.
  7. At the bottom of Step 4 – Review, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Create stack.

    Figure 3: Confirmation before stack creation

    Figure 3: Confirmation before stack creation

  8. Wait for the stack to reach status CREATE_COMPLETE.

Running the solution

At this point, you’ve deployed the solution and your resources are created.

To test the solution, you can schedule a Macie job targeting a bucket that contains a file with sensitive information that Macie can detect.

Note: You can check the Amazon Macie documentation to see the list of supported managed data identifiers.

When the Macie job is complete, any findings are sent to the Slack channel.

Figure 4: Macie finding delivered to Slack channel

Figure 4: Macie finding delivered to Slack channel

Select the link in the message sent to the Slack channel to open that finding in the Macie console, as shown in Figure 5.

Figure 5: Finding details

Figure 5: Finding details

And you’re done!

Now your Macie finding results are delivered to your Slack channel where they can be easily monitored, reducing response time and risk exposure.

If you deployed this for testing purposes, or want to clean this up and move to your production account, you can delete the Cloudformation stack:

  1. Open the CloudFormation console.
  2. Select the stack and choose Delete.

Conclusion

In this blog post we walked through the steps to configure a notification workflow using Macie, Lambda, and EventBridge to send sensitive data findings to your data protection team via a Slack channel.

Your data protection team will appreciate the timely notifications of sensitive data findings, giving you the ability to focus on creating controls to improve data security and compliance with regulations related to protection and treatment of personal data.

For more information about data privacy on AWS, see Data Privacy FAQ.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Bruno Silveira

Bruno is a Solutions Architect Manager in the Public Sector team with focus on educational institutions in Brazil. His previous career was in government, financial services, utilities, and nonprofit institutions. Bruno is an enthusiast of cloud security and an appreciator of good rock’n roll with a good beer.

Author

Julio Carvalho

Julio is a Principal Security Solutions Architect at AWS for the Latin American financial market. As a security specialist, he helps customers solve protection and compliance challenges on their cloud journey.

How to protect sensitive data for its entire lifecycle in AWS

Post Syndicated from Raj Jain original https://aws.amazon.com/blogs/security/how-to-protect-sensitive-data-for-its-entire-lifecycle-in-aws/

Many Amazon Web Services (AWS) customer workflows require ingesting sensitive and regulated data such as Payments Card Industry (PCI) data, personally identifiable information (PII), and protected health information (PHI). In this post, I’ll show you a method designed to protect sensitive data for its entire lifecycle in AWS. This method can help enhance your data security posture and be useful for fulfilling the data privacy regulatory requirements applicable to your organization for data protection at-rest, in-transit, and in-use.

An existing method for sensitive data protection in AWS is to use the field-level encryption feature offered by Amazon CloudFront. This CloudFront feature protects sensitive data fields in requests at the AWS network edge. The chosen fields are protected upon ingestion and remain protected throughout the entire application stack. The notion of protecting sensitive data early in its lifecycle in AWS is a highly desirable security architecture. However, CloudFront can protect a maximum of 10 fields and only within HTTP(S) POST requests that carry HTML form encoded payloads.

If your requirements exceed CloudFront’s native field-level encryption feature, such as a need to handle diverse application payload formats, different HTTP methods, and more than 10 sensitive fields, you can implement field-level encryption yourself using the Lambda@Edge feature in CloudFront. In terms of choosing an appropriate encryption scheme, this problem calls for an asymmetric cryptographic system that will allow public keys to be openly distributed to the CloudFront network edges while keeping the corresponding private keys stored securely within the network core. One such popular asymmetric cryptographic system is RSA. Accordingly, we’ll implement a Lambda@Edge function that uses asymmetric encryption using the RSA cryptosystem to protect an arbitrary number of fields in any HTTP(S) request. We will discuss the solution using an example JSON payload, although this approach can be applied to any payload format.

A complex part of any encryption solution is key management. To address that, I use AWS Key Management Service (AWS KMS). AWS KMS simplifies the solution and offers improved security posture and operational benefits, detailed later.

Solution overview

You can protect data in-transit over individual communications channels using transport layer security (TLS), and at-rest in individual storage silos using volume encryption, object encryption or database table encryption. However, if you have sensitive workloads, you might need additional protection that can follow the data as it moves through the application stack. Fine-grained data protection techniques such as field-level encryption allow for the protection of sensitive data fields in larger application payloads while leaving non-sensitive fields in plaintext. This approach lets an application perform business functions on non-sensitive fields without the overhead of encryption, and allows fine-grained control over what fields can be accessed by what parts of the application.

A best practice for protecting sensitive data is to reduce its exposure in the clear throughout its lifecycle. This means protecting data as early as possible on ingestion and ensuring that only authorized users and applications can access the data only when and as needed. CloudFront, when combined with the flexibility provided by Lambda@Edge, provides an appropriate environment at the edge of the AWS network to protect sensitive data upon ingestion in AWS.

Since the downstream systems don’t have access to sensitive data, data exposure is reduced, which helps to minimize your compliance footprint for auditing purposes.

The number of sensitive data elements that may need field-level encryption depends on your requirements. For example:

  • For healthcare applications, HIPAA regulates 18 personal data elements.
  • In California, the California Consumer Privacy Act (CCPA) regulates at least 11 categories of personal information—each with its own set of data elements.

The idea behind field-level encryption is to protect sensitive data fields individually, while retaining the structure of the application payload. The alternative is full payload encryption, where the entire application payload is encrypted as a binary blob, which makes it unusable until the entirety of it is decrypted. With field-level encryption, the non-sensitive data left in plaintext remains usable for ordinary business functions. When retrofitting data protection in existing applications, this approach can reduce the risk of application malfunction since the data format is maintained.

The following figure shows how PII data fields in a JSON construction that are deemed sensitive by an application can be transformed from plaintext to ciphertext with a field-level encryption mechanism.

Figure 1: Example of field-level encryption

Figure 1: Example of field-level encryption

You can change plaintext to ciphertext as depicted in Figure 1 by using a Lambda@Edge function to perform field-level encryption. I discuss the encryption and decryption processes separately in the following sections.

Field-level encryption process

Let’s discuss the individual steps involved in the encryption process as shown in Figure 2.

Figure 2: Field-level encryption process

Figure 2: Field-level encryption process

Figure 2 shows CloudFront invoking a Lambda@Edge function while processing a client request. CloudFront offers multiple integration points for invoking Lambda@Edge functions. Since you are processing a client request and your encryption behavior is related to requests being forwarded to an origin server, you want your function to run upon the origin request event in CloudFront. The origin request event represents an internal state transition in CloudFront that happens immediately before CloudFront forwards a request to the downstream origin server.

You can associate your Lambda@Edge with CloudFront as described in Adding Triggers by Using the CloudFront Console. A screenshot of the CloudFront console is shown in Figure 3. The selected event type is Origin Request and the Include Body check box is selected so that the request body is conveyed to Lambda@Edge.

Figure 3: Configuration of Lambda@Edge in CloudFront

Figure 3: Configuration of Lambda@Edge in CloudFront

The Lambda@Edge function acts as a programmable hook in the CloudFront request processing flow. You can use the function to replace the incoming request body with a request body with the sensitive data fields encrypted.

The process includes the following steps:

Step 1 – RSA key generation and inclusion in Lambda@Edge

You can generate an RSA customer managed key (CMK) in AWS KMS as described in Creating asymmetric CMKs. This is done at system configuration time.

Note: You can use your existing RSA key pairs or generate new ones externally by using OpenSSL commands, especially if you need to perform RSA decryption and key management independently of AWS KMS. Your choice won’t affect the fundamental encryption design pattern presented here.

The RSA key creation in AWS KMS requires two inputs: key length and type of usage. In this example, I created a 2048-bit key and assigned its use for encryption and decryption. The cryptographic configuration of an RSA CMK created in AWS KMS is shown in Figure 4.

Figure 4: Cryptographic properties of an RSA key managed by AWS KMS

Figure 4: Cryptographic properties of an RSA key managed by AWS KMS

Of the two encryption algorithms shown in Figure 4— RSAES_OAEP_SHA_256 and RSAES_OAEP_SHA_1, this example uses RSAES_OAEP_SHA_256. The combination of a 2048-bit key and the RSAES_OAEP_SHA_256 algorithm lets you encrypt a maximum of 190 bytes of data, which is enough for most PII fields. You can choose a different key length and encryption algorithm depending on your security and performance requirements. How to choose your CMK configuration includes information about RSA key specs for encryption and decryption.

Using AWS KMS for RSA key management versus managing the keys yourself eliminates that complexity and can help you:

  • Enforce IAM and key policies that describe administrative and usage permissions for keys.
  • Manage cross-account access for keys.
  • Monitor and alarm on key operations through Amazon CloudWatch.
  • Audit AWS KMS API invocations through AWS CloudTrail.
  • Record configuration changes to keys and enforce key specification compliance through AWS Config.
  • Generate high-entropy keys in an AWS KMS hardware security module (HSM) as required by NIST.
  • Store RSA private keys securely, without the ability to export.
  • Perform RSA decryption within AWS KMS without exposing private keys to application code.
  • Categorize and report on keys with key tags for cost allocation.
  • Disable keys and schedule their deletion.

You need to extract the RSA public key from AWS KMS so you can include it in the AWS Lambda deployment package. You can do this from the AWS Management Console, through the AWS KMS SDK, or by using the get-public-key command in the AWS Command Line Interface (AWS CLI). Figure 5 shows Copy and Download options for a public key in the Public key tab of the AWS KMS console.

Figure 5: RSA public key available for copy or download in the console

Figure 5: RSA public key available for copy or download in the console

Note: As we will see in the sample code in step 3, we embed the public key in the Lambda@Edge deployment package. This is a permissible practice because public keys in asymmetric cryptography systems aren’t a secret and can be freely distributed to entities that need to perform encryption. Alternatively, you can use Lambda@Edge to query AWS KMS for the public key at runtime. However, this introduces latency, increases the load against your KMS account quota, and increases your AWS costs. General patterns for using external data in Lambda@Edge are described in Leveraging external data in Lambda@Edge.

Step 2 – HTTP API request handling by CloudFront

CloudFront receives an HTTP(S) request from a client. CloudFront then invokes Lambda@Edge during origin-request processing and includes the HTTP request body in the invocation.

Step 3 – Lambda@Edge processing

The Lambda@Edge function processes the HTTP request body. The function extracts sensitive data fields and performs RSA encryption over their values.

The following code is sample source code for the Lambda@Edge function implemented in Python 3.7:

import Crypto
import base64
import json
from Crypto.Cipher import PKCS1_OAEP
from Crypto.PublicKey import RSA

# PEM-formatted RSA public key copied over from AWS KMS or your own public key.
RSA_PUBLIC_KEY = "-----BEGIN PUBLIC KEY-----<your key>-----END PUBLIC KEY-----"
RSA_PUBLIC_KEY_OBJ = RSA.importKey(RSA_PUBLIC_KEY)
RSA_CIPHER_OBJ = PKCS1_OAEP.new(RSA_PUBLIC_KEY_OBJ, Crypto.Hash.SHA256)

# Example sensitive data field names in a JSON object. 
PII_SENSITIVE_FIELD_NAMES = ["fname", "lname", "email", "ssn", "dob", "phone"]

CIPHERTEXT_PREFIX = "#01#"
CIPHERTEXT_SUFFIX = "#10#"

def lambda_handler(event, context):
    # Extract HTTP request and its body as per documentation:
    # https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-event-structure.html
    http_request = event['Records'][0]['cf']['request']
    body = http_request['body']
    org_body = base64.b64decode(body['data'])
    mod_body = protect_sensitive_fields_json(org_body)
    body['action'] = 'replace'
    body['encoding'] = 'text'
    body['data'] = mod_body
    return http_request


def protect_sensitive_fields_json(body):
    # Encrypts sensitive fields in sample JSON payload shown earlier in this post.
    # [{"fname": "Alejandro", "lname": "Rosalez", … }]
    person_list = json.loads(body.decode("utf-8"))
    for person_data in person_list:
        for field_name in PII_SENSITIVE_FIELD_NAMES:
            if field_name not in person_data:
                continue
            plaintext = person_data[field_name]
            ciphertext = RSA_CIPHER_OBJ.encrypt(bytes(plaintext, 'utf-8'))
            ciphertext_b64 = base64.b64encode(ciphertext).decode()
            # Optionally, add unique prefix/suffix patterns to ciphertext
            person_data[field_name] = CIPHERTEXT_PREFIX + ciphertext_b64 + CIPHERTEXT_SUFFIX 
    return json.dumps(person_list)

The event structure passed into the Lambda@Edge function is described in Lambda@Edge Event Structure. Following the event structure, you can extract the HTTP request body. In this example, the assumption is that the HTTP payload carries a JSON document based on a particular schema defined as part of the API contract. The input JSON document is parsed by the function, converting it into a Python dictionary. The Python native dictionary operators are then used to extract the sensitive field values.

Note: If you don’t know your API payload structure ahead of time or you’re dealing with unstructured payloads, you can use techniques such as regular expression pattern searches and checksums to look for patterns of sensitive data and target them accordingly. For example, credit card primary account numbers include a Luhn checksum that can be programmatically detected. Additionally, services such as Amazon Comprehend and Amazon Macie can be leveraged for detecting sensitive data such as PII in application payloads.

While iterating over the sensitive fields, individual field values are encrypted using the standard RSA encryption implementation available in the Python Cryptography Toolkit (PyCrypto). The PyCrypto module is included within the Lambda@Edge zip archive as described in Lambda@Edge deployment package.

The example uses the standard optimal asymmetric encryption padding (OAEP) and SHA-256 encryption algorithm properties. These properties are supported by AWS KMS and will allow RSA ciphertext produced here to be decrypted by AWS KMS later.

Note: You may have noticed in the code above that we’re bracketing the ciphertexts with predefined prefix and suffix strings:

person_data[field_name] = CIPHERTEXT_PREFIX + ciphertext_b64 + CIPHERTEXT_SUFFIX

This is an optional measure and is being implemented to simplify the decryption process.

The prefix and suffix strings help demarcate ciphertext embedded in unstructured data in downstream processing and also act as embedded metadata. Unique prefix and suffix strings allow you to extract ciphertext through string or regular expression (regex) searches during the decryption process without having to know the data body format or schema, or the field names that were encrypted.

Distinct strings can also serve as indirect identifiers of RSA key pair identifiers. This can enable key rotation and allow separate keys to be used for separate fields depending on the data security requirements for individual fields.

You can ensure that the prefix and suffix strings can’t collide with the ciphertext by bracketing them with characters that don’t appear in cyphertext. For example, a hash (#) character cannot be part of a base64 encoded ciphertext string.

Deploying a Lambda function as a Lambda@Edge function requires specific IAM permissions and an IAM execution role. Follow the Lambda@Edge deployment instructions in Setting IAM Permissions and Roles for Lambda@Edge.

Step 4 – Lambda@Edge response

The Lambda@Edge function returns the modified HTTP body back to CloudFront and instructs it to replace the original HTTP body with the modified one by setting the following flag:

http_request['body']['action'] = 'replace'

Step 5 – Forward the request to the origin server

CloudFront forwards the modified request body provided by Lambda@Edge to the origin server. In this example, the origin server writes the data body to persistent storage for later processing.

Field-level decryption process

An application that’s authorized to access sensitive data for a business function can decrypt that data. An example decryption process is shown in Figure 6. The figure shows a Lambda function as an example compute environment for invoking AWS KMS for decryption. This functionality isn’t dependent on Lambda and can be performed in any compute environment that has access to AWS KMS.

Figure 6: Field-level decryption process

Figure 6: Field-level decryption process

The steps of the process shown in Figure 6 are described below.

Step 1 – Application retrieves the field-level encrypted data

The example application retrieves the field-level encrypted data from persistent storage that had been previously written during the data ingestion process.

Step 2 – Application invokes the decryption Lambda function

The application invokes a Lambda function responsible for performing field-level decryption, sending the retrieved data to Lambda.

Step 3 – Lambda calls the AWS KMS decryption API

The Lambda function uses AWS KMS for RSA decryption. The example calls the KMS decryption API that inputs ciphertext and returns plaintext. The actual decryption happens in KMS; the RSA private key is never exposed to the application, which is a highly desirable characteristic for building secure applications.

Note: If you choose to use an external key pair, then you can securely store the RSA private key in AWS services like AWS Systems Manager Parameter Store or AWS Secrets Manager and control access to the key through IAM and resource policies. You can fetch the key from relevant vault using the vault’s API, then decrypt using the standard RSA implementation available in your programming language. For example, the cryptography toolkit in Python or javax.crypto in Java.

The Lambda function Python code for decryption is shown below.

import base64
import boto3
import re

kms_client = boto3.client('kms')
CIPHERTEXT_PREFIX = "#01#"
CIPHERTEXT_SUFFIX = "#10#"

# This lambda function extracts event body, searches for and decrypts ciphertext 
# fields surrounded by provided prefix and suffix strings in arbitrary text bodies 
# and substitutes plaintext fields in-place.  
def lambda_handler(event, context):    
    org_data = event["body"]
    mod_data = unprotect_fields(org_data, CIPHERTEXT_PREFIX, CIPHERTEXT_SUFFIX)
    return mod_data

# Helper function that performs non-greedy regex search for ciphertext strings on
# input data and performs RSA decryption of them using AWS KMS 
def unprotect_fields(org_data, prefix, suffix):
    regex_pattern = prefix + "(.*?)" + suffix
    mod_data_parts = []
    cursor = 0

    # Search ciphertexts iteratively using python regular expression module
    for match in re.finditer(regex_pattern, org_data):
        mod_data_parts.append(org_data[cursor: match.start()])
        try:
            # Ciphertext was stored as Base64 encoded in our example. Decode it.
            ciphertext = base64.b64decode(match.group(1))

            # Decrypt ciphertext using AWS KMS  
            decrypt_rsp = kms_client.decrypt(
                EncryptionAlgorithm="RSAES_OAEP_SHA_256",
                KeyId="<Your-Key-ID>",
                CiphertextBlob=ciphertext)
            decrypted_val = decrypt_rsp["Plaintext"].decode("utf-8")
            mod_data_parts.append(decrypted_val)
        except Exception as e:
            print ("Exception: " + str(e))
            return None
        cursor = match.end()

    mod_data_parts.append(org_data[cursor:])
    return "".join(mod_data_parts)

The function performs a regular expression search in the input data body looking for ciphertext strings bracketed in predefined prefix and suffix strings that were added during encryption.

While iterating over ciphertext strings one-by-one, the function calls the AWS KMS decrypt() API. The example function uses the same RSA encryption algorithm properties—OAEP and SHA-256—and the Key ID of the public key that was used during encryption in Lambda@Edge.

Note that the Key ID itself is not a secret. Any application can be configured with it, but that doesn’t mean any application will be able to perform decryption. The security control here is that the AWS KMS key policy must allow the caller to use the Key ID to perform the decryption. An additional security control is provided by Lambda execution role that should allow calling the KMS decrypt() API.

Step 4 – AWS KMS decrypts ciphertext and returns plaintext

To ensure that only authorized users can perform decrypt operation, the KMS is configured as described in Using key policies in AWS KMS. In addition, the Lambda IAM execution role is configured as described in AWS Lambda execution role to allow it to access KMS. If both the key policy and IAM policy conditions are met, KMS returns the decrypted plaintext. Lambda substitutes the plaintext in place of ciphertext in the encapsulating data body.

Steps three and four are repeated for each ciphertext string.

Step 5 – Lambda returns decrypted data body

Once all the ciphertext has been converted to plaintext and substituted in the larger data body, the Lambda function returns the modified data body to the client application.

Conclusion

In this post, I demonstrated how you can implement field-level encryption integrated with AWS KMS to help protect sensitive data workloads for their entire lifecycle in AWS. Since your Lambda@Edge is designed to protect data at the network edge, data remains protected throughout the application execution stack. In addition to improving your data security posture, this protection can help you comply with data privacy regulations applicable to your organization.

Since you author your own Lambda@Edge function to perform standard RSA encryption, you have flexibility in terms of payload formats and the number of fields that you consider to be sensitive. The integration with AWS KMS for RSA key management and decryption provides significant simplicity, higher key security, and rich integration with other AWS security services enabling an overall strong security solution.

By using encrypted fields with identifiers as described in this post, you can create fine-grained controls for data accessibility to meet the security principle of least privilege. Instead of granting either complete access or no access to data fields, you can ensure least privileges where a given part of an application can only access the fields that it needs, when it needs to, all the way down to controlling access field by field. Field by field access can be enabled by using different keys for different fields and controlling their respective policies.

In addition to protecting sensitive data workloads to meet regulatory and security best practices, this solution can be used to build de-identified data lakes in AWS. Sensitive data fields remain protected throughout their lifecycle, while non-sensitive data fields remain in the clear. This approach can allow analytics or other business functions to operate on data without exposing sensitive data.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Raj Jain

Raj is a Senior Cloud Architect at AWS. He is passionate about helping customers build well-architected applications in AWS. Raj is a published author in Bell Labs Technical Journal, has authored 3 IETF standards, and holds 12 patents in internet telephony and applied cryptography. In his spare time, Raj enjoys outdoors, cooking, reading, and travel.

AWS and EU data transfers: strengthened commitments to protect customer data

Post Syndicated from Stephen Schmidt original https://aws.amazon.com/blogs/security/aws-and-eu-data-transfers-strengthened-commitments-to-protect-customer-data/

Last year we published a blog post describing how our customers can transfer personal data in compliance with both GDPR and the new “Schrems II” ruling. In that post, we set out some of the robust and comprehensive measures that AWS takes to protect customers’ personal data.

Today, we are announcing strengthened contractual commitments that go beyond what’s required by the Schrems II ruling and currently provided by other cloud providers to protect the personal data that customers entrust AWS to process (customer data). Significantly, these new commitments apply to all customer data subject to GDPR processed by AWS, whether it is transferred outside the European Economic Area (EEA) or not. These commitments are automatically available to all customers using AWS to process their customer data, with no additional action required, through a new supplementary addendum to the AWS GDPR Data Processing Addendum.

Our strengthened contractual commitments include:

  • Challenging law enforcement requests: We will challenge law enforcement requests for customer data from governmental bodies, whether inside or outside the EEA, where the request conflicts with EU law, is overbroad, or where we otherwise have any appropriate grounds to do so.
  • Disclosing the minimum amount necessary: We also commit that if, despite our challenges, we are ever compelled by a valid and binding legal request to disclose customer data, we will disclose only the minimum amount of customer data necessary to satisfy the request.

These strengthened commitments to our customers build on our long track record of challenging law enforcement requests. AWS rigorously limits – or rejects outright – law enforcement requests for data coming from any country, including the United States, where they are overly broad or we have any appropriate grounds to do so.

These commitments further demonstrate AWS’s dedication to securing our customers’ data: it is AWS’s highest priority. We implement rigorous contractual, technical, and organizational measures to protect the confidentiality, integrity, and availability of customer data regardless of which AWS Region a customer selects. Customers have complete control over their data through powerful AWS services and tools that allow them to determine where data will be stored, how it is secured, and who has access.

For example, customers using our latest generation of EC2 instances automatically gain the protection of the AWS Nitro System. Using purpose-built hardware, firmware, and software, AWS Nitro provides unique and industry-leading security and isolation by offloading the virtualization of storage, security, and networking resources to dedicated hardware and software. This enhances security by minimizing the attack surface and prohibiting administrative access while improving performance. Nitro was designed to operate in the most hostile network we could imagine, building in encryption, secure boot, a hardware-based root of trust, a decreased Trusted Computing Base (TCB) and restrictions on operator access. The newly announced AWS Nitro Enclaves feature enables customers to create isolated compute environments with cryptographic controls to assure the integrity of code that is processing highly sensitive data.

AWS encrypts all data in transit, including secure and private connectivity between EC2 instances of all types. Customers can rely on our industry leading encryption features and take advantage of AWS Key Management Services to control and manage their own keys within FIPS-140-2 certified hardware security modules. Regardless of whether data is encrypted or unencrypted, we will always work vigilantly to protect data from any unauthorized access. Find out more about our approach to data privacy.

AWS is constantly working to ensure that our customers can enjoy the benefits of AWS everywhere they operate. We will continue to update our practices to meet the evolving needs and expectations of customers and regulators, and fully comply with all applicable laws in every country in which we operate. With these changes, AWS continues our customer obsession by offering tooling, capabilities, and contractual rights that nobody else does.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Steve Schmidt

Steve is Vice President and Chief Information Security Officer for AWS. His duties include leading product design, management, and engineering development efforts focused on bringing the competitive, economic, and security benefits of cloud computing to business and government customers. Prior to AWS, he had an extensive career at the Federal Bureau of Investigation, where he served as a senior executive and section chief. He currently holds 11 patents in the field of cloud security architecture. Follow Steve on Twitter.