Tag Archives: Intermediate (200)

Improve your security investigations with Detective finding groups visualizations

2023-08-29 Rich Vorwaller

Post Syndicated from Rich Vorwaller original https://aws.amazon.com/blogs/security/improve-your-security-investigations-with-detective-finding-groups-visualizations/

At AWS, we often hear from customers that they want expanded security coverage for the multiple services that they use on AWS. However, alert fatigue is a common challenge that customers face as we introduce new security protections. The challenge becomes how to operationalize, identify, and prioritize alerts that represent real risk.

In this post, we highlight recent enhancements to Amazon Detective finding groups visualizations. We show you how Detective automatically consolidates multiple security findings into a single security event—called finding groups—and how finding group visualizations help reduce noise and prioritize findings that present true risk. We incorporate additional services like Amazon GuardDuty, Amazon Inspector, and AWS Security Hub to highlight how effective findings groups is at consolidating findings for different AWS security services.

Overview of solution

This post uses several different services. The purpose is twofold: to show how you can enable these services for broader protection, and to show how Detective can help you investigate findings from multiple services without spending a lot of time sifting through logs or querying multiple data sources to find the root cause of a security event. These are the services and their use cases:

GuardDuty – a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity. If potential malicious activity, such as anomalous behavior, credential exfiltration, or command and control (C2) infrastructure communication is detected, GuardDuty generates detailed security findings that you can use for visibility and remediation. Recently, GuardDuty released the following threat detections for specific services that we’ll show you how to enable for this walkthrough: GuardDuty RDS Protection, EKS Runtime Monitoring, and Lambda Protection.
Amazon Inspector – an automated vulnerability management service that continually scans your AWS workloads for software vulnerabilities and unintended network exposure. Like GuardDuty, Amazon Inspector sends a finding for alerting and remediation when it detects a software vulnerability or a compute instance that’s publicly available.
Security Hub – a cloud security posture management service that performs automated, continuous security best practice checks against your AWS resources to help you identify misconfigurations, and aggregates your security findings from integrated AWS security services.
Detective – a security service that helps you investigate potential security issues. It does this by collecting log data from AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and other services. Detective then uses machine learning, statistical analysis, and graph theory to build a linked set of data called a security behavior graph that you can use to conduct faster and more efficient security investigations.

The following diagram shows how each service delivers findings along with log sources to Detective.

Figure 1: Amazon Detective log source diagram

Enable the required services

If you’ve already enabled the services needed for this post—GuardDuty, Amazon Inspector, Security Hub, and Detective—skip to the next section. For instructions on how to enable these services, see the following resources:

Each of these services offers a free 30-day trial and provides estimates on charges after your trial expires. You can also use the AWS Pricing Calculator to get an estimate.

To enable the services across multiple accounts, consider using a delegated administrator account in AWS Organizations. With a delegated administrator account, you can automatically enable services for multiple accounts and manage settings for each account in your organization. You can view other accounts in the organization and add them as member accounts, making central management simpler. For instructions on how to enable the services with AWS Organizations, see the following resources:

Enable GuardDuty protections

The next step is to enable the latest detections in GuardDuty and learn how Detective can identify multiple threats that are related to a single security event.

If you’ve already enabled the different GuardDuty protection plans, skip to the next section. If you recently enabled GuardDuty, the protections plans are enabled by default, except for EKS Runtime Monitoring, which is a two-step process.

For the next steps, we use the delegated administrator account in GuardDuty to make sure that the protection plans are enabled for each AWS account. When you use GuardDuty (or Security Hub, Detective, and Inspector) with AWS Organizations, you can designate an account to be the delegated administrator. This is helpful so that you can configure these security services for multiple accounts at the same time. For instructions on how to enable a delegated administrator account for GuardDuty, see Managing GuardDuty accounts with AWS Organizations.

To enable EKS Protection

Sign in to the GuardDuty console using the delegated administrator account, choose Protection plans, and then choose EKS Protection.
In the Delegated administrator section, choose Edit and then choose Enable for each scope or protection. For this post, select EKS Audit Log Monitoring, EKS Runtime Monitoring, and Manage agent automatically, as shown in Figure 2. For more information on each feature, see the following resources:
- EKS Audit Log Monitoring
- EKS Runtime Monitoring
To enable these protections for current accounts, in the Active member accounts section, choose Edit and Enable for each scope of protection.
To enable these protections for new accounts, in the New account default configuration section, choose Edit and Enable for each scope of protection.

To enable RDS Protection

The next step is to enable RDS Protection. GuardDuty RDS Protection works by analysing RDS login activity for potential threats to your Amazon Aurora databases (MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Editions). Using this feature, you can identify potentially suspicious login behavior and then use Detective to investigate CloudTrail logs, VPC flow logs, and other useful information around those events.

Navigate to the RDS Protection menu and under Delegated administrator (this account), select Enable and Confirm.
In the Enabled for section, select Enable all if you want RDS Protection enabled on all of your accounts. If you want to select a specific account, choose Manage Accounts and then select the accounts for which you want to enable RDS Protection. With the accounts selected, choose Edit Protection Plans, RDS Login Activity, and Enable for X selected account.
(Optional) For new accounts, turn on Auto-enable RDS Login Activity Monitoring for new member accounts as they join your organization.

Figure 2: Enable EKS Runtime Monitoring

To enable Lambda Protection

The final step is to enable Lambda Protection. Lambda Protection helps detect potential security threats during the invocation of AWS Lambda functions. By monitoring network activity logs, GuardDuty can generate findings when Lambda functions are involved with malicious activity, such as communicating with command and control servers.

Navigate to the Lambda Protection menu and under Delegated administrator (this account), select Enable and Confirm.
In the Enabled for section, select Enable all if you want Lambda Protection enabled on all of your accounts. If you want to select a specific account, choose Manage Accounts and select the accounts for which you want to enable RDS Protection. With the accounts selected, choose Edit Protection Plans, Lambda Network Activity Monitoring, and Enable for X selected account.
(Optional) For new accounts, turn on Auto-enable Lambda Network Activity Monitoring for new member accounts as they join your organization.

Figure 4: Enable Lambda Network Activity Monitoring

Now that you’ve enabled these new protections, GuardDuty will start monitoring EKS audit logs, EKS runtime activity, RDS login activity, and Lambda network activity. If GuardDuty detects suspicious or malicious activity for these log sources or services, it will generate a finding for the activity, which you can review in the GuardDuty console. In addition, you can automatically forward these findings to Security Hub for consolidation, and to Detective for security investigation.

Detective data sources

If you have Security Hub and other AWS security services such as GuardDuty or Amazon Inspector enabled, findings from these services are forwarded to Security Hub. With the exception of sensitive data findings from Amazon Macie, you’re automatically opted in to other AWS service integrations when you enable Security Hub. For the full list of services that forward findings to Security Hub, see Available AWS service integrations.

With each service enabled and forwarding findings to Security Hub, the next step is to enable the data source in Detective called AWS security findings, which are the findings forwarded to Security Hub. Again, we’re going to use the delegated administrator account for these steps to make sure that AWS security findings are being ingested for your accounts.

To enable AWS security findings

Sign in to the Detective console using the delegated administrator account and navigate to Settings and then General.
Choose Optional source packages, Edit, select AWS security findings, and then choose Save.

Figure 5: Enable AWS security findings

When you enable Detective, it immediately starts creating a security behavior graph for AWS security findings to build a linked dataset between findings and entities, such as RDS login activity from Aurora databases, EKS runtime activity, and suspicious network activity for Lambda functions. For GuardDuty to detect potential threats that affect your database instances, it first needs to undertake a learning period of up to two weeks to establish a baseline of normal behavior. For more information, see How RDS Protection uses RDS login activity monitoring. For the other protections, after suspicious activity is detected, you can start to see findings in both GuardDuty and Security Hub consoles. This is where you can start using Detective to better understand which findings are connected and where to prioritize your investigations.

Detective behavior graph

As Detective ingests data from GuardDuty, Amazon Inspector, and Security Hub, as well as CloudTrail logs, VPC flow logs, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs, it builds a behavior graph database. Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, which means that they’re not computed out-of-band or by interfering with relationships through querying foreign keys. Because Detective stores information on relationships in your graph database, you can effectively answer questions such as “are these security findings related?”. In Detective, you can use the search menu and profile panels to view these connections, but a quicker way to see this information is by using finding groups visualizations.

Finding groups visualizations

Finding groups extract additional information out of the behavior graph to highlight findings that are highly connected. Detective does this by running several machine learning algorithms across your behavior graph to identify related findings and then statically weighs the relationships between those findings and entities. The result is a finding group that shows GuardDuty and Amazon Inspector findings that are connected, along with entities like Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS accounts, and AWS Identity and Access Management (IAM) roles and sessions that were impacted by these findings. With finding groups, you can more quickly understand the relationships between multiple findings and their causes because you don’t need to connect the dots on your own. Detective automatically does this and presents a visualization so that you can see the relationships between various entities and findings.

Enhanced visualizations

Recently, we released several enhancements to finding groups visualizations to aid your understanding of security connections and root causes. These enhancements include:

Dynamic legend – the legend now shows icons for entities that you have in the finding group instead of showing all available entities. This helps reduce noise to only those entities that are relevant to your investigation.
Aggregated evidence and finding icons – these icons provide a count of similar evidence and findings. Instead of seeing the same finding or evidence repeated multiple times, you’ll see one icon with a counter to help reduce noise.
More descriptive side panel information – when you choose a finding or entity, the side panel shows additional information, such as the service that identified the finding and the finding title, in addition to the finding type, to help you understand the action that invoked the finding.
Label titles – you can now turn on or off titles for entities and findings in the visualization so that you don’t have to choose each to get a summary of what the different icons mean.

To use the finding groups visualization

Open the Detective console, choose Summary, and then choose View all finding groups.
Choose the title of an available finding group and scroll down to Visualization.
Under the Select layout menu, choose one of the layouts available, or choose and drag each icon to rearrange the layout according to how you’d like to see connections.
For a complete list of involved entities and involved findings, scroll down below the visualization.

Figure 6 shows an example of how you can use finding groups visualization to help identify the root cause of findings quickly. In this example, an IAM role was connected to newly observed geolocations, multiple GuardDuty findings detected malicious API calls, and there were newly observed user agents from the IAM session. The visualization can give you high confidence that the IAM role is compromised. It also provides other entities that you can search against, such as the IP address, S3 bucket, or new user agents.

Figure 6: Finding groups visualization

Now that you have the new GuardDuty protections enabled along with the data source of AWS security findings, you can use finding groups to more quickly visualize which IAM sessions have had multiple findings associated with unauthorized access, or which EC2 instances are publicly exposed with a software vulnerability and active GuardDuty finding—these patterns can help you determine if there is an actual risk.

Conclusion

In this blog post, you learned how to enable new GuardDuty protections and use Detective, finding groups, and visualizations to better identify, operationalize, and prioritize AWS security findings that represent real risk. We also highlighted the new enhancements to visualizations that can help reduce noise and provide summaries of detailed information to help reduce the time it takes to triage findings. If you’d like to see an investigation scenario using Detective, watch the video Amazon Detective Security Scenario Investigation.

If you have feedback about this post, submit comments in the Comments section below. You can also start a new thread on Amazon Detective re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Automate the archive and purge data process for Amazon RDS for PostgreSQL using pg_partman, Amazon S3, and AWS Glue

2023-08-22 Anand Komandooru

Post Syndicated from Anand Komandooru original https://aws.amazon.com/blogs/big-data/automate-the-archive-and-purge-data-process-for-amazon-rds-for-postgresql-using-pg_partman-amazon-s3-and-aws-glue/

The post Archive and Purge Data for Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL Compatibility using pg_partman and Amazon S3 proposes data archival as a critical part of data management and shows how to efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in Amazon Simple Storage Service (Amazon S3). Customers need a cloud-native automated solution to archive historical data from their databases. Customers want the business logic to be maintained and run from outside the database to reduce the compute load on the database server. This post proposes an automated solution by using AWS Glue for automating the PostgreSQL data archiving and restoration process, thereby streamlining the entire procedure.

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. There is no need to pre-provision, configure, or manage infrastructure. It can also automatically scale resources to meet the requirements of your data processing job, providing a high level of abstraction and convenience. AWS Glue integrates seamlessly with AWS services like Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon DynamoDB, Amazon Kinesis Data Streams, and Amazon DocumentDB (with MongoDB compatibility) to offer a robust, cloud-native data integration solution.

The features of AWS Glue, which include a scheduler for automating tasks, code generation for ETL (extract, transform, and load) processes, notebook integration for interactive development and debugging, as well as robust security and compliance measures, make it a convenient and cost-effective solution for archival and restoration needs.

Solution overview

The solution combines PostgreSQL’s native range partitioning feature with pg_partman, the Amazon S3 export and import functions in Amazon RDS, and AWS Glue as an automation tool.

The solution involves the following steps:

Provision the required AWS services and workflows using the provided AWS Cloud Development Kit (AWS CDK) project.
Set up your database.
Archive the older table partitions to Amazon S3 and purge them from the database with AWS Glue.
Restore the archived data from Amazon S3 to the database with AWS Glue when there is a business need to reload the older table partitions.

The solution is based on AWS Glue, which takes care of archiving and restoring databases with Availability Zone redundancy. The solution is comprised of the following technical components:

An Amazon RDS for PostgreSQL Multi-AZ database runs in two private subnets.
AWS Secrets Manager stores database credentials.
An S3 bucket stores Python scripts and database archives.
An S3 Gateway endpoint allows Amazon RDS and AWS Glue to communicate privately with the Amazon S3.
AWS Glue uses a Secrets Manager interface endpoint to retrieve database secrets from Secrets Manager.
AWS Glue ETL jobs run in either private subnet. They use the S3 endpoint to retrieve Python scripts. The AWS Glue jobs read the database credentials from Secrets Manager to establish JDBC connections to the database.

You can create an AWS Cloud9 environment in one of the private subnets available in your AWS account to set up test data in Amazon RDS. The following diagram illustrates the solution architecture.

Solution Architecture

Prerequisites

For instructions to set up your environment for implementing the solution proposed in this post, refer to Deploy the application in the GitHub repo.

Provision the required AWS resources using AWS CDK

Complete the following steps to provision the necessary AWS resources:

Clone the repository to a new folder on your local desktop.
Create a virtual environment and install the project dependencies.
Deploy the stacks to your AWS account.

The CDK project includes three stacks: vpcstack, dbstack, and gluestack, implemented in the vpc_stack.py, db_stack.py, and glue_stack.py modules, respectively.

These stacks have preconfigured dependencies to simplify the process for you. app.py declares Python modules as a set of nested stacks. It passes a reference from vpcstack to dbstack, and a reference from both vpcstack and dbstack to gluestack.

gluestack reads the following attributes from the parent stacks:

The S3 bucket, VPC, and subnets from vpcstack
The secret, security group, database endpoint, and database name from dbstack

The deployment of the three stacks creates the technical components listed earlier in this post.

Set up your database

Prepare the database using the information provided in Populate and configure the test data on GitHub.

Archive the historical table partition to Amazon S3 and purge it from the database with AWS Glue

The “Maintain and Archive” AWS Glue workflow created in the first step consists of two jobs: “Partman run maintenance” and “Archive Cold Tables.”

The “Partman run maintenance” job runs the Partman.run_maintenance_proc() procedure to create new partitions and detach old partitions based on the retention setup in the previous step for the configured table. The “Archive Cold Tables” job identifies the detached old partitions and exports the historical data to an Amazon S3 destination using aws_s3.query_export_to_s3. In the end, the job drops the archived partitions from the database, freeing up storage space. The following screenshot shows the results of running this workflow on demand from the AWS Glue console.

Archive job run result

Additionally, you can set up this AWS Glue workflow to be triggered on a schedule, on demand, or with an Amazon EventBridge event. You need to use your business requirement to select the right trigger.

Restore archived data from Amazon S3 to the database

The “Restore from S3” Glue workflow created in the first step consists of one job: “Restore from S3.”

This job initiates the run of the partman.create_partition_time procedure to create a new table partition based on your specified month. It subsequently calls aws_s3.table_import_from_s3 to restore the matched data from Amazon S3 to the newly created table partition.

To start the “Restore from S3” workflow, navigate to the workflow on the AWS Glue console and choose Run.

The following screenshot shows the “Restore from S3” workflow run details.

Restore job run result

Validate the results

The solution provided in this post automated the PostgreSQL data archival and restoration process using AWS Glue.

You can use the following steps to confirm that the historical data in the database is successfully archived after running the “Maintain and Archive” AWS Glue workflow:

On the Amazon S3 console, navigate to your S3 bucket.
Confirm the archived data is stored in an S3 object as shown in the following screenshot.
From a psql command line tool, use the \dt command to list the available tables and confirm the archived table ticket_purchase_hist_p2020_01 does not exist in the database.

You can use the following steps to confirm that the archived data is restored to the database successfully after running the “Restore from S3” AWS Glue workflow.

From a psql command line tool, use the \dt command to list the available tables and confirm the archived table ticket_history_hist_p2020_01 is restored to the database.

Clean up

Use the information provided in Cleanup to clean up your test environment created for testing the solution proposed in this post.

Summary

This post showed how to use AWS Glue workflows to automate the archive and restore process in RDS for PostgreSQL database table partitions using Amazon S3 as archive storage. The automation is run on demand but can be set up to be trigged on a recurring schedule. It allows you to define the sequence and dependencies of jobs, track the progress of each workflow job, view run logs, and monitor the overall health and performance of your tasks. Although we used Amazon RDS for PostgreSQL as an example, the same solution works for Amazon Aurora-PostgreSQL Compatible Edition as well. Modernize your database cron jobs using AWS Glue by using this post and the GitHub repo. Gain a high-level understanding of AWS Glue and its components by using the following hands-on workshop.

About the Authors

Anand Komandooru is a Senior Cloud Architect at AWS. He joined AWS Professional Services organization in 2021 and helps customers build cloud-native applications on AWS cloud. He has over 20 years of experience building software and his favorite Amazon leadership principle is “Leaders are right a lot.”

Li Liu is a Senior Database Specialty Architect with the Professional Services team at Amazon Web Services. She helps customers migrate traditional on-premise databases to the AWS Cloud. She specializes in database design, architecture, and performance tuning.

Neil Potter is a Senior Cloud Application Architect at AWS. He works with AWS customers to help them migrate their workloads to the AWS Cloud. He specializes in application modernization and cloud-native design and is based in New Jersey.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a big data enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

How AWS built the Security Guardians program, a mechanism to distribute security ownership

2023-08-18 Ana Malhotra

Post Syndicated from Ana Malhotra original https://aws.amazon.com/blogs/security/how-aws-built-the-security-guardians-program-a-mechanism-to-distribute-security-ownership/

Product security teams play a critical role to help ensure that new services, products, and features are built and shipped securely to customers. However, since security teams are in the product launch path, they can form a bottleneck if organizations struggle to scale their security teams to support their growing product development teams. In this post, we will share how Amazon Web Services (AWS) developed a mechanism to scale security processes and expertise by distributing security ownership between security teams and development teams. This mechanism has many names in the industry — Security Champions, Security Advocates, and others — and it’s often part of a shift-left approach to security. At AWS, we call this mechanism Security Guardians.

In many organizations, there are fewer security professionals than product developers. Our experience is that it takes much more time to hire a security professional than other technical job roles, and research conducted by (ISC)² shows that the cybersecurity industry is short 3.4 million workers. When product development teams continue to grow at a faster rate than security teams, the disparity between security professionals and product developers continues to increase as well. Although most businesses understand the importance of security, frustration and tensions can arise when it becomes a bottleneck for the business and its ability to serve customers.

At AWS, we require the teams that build products to undergo an independent security review with an AWS application security engineer before launching. This is a mechanism to verify that new services, features, solutions, vendor applications, and hardware meet our high security bar. This intensive process impacts how quickly product teams can ship to customers. As shown in Figure 1, we found that as the product teams scaled, so did the problem: there were more products being built than the security teams could review and approve for launch. Because security reviews are required and non-negotiable, this could potentially lead to delays in the shipping of products and features.

Figure 1: More products are being developed than can be reviewed and shipped

How AWS builds a culture of security

Because of its size and scale, many customers look to AWS to understand how we scale our own security teams. To tell our story and provide insight, let’s take a look at the culture of security at AWS.

Security is a business priority

At AWS, security is a business priority. Business leaders prioritize building products and services that are designed to be secure, and they consider security to be an enabler of the business rather than an obstacle.

Leaders also strive to create a safe environment by encouraging employees to identify and escalate potential security issues. Escalation is the process of making sure that the right people know about the problem at the right time. Escalation encompasses “Dive Deep”, which is one of our corporate values at Amazon, because it requires owners and leaders to dive into the details of the issue. If you don’t know the details, you can’t make good decisions about what’s going on and how to run your business effectively.

This aspect of the culture goes beyond intention — it’s embedded in our organizational structure:

CISOs and IT leaders play a key role in demystifying what security and compliance represent for the business. At AWS, we made an intentional choice for the security team to report directly to the CEO. The goal was to build security into the structural fabric of how AWS makes decisions, and every week our security team spends time with AWS leadership to ensure we’re making the right choices on tactical and strategic security issues.

– Stephen Schmidt, Chief Security Officer, Amazon, on Building a Culture of Security

Everyone owns security

Because our leadership supports security, it’s understood within AWS that security is everyone’s job. Security teams and product development teams work together to help ensure that products are built and shipped securely. Despite this collaboration, the product teams own the security of their product. They are responsible for making sure that security controls are built into the product and that customers have the tools they need to use the product securely.

On the other hand, central security teams are responsible for helping developers to build securely and verifying that security requirements are met before launch. They provide guidance to help developers understand what security controls to build, provide tools to make it simpler for developers to implement and test controls, provide support in threat modeling activities, use mechanisms to help ensure that customers’ security expectations are met before launch, and so on.

This responsibility model highlights how security ownership is distributed between the security and product development teams. At AWS, we learned that without this distribution, security doesn’t scale. Regardless of the number of security experts we hire, product teams always grow faster. Although the culture around security and the need to distribute ownership is now well understood, without the right mechanisms in place, this model would have collapsed.

Mechanisms compared to good intentions

Mechanisms are the final pillar of AWS culture that has allowed us to successfully distribute security across our organization. A mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. As shown in Figure 2, a mechanism takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. At AWS, the business challenge that we’re facing is that security teams create bottlenecks for the business. The culture of security at AWS provides support to help address this challenge, but we needed a mechanism to actually do it.

Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle

“Often, when we find a recurring problem, something that happens over and over again, we pull the team together, ask them to try harder, do better – essentially, we ask for good intentions. This rarely works… When you are asking for good intentions, you are not asking for a change… because people already had good intentions. But if good intentions don’t work, what does? Mechanisms work.

– Jeff Bezos, February 1, 2008 All Hands.

At AWS, we’ve learned that we can help solve the challenge of scaling security by distributing security ownership with a mechanism we call the Security Guardians program. Like other mechanisms, it has inputs and outputs, and transforms over time.

AWS distributes security ownership with the Security Guardians program

At AWS, the Security Guardians program trains, develops, and empowers developers to be security ambassadors, or Guardians, within the product teams. At a high level, Guardians make sure that security considerations for a product are made earlier and more often, helping their peers build and ship their product faster. They also work closely with the central security team to help ensure that the security bar at AWS is rising and the Security Guardians program is improving over time. As shown in Figure 3, embedding security expertise within the product teams helps products with Guardian involvement move through security review faster.

Figure 3: Security expertise is embedded in the product teams by Guardians

Guardians are informed, security-minded product builders who volunteer to be consistent champions of security on their teams and are deeply familiar with the security processes and tools. They provide security guidance throughout the development lifecycle and are stakeholders in the security of the products being shipped, helping their teams make informed decisions that lead to more secure, on-time launches. Guardians are the security points-of-contact for their product teams.

In this distributed security ownership model, accountability for product security sits with the product development teams. However, the Guardians are responsible for performing the first evaluation of a development team’s security review submission. They confirm the quality and completeness of the new service’s resources, design documents, threat model, automated findings, and penetration test readiness. The development teams, supported by the Guardian, submit their security review to AWS Application Security (AppSec) engineers for the final pre-launch review.

In practice, as part of this development journey, Guardians help ensure that security considerations are made early, when teams are assessing customer requests and the feature or product design. This can be done by starting the threat modeling processes. Next, they work to make sure that mitigations identified during threat modeling are developed. Guardians also play an active role in software testing, including security scans such as static application security testing (SAST) and dynamic application security testing (DAST). To close out the security review, security engineers work with Guardians to make sure that findings are resolved and the product is ready to ship.

Figure 4: Expedited security review process supported by Guardians

Guardians are, after all, Amazonians. Therefore, Guardians exemplify a number of the Amazon Leadership Principles and often have the following characteristics:

They are exemplary practitioners for security ownership and empower their teams to own the security of their service.
They hold a high security bar and exercise strong security judgement, don’t accept quick or easy answers, and drive continuous improvement.
They advocate for security needs in internal discussions with the product team.
They are thoughtful yet assertive to make customer security a top priority on their team.
They maintain and showcase their security knowledge to their peers, continuously building knowledge from many different sources to gain perspective and to stay up to date on the constantly evolving threat landscape.
They aren’t afraid to have their work independently validated by the central security team.

Expected outcomes

AWS has benefited greatly from the Security Guardians program. We’ve had 22.5 percent fewer medium and high severity security findings generated during the security review process and have taken about 26.9 percent less time to review a new service or feature. This data demonstrates that with Guardians involved we’re identifying fewer issues late in the process, reducing remediation work, and as a result securely shipping services faster for our customers. To help both builders and Guardians improve over time, our security review tool captures feedback from security engineers on their inputs. This helps ensure that our security ownership mechanism reinforces and improves itself over time.

AWS and other organizations have benefited from this mechanism because it generates specialized security resources and distributes security knowledge that scales without needing to hire additional staff.

A program such as this could help your business build and ship faster, as it has for AWS, while maintaining an appropriately high security bar that rises over time. By training builders to be security practitioners and advocates within your development cycle, you can increase the chances of identifying risks and security findings earlier. These findings, earlier in the development lifecycle, can reduce the likelihood of having to patch security bugs or even start over after the product has already been built. We also believe that a consistent security experience for your product teams is an important aspect of successfully distributing your security ownership. An experience with less confusion and friction will help build trust between the product and security teams.

To learn more about building positive security culture for your business, watch this spotlight interview with Stephen Schmidt, Chief Security Officer, Amazon.

If you’re an AWS customer and want to learn more about how AWS built the Security Guardians program, reach out to your local AWS solutions architect or account manager for more information.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Introducing AWS Glue crawler and create table support for Apache Iceberg format

2023-08-17 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawler-and-create-table-support-for-apache-iceberg-format/

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time.

AWS Glue crawlers now support Iceberg tables, enabling you to use the AWS Glue Data Catalog and migrate from other Iceberg catalogs easier. AWS Glue crawlers will extract schema information and update the location of Iceberg metadata and schema updates in the Data Catalog. You can then query the Data Catalog Iceberg tables across all analytics engines and apply AWS Lake Formation fine-grained permissions.

The Iceberg catalog helps you manage a collection of Iceberg tables and tracks the table’s current metadata. Iceberg provides several implementation options for the Iceberg catalog, including the AWS Glue Data Catalog, Hive Metastore, and JDBC catalogs. Customers prefer using or migrating to the AWS Glue Data Catalog because of its integrations with AWS analytical services such as Amazon Athena, AWS Glue, Amazon EMR, and Lake Formation.

With today’s launch, you can create and schedule an AWS Glue crawler to existing Iceberg tables into in the Data Catalog. You can then provide one or multiple S3 paths where the Iceberg tables are located. You have the option to provide the maximum depth of S3 paths that the crawler can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the Data Catalog. Crawlers support schema merging across all snapshots and update the latest metadata file location in the Data Catalog that AWS analytical engines can directly use.

Additionally, AWS Glue is launching support for creating new (empty) Iceberg tables in the Data Catalog using the AWS Glue console or AWS Glue CreateTable API. Before the launch, customers who wanted to adopt Iceberg table format were required to generate Iceberg’s metadata.json file on Amazon S3 using PutObject separately in addition to CreateTable. Often, customers have used the create table statement on analytics engines such as Athena, AWS Glue, and so on. The new CreateTable API eliminates the need to create the metadata.json file separately, and automates generating metadata.json based on the given API input. Also, customers who manage deployments using AWS CloudFormation templates can now create Iceberg tables using the CreateTable API. For more details, refer to Creating Apache Iceberg tables.

For accessing the data using Athena, you can also use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by AWS Identity and Access Management (IAM) permissions policies for Amazon S3 and AWS Glue actions.

Solution overview

For our example use case, a customer uses Amazon EMR for data processing and Iceberg format for the transactional data. They store their product data in Iceberg format on Amazon S3 and host the metadata of their datasets in Hive Metastore on the EMR primary node. The customer wants to make product data accessible to analyst personas for interactive analysis using Athena. Many AWS analytics services don’t integrate natively with Hive Metastore, so we use an AWS Glue crawler to populate the metadata in the AWS Glue Data Catalog. Athena supports Lake Formation permissions on Iceberg tables, so we apply fine-grained access for data access.

We configure the crawler to onboard the Iceberg schema to the Data Catalog and use Lake Formation access control for crawling. We apply Lake Formation grants on the database and crawled table to enable analyst users to query the data and verify using Athena.

After we populate the schema of the existing Iceberg dataset to the Data Catalog, we onboard new Iceberg tables to the Data Catalog and load data into the newly created data using Athena. We apply Lake Formation grants on the database and newly created table to enable analyst users to query the data and verify using Athena.

The following diagram illustrates the solution architecture.

Set up resources with AWS CloudFormation

To set up the solution resources using AWS CloudFormation, complete the following steps:

Log in to the AWS Management Console as IAM administrator.
Choose Launch Stack to deploy a CloudFormation template.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

The CloudFormation template generates the following resources:

VPC, subnet, and security group for the EMR cluster
Data lake bucket to store Iceberg table data and metadata
IAM roles for the crawler and Lake Formation registration
EMR cluster and steps to create an Iceberg table with Hive Metastore
Analyst role for data access
Athena bucket path for results

When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
Note down the values of EmrClusterId, DataLakeBucketName, LFRegisterLocationServiceRole, AWSGlueServiceRole, AthenaBucketName, and LFBusinessAnalystRole.
Navigate to the Amazon EMR console and choose the EMR cluster you created.
Navigate to the Steps tab and verify that the steps were run.

This script run creates the database icebergcrawlerblodb using Hive and the Iceberg table product. It uses the Hive Metastore server on Amazon EMR as the metastore and stores the data on Amazon S3.

Navigate to the S3 bucket you created and verify if the data and metadata are created for the Iceberg table.

Some of the resources that this stack deploys incur costs when in use.

Now that the data is on Amazon S3, we can register the bucket with Lake Formation to implement access control and centralize the data governance.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps to update the Data Catalog settings to use Lake Formation permissions to control Data Catalog resources instead of IAM-based access control:

Sign in to the Lake Formation console as admin.
- If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
In the navigation pane, under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases.
Deselect Use only IAM access control for new tables in new databases.
Choose Version 3 for Cross account version settings.
Choose Save.

Now you can set up Lake Formation permissions.

Register the data lake S3 bucket with Lake Formation

To register the data lake S3 bucket, complete the following steps:

On the Lake Formation console, in the navigation pane, choose Data lake locations.
Choose Register location.
For Amazon S3 path, enter the data lake bucket path.
For IAM role, choose the role noted from the CloudFormation template for LFRegisterLocationServiceRole.
Choose Register location.

Grant crawler role access to the data location

To grant access to the crawler, complete the following steps:

On the Lake Formation console, in the navigation pane, choose Data locations.
Choose Grant.
For IAM users and roles, choose the role for the crawler.
For Storage locations, enter the data lake bucket path.
Choose Grant.

Create database and grant access to the crawler role

Complete the following steps to create your database and grant access to the crawler role:

On the Lake Formation console, in the navigation pane, choose Databases.
Choose Create database.
Provide the name icebergcrawlerblogdb for the database.
Make sure Use only IAM access control for new tables in this database option is not selected.
Choose Create database.
On the Action menu, choose Grant.
For IAM users and roles, choose the role for the crawler.
Leave the database specified as icebergcrawlerblogdb.
Select Create table, Describe, and Alter for Database permissions.
Choose Grant.

Configure the crawler for Iceberg

To configure your crawler for Iceberg, complete the following steps:

On the AWS Glue console, in the navigation pane, choose Crawlers.
Choose Create crawler.
Enter a name for the crawler. For this post, we use icebergcrawler.
Under Data source configuration, choose Add data source.
For Data source, choose Iceberg.
For S3 path, enter s3://<datalakebucket>/icebergcrawlerblogdb.db/.
Choose Add a Iceberg data source.

Support for Iceberg tables is available through CreateCrawler and UpdateCrawler APIs and adding the additional IcebergTarget as a target, with the following properties:

connectionId – If your Iceberg tables are stored in buckets that require VPC authorization, you can set your connection properties here
icebergTables – This is an array of icebergPaths strings, each indicating the folder with which the metadata files for an Iceberg table resides

See the following code:

{
    "IcebergTarget": {
        "connectionId": "iceberg-connection-123",
        "icebergMetaDataPaths": [
            "s3://bucketA/",
            "s3://bucketB/",
            "s3://bucket3/financedb/financetable/"
        ]
        "exclusions": ["departments/**", "employees/folder/**"]
        "maximumDepth": 5
    }
}

Choose Next.
For Existing IAM role, enter the crawler role created by the stack.
Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
Choose Next.
Under Set output and scheduling, specify the target database as icebergcrawlerblogdb.
Choose Next.
Choose Create crawler.
Run the crawler.

During each crawl, for each icebergTable path provided, the crawler calls the Amazon S3 List API to find the most recent metadata file under that Iceberg table metadata folder and updates the metadata_location parameter to the latest manifest file.

The following screenshot shows the details after a successful run.

The crawler was able to crawl the S3 data source and successfully populate the schema for Iceberg data in the Data Catalog.

You can now start using the Data Catalog as your primary metastore and create new Iceberg tables directly in the Data Catalog or using the createtable API.

Create a new Iceberg table

To create an Iceberg table in the Data Catalog using the console, complete the steps in this section. Alternatively, you can use a CloudFormation template to create an Iceberg table using the following code:

Type: AWS::Glue::Table
Properties: 
  CatalogId:"<account_id>"
  DatabaseName:"icebergcrawlerblogdb"
  TableInput:
    Name: "product_details"
    StorageDescriptor:
       Columns:
         - Name: "product_id"
           Type: "string"
         - Name: "manufacture_name"
           Type: "string"
         - Name: "product_rating"
           Type: "int"
       Location: "s3://<datalakebucket>/icebergcrawlerblogdb.db/"
    TableType: "EXTERNAL_TABLE"
  OpenTableFormatInput:
    IcebergInput:
      MetadataOperation: "CREATE"
      Version: "2"

Grant the IAM role access to the data location

First, grant the IAM role access to the data location:

On the Lake Formation console, in the navigation pane, choose Data locations.
Choose Grant.
Select Admin IAM role for IAM users and roles.
For Storage location, enter the data lake bucket path.
Choose Grant.

Create the Iceberg table

Complete the following steps to create the Iceberg table:

On the Lake Formation console, in the navigation pane, choose Tables.
Choose Create table.
For Name, enter product_details.
Choose icebergcrawlerblogdb for Database.
Select Apache Iceberg table for Table format.
Provide the path for <datalakebucket>/icebergcrawlerblogdb.db/ for Table location.

Provide the following schema and choose Upload schema:

[
     {
         "Name": "product_id",
         "Type": "string"
     },
     {
         "Name": "manufacture_name",
         "Type": "string"
     },
     {
         "Name": "product_rating",
         "Type": "int"
     }
 ]

Choose Submit to create the table.

Add a record to the new Iceberg table

Complete the following steps to add a record to the Iceberg table:

On the Athena console, navigate to the query editor.
Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
Choose Save.

Run the following query to add a record to the table:

insert into icebergcrawlerblogdb.product_details values('00001','ABC Company',10)

Configure Lake Formation permissions on the Iceberg table in the Data Catalog

Athena supports Lake Formation permission on Iceberg tables, so for this post, we show you how to set up fine-grained access on the tables and query them using Athena.

Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalystRole-IcebergBlogIAM role via the Lake Formation console.

Grant the role access to the database and describe permissions

To grant the LFBusinessAnalystRole-IcebergBlogIAM role access to the database with describe permissions, complete the following steps:

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases.
Select Describe for Database permissions.
Choose Grant to apply the permissions.

Grant column access to the role

Next, grant column access to the LFBusinessAnalystRole-IcebergBlogIAM role:

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product for Tables.
Choose Select for Table permissions.
Under Data permissions, select Column-based access.
Select Include columns and choose product_name and price.
Choose Grant to apply the permissions.

Grant table access to the role

Lastly, grant table access to the LFBusinessAnalystRole-IcebergBlogIAM role:

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product_details for Tables.
Choose Select and Describe for Table permissions.
Choose Grant to apply the permissions.

Verify the tables using Athena

To verify the tables using Athena, switch to LFBusinessAnalystRole-IcebergBlogrole and complete the following steps:

On the Athena console, navigate to the query editor.
Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
Choose Save.
Run the queries on product and product_details to validate access.

The following screenshot shows column permissions on product.

The following screenshot shows table permissions on product_details.

We have successfully crawled the Iceberg dataset created from Hive Metastore with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. We registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. We granted Lake Formation permissions on the database and table to the analyst user and validated access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, delete the AWS resources:

Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
Delete the CloudFormation stack you created.

Conclusion

With the support for Iceberg crawlers, you can quickly move to using the AWS Glue Data Catalog as your primary Iceberg table catalog. You can automatically register Iceberg tables into the Data Catalog by running an AWS Glue crawler, which doesn’t require any DDL or manual schema definition. You can start building your serverless transactional data lake on AWS using the AWS Glue crawler, create a new table using the Data Catalog, and utilize Lake Formation fine-grained access controls for querying Iceberg tables formats by Athena.

Refer to Working with other AWS services for Lake Formation support for Iceberg tables across various AWS analytical services.

Special thanks to everyone who contributed to this crawler and createtable feature launch: Theo Xu, Kyle Duong, Anshuman Sharma, Atreya Srivathsan, Eric Wu, Jack Ye, Himani Desai, Atreya Srivathsan, Masoud Shahamiri and Sachet Saurabh.

If you have questions or suggestions, submit them in the comments section.

About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Mahesh Mishra is a Principal Product Manager with AWS Lake Formation team. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including strong support for Transactional Data Lakes.

How to use AWS Verified Access logs to write and troubleshoot access policies

2023-08-14 Ankush Goyal

Post Syndicated from Ankush Goyal original https://aws.amazon.com/blogs/security/how-to-use-aws-verified-access-logs-to-write-and-troubleshoot-access-policies/

On June 19, 2023, AWS Verified Access introduced improved logging functionality; Verified Access now logs more extensive user context information received from the trust providers. This improved logging feature simplifies administration and troubleshooting of application access policies while adhering to zero-trust principles.

In this blog post, we will show you how to manage the Verified Access logging configuration and how to use Verified Access logs to write and troubleshoot access policies faster. We provide an example showing the user context information that was logged before and after the improved logging functionality and how you can use that information to transform a high-level policy into a fine-grained policy.

Overview of AWS Verified Access

AWS Verified Access helps enterprises to provide secure access to their corporate applications without using a virtual private network (VPN). Using Verified Access, you can configure fine-grained access policies to help limit application access only to users who meet the specified security requirements (for example, user identity and device security status). These policies are written in Cedar, a new policy language developed and open-sourced by AWS.

Verified Access validates each request based on access policies that you set. You can use user context—such as user, group, and device risk score—from your existing third-party identity and device security services to define access policies. In addition, Verified Access provides you an option to log every access attempt to help you respond quickly to security incidents and audit requests. These logs also contain user context sent from your identity and device security services and can help you to match the expected outcomes with the actual outcomes of your policies. To capture these logs, you need to enable logging from the Verified Access console.

Figure 1: Overview of AWS Verified Access architecture showing Verified Access connected to an application

After a Verified Access administrator attaches a trust provider to a Verified Access instance, they can write policies using the user context information from the trust provider. This user context information is custom to an organization, and you need to gather it from different sources when writing or troubleshooting policies that require more extensive user context.

Now, with the improved logging functionality, the Verified Access logs record more extensive user context information from the trust providers. This eliminates the need to gather information from different sources. With the detailed context available in the logs, you have more information to help validate and troubleshoot your policies.

Let’s walk through an example of how this detailed context can help you improve your Verified Access policies. For this example, we set up a Verified Access instance using AWS IAM Identity Center (successor to AWS Single Sign-on) and CrowdStrike as trust providers. To learn more about how to set up a Verified Access instance, see Getting started with Verified Access. To learn how to integrate Verified Access with CrowdStrike, see Integrating AWS Verified Access with device trust providers.

Then we wrote the following simple policy, where users are allowed only if their email matches the corporate domain.

permit(principal,action,resource)
when {
    context.sso.user.email.address like "*@example.com"
};

Before improved logging, Verified Access logged basic information only, as shown in the following example log.

    "identity": {
        "authorizations": [
            {
                "decision": "Allow",
                "policy": {
                    "name": "inline"
                }
            }
        ],
        "idp": {
            "name": "user",
            "uid": "vatp-09bc4cbce2EXAMPLE"
        },
        "user": {
            "email_addr": "[email protected]",
            "name": "Test User Display",
            "uid": "[email protected]",
            "uuid": "00u6wj48lbxTAEXAMPLE"
        }
    }

Modify an existing Verified Access instance

To improve the preceding policy and make it more granular, you can include checks for various user and device details. For example, you can check if the user belongs to a particular group, has a verified email, should be logging in from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.

Modify the Verified Access instance logging configuration

You can modify the instance logging configuration of an existing Verified Access instance by using either the AWS Management Console or AWS Command Line Interface (AWS CLI).

Open the Verified Access console and select Verified Access instances.
Select the instance that you want to modify, and then, on the Verified Access instance logging configuration tab, select Modify Verified Access instance logging configuration.

Figure 2: Modify Verified Access logging configuration
Under Update log version, select ocsf-1.0.0-rc.2, turn on Include trust context, and select where the logs should be delivered.

Figure 3: Verified Access log version and trust context

After you’ve completed the preceding steps, Verified Access will start logging more extensive user context information from the trust providers for every request that Verified Access receives. This context information can have sensitive information. To learn more about how to protect this sensitive information, see Protect Sensitive Data with Amazon CloudWatch Logs.

The following example log shows information received from the IAM Identity Center identity provider (IdP) and the device provider CrowdStrike.

"data": {
    "context": {
        "crowdstrike": {
            "assessment": {
                "overall": 21,
                "os": 53,
                "sensor_config": 4,
                "version": "3.6.1"
            },
            "cid": "7545bXXXXXXXXXXXXXXX93cf01a19b",
            "exp": 1692046783,
            "iat": 1690837183,
            "jwk_url": "https://assets-public.falcon.crowdstrike.com/zta/jwk.json",
            "platform": "Windows 11",
            "serial_number": "ec2dXXXXb-XXXX-XXXX-XXXX-XXXXXX059f05",
            "sub": "99c185e69XXXXXXXXXX4c34XXXXXX65a",
            "typ": "crowdstrike-zta+jwt"
        },
        "sso": {
            "user": {
                "user_id": "24a80468-XXXX-XXXX-XXXX-6db32c9f68fc",
                "user_name": "XXXX",
                "email": {
                    "address": "[email protected]",
                    "verified": false
                }
            },
            "groups": {
                "04c8d4d8-e0a1-XXXX-383543e07f11": {
                    "group_name": "XXXX"
                }
            }
        },
        "http_request": {
            "hostname": "sales.example.com",
            "http_method": "GET",
            "x_forwarded_for": "52.XX.XX.XXXX",
            "port": 80,
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0",
            "client_ip": "52.XX.XX.XXXX"
        }
    }
}

The following example log shows the user context information received from the OpenID Connect (OIDC) trust provider Okta. You can see the difference in the information provided by the two different trust providers: IAM Identity Center and Okta.

"data": {
    "context": {
        "http_request": {
            "hostname": "sales.example.com",
            "http_method": "GET",
            "x_forwarded_for": "99.X.XX.XXX",
            "port": 80,
            "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15",
            "client_ip": "99.X.XX.XXX"
        },
        "okta": {
            "sub": "00uXXXXXXXJNbWyRI5d7",
            "name": "XXXXXX",
            "locale": "en_US",
            "preferred_username": "[email protected]",
            "given_name": "XXXX",
            "family_name": "XXXX",
            "zoneinfo": "America/Los_Angeles",
            "groups": [
                "Everyone",
                "Sales",
                "Finance",
                "HR"
            ],
            "exp": 1690835175,
            "iss": "https://example.okta.com"
        }
    }
}

The following is a sample policy written using the information received from the trust providers.

permit(principal,action,resource)
when {
  context.idcpolicy.groups has "<hr-group-id>" &&
  context.idcpolicy.user.email.address like "*@example.com" &&
  context.idcpolicy.user.email.verified == true &&
  context has "crdstrikepolicy" &&
  context.crdstrikepolicy.assessment.os > 50 &&
  context.crdstrikepolicy.assessment.overall > 15
};

This policy only grants access to users who belong to a particular group, have a verified email address, and have a corporate email domain. Also, users can only access the application from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.

Conclusion

In this post, you learned how to manage Verified Access logging configuration from the Verified Access console and how to use improved logging information to write AWS Verified Access policies. To get started with Verified Access, see the Amazon VPC console.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Introducing Apache Airflow version 2.6.3 support on Amazon MWAA

2023-08-10 Hernan Garcia

Post Syndicated from Hernan Garcia original https://aws.amazon.com/blogs/big-data/introducing-apache-airflow-version-2-6-3-support-on-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, Amazon MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security and reducing infrastructure management overhead.

Today, we are announcing the availability of Apache Airflow version 2.6.3 environments. If you’re currently running Apache Airflow version 2.x, you can seamlessly upgrade to v2.6.3 using in-place version upgrades, thereby retaining your workflow run history and environment configurations.

In this post, we delve into some of the new features and capabilities of Apache Airflow v2.6.3 and how you can set up or upgrade your Amazon MWAA environment to accommodate this version as you orchestrate your workflows in the cloud at scale.

New feature: Notifiers

Airflow now gives you an efficient way to create reusable and standardized notifications to handle systemic errors and failures. Notifiers introduce a new object in Airflow, designed to be an extensible layer for adding notifications to DAGs. This framework can send messages to external systems when a task instance or an individual DAG run changes its state. You can build notification logic from a new base object and call it directly from your DAG files. The BaseNotifier is an abstract class that provides a basic structure for sending notifications in Airflow using the various on_*__callback. It is intended for providers to extend and customize this for their specific needs.

Using this framework, you can build custom notification logic directly within your DAG files. For instance, notifications can be sent through email, Slack, or Amazon Simple Notification Service (Amazon SNS) based on the state of a DAG (on_failure, on_success, and so on). You can also create your own custom notifier that updates an API or posts a file to your storage system of choice.

For details on how to create and use a notifier, refer to Creating a notifier.

New feature: Managing tasks stuck in a queued state

Apache Airflow v2.6.3 brings a significant improvement to address the long-standing issue of tasks getting stuck in the queued state when using the CeleryExecutor. In a typical Apache Airflow workflow, tasks progress through a lifecycle, moving from the scheduled state to the queued state, and eventually to the running state. However, tasks can occasionally remain in the queued state longer than expected due to communication issues among the scheduler, the executor, and the worker. In Amazon MWAA, customers have experienced such tasks being queued for up to 12 hours due to how it utilizes the native integration of Amazon Simple Queue Service (Amazon SQS) with CeleryExecutor.

To mitigate this issue, Apache Airflow v2.6.3 introduced a mechanism that checks the Airflow database for tasks that have remained in the queued state beyond a specified timeout, defaulting to 600 seconds. This default can be modified using the environment configuration parameter scheduler.task_queued_timeout. The system then retries such tasks if retries are still available or fails them otherwise, ensuring that your data pipelines continue to function smoothly.

Notably, this update deprecates the previously used celery.stalled_task_timeout and celery.task_adoption_timeout settings, and consolidates their functionalities into a single configuration, scheduler.task_queued_timeout. This enables more effective management of tasks that remain in the queued state. Operators can also configure scheduler.task_queued_timeout_check_interval, which controls how frequently the system checks for tasks that have stayed in the queued state beyond the defined timeout.

For details on how to use task_queued_timeout, refer to the official Airflow documentation.

New feature: A new continuous timetable and support for continuous schedule

With prior versions of Airflow, to run a DAG continuously in a loop, you had to use the TriggerDagRunOperator to rerun the DAG after the last task is finished. With Apache Airflow v2.6.3, you can now run a DAG continuously with a predefined timetable. The simplifies scheduling for continual DAG runs. The new ContinuousTimetable construct will create one continuous DAG run, respecting start_date and end_date, with the new run starting as soon as the previous run has completed, regardless of whether the previous run has succeeded or failed. Using a continuous timetable is especially useful when sensors are used to wait for highly irregular events from external data tools.

You can bound the degree of parallelism to ensure that only one DAG is running at any given time with the max_active_runs parameter:

@dag(
    start_date=datetime(2023, 5, 9),
    schedule="@continuous",
    max_active_runs=1,  
    catchup=False,
)

New feature: Trigger the DAG UI extension with flexible user form concept

Prior to Apache Airflow v2.6.3, you could provide parameters in JSON structure through the Airflow UI for custom workflow runs. You had to model, check, and understand the JSON and enter parameters manually without the option to validate them before triggering a workflow. With Apache Airflow v2.6.3, when you choose Trigger DAG w/ config, a trigger UI form is rendered based on the predefined DAG Params. For your ad hoc, testing, or custom runs, this simplifies the DAG’s parameter entry. If the DAG has no parameters defined, a JSON entry mask is shown. The form elements can be defined with the Param class and attributes define how a form field is displayed.

For an example DAG the following form is generated by DAG Params.

Set Up a New Apache Airflow v2.6.3 Environment

You can set up a new Apache Airflow v2.6.3 environment in your account and preferred Region using the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). If you’re adopting infrastructure as code (IaC), you can automate the setup using either AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform scripts.

When you have successfully created an Apache Airflow v2.6.3 environment in Amazon MWAA, the following packages are automatically installed on the scheduler and worker nodes along with other provider packages:

apache-airflow-providers-amazon==8.2.0

python==3.10.8

For a complete list of provider packages installed, refer to Apache Airflow provider packages installed on Amazon MWAA environments.

Upgrade from older versions of Apache Airflow to Apache Airflow v2.6.3

You can perform in-place version upgrades of your existing Amazon MWAA environments to update your older Apache Airflow v2.x-based environments to v2.6.3. To learn more about in-place version upgrades, refer to Upgrading the Apache Airflow version or Introducing in-place version upgrades with Amazon MWAA.

Conclusion

In this post, we talked about some of the new features of Apache Airflow v2.6.3 and how you can get started using them in Amazon MWAA. Try out these new features like notifiers and continuous timetables, and other enhancements to improve your data orchestration pipelines.

For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Authors

Hernan Garcia is a Senior Solutions Architect at AWS, based out of Amsterdam, working in the Financial Services Industry since 2018. He specializes in application modernization and supports his customers in the adoption of cloud operating models and serverless technologies.

Parnab Basak is a Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Shubham Mehta is an experienced product manager with over eight years of experience and a proven track record of delivering successful products. In his current role as a Senior Product Manager at AWS, he oversees Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and spearheads the Apache Airflow open-source contributions to further enhance the product’s functionality.

Perform Amazon Kinesis load testing with Locust

2023-08-10 Luis Morales

Post Syndicated from Luis Morales original https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/

Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per second, and it’s challenging for developers to simulate high-traffic Amazon Kinesis-based applications to generate such load easily.

Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose are capable of capturing and storing terabytes of data per hour from numerous sources. Creating Kinesis data streams or Firehose delivery streams is straightforward through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Kinesis API. However, generating a continuous stream of test data requires a custom process or script to run continuously. Although the Amazon Kinesis Data Generator (KDG) provides a user-friendly UI for this purpose, it has some limitations, such as bandwidth constraints and increased round trip latency. (For more information on the KDG, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.)

To overcome these limitations, this post describes how to use Locust, a modern load testing framework, to conduct large-scale load testing for a more comprehensive evaluation of the streaming data solution.

Overview

This project emits temperature sensor readings via Locust to Kinesis. We set up the Amazon Elastic Compute Cloud (Amazon EC2) Locust instance via the AWS Cloud Development Kit (AWS CDK) to load test Kinesis-based applications. You can access the Locust dashboard to perform and observe the load test and connect via Session Manager, a capability of AWS Systems Manager, for configuration changes. The following diagram illustrates this architecture.

Architecture overview

In our testing with the largest recommended instance (c7g.16xlarge), the setup was capable of emitting over 1 million events per second to Kinesis data streams in on-demand capacity mode, with a batch size (simulated users per Locust user) of 500. You can find more details on what this means and how to configure the load test later in this post.

Locust overview

Locust is an open-source, scriptable, and scalable performance testing tool that allows you to define user behavior using Python code. It offers an easy-to-use interface, making it developer-friendly and highly expandable. With its distributed and scalable design, Locust can simulate millions of simultaneous users to mimic real user behavior during a performance test.

Each Locust user represents a scenario or a specific set of actions that a real user might perform on your system. When you run a performance test with Locust, you can specify the number of concurrent Locust users you want to simulate, and Locust will create an instance for each user, allowing you to assess the performance and behavior of your system under different user loads.

For more information on Locust, refer to the Locust documentation.

Prerequisites

To get started, clone or download the code from the GitHub repository.

Test locally

To test Locust out locally first before deploying it to the cloud, you have to install the necessary Python dependencies. If you’re new to Python, refer the README for more information on getting started.

Navigate to the load-test directory and run the following code:

pip install -r requirements.txt

To send events to a Kinesis data stream from your local machine, you will need to have AWS credentials. For more information, refer to Configuration and credential file settings.

To perform the test locally, stay in the load-test directory and run the following code:

locust -f locust-load-test.py

You can now access the Locust dashboard via http://0.0.0.0:8089/. Enter the number of Locust users, the spawn rate (users added per second), and the target Amazon Kinesis data stream name for Host. By default, it deploys the Kinesis data stream DemoStream that you can use for testing.

Locust Dashboard - Enter details

To see the generated events logged, run the following command, which filters only Locust and root logs (for example, no Botocore logs):

locust -f locust-load-test.py --loglevel DEBUG 2&gt;&amp;1 | grep -E "(locust|root)"

Set up resources with the AWS CDK

The GitHub repository contains the AWS CDK code to create all the necessary resources for the load test. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

If not already downloaded, clone the GitHub repository to your local computer using the following command:

git clone https://github.com/aws-samples/amazon-kinesis-load-testing-with-locust

Download and install the latest Node.js.
Navigate to the root folder of the project and run the following command to install the latest version of AWS CDK:

npm install -g aws-cdk

Install the necessary dependencies:

npm install

Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command:

cdk bootstrap

To learn more about the bootstrapping process, refer to Bootstrapping.

After the dependencies are installed, you can run the following command to deploy the stack of the AWS CDK template, which sets up the infrastructure within 5 minutes:

cdk deploy

The template sets up the Locust EC2 test instance, which is by default a c7g.xlarge instance, which at the time of publishing costs approximately $0.145 per hour in us-east-1. To find the most accurate pricing information, see Amazon EC2 On-Demand Pricing. You can find more details on how to change your instance size according to your scale of load testing later in this post.

It’s crucial to consider that the expenses incurred during load testing are not solely attributed to EC2 instance costs, but also heavily influenced by data transfer costs.

Accessing the Locust dashboard

You can access the dashboard by using the AWS CDK output KinesisLocustLoadTestingStack.locustdashboardurl to open the dashboard, for example http://1.2.3.4:8089.

The Locust dashboard is password protected. By default, it’s set to user name locust-user and password locust-dashboard-pwd.

With the default configuration, you can achieve up to 15,000 emitted events per second. Enter the number of Locust users (times the batch size), the spawn rate (users added per second), and the target Kinesis data stream name for Host.

Locust Dashboard - Enter details

After you have started the load test, you can look at the load test on the Charts tab.

Locust Dashboard - Charts

You can also monitor the load test on the Kinesis Data Streams console by navigating to the stream that you are load testing. If you used the default settings, navigate to DemoStream. On the detail page, choose the Monitoring tab to see the ingested load.

Kinesis Data Streams - Monitoring

Adapt workloads

By default, this project generates random temperature sensor readings for every sensor with the following format:

{
    "sensorId": "bfbae19c-2f0f-41c2-952b-5d5bc6e001f1_1",
    "temperature": 147.24,
    "status": "OK",
    "timestamp": 1675686126310
}

The project comes packaged with Faker, which you can use to adapt the payload to your needs. You just have to update the generate_sensor_reading function in the locust-load-test.py file:

class SensorAPIUser(KinesisBotoUser):
    # ...

    def generate_sensor_reading(self, sensor_id, sensor_reading):
        current_temperature = round(10 + random.random() * 170, 2)

        if current_temperature > 160:
            status = "ERROR"
        elif current_temperature > 140 or random.randrange(1, 100) > 80:
            status = random.choice(["WARNING", "ERROR"])
        else:
            status = "OK"

        return {
            'sensorId': f"{sensor_id}_{sensor_reading}",
            'temperature': current_temperature,
            'status': status,
            'timestamp': round(time.time()*1000)
        }

    # ...

Change configurations

After the initial deployment of the load testing tool, you can change configuration in two ways:

Connect to the EC2 instance, make any configuration and code changes, and restart the Locust process
Change the configuration and load testing code locally and redeploy it via cdk deploy

The first option helps you iterate more quickly on the remote instance without a need to redeploy. The latter uses the infrastructure as code (IaC) approach and makes sure that your configuration changes can be committed to your source control system. For a fast development cycle, it’s recommended to test your load test configuration locally first, connect to your instance to apply the changes, and after successful implementation, codify it as part of your IaC repository and then redeploy.

Locust is created on the EC2 instance as a systemd service and can therefore be controlled with systemctl. If you want to change the configuration of Locust as needed without redeploying the stack, you can connect to the instance via Systems Manager, navigate to the project directory on /usr/local/load-test, change the locust.env file, and restart the service by running sudo systemctl restart locust.

Large-scale load testing

This setup is capable of emitting over 1 million events per second to Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge.

To achieve peak performance with Locust and Kinesis, keep the following in mind:

Instance size – Your performance is bound by the underlying EC2 instance, so refer to EC2 instance type for more information about scaling. To set the correct instance size, you can configure the instance size in the file kinesis-locust-load-testing.ts.
Number of secondaries – Locust benefits from a distributed setup. Therefore, the setup spins up a primary, which does the coordination, and multiple secondaries, which do the actual work. To fully take advantage of the cores, you should specify one secondary per core. You can configure the number in the locust.env file.
Batch size – The amount of Kinesis data stream events you can send per Locust user is limited due to the resource overhead of switching Locust users and threads. To overcome this, you can configure a batch size to define how much users are simulated per Locust user. These are sent as a Kinesis data stream put_records call. You can configure the number in the locust.env file.

This setup is capable of emitting over 1 million events per second to the Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge instance.

Locust Dashboard - Large Scale Load Test Charts

You can observe this on the Monitoring tab for the Kinesis data stream as well.

Kinesis Data Stream - Large Scale Load Test Monitoring

Clean up

In order to not incur any unnecessary costs, delete the stack by running the following code:

cdk destroy

Summary

Kinesis is already popular for its ease of use among users building streaming applications. With this load testing capability using Locust, you can now test your workloads in a more straightforward and faster way. Visit the GitHub repo to embark on your testing journey.

The project is licensed under the Apache 2.0 license, providing the freedom to clone and modify it according to your needs. Furthermore, you can contribute to the project by submitting issues or pull requests via GitHub, fostering collaboration and improvement in the testing ecosystem.

About the author

Luis Morales works as Senior Solutions Architect with digital native businesses to support them in constantly reinventing themselves in the cloud. He is passionate about software engineering, cloud-native distributed systems, test-driven development, and all things code and security

Monitor data pipelines in a serverless data lake

2023-08-09 Virendhar Sivaraman

Post Syndicated from Virendhar Sivaraman original https://aws.amazon.com/blogs/big-data/monitor-data-pipelines-in-a-serverless-data-lake/

AWS serverless services, including but not limited to AWS Lambda, AWS Glue, AWS Fargate, Amazon EventBridge, Amazon Athena, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have become the building blocks for any serverless data lake, providing key mechanisms to ingest and transform data without fixed provisioning and the persistent need to patch the underlying servers. The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. The advent of rapid adoption of serverless data lake architectures—with ever-growing datasets that need to be ingested from a variety of sources, followed by complex data transformation and machine learning (ML) pipelines—can present a challenge. Similarly, in a serverless paradigm, application logs in Amazon CloudWatch are sourced from a variety of participating services, and traversing the lineage across logs can also present challenges. To successfully manage a serverless data lake, you require mechanisms to perform the following actions:

Reinforce data accuracy with every data ingestion
Holistically measure and analyze ETL (extract, transform, and load) performance at the individual processing component level
Proactively capture log messages and notify failures as they occur in near-real time

In this post, we will walk you through a solution to efficiently track and analyze ETL jobs in a serverless data lake environment. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Overview of solution

The serverless monitoring solution focuses on achieving the following goals:

Capture state changes across all steps and tasks in the data lake
Measure service reliability across a data lake
Quickly notify operations of failures as they happen

To illustrate the solution, we create a serverless data lake with a monitoring solution. For simplicity, we create a serverless data lake with the following components:

Storage layer – Amazon S3 is the natural choice, in this case with the following buckets:
- Landing – Where raw data is stored
- Processed – Where transformed data is stored
Ingestion layer – For this post, we use Lambda and AWS Glue for data ingestion, with the following resources:
- Lambda functions – Two Lambda functions that run to simulate a success state and failure state, respectively
- AWS Glue crawlers – Two AWS Glue crawlers that run to simulate a success state and failure state, respectively
- AWS Glue jobs – Two AWS Glue jobs that run to simulate a success state and failure state, respectively
Reporting layer – An Athena database to persist the tables created via the AWS Glue crawlers and AWS Glue jobs
Alerting layer – Slack is used to notify stakeholders

The serverless monitoring solution is devised to be loosely coupled as plug-and-play components that complement an existing data lake. The Lambda-based ETL tasks state changes are tracked using AWS Lambda Destinations. We have used an SNS topic for routing both success and failure states for the Lambda-based tasks. In the case of AWS Glue-based tasks, we have configured EventBridge rules to capture state changes. These event changes are also routed to the same SNS topic. For demonstration purposes, this post only provides state monitoring for Lambda and AWS Glue, but you can extend the solution to other AWS services.

The following figure illustrates the architecture of the solution.

The architecture contains the following components:

EventBridge rules – EventBridge rules that capture the state change for the ETL tasks—in this case AWS Glue tasks. This can be extended to other supported services as the data lake grows.
SNS topic – An SNS topic that serves to catch all state events from the data lake.
Lambda function – The Lambda function is the subscriber to the SNS topic. It’s responsible for analyzing the state of the task run to do the following:
- Persist the status of the task run.
- Notify any failures to a Slack channel.
Athena database – The database where the monitoring metrics are persisted for analysis.

Deploy the solution

The source code to implement this solution uses AWS Cloud Development Kit (AWS CDK) and is available on the GitHub repo monitor-serverless-datalake. This AWS CDK stack provisions required network components and the following:

Three S3 buckets (the bucket names are prefixed with the AWS account name and Regions, for example, the landing bucket is <aws-account-number>-<aws-region>-landing):
- Landing
- Processed
- Monitor
Three Lambda functions:
- datalake-monitoring-lambda
- lambda-success
- lambda-fail
Two AWS Glue crawlers:
- glue-crawler-success
- glue-crawler-fail
Two AWS Glue jobs:
- glue-job-success
- glue-job-fail
An SNS topic named datalake-monitor-sns
Three EventBridge rules:
- glue-monitor-rule
- event-rule-lambda-fail
- event-rule-lambda-success
An AWS Secrets Manager secret named datalake-monitoring
Athena artifacts:
- monitor database
- monitor-table table

You can also follow the instructions in the GitHub repo to deploy the serverless monitoring solution. It takes about 10 minutes to deploy this solution.

Connect to a Slack channel

We still need a Slack channel to which the alerts are delivered. Complete the following steps:

Set up a workflow automation to route messages to the Slack channel using webhooks.
Note the webhook URL.

The following screenshot shows the field names to use.

The following is a sample message for the preceding template.

On the Secrets Manager console, navigate to the datalake-monitoring secret.
Add the webhook URL to the slack_webhook secret.

Load sample data

The next step is to load some sample data. Copy the sample data files to the landing bucket using the following command:

aws s3 cp --recursive s3://awsglue-datasets/examples/us-legislators s3://<AWS_ACCCOUNT>-<AWS_REGION>-landing/legislators

In the next sections, we show how Lambda functions, AWS Glue crawlers, and AWS Glue jobs work for data ingestion.

Test the Lambda functions

On the EventBridge console, enable the rules that trigger the lambda-success and lambda-fail functions every 5 minutes:

event-rule-lambda-fail
event-rule-lambda-success

After a few minutes, the failure events are relayed to the Slack channel. The following screenshot shows an example message.

Disable the rules after testing to avoid repeated messages.

Test the AWS Glue crawlers

On the AWS Glue console, navigate to the Crawlers page. Here you can start the following crawlers:

glue-crawler-success
glue-crawler-fail

In a minute, the glue-crawler-fail crawler’s status changes to Failed, which triggers a notification in Slack in near-real time.

Test the AWS Glue jobs

On the AWS Glue console, navigate to the Jobs page, where you can start the following jobs:

glue-job-success
glue-job-fail

In a few minutes, the glue-job-fail job status changes to Failed, which triggers a notification in Slack in near-real time.

Analyze the monitoring data

The monitoring metrics are persisted in Amazon S3 for analysis and can be used of historical analysis.

On the Athena console, navigate to the monitor database and run the following query to find the service that failed the most often:

SELECT service_type, count(*) as "fail_count"
FROM "monitor"."monitor"
WHERE event_type = 'failed'
group by service_type
order by fail_count desc;

Over time with rich observability data – time series based monitoring data analysis will yield interesting findings.

Clean up

The overall cost of the solution is less than one dollar but to avoid future costs, make sure to clean up the resources created as part of this post.

Summary

The post provided an overview of a serverless data lake monitoring solution that you can configure and deploy to integrate with enterprise serverless data lakes in just a few hours. With this solution, you can monitor a serverless data lake, send alerts in near-real time, and analyze performance metrics for all ETL tasks operating in the data lake. The design was intentionally kept simple to demonstrate the idea; you can further extend this solution with Athena and Amazon QuickSight to generate custom visuals and reporting. Check out the GitHub repo for a sample solution and further customize it for your monitoring needs.

About the Authors

Virendhar (Viru) Sivaraman is a strategic Senior Big Data & Analytics Architect with Amazon Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

Developing with Java and Spring Boot using Amazon CodeWhisperer

2023-08-04 Rajdeep Banerjee

Post Syndicated from Rajdeep Banerjee original https://aws.amazon.com/blogs/devops/developing-with-java-and-spring-boot-using-amazon-codewhisperer/

Developers often have to work with multiple programming languages depending on the task at hand. Sometimes, this is a result of choosing the right tool for a specific problem, or it is mandated by adhering to a specific technology adopted by a team. Within a specific programming language, developers may have to work with frameworks, software libraries, and popular cloud services from providers such as Amazon Web Services (AWS). This must be done while adhering to secure and best programming practices. Despite these challenges, developers must continue to release code at a sufficiently high velocity.

Amazon CodeWhisperer is a real-time, AI coding companion that provides code suggestions in your IDE code editor. Developers can simply write a comment that outlines a specific task in plain English, such as “method to upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best suited to accomplish the task and recommends multiple code snippets directly in the IDE. The code is generated based on the context of your file, such as comments as well as surrounding source code and import statements. CodeWhisperer is available as part of the AWS Toolkit for Visual Studio Code and JetBrain family of IDEs. CodeWhisperer is also available for AWS Cloud9, AWS Lambda console, JupyterLab, Amazon SageMaker Studio and AWS Glue Studio. CodeWhisperer supports popular programming languages like Java, Python, C#, TypeScript, GO, JavaScript, Rust, PHP, Kotlin, C, C++, Shell scripting, SQL, and Scala.

In this post, we will explore how to leverage CodeWhisperer in Java applications specifically using the Spring Boot framework. Spring Boot is an extension of the Spring framework that makes it easier to develop Java applications and microservices. Using CodeWhisperer, you will be spending less time creating boilerplate and repetitive code and more time focusing on business logic. You can generate entire Java Spring Boot functions and logical code blocks without having to search for code snippets from the web and customize them according to your requirements. CodeWhisperer will enable you to responsibly use AI to create syntactically correct and secure Java Spring Boot applications. To enable CodeWhisperer in your IDE, please see Setting up CodeWhisperer for VS Code or Setting up Amazon CodeWhisperer for JetBrains depending on which IDE you are using.

Note: Please note that CodeWhisperer uses artificial intelligence to provide code recommendations and this is non-deterministic. This code might differ from what you get from Amazon CodeWhisperer in your case.

Creating Data Transfer Objects (DTO)

Amazon CodeWhisperer makes it easier to develop the classes as you include import statements and provide brief comments on the purpose of the class. Let’s start with the basics and develop a simple DTO or Plain Old Java Object (POJO). This class will contain properties representing a product. This DTO will be referenced later as part of a REST controller we generate to serialize the output to JSON. CodeWhisperer will create a DTO class by using the class name and comments provided in plain language. Detailed and contextual comments will enable CodeWhisperer to generate code suggestions ranging from snippets to full functions in real time. For this use case, you are going to create a product class with id, name, price, description and rating properties.

Type the following or similar comment in the class :

package com.amazonws.demo.cart.dto;

//create a Product class with id, name, price, description and rating properties.

Quickly develop a Java class using Amazon Codewhisperer

After entering the comment and pressing ENTER, CodeWhisperer will start providing code suggestions. You can use the Tab key to accept a suggestion based on the context or use the left/right arrow keys to see more suggestions. As shown below, the product class is auto generated with five properties id, name, price, rating and description with default getter/setter methods and two constructors. If you need more properties, you can either update the comment to include the new columns or manually create the columns in the file:

package com.amazonws.demo.cart.dto;

//create a Product class with id, name, price, description and rating properties. 

public class Product {
    private String id;
    private String name;
    private Double price;
    private String description;
    private Integer rating;
    
    public Product() {
    }
  
    public Product(String id, String name, Double price) {
      this.id = id;
      this.name = name;
      setPrice(this.price = price);
    }
  
    public String getId() {
      return id;
    }
  
    public void setId(String id) {
      this.id = id;
    }
  
    public String getName() {
      return name;
    }
  
    public void setName(String name) {
      this.name = name;
    }
  
    public Double getPrice() {
      return price;
    }
  
    public void setPrice(Double price) {
      this.price = price;
    }
    
    public String getDescription(){
      return description;
    }

    public void setDescription(String description){
      this.description = description;
    }

    public Integer getRating(){
      return rating;
    }
    
    public void setRating(Integer rating){
      this.rating = rating;
    }
}

Implementing Data Access Object (DAO) pattern

Next, we implement the DAO pattern in this case for Amazon DynamoDB. The DAO pattern allows you to decouple the application/business layers from the persistence layer. It contains all the implementation logic for interacting with the persistence layer. We will create an entity class which represents the data to persist in DynamoDB along with the DAO class which contains the persistence logic.

First, create a ProductDaoEntity class which maps to the Amazon DynamoDB table. Create a blank ProductDaoEntity class and import DynamoDB packages for annotations, attributes, and partition key, as shown below. Notice that the class has a comment about the class structure and use of the DynamoDB enhanced client, so that CodeWhisperer can provide meaningful suggestion. The enhanced client allows you to map client-side classes to DynamoDB tables.

package com.amazonws.demo.cart.dao.entity;

import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbAttribute;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbBean;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbPartitionKey;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbSortKey;

//create a dynamo db ProductDaoEntity class with partition Key as id, name, price, description, rating attributes using dynamo db enhanced mapper annotations

CodeWhisperer can now infer the comments, contexts, and import statements and start to generate the class implementation. You can accept or reject comments based on your requirements. Below, you can see the complete class generated by CodeWhisperer.

@DynamoDbBean
public class ProductDaoEntity {

    private String id;
    private String name;
    private double price;
    private String description;
    private int rating;
    
    public ProductDaoEntity() {
    }

    public ProductDaoEntity(String id, String name, double price, String description, int rating) {
        this.id = id;
        this.name = name;
        this.price = price;
        this.description = description;
        this.rating = rating;
    }

    @DynamoDbPartitionKey
    @DynamoDbAttribute("id")
    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    @DynamoDbSortKey
    @DynamoDbAttribute("name")
    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @DynamoDbAttribute("price")
    public double getPrice() {
        return price;
    }

    public void setPrice(double price) {
        this.price = price;
    }

    @DynamoDbAttribute("description")
    public String getDescription() {
        return description;
    }

    public void setDescription(String description) {
        this.description = description;
    }

    @DynamoDbAttribute("rating")
    public int getRating() {
        return rating;
    }

    public void setRating(int rating) {
        this.rating = rating;
    }
    
    @Override
    public String toString() {
        return "ProductDaoEntity [id=" + id + ", name=" + name + ", price=" + price + ", description=" + description
                + ", rating=" + rating + "]";
    }

}

Notice how CodeWhisperer includes the appropriate DynamoDB related annotations such as @DynamoDbBean, @DynamoDbPartitionKey, @DynamoDbSortKey and @DynamoDbAttribute. This will be used to generate a TableSchema for mapping classes to tables.

Now that you have the mapper methods completed, you can create the actual persistence logic that is specific to DynamoDB. Create a class named ProductDaoImpl. (Note: it’s a best practice for DAOImpl class to implement a DAO interface class. We left that out for brevity.) Using the import statements and comments, CodeWhisperer can auto-generate most of the DynamoDB persistence logic for you. Create a ProductDaoImpl class which uses a DynamoDbEnhancedClient object as shown below.

package com.amazonws.demo.cart.dao;

import javax.annotation.PostConstruct;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import com.amazonws.demo.cart.dao.Mapper.ProductMapper;
import com.amazonws.demo.cart.dao.entity.ProductDaoEntity;
import com.amazonws.demo.cart.dto.Product;

import software.amazon.awssdk.core.internal.waiters.ResponseOrException;
import software.amazon.awssdk.enhanced.dynamodb.DynamoDbEnhancedClient;
import software.amazon.awssdk.enhanced.dynamodb.DynamoDbTable;
import software.amazon.awssdk.enhanced.dynamodb.Key;
import software.amazon.awssdk.enhanced.dynamodb.TableSchema;


@Component
public class ProductDaoImpl{
    private static final Logger logger = LoggerFactory.getLogger(ProductDaoImpl.class);
    private static final String PRODUCT_TABLE_NAME = "Products";
    private final DynamoDbEnhancedClient enhancedClient;

    @Autowired
    public ProductDaoImpl(DynamoDbEnhancedClient enhancedClient){
        this.enhancedClient = enhancedClient;

    }

Rather than providing comments that describe the functionality of the entire class, you can provide comments for each specific method here. You will use CodeWhisperer to generate the implementation details for interacting with DynamoDB. If the Products table doesn’t already exist, you will need to create it. Based on the comment, CodeWhisperer will generate a method to create a a Products table if one does not exist. As you can see, you don’t have to memorize or search through the DynamoDB API documentation to implement this logic. CodeWhisperer will save you time and effort by giving contextualized suggestions.

//Create the DynamoDB table through enhancedClient object from ProductDaoEntity. If the table already exists, log the error.
    @PostConstruct
    public void createTable() {
        try {
            DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
            productTable.createTable();
        } catch (Exception e) {
            logger.error("Error creating table: ", e);
        }
    }

Now, you can create the CRUD operations for the Product object. You can start with the createProduct operation to insert a new product entity to the DynamoDB table. Provide a comment about the purpose of the method along with relevant implementation details.

    // Create the createProduct() method 
    // Insert the ProductDaoEntity object into the DynamoDB table
    // Return the Product object

CodeWhisperer will start auto generating the Create operation as shown below. You can accept/reject the suggestions as needed. Or, you may select from alternate suggestion if available using the left/right arrow keys.

   // Create the createProduct() method
   // Insert the ProductDaoEntity object into the DynamoDB table
   // Return the Product object
    public ProductDaoEntity createProduct(ProductDaoEntity productDaoEntity) {
        DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
        productTable.putItem(productDaoEntity);
        return product;
    }

Similarly, you can generate a method to return a specific product by id. Provide a contextual comment, as shown below.

// Get a particular ProductDaoEntity object from the DynamoDB table using the
 // product id and return the Product object

Below is the auto-generated code. CodeWhisperer has correctly analyzed the comments and generated the method to get a Product by its id.

    //Get a particular ProductDaoEntity object from the DynamoDB table using the
    // product id and return the Product object
    
    public ProductDaoEntity getProduct(String productId) {
        DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
        ProductDaoEntity productDaoEntity = productTable.getItem(Key.builder().partitionValue(productId).build());
        return productDaoEntity;
    }

Similarly, you can implement the DAO layer to delete and update products using DynamoDB table.

Creating a Service Object

Next, you will generate the ProductService class which retrieves the Product using ProductDAO. In Spring Boot, a class annotated with @Service allows it to be detected through classpath scanning.

Let’s provide a comment to generate the ProductService class:

package com.amazonws.demo.cart.service;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import com.amazonws.demo.cart.dto.Product;
import com.amazonws.demo.cart.dao.ProductDao;

//Create a class called ProductService with methods: getProductById(string id),
//getAllProducts(), updateProduct(Product product), 
//deleteProduct(string id), createProduct(Product product)

CodeWhisperer will create the following class implementation. Note, you may have to adjust return types or method parameter types as needed. Notice the @Service annotation for this class along with the productDao property being @Autowired.

@Service
public class ProductService {

   @Autowired
   ProductDao productDao;

   public Product getProductById(String id) {
      return productDao.getProductById(id);
   }

   public List<Product> getProducts() {
      return productDao.getAllProducts();
   }

   public void updateProduct(Product product) {
      productDao.updateProduct(product);
   }

   public void deleteProduct(String id) {
      productDao.deleteProduct(id);
   }

   public void createProduct(Product product) {
      productDao.createProduct(product);
   }

}

Creating a REST Controller

The REST controller typically handles incoming client HTTP requests and responses and its output is typically serialized into JSON or XML formats. Using annotations, Spring Boot maps the HTTPS methods such as GET, PUT, POST, and DELETE to appropriate methods within the controller. It also binds the HTTP request data to parameters defined within the controller methods.

Provide a comment as shown below specifying that the class is a REST controller that should support CORS along with the required methods.

package com.amazonws.demo.product.controller;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.CrossOrigin;
import org.springframework.web.bind.annotation.DeleteMapping;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.PutMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.amazonws.demo.product.dto.Product;
import com.amazonws.demo.product.service.ProductService;

//create a RestController called ProductController to get all
//products, get a product by id, create a product, update a product,
//and delete a product. support cross origin requests from all origins.

Notice how the appropriate annotations are added to support CORS along with the mapping annotations that correspond with the GET, PUT, POST and DELETE HTTP methods. The @RestController annotation is used to specify that this controller returns an object serialized as XML or JSON rather than a view.

@RestController
@RequestMapping("/product")
@CrossOrigin(origins = "*")
public class ProductController {

    @Autowired
    private ProductService productService;
    
    @GetMapping("/getAllProducts")
    public List<Product> getAllProducts() {
        return productService.getAllProducts();
    }

    @GetMapping("/getProductById/{id}")
    public Product getProductById(@PathVariable String id) {
        return productService.getProductById(id);
    }

    @PostMapping("/createProduct")
    public Product createProduct(@RequestBody Product product) {
        return productService.createProduct(product);
    }

    @PutMapping("/updateProduct")
    public Product updateProduct(@RequestBody Product product) {
        return productService.updateProduct(product);
    }

    @DeleteMapping("/deleteProduct/{id}")
    public void deleteProduct(@PathVariable String id) {
        productService.deleteProduct(id);
    }

}

Conclusion

In this post, you have used CodeWhisperer to generate DTOs, controllers, service objects, and persistence classes. By inferring your natural language comments, CodeWhisperer will provide contextual code snippets to accelerate your development. In addition, CodeWhisperer has additional features like reference tracker that detects whether a code suggestion might resemble open-source training data and can flag such suggestions with the open-source project’s repository URL, file reference, and license information for your review before deciding whether to incorporate the suggested code.

Try out Amazon CodeWhisperer today to get a head start on your coding projects.

Configure fine-grained access to your resources shared using AWS Resource Access Manager

2023-08-03 Fabian Labat

Post Syndicated from Fabian Labat original https://aws.amazon.com/blogs/security/configure-fine-grained-access-to-your-resources-shared-using-aws-resource-access-manager/

You can use AWS Resource Access Manager (AWS RAM) to securely, simply, and consistently share supported resource types within your organization or organizational units (OUs) and across AWS accounts. This means you can provision your resources once and use AWS RAM to share them with accounts. With AWS RAM, the accounts that receive the shared resources can list those resources alongside the resources they own.

When you share your resources by using AWS RAM, you can specify the actions that an account can perform and the access conditions on the shared resource. AWS RAM provides AWS managed permissions, which are created and maintained by AWS and which grant permissions for common customer scenarios. Now, you can further tailor resource access by authoring and applying fine-grained customer managed permissions in AWS RAM. A customer managed permission is a managed permission that you create to precisely specify who can do what under which conditions for the resource types included in your resource share.

This blog post walks you through how to use customer managed permissions to tailor your resource access to meet your business and security needs. Customer managed permissions help you follow the best practice of least privilege for your resources that are shared using AWS RAM.

Considerations

Before you start, review the considerations for using customer managed permissions for supported resource types in the AWS RAM User Guide.

Solution overview

Many AWS customers share infrastructure services to accounts in an organization from a centralized infrastructure OU. The networking account in the infrastructure OU follows the best practice of least privilege and grants only the permissions that accounts receiving these resources, such as development accounts, require to perform a specific task. The solution in this post demonstrates how you can share an Amazon Virtual Private Cloud (Amazon VPC) IP Address Manager (IPAM) pool with the accounts in a Development OU. IPAM makes it simpler for you to plan, track, and monitor IP addresses for your AWS workloads.

You’ll use a networking account that owns an IPAM pool to share the pool with the accounts in a Development OU. You’ll do this by creating a resource share and a customer managed permission through AWS RAM. In this example, shown in Figure 1, both the networking account and the Development OU are in the same organization. The accounts in the Development OU only need the permissions that are required to allocate a classless inter-domain routing (CIDR) range and not to view the IPAM pool details. You’ll further refine access to the shared IPAM pool so that only AWS Identity and Access Management (IAM) users or roles tagged with team = networking can perform actions on the IPAM pool that’s shared using AWS RAM.

Figure 1: Multi-account diagram for sharing your IPAM pool from a networking account in the Infrastructure OU to accounts in the Development OU

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account (the networking account) with an IPAM pool already provisioned. For this example, create an IPAM pool in a networking account named ipam-vpc-pool-use1-dev. Because you share resources across accounts in the same AWS Region using AWS RAM, provision the IPAM pool in the same Region where your development accounts will access the pool.
An AWS OU with the associated development accounts to share the IPAM pool with. In this example, these accounts are in your Development OU.
An IAM role or user with permissions to perform IPAM and AWS RAM operations in the networking account and the development accounts.

Share your IPAM pool with your Development OU with least privilege permissions

In this section, you share an IPAM pool from your networking account to the accounts in your Development OU and grant least-privilege permissions. To do that, you create a resource share that contains your IPAM pool, your customer managed permission for the IPAM pool, and the OU principal you want to share the IPAM pool with. A resource share contains resources you want to share, the principals you want to share the resources with, and the managed permissions that grant resource access to the account receiving the resources. You can add the IPAM pool to an existing resource share, or you can create a new resource share. Depending on your workflow, you can start creating a resource share either in the Amazon VPC IPAM or in the AWS RAM console.

To initiate a new resource share from the Amazon VPC IPAM console

Sign in to the AWS Management Console as your networking account. For Features, select Amazon VPC IP Address Manager console.
Select ipam-vpc-pool-use1-dev, which was provisioned as part of the prerequisites.
On the IPAM pool detail page, choose the Resource sharing tab.
Choose Create resource share.

Figure 2: Create resource share to share your IPAM pool

Alternatively, you can initiate a new resource share from the AWS RAM console.

To initiate a new resource share from the AWS RAM console

Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
Choose Create resource share.

Next, specify the resource share details, including the name, the resource type, and the specific resource you want to share. Note that the steps of the resource share creation process are located on the left side of the AWS RAM console.

To specify the resource share details

For Name, enter ipam-shared-dev-pool.
For Select resource type, choose IPAM pools.
For Resources, select the Amazon Resource Name (ARN) of the IPAM pool you want to share from a list of the IPAM pool ARNs you own.
Choose Next.

Figure 3: Specify the resources to share in your resource share

Configure customer managed permissions

In this example, the accounts in the Development OU need the permissions required to allocate a CIDR range, but not the permissions to view the IPAM pool details. The existing AWS managed permission grants both read and write permissions. Therefore, you need to create a customer managed permission to refine the resource access permissions for your accounts in the Development OU. With a customer managed permission, you can select and tailor the actions that the development accounts can perform on the IPAM pool, such as write-only actions.

In this section, you create a customer managed permission, configure the managed permission name, select the resource type, and choose the actions that are allowed with the shared resource.

To create and author a customer managed permission

On the Associate managed permissions page, choose Create customer managed permission. This will bring up a new browser tab with a Create a customer managed permission page.
On the Create a customer managed permission page, enter my-ipam-cmp for the Customer managed permission name.
Confirm the Resource type as ec2:IpamPool.
On the Visual editor tab of the Policy template section, select the Write checkbox only. This will automatically check all the available write actions.
Choose Create customer managed permission.

Figure 4: Create a customer managed permission with only write actions

Now that you’ve created your customer managed permission, you must associate it to your resource share.

To associate your customer managed permission

Go back to the previous Associate managed permissions page. This is most likely located in a separate browser tab.
Choose the refresh icon .
Select my-ipam-cmp from the dropdown menu.
Review the policy template, and then choose Next.

Next, select the IAM roles, IAM users, AWS accounts, AWS OUs, or organization you want to share your IPAM pool with. In this example, you share the IPAM pool with an OU in your account.

To grant access to principals

On the Grant access to principals page, select Allow sharing only with your organization.
For Select principal type, choose Organizational unit (OU).
Enter the Development OU’s ID.
Select Add, and then choose Next.
Choose Create resource share to complete creation of your resource share.

Figure 5: Grant access to principals in your resource share

Verify the customer managed permissions

Now let’s verify that the customer managed permission is working as expected. In this section, you verify that the development account cannot view the details of the IPAM pool and that you can use that same account to create a VPC with the IPAM pool.

To verify that an account in your Development OU can’t view the IPAM pool details

Sign in to the AWS Management Console as an account in your Development OU. For Features, select Amazon VPC IP Address Manager console.
In the left navigation pane, choose Pools.
Select ipam-shared-dev-pool. You won’t be able to view the IPAM pool details.

To verify that an account in your Development OU can create a new VPC with the IPAM pool

Sign in to the AWS Management Console as an account in your Development OU. For Services, select VPC console.
On the VPC dashboard, choose Create VPC.
On the Create VPC page, select VPC only.
For name, enter my-dev-vpc.
Select IPAM-allocated IPv4 CIDR block.
Choose the ARN of the IPAM pool that’s shared with your development account.
For Netmask, select /24 256 IPs.
Choose Create VPC. You’ve successfully created a VPC with the IPAM pool shared with your account in your Development OU.

Figure 6: Create a VPC

Update customer managed permissions

You can create a new version of your customer managed permission to rescope and update the access granularity of your resources that are shared using AWS RAM. For example, you can add a condition in your customer managed permissions so that only IAM users or roles tagged with a particular principal tag can access and perform the actions allowed on resources shared using AWS RAM. If you need to update your customer managed permission — for example, after testing or as your business and security needs evolve — you can create and save a new version of the same customer managed permission rather than creating an entirely new customer management permission. For example, you might want to adjust your access configurations to read-only actions for your development accounts and to rescope to read-write actions for your testing accounts. The new version of the permission won’t apply automatically to your existing resource shares, and you must explicitly apply it to those shares for it to take effect.

To create a version of your customer managed permission

Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
In the left navigation pane, choose Managed permissions library.
For Filter by text, enter my-ipam-cmp and select my-ipam-cmp. You can also select the Any type dropdown menu and then select Customer managed to narrow the list of managed permissions to only your customer managed permissions.
On the my-ipam-cmp page, choose Create version.
You can make the customer managed permission more fine-grained by adding a condition. On the Create a customer managed permission for my-ipam-cmp page, under the Policy template section, choose JSON editor.

Add a condition with aws:PrincipalTag that allows only the users or roles tagged with team = networking to access the shared IPAM pool.

"Condition": {
                "StringEquals": {
                    "aws:PrincipalTag/team": "networking"
                }
            }

Choose Create version. This new version will be automatically set as the default version of your customer managed permission. As a result, new resource shares that use the customer managed permission will use the new version.

Figure 7: Update your customer managed permissions and add a condition statement with aws:PrincipalTag

Note: Now that you have the new version of your customer managed permission, you must explicitly apply it to your existing resource shares for it to take effect.

To apply the new version of the customer managed permission to existing resource shares

On the my-ipam-cmp page, under the Managed permission versions, select Version 1.
Choose the Associated resource shares tab.
Find ipam-shared-dev-pool and next to the current version number, select Update to default version. This will update your ipam-shared-dev-pool resource share with the new version of your my-ipam-cmp customer managed permission.

To verify your updated customer managed permission, see the Verify the customer managed permissions section earlier in this post. Make sure that you sign in with an IAM role or user tagged with team = networking, and then repeat the steps of that section to verify your updated customer managed permission. If you use an IAM role or user that is not tagged with team = networking, you won’t be able to allocate a CIDR from the IPAM pool and you won’t be able to create the VPC.

Cleanup

To remove the resources created by the preceding example:

Delete the resource share from the AWS RAM console.
Deprovision the CIDR from the IPAM pool.
Delete the IPAM pool you created.

Summary

This blog post presented an example of using customer managed permissions in AWS RAM. AWS RAM brings simplicity, consistency, and confidence when sharing your resources across accounts. In the example, you used AWS RAM to share an IPAM pool to accounts in a Development OU, configured fine-grained resource access controls, and followed the best practice of least privilege by granting only the permissions required for the accounts in the Development OU to perform a specific task with the shared IPAM pool. In the example, you also created a new version of your customer managed permission to rescope the access granularity of your resources that are shared using AWS RAM.

To learn more about AWS RAM and customer managed permissions, see the AWS RAM documentation and watch the AWS RAM Introduces Customer Managed Permissions demo.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

2023-08-01 Tom Romano

Post Syndicated from Tom Romano original https://aws.amazon.com/blogs/big-data/empower-your-jira-data-in-a-data-lake-with-amazon-appflow-and-aws-glue/

In the world of software engineering and development, organizations use project management tools like Atlassian Jira Cloud. Managing projects with Jira leads to rich datasets, which can provide historical and predictive insights about project and development efforts.

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Companies often take a data lake approach to their analytics, bringing data from many different systems into one place to simplify how the analytics are done.

This post shows you how to use Amazon AppFlow and AWS Glue to create a fully automated data ingestion pipeline that will synchronize your Jira data into your data lake. Amazon AppFlow provides software as a service (SaaS) integration with Jira Cloud to load the data into your AWS account. AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Additionally, this post strives to achieve a low-code and serverless solution for operational efficiency and cost optimization, and the solution supports incremental loading for cost optimization.

Solution overview

This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. The data is synchronized to an Amazon Simple Storage Service (Amazon S3) bucket using an initial full download and subsequent incremental downloads of changes. When new data arrives in the S3 bucket, an AWS Step Functions workflow is triggered that orchestrates extract, transform, and load (ETL) activities using AWS Glue crawlers and AWS Glue DataBrew. The data is then available in the AWS Glue Data Catalog and can be queried by services such as Amazon Athena, Amazon QuickSight, and Amazon Redshift Spectrum. The solution is completely automated and serverless, resulting in low operational overhead. When this setup is complete, your Jira data will be automatically ingested and kept up to date in your data lake!

The following diagram illustrates the solution architecture.

The Jira Appflow Architecture is shown. The Jira Cloud data is retrieved by Amazon AppFlow and is stored in Amazon S3. This triggers an Amazon EventBridge event that runs an AWS Step Functions workflow. The workflow uses AWS Glue to catalog and transform the data, The data is then queried with QuickSight.

The Step Functions workflow orchestrates the following ETL activities, resulting in two tables:

An AWS Glue crawler collects all downloads into a single AWS Glue table named jira_raw. This table is comprised of a mix of full and incremental downloads from Jira, with many versions of the same records representing changes over time.
A DataBrew job prepares the data for reporting by unpacking key-value pairs in the fields, as well as removing depreciated records as they are updated in subsequent change data captures. This reporting-ready data will available in an AWS Glue table named jira_data.

The following figure shows the Step Functions workflow.

A diagram represents the AWS Step Functions workflow. It contains the steps to run an AWS Crawler, wait for it's completion, and then run a AWS Glue DataBrew data transformation job.

Prerequisites

This solution requires the following:

Administrative access to your Jira Cloud instance, and an associated Jira Cloud developer account.
An AWS account and a login with access to the AWS Management Console. Your login will need AWS Identity and Access Management (IAM) permissions to create and access the resources in your AWS account.
Basic knowledge of AWS and working knowledge of Jira administration.

Configure the Jira Instance

After logging in to your Jira Cloud instance, you establish a Jira project with associated epics and issues to download into a data lake. If you’re starting with a new Jira instance, it helps to have at least one project with a sampling of epics and issues for the initial data download, because it allows you to create an initial dataset without errors or missing fields. Note that you may have multiple projects as well.

An image show a Jira Cloud example, with several issues arranged in a Kansan board.

After you have established your Jira project and populated it with epics and issues, ensure you also have access to the Jira developer portal. In later steps, you use this developer portal to establish authentication and permissions for the Amazon AppFlow connection.

Provision resources with AWS CloudFormation

For the initial setup, you launch an AWS CloudFormation stack to create an S3 bucket to store data, IAM roles for data access, and the AWS Glue crawler and Data Catalog components. Complete the following steps:

Sign in to your AWS account.
Click Launch Stack:
For Stack name, enter a name for the stack (the default is aws-blog-jira-datalake-with-AppFlow).
For GlueDatabaseName, enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake).
For InitialRunFlag, choose Setup. This mode will scan all data and disable the change data capture (CDC) features of the stack. (Because this is the initial load, the stack needs an initial data load before you configure CDC in later steps.)
Under Capabilities and transforms, select the acknowledgement check boxes to allow IAM resources to be created within your AWS account.
Review the parameters and choose Create stack to deploy the CloudFormation stack. This process will take around 5–10 minutes to complete.
After the stack is deployed, review the Outputs tab for the stack and collect the following values to use when you set up Amazon AppFlow:
- Amazon AppFlow destination bucket (o01AppFlowBucket)
- Amazon AppFlow destination bucket path (o02AppFlowPath)
- Role for Amazon AppFlow Jira connector (o03AppFlowRole)

Configure Jira Cloud

Next, you configure your Jira Cloud instance for access by Amazon AppFlow. For full instructions, refer to Jira Cloud connector for Amazon AppFlow. The following steps summarize these instructions and discuss the specific configuration to enable OAuth in the Jira Cloud:

Open the Jira developer portal.
Create the OAuth 2 integration from the developer application console by choosing Create an OAuth 2.0 Integration. This will provide a login mechanism for AppFlow.
Enable fine-grained permissions. See Recommended scopes for the permission settings to grant AppFlow appropriate access to your Jira instance.
Add the following permission scopes to your OAuth app:
1. manage:jira-configuration
2. read:field-configuration:jira
Under Authorization, set the Call Back URL to return to Amazon AppFlow with the URL https://us-east-1.console.aws.amazon.com/AppFlow/oauth.
Under Settings, note the client ID and secret to use in later steps to set up authentication from Amazon AppFlow.

Create the Amazon AppFlow Jira Cloud connection

In this step, you configure Amazon AppFlow to run a one-time full data fetch of all your data, establishing the initial data lake:

On the Amazon AppFlow console, choose Connectors in the navigation pane.
Search for the Jira Cloud connector.
Choose Create flow on the connector tile to create the connection to your Jira instance.
For Flow name, enter a name for the flow (for example, JiraLakeFlow).
Leave the Data encryption setting as the default.
Choose Next.
For Source name, keep the default of Jira Cloud.
Choose Create new connection under Jira Cloud connection.
In the Connect to Jira Cloud section, enter the values for Client ID, Client secret, and Jira Cloud Site that you collected earlier. This provides the authentication from AppFlow to Jira Cloud.
For Connection Name, enter a connection name (for example, JiraLakeCloudConnection).
Choose Connect. You will be prompted to allow your OAuth app to access your Atlassian account to verify authentication.
In the Authorize App window that pops up, choose Accept.
With the connection created, return to the Configure flow section on the Amazon AppFlow console.
For API version, choose V2 to use the latest Jira query API.
For Jira Cloud object, choose Issue to query and download all issues and associated details.
For Destination Name in the Destination Details section, choose Amazon S3.
For Bucket details, choose the S3 bucket name that matches the Amazon AppFlow destination bucket value that you collected from the outputs of the CloudFormation stack.
Enter the Amazon AppFlow destination bucket path to complete the full S3 path. This will send the Jira data to the S3 bucket created by the CloudFormation script.
Leave Catalog your data in the AWS Glue Data Catalog unselected. The CloudFormation script uses an AWS Glue crawler to update the Data Catalog in a different manner, grouping all the downloads into a common table, so we disable the update here.
For File format settings, select Parquet format and select Preserve source data types in Parquet output. Parquet is a columnar format to optimize subsequent querying.
Select Add a timestamp to the file name for Filename preference. This will allow you to easily find data files downloaded at a specific date and time.
For now, select Run on Demand for the Flow trigger to run the full load flow manually. You will schedule downloads in a later step when implementing CDC.
Choose Next.
On the Map data fields page, select Manually map fields.
For Source to destination field mapping, choose the drop-down box under Source field name and select Map all fields directly. This will bring down all fields as they are received, because we will instead implement data preparation in later steps.
Under Partition and aggregation settings, you can set up the partitions in a way that works for your use case. For this example, we use a daily partition, so select Date and time and choose Daily.
For Aggregation settings, leave it as the default of Don’t aggregate.
Choose Next.
On the Add filters page, you can create filters to only download specific data. For this example, you download all the data, so choose Next.
Review and choose Create flow.
When the flow is created, choose Run flow to start the initial data seeding. After some time, you should receive a banner indicating the run finished successfully.

Review seed data

At this stage in the process, you now have data in your S3 environment. When new data files are created in the S3 bucket, it will automatically run an AWS Glue crawler to catalog the new data. You can see if it’s complete by reviewing the Step Functions state machine for a Succeeded run status. There is a link to the state machine on the CloudFormation stack’s Resources tab, which will redirect you to the Step Functions state machine.

A image showing the CloudFormation resources tab of the stack, with a link to the AWS Step Functions workflow.

When the state machine is complete, it’s time to review the raw Jira data with Athena. The database is as you specified in the CloudFormation stack (jiralake by default), and the table name is jira_raw. If you kept the default AWS Glue database name of jiralake, the Athena SQL is as follows:

SELECT * FROM "jiralake"."jira_raw" limit 10;

If you explore the data, you’ll notice that most of the data you would want to work with is actually packed into a column called fields. This means the data is not available as columns in your Athena queries, making it harder to select, filter, and sort individual fields within an Athena SQL query. This will be addressed in the next steps.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_raw" limit 10;

Set up CDC and unpack the fields columns

To add the ongoing CDC and reformat the data for analytics, we introduce a DataBrew job to transform the data and filter to the most recent version of each record as changes come in. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps.

On the AWS CloudFormation console, return to the stack.
Choose Update.
Select Use current template and choose Next.
For SetupOrCDC, choose CDC, then choose Next. This will enable both the CDC steps and the data transformation steps for the Jira data.
Continue choosing Next until you reach the Review section.
Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
Return to the Amazon AppFlow console and open your flow.
On the Actions menu, choose Edit flow. We will now edit the flow trigger to run an incremental load on a periodic basis.
Select Run flow on schedule.
Configure the desired repeats, as well as start time and date. For this example, we choose Daily for Repeats and enter 1 for the number of days you’ll have the flow trigger. For Starting at, enter 01:00.
Select Incremental transfer for Transfer mode.
Choose Updated on the drop-down menu so that changes will be captured based on when the records were updated.
Choose Save. With these settings in our example, the run will happen nightly at 1:00 AM.

Review the analytics data

When the next incremental load occurs that results in new data, the Step Functions workflow will start the DataBrew job and populate a new staged analytical data table named jira_data in your Data Catalog database. If you don’t want to wait, you can trigger the Step Functions workflow manually.

The DataBrew job performs data transformation and filtering tasks. The job unpacks the key-values from the Jira JSON data and the raw Jira data, resulting in a tabular data schema that facilitates use with BI and AI/ML tools. As Jira items are changed, the changed item’s data is resent, resulting in multiple versions of an item in the raw data feed. The DataBrew job filters the raw data feed so that the resulting data table only contains the most recent version of each item. You could enhance this DataBrew job to further customize the data for your needs, such as renaming the generic Jira custom field names to reflect their business meaning.

When the Step Functions workflow is complete, we can query the data in Athena again using the following query:

SELECT * FROM "jiralake"."jira_data" limit 10;

You can see that in our transformed jira_data table, the nested JSON fields are broken out into their own columns for each field. You will also notice that we’ve filtered out obsolete records that have been superseded by more recent record updates in later data loads so the data is fresh. If you want to rename custom fields, remove columns, or restructure what comes out of the nested JSON, you can modify the DataBrew recipe to accomplish this. At this point, the data is ready to be used by your analytics tools, such as Amazon QuickSight.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_data" limit 10;

Clean up

If you would like to discontinue this solution, you can remove it with the following steps:

On the Amazon AppFlow console, deactivate the flow for Jira, and optionally delete it.
On the Amazon S3 console, select the S3 bucket for the stack, and empty the bucket to delete the existing data.
On the AWS CloudFormation console, delete the CloudFormation stack that you deployed.

Conclusion

In this post, we created a serverless incremental data load process for Jira that will synchronize data while handling custom fields using Amazon AppFlow, AWS Glue, and Step Functions. The approach uses Amazon AppFlow to incrementally load the data into Amazon S3. We then use AWS Glue and Step Functions to manage the extraction of the Jira custom fields and load them in a format to be queried by analytics services such as Athena, QuickSight, or Redshift Spectrum, or AI/ML services like Amazon SageMaker.

To learn more about AWS Glue and DataBrew, refer to Getting started with AWS Glue DataBrew. With DataBrew, you can take the sample data transformation in this project and customize the output to meet your specific needs. This could include renaming columns, creating additional fields, and more.

To learn more about Amazon AppFlow, refer to Getting started with Amazon AppFlow. Note that Amazon AppFlow supports integrations with many SaaS applications in addition to the Jira Cloud.

To learn more about orchestrating flows with Step Functions, see Create a Serverless Workflow with AWS Step Functions and AWS Lambda. The workflow could be enhanced to load the data into a data warehouse, such as Amazon Redshift, or trigger a refresh of a QuickSight dataset for analytics and reporting.

In future posts, we will cover how to unnest parent-child relationships within the Jira data using Athena and how to visualize the data using QuickSight.

About the Authors

Tom Romano is a Sr. Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics, and is an Analytics Specialist. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Shane Thompson is a Sr. Solutions Architect based out of San Luis Obispo, California, working with AWS Startups. He works with customers who use AI/ML in their business model and is passionate about democratizing AI/ML so that all customers can benefit from it. In his free time, Shane loves to spend time with his family and travel around the world.

Perform continuous vulnerability scanning of AWS Lambda functions with Amazon Inspector

2023-07-31 Manjunath Arakere

Post Syndicated from Manjunath Arakere original https://aws.amazon.com/blogs/security/perform-continuous-vulnerability-scanning-of-aws-lambda-functions-with-amazon-inspector/

This blog post demonstrates how you can activate Amazon Inspector within one or more AWS accounts and be notified when a vulnerability is detected in an AWS Lambda function.

Amazon Inspector is an automated vulnerability management service that continually scans workloads for software vulnerabilities and unintended network exposure. Amazon Inspector scans mixed workloads like Amazon Elastic Compute Cloud (Amazon EC2) instances and container images located in Amazon Elastic Container Registry (Amazon ECR). At re:Invent 2022, we announced Amazon Inspector support for Lambda functions and Lambda layers to provide a consolidated solution for compute types.

Only scanning your functions for vulnerabilities before deployment might not be enough since vulnerabilities can appear at any time, like the widespread Apache Log4j vulnerability. So it’s essential that workloads are continuously monitored and rescanned in near real time as new vulnerabilities are published or workloads are changed.

Amazon Inspector scans are intelligently initiated based on the updates to Lambda functions or when new Common Vulnerabilities and Exposures (CVEs) are published that are relevant to your function. No agents are needed for Amazon Inspector to work, which means you don’t need to install a library or agent in your Lambda functions or layers. When Amazon Inspector discovers a software vulnerability or network configuration issue, it creates a finding which describes the vulnerability, identifies the affected resource, rates the severity of the vulnerability, and provides remediation guidance.

In addition, Amazon Inspector integrates with several AWS services, such as Amazon EventBridge and AWS Security Hub. You can use EventBridge to build automation workflows like getting notified for a specific vulnerability finding or performing an automatic remediation with the help of Lambda or AWS Systems Manager.

In this blog post, you will learn how to do the following:

Activate Amazon Inspector in a single AWS account and AWS Region.
See how Amazon Inspector automated discovery and continuous vulnerability scanning works by deploying a new Lambda function with a vulnerable package dependency.
Receive a near real-time notification when a vulnerability with a specific severity is detected in a Lambda function with the help of EventBridge and Amazon Simple Notification Service (Amazon SNS).
Remediate the vulnerability by using the recommendation provided in the Amazon Inspector dashboard.
Activate Amazon Inspector in multiple accounts or Regions through AWS Organizations.

Solution architecture

Figure 1 shows the AWS services used in the solution and how they are integrated.

Figure 1: Solution architecture overview

The workflow for the solution is as follows:

Deploy a new Lambda function by using the AWS Serverless Application Model (AWS SAM).
Amazon Inspector scans when a new vulnerability is published or when an update to an existing Lambda function or a new Lambda function is deployed. Vulnerabilities are identified in the deployed Lambda function.
Amazon EventBridge receives the events from Amazon Inspector and checks against the rules for specific events or filter conditions.
In this case, an EventBridge rule exists for the Amazon Inspector findings, and the target is defined as an SNS topic to send an email to the system operations team.
The EventBridge rule invokes the target SNS topic with the event data, and an email is sent to the confirmed subscribers in the SNS topic.
The system operations team receives an email with detailed information on the vulnerability, the fixed package versions, the Amazon Inspector score to prioritize, and the impacted Lambda functions. By using the remediation information from Amazon Inspector, the team can now prioritize actions and remediate.

Prerequisites

To follow along with this demo, we recommend that you have the following in place:

An AWS account.
A command line interface: AWS CloudShell or AWS CLI. In this post, we recommend the use of CloudShell because it already has Python and AWS SAM. However, you can also use your CLI with AWS CLI, SAM, and Python.
An AWS Region where Amazon Inspector Lambda code scanning is available.
An IAM role in that account with administrator privileges.

The solution in this post includes the following AWS services: Amazon Inspector, AWS Lambda, Amazon EventBridge, AWS Identity and Access Management (IAM), Amazon SNS, AWS CloudShell and AWS Organizations for activating Amazon Inspector at scale (multi-accounts).

Step 1: Activate Amazon Inspector in a single account in the Region

The first step is to activate Amazon Inspector in your account in the Region you are using.

To activate Amazon Inspector

Sign in to the AWS Management Console.
Open AWS CloudShell. CloudShell inherits the credentials and permissions of the IAM principal who is signed in to the AWS Management Console. CloudShell comes with the CLIs and runtimes that are needed for this demo (AWS CLI, AWS SAM, and Python).
Use the following command in CloudShell to get the status of the Amazon Inspector activation.
```
aws inspector2 batch-get-account-status
```
Use the following command to activate Inspector in the default Region for resource type LAMBDA. Other allowed values for resource types are EC2, ECR and LAMDA_CODE.
```
aws inspector2 enable --resource-types '["LAMBDA"]'
```
Use the following command to verify the status of the Amazon Inspector activation.
```
aws inspector2 batch-get-account-status
```

You should see a response that shows that Amazon Inspector is enabled for Lambda resources, as shown in Figure 2.

Figure 2: Amazon Inspector status after you enable Lambda scanning

Step 2: Create an SNS topic and subscription for notification

Next, create the SNS topic and the subscription so that you will be notified of each new Amazon Inspector finding.

To create the SNS topic and subscription

Use the following command in CloudShell to create the SNS topic and its subscription and replace <REGION_NAME>, <AWS_ACCOUNTID> and <[email protected]> by the relevant values.

aws sns create-topic --name amazon-inspector-findings-notifier; 

aws sns subscribe \
--topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier \
--protocol email --notification-endpoint <[email protected]>

Check the email inbox you entered for <[email protected]>, and in the email from Amazon SNS, choose Confirm subscription.
In the CloudShell console, use the following command to list the subscriptions, to verify the topic and email subscription.
```
aws sns list-subscriptions
```
You should see a response that shows subscription details like the email address and ARN, as shown in Figure 3.

Figure 3: Subscribed email address and SNS topic

Use the following command to send a test message to your subscribed email and verify that you receive the message by replacing <REGION_NAME> and <AWS_ACCOUNTID>.

aws sns publish \
    --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
    --message "Hello from Amazon Inspector2"

Step 3: Set up Amazon EventBridge with a custom rule and the SNS topic as target

Create an EventBridge rule that will invoke your previously created SNS topic whenever Amazon Inspector finds a new vulnerability with a critical severity.

To set up the EventBridge custom rule

In the CloudShell console, use the following command to create an EventBridge rule named amazon-inspector-findings with filters InspectorScore greater than 8 and severity state set to CRITICAL.
```
aws events put-rule \
    --name "amazon-inspector-findings" \
    --event-pattern "{\"source\": [\"aws.inspector2\"],\"detail-type\": [\"Inspector2 Finding\"],\"detail\": {\"inspectorScore\": [ { \"numeric\": [ \">\", 8] } ],\"severity\": [\"CRITICAL\"]}}"
```
Refer to the topic Amazon EventBridge event schema for Amazon Inspector events to customize the event pattern for your application needs.
To verify the rule creation, go to the EventBridge console and in the left navigation bar, choose Rules.
Choose the rule with the name amazon-inspector-findings. You should see the event pattern as shown in Figure 4.

Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.
Add the SNS topic you previously created as the target to the EventBridge rule. Replace <REGION_NAME>, <AWS_ACCOUNTID>, and <RANDOM-UNIQUE-IDENTIFIER-VALUE> with the relevant values. For RANDOM-UNIQUE-IDENTIFIER-VALUE, create a memorable and unique string.
```
aws events put-targets \
    --rule amazon-inspector-findings \
    --targets "Id"="<RANDOM-UNIQUE-IDENTIFIER-VALUE>","Arn"="arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier"
```
Important: Save the target ID. You will need this in order to delete the target in the last step.

Provide permission to enable Amazon EventBridge to publish to SNS topic amazon-inspector-findings-notifier

aws sns set-topic-attributes --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
--attribute-name Policy \
--attribute-value "{\"Version\":\"2012-10-17\",\"Id\":\"__default_policy_ID\",\"Statement\":[{\"Sid\":\"PublishEventsToMyTopic\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"sns:Publish\",\"Resource\":\"arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier\"}]}"

Step 4: Deploy the Lambda function to the AWS account by using AWS SAM

In this step, you will use Serverless Application Manager (SAM) quick state templates to build and deploy a Lambda function with a vulnerable library, in order to generate findings. Learn more about AWS SAM.

To deploy the Lambda function with a vulnerable library

In the CloudShell console, use a prebuilt “hello-world” AWS SAM template to deploy the Lambda function.
```
sam init --runtime python3.7 --dependency-manager pip --app-template hello-world --name sam-app
```
Use the following command to add the vulnerable package python-jwt==3.3.3 to the Lambda function.
```
cd sam-app;
echo -e 'requests\npython-jwt==3.3.3' > hello_world/requirements.txt
```
Use the following command to build the application.
```
sam build
```
Use the following command to deploy the application with the guided option.
```
sam deploy --guided
```
This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the:
1. Stack name you want
2. Set the default options, except for the
  1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
  2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.

Step 5: View Amazon Inspector findings

Amazon Inspector will automatically generate findings when scanning the Lambda function previously deployed. To view those findings, follow the steps below.

To view Amazon Inspector findings for the vulnerability

Navigate to the Amazon Inspector console.
In the left navigation menu, choose All findings to see all of the Active findings, as shown in Figure 5.
Due to the custom event pattern rule in Amazon EventBridge, even though there are multiple findings for the vulnerable package python-jwt==3.3.3, you will be notified only for the finding that has InspectorScore greater than 8 and severity CRITICAL.
Choose the title of each finding to see detailed information about the vulnerability.

Figure 5: Example of findings from the Amazon Inspector console

Step 6: Remediate the vulnerability by applying the fixed package version

Now you can remediate the vulnerability by updating the package version as suggested by Amazon Inspector.

To remediate the vulnerability

In the Amazon Inspector console, in the left navigation menu, choose All Findings.
Choose the title of the vulnerability to see the finding details and the remediation recommendations.

Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation
To remediate, use the following command to update the package version to the fixed version as suggested by Amazon Inspector.
```
cd /home/cloudshell-user/sam-app;
echo -e "requests\npython-jwt==3.3.4" > hello_world/requirements.txt
```
Use the following command to build the application.
```
sam build
```
Use the following command to deploy the application with the guided option.
```
sam deploy --guided
```
This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the
1. Stack name you want
2. Set the default options, except for the
  1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
  2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.
Amazon Inspector automatically rescans the function after its deployment and reevaluates the findings. At this point, you can navigate back to the Amazon Inspector console, and in the left navigation menu, choose All findings. In the Findings area, you can see that the vulnerabilities are moved from Active to Closed status.
Due to the custom event pattern rule in Amazon EventBridge, you will be notified by email with finding status as CLOSED.

Figure 7: Inspector rescan results, showing no open findings after remediation

(Optional) Step 7: Activate Amazon Inspector in multiple accounts and Regions

To benefit from Amazon Inspector scanning capabilities across the accounts that you have in AWS Organizations and in your selected Regions, use the following steps:

To activate Amazon Inspector in multiple accounts and Regions

In the CloudShell console, use the following command to clone the code from the aws-samples inspector2-enablement-with-cli GitHub repo.

cd /home/cloudshell-user;
git clone https://github.com/aws-samples/inspector2-enablement-with-cli.git;
cd inspector2-enablement-with-cli

Follow the instructions from the README.md file.
Configure the file param_inspector2.json with the relevant values, as follows:
- inspector2_da: The delegated administrator account ID for Amazon Inspector to manage member accounts.
- scanning_type: The resource types (EC2, ECR, LAMBDA) to be enabled by Amazon Inspector.
- auto_enable: The resource types to be enabled on every account that is newly attached to the delegated administrator.
- regions: Because Amazon Inspector is a regional service, provide the list of AWS Regions to enable.
Select the AWS account that would be used as the delegated administrator account (<DA_ACCOUNT_ID>).
Delegate an account as the admin for Amazon Inspector by using the following command.
```
./inspector2_enablement_with_awscli.sh -a delegate_admin -da <DA_ACCOUNT_ID>
```

Activate the delegated admin by using the following command:

./inspector2_enablement_with_awscli.sh -a activate -t <DA_ACCOUNT_ID> -s all

Associate the member accounts by using the following command:

./inspector2_enablement_with_awscli.sh -a associate -t members

Wait five minutes.
Enable the resource types (EC2, ECR, LAMBDA) on your member accounts by using the following command:
```
./inspector2_enablement_with_awscli.sh -a activate -t members
```
Enable Amazon Inspector on the new member accounts that are associated with the organization by using the following command:
```
./inspector2_enablement_with_awscli.sh -auto_enable
```
Check the Amazon Inspector status in your accounts and in multiple selected Regions by using the following command:
```
./inspector2_enablement_with_awscli.sh -a get_status
```

There are other options you can use to enable Amazon Inspector in multiple accounts, like AWS Control Tower and Terraform. For the reference architecture for Control Tower, see the AWS Security Reference Architecture Examples on GitHub. For more information on the Terraform option, see the Terraform aws_inspector2_enabler resource page.

Step 8: Delete the resources created in the previous steps

AWS offers a 15-day free trial for Amazon Inspector so that you can evaluate the service and estimate its cost.

To avoid potential charges, delete the AWS resources that you created in the previous steps of this solution (Lambda function, EventBridge target, EventBridge rule, and SNS topic), and deactivate Amazon Inspector.

To delete resources

In the CloudShell console, enter the sam-app folder.
```
cd /home/cloudshell-user/sam-app
```
Delete the Lambda function and confirm by typing “y” when prompted for confirmation.
```
sam delete
```
Remove the SNS target from the Amazon EventBridge rule.
```
aws events remove-targets --rule "amazon-inspector-findings" --ids <RANDOM-UNIQUE-IDENTIFIER-VALUE>
```
Note: If you don’t remember the target ID, navigate to the Amazon EventBridge console, and in the left navigation menu, choose Rules. Select the rule that you want to delete. Choose CloudFormation, and copy the ID.

Delete the EventBridge rule.

aws events delete-rule --name amazon-inspector-findings

Delete the SNS topic.

aws sns delete-topic --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier

Disable Amazon Inspector.
```
aws inspector2 disable --resource-types '["LAMBDA"]'
```
Follow the new few steps to roll back changes only if you have performed the activities listed in Step 7: Activate Amazon Inspector in multiple accounts and Regions.
In the CloudShell console, enter the folder inspector2-enablement-with-cli.
```
cd /home/cloudshell-user/inspector2-enablement-with-cli
```
Deactivate the resource types (EC2, ECR, LAMBDA) on your member accounts.
```
./inspector2_enablement_with_awscli.sh -a deactivate -t members -s all
```

Disassociate the member accounts.

./inspector2_enablement_with_awscli.sh -a disassociate -t members

Deactivate the delegated admin account.

./inspector2_enablement_with_awscli.sh -a deactivate -t <DA_ACCOUNT_ID> -s all

Remove the delegated account as the admin for Amazon Inspector.

./inspector2_enablement_with_awscli.sh -a remove_admin -da <DA_ACCOUNT_ID>

Conclusion

In this blog post, we discussed how you can use Amazon Inspector to continuously scan your Lambda functions, and how to configure an Amazon EventBridge rule and SNS to send out notification of Lambda function vulnerabilities in near real time. You can then perform remediation activities by using AWS Lambda or AWS Systems Manager. We also showed how to enable Amazon Inspector at scale, activating in both single and multiple accounts, in default and multiple Regions.

As of the writing this post, a new feature to perform code scans for Lambda functions is available. Amazon Inspector can now also scan the custom application code within a Lambda function for code security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption, based on AWS security best practices. You can use this additional scanning functionality to further protect your workloads.

If you have feedback about this blog post, submit comments in the Comments section below. If you have question about this blog post, start a new thread on the Amazon Inspector forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

How to Receive Alerts When Your IAM Configuration Changes

2023-07-31 Dylan Souvage

Post Syndicated from Dylan Souvage original https://aws.amazon.com/blogs/security/how-to-receive-alerts-when-your-iam-configuration-changes/

July 27, 2023: This post was originally published February 5, 2015, and received a major update July 31, 2023.

As an Amazon Web Services (AWS) administrator, it’s crucial for you to implement robust protective controls to maintain your security configuration. Employing a detective control mechanism to monitor changes to the configuration serves as an additional safeguard in case the primary protective controls fail. Although some changes are expected, you might want to review unexpected changes or changes made by a privileged user. AWS Identity and Access Management (IAM) is a service that primarily helps manage access to AWS services and resources securely. It does provide detailed logs of its activity, but it doesn’t inherently provide real-time alerts or notifications. Fortunately, you can use a combination of AWS CloudTrail, Amazon EventBridge, and Amazon Simple Notification Service (Amazon SNS) to alert you when changes are made to your IAM configuration. In this blog post, we walk you through how to set up EventBridge to initiate SNS notifications for IAM configuration changes. You can also have SNS push messages directly to ticketing or tracking services, such as Jira, Service Now, or your preferred method of receiving notifications, but that is not discussed here.

In any AWS environment, many activities can take place at every moment. CloudTrail records IAM activities, EventBridge filters and routes event data, and Amazon SNS provides notification functionality. This post will guide you through identifying and setting alerts for IAM changes, modifications in authentication and authorization configurations, and more. The power is in your hands to make sure you’re notified of the events you deem most critical to your environment. Here’s a quick overview of how you can invoke a response, shown in Figure 1.

Figure 1: Simple architecture diagram of actors and resources in your account and the process for sending notifications through IAM, CloudTrail, EventBridge, and SNS.

Log IAM changes with CloudTrail

Before we dive into implementation, let’s briefly understand the function of AWS CloudTrail. It records and logs activity within your AWS environment, tracking actions such as IAM role creation, deletion, or modification, thereby offering an audit trail of changes.

With this in mind, we’ll discuss the first step in tracking IAM changes: establishing a log for each modification. In this section, we’ll guide you through using CloudTrail to create these pivotal logs.

For an in-depth understanding of CloudTrail, refer to the AWS CloudTrail User Guide.

In this post, you’re going to start by creating a CloudTrail trail with the Management events type selected, and read and write API activity selected. If you already have a CloudTrail trail set up with those attributes, you can use that CloudTrail trail instead.

To create a CloudTrail log

Open the AWS Management Console and select CloudTrail, and then choose Dashboard.
In the CloudTrail dashboard, choose Create Trail.

Figure 2: Use the CloudTrail dashboard to create a trail
In the Trail name field, enter a display name for your trail and then select Create a new S3 bucket. Leave the default settings for the remaining trail attributes.

Figure 3: Set the trail name and storage location
Under Event type, select Management events. Under API activity, select Read and Write.
Choose Next.

Figure 4: Choose which events to log

Set up notifications with Amazon SNS

Amazon SNS is a managed service that provides message delivery from publishers to subscribers. It works by allowing publishers to communicate asynchronously with subscribers by sending messages to a topic, a logical access point, and a communication channel. Subscribers can receive these messages using supported endpoint types, including email, which you will use in the blog example today.

For further reading on Amazon SNS, refer to the Amazon SNS Developer Guide.

Now that you’ve set up CloudTrail to log IAM changes, the next step is to establish a mechanism to notify you about these changes in real time.

To set up notifications

Open the Amazon SNS console and choose Topics.
Create a new topic. Under Type, select Standard and enter a name for your topic. Keep the defaults for the rest of the options, and then choose Create topic.

Figure 5: Select Standard as the topic type
Navigate to your topic in the topic dashboard, choose the Subscriptions tab, and then choose Create subscription.

Figure 6: Choose Create subscription
For Topic ARN, select the topic you created previously, then under Protocol, select Email and enter the email address you want the alerts to be sent to.

Figure 7: Select the topic ARN and add an endpoint to send notifications to
After your subscription is created, go to the mailbox you designated to receive notifications and check for a verification email from the service. Open the email and select Confirm subscription to verify the email address and complete setup.

Initiate events with EventBridge

Amazon EventBridge is a serverless service that uses events to connect application components. EventBridge receives an event (an indicator of a change in environment) and applies a rule to route the event to a target. Rules match events to targets based on either the structure of the event, called an event pattern, or on a schedule.

Events that come to EventBridge are associated with an event bus. Rules are tied to a single event bus, so they can only be applied to events on that event bus. Your account has a default event bus that receives events from AWS services, and you can create custom event buses to send or receive events from a different account or AWS Region.

For a more comprehensive understanding of EventBridge, refer to the Amazon EventBridge User Guide.

In this part of our post, you’ll use EventBridge to devise a rule for initiating SNS notifications based on IAM configuration changes.

To create an EventBridge rule

Go to the EventBridge console and select EventBridge Rule, and then choose Create rule.

Figure 8: Use the EventBridge console to create a rule
Enter a name for your rule, keep the defaults for the rest of rule details, and then choose Next.

Figure 9: Rule detail screen
Under Target 1, select AWS service.
In the dropdown list for Select a target, select SNS topic, select the topic you created previously, and then choose Next.

Figure 10: Target with target type of AWS service and target topic of SNS topic selected
Under Event source, select AWS events or EventBridge partner events.

Figure 11: Event pattern with AWS events or EventBridge partner events selected
Under Event pattern, verify that you have the following selected.
1. For Event source, select AWS services.
2. For AWS service, select IAM.
3. For Event type, select AWS API Call via CloudTrail.
4. Select the radio button for Any operation.
Figure 12: Event pattern details selected

Now that you’ve set up EventBridge to monitor IAM changes, test it by creating a new user or adding a new policy to an IAM role and see if you receive an email notification.

Centralize EventBridge alerts by using cross-account alerts

If you have multiple accounts, you should be evaluating using AWS Organizations. (For a deep dive into best practices for using AWS Organizations, we recommend reading this AWS blog post.)

By standardizing the implementation to channel alerts from across accounts to a primary AWS notification account, you can use a multi-account EventBridge architecture. This allows aggregation of notifications across your accounts through sender and receiver accounts. Figure 13 shows how this works. Separate member accounts within an AWS organizational unit (OU) have the same mechanism for monitoring changes and sending notifications as discussed earlier, but send notifications through an EventBridge instance in another account.

Figure 13: Multi-account EventBridge architecture aggregating notifications between two AWS member accounts to a primary management account

You can read more and see the implementation and deep dive of the multi-account EventBridge solution on the AWS samples GitHub, and you can also read more about sending and receiving Amazon EventBridge notifications between accounts.

Monitor calls to IAM

In this blog post example, you monitor calls to IAM.

The filter pattern you selected while setting up EventBridge matches CloudTrail events for calls to the IAM service. Calls to IAM have a CloudTrail eventSource of iam.amazonaws.com, so IAM API calls will match this pattern. You will find this simple default filter pattern useful if you have minimal IAM activity in your account or to test this example. However, as your account activity grows, you’ll likely receive more notifications than you need. This is when filtering only the relevant events becomes essential to prioritize your responses. Effectively managing your filter preferences allows you to focus on events of significance and maintain control as your AWS environment grows.

Monitor changes to IAM

If you’re interested only in changes to your IAM account, you can modify the event pattern inside EventBridge, the one you used to set up IAM notifications, with an eventName filter pattern, shown following.

"eventName": [
      "Add*",
      "Attach*",
      "Change*",
      "Create*",
      "Deactivate*",
      "Delete*",
      "Detach*",
      "Enable*",
      "Put*",
      "Remove*",
      "Set*",
      "Update*",
      "Upload*"
    ]

This filter pattern will only match events from the IAM service that begin with Add, Change, Create, Deactivate, Delete, Enable, Put, Remove, Update, or Upload. For more information about APIs matching these patterns, see the IAM API Reference.

To edit the filter pattern to monitor only changes to IAM

Open the EventBridge console, navigate to the Event pattern, and choose Edit pattern.

Figure 14: Modifying the event pattern
Add the eventName filter pattern from above to your event pattern.

Figure 15: Use the JSON editor to add the eventName filter pattern

Monitor changes to authentication and authorization configuration

Monitoring changes to authentication (security credentials) and authorization (policy) configurations is critical, because it can alert you to potential security vulnerabilities or breaches. For instance, unauthorized changes to security credentials or policies could indicate malicious activity, such as an attempt to gain unauthorized access to your AWS resources. If you’re only interested in these types of changes, use the preceding steps to implement the following filter pattern.

    "eventName": [
      "Put*Policy",
      "Attach*",
      "Detach*",
      "Create*",
      "Update*",
      "Upload*",
      "Delete*",
      "Remove*",
      "Set*"
    ]

This filter pattern matches calls to IAM that modify policy or create, update, upload, and delete IAM elements.

Conclusion

Monitoring IAM security configuration changes allows you another layer of defense against the unexpected. Balancing productivity and security, you might grant a user broad permissions in order to facilitate their work, such as exploring new AWS services. Although preventive measures are crucial, they can potentially restrict necessary actions. For example, a developer may need to modify an IAM role for their task, an alteration that could pose a security risk. This change, while essential for their work, may be undesirable from a security standpoint. Thus, it’s critical to have monitoring systems alongside preventive measures, allowing necessary actions while maintaining security.

Create an event rule for IAM events that are important to you and have a response plan ready. You can refer to Security best practices in IAM for further reading on this topic.

If you have questions or feedback about this or any other IAM topic, please visit the IAM re:Post forum. You can also read about the multi-account EventBridge solution on the AWS samples GitHub and learn more about sending and receiving Amazon EventBridge notifications between accounts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

2023-07-28 Deepthi Mohan

Post Syndicated from Deepthi Mohan original https://aws.amazon.com/blogs/big-data/a-side-by-side-comparison-of-apache-spark-and-apache-flink-for-common-streaming-use-cases/

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Apache Flink offers layered APIs that offer different levels of expressiveness and control and are designed to target different types of use cases. The three layers of API are Process Functions (also known as the Stateful Stream Processing API), DataStream, and Table and SQL. The Stateful Stream Processing API requires writing verbose code but offers the most control over time and state, which are core concepts in stateful stream processing. The DataStream API supports Java, Scala, and Python and offers primitives for many common stream processing operations, as well as a balance between code verbosity or expressiveness and control. The Table and SQL APIs are relational APIs that offer support for Java, Scala, Python, and SQL. They offer the highest abstraction and intuitive, SQL-like declarative control over data streams. Flink also allows seamless transition and switching across these APIs. To learn more about Flink’s layered APIs, refer to layered APIs.

Apache Spark Structured Streaming offers the Dataset and DataFrames APIs, which provide high-level declarative streaming APIs to represent static, bounded data as well as streaming, unbounded data. Operations are supported in Scala, Java, Python, and R. Spark has a rich function set and syntax with simple constructs for selection, aggregation, windowing, joins, and more. You can also use the Streaming Table API to read tables as streaming DataFrames as an extension to the DataFrames API. Although it’s hard to draw direct parallels between Flink and Spark across all stream processing constructs, at a very high level, we could say Spark Structured Streaming APIs are equivalent to Flink’s Table and SQL APIs. Spark Structured Streaming, however, does not yet (at the time of this writing) offer an equivalent to the lower-level APIs in Flink that offer granular control of time and state.

Both Flink and Spark Structured Streaming (referenced as Spark henceforth) are evolving projects. The following table provides a simple comparison of Flink and Spark capabilities for common streaming primitives (as of this writing).

.	Flink	Spark
Row-based processing	Yes	Yes
User-defined functions	Yes	Yes
Fine-grained access to state	Yes, via DataStream and low-level APIs	No
Control when state eviction occurs	Yes, via DataStream and low-level APIs	No
Flexible data structures for state storage and querying	Yes, via DataStream and low-level APIs	No
Timers for processing and stateful operations	Yes, via low level APIs	No

In the following sections, we cover the greatest common factors so that we can showcase how Spark users can relate to Flink and vice versa. To learn more about Flink’s low-level APIs, refer to Process Function. For the sake of simplicity, we cover the four use cases in this post using the Flink Table API. We use a combination of Python and SQL for an apples-to-apples comparison with Spark.

Data preparation

In this section, we compare data preparation methods for Spark and Flink.

Reading data

We first look at the simplest ways to read data from a data stream. The following sections assume the following schema for messages:

symbol: string,
price: int,
timestamp: timestamp,
company_info:
{
    name: string,
    employees_count: int
}

Reading data from a source in Spark Structured Streaming

In Spark Structured Streaming, we use a streaming DataFrame in Python that directly reads the data in JSON format:

spark = ...  # spark session

# specify schema
stock_ticker_schema = ...

# Create a streaming DataFrame
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "mybroker1:port") \
    .option("topic", "stock_ticker") \
    .load()
    .select(from_json(col("value"), stock_ticker_schema).alias("ticker_data")) \
    .select(col("ticker_data.*"))

Note that we have to supply a schema object that captures our stock ticker schema (stock_ticker_schema). Compare this to the approach for Flink in the next section.

Reading data from a source using Flink Table API

For Flink, we use the SQL DDL statement CREATE TABLE. You can specify the schema of the stream just like you would any SQL table. The WITH clause allows us to specify the connector to the data stream (Kafka in this case), the associated properties for the connector, and data format specifications. See the following code:

# Create table using DDL

CREATE TABLE stock_ticker (
  symbol string,
  price INT,
  timestamp TIMESTAMP(3),
  company_info STRING,
  WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE
) WITH (
 'connector' = 'kafka',
 'topic' = 'stock_ticker',
 'properties.bootstrap.servers' = 'mybroker1:port',
 'properties.group.id' = 'testGroup',
 'format' = 'json',
 'json.fail-on-missing-field' = 'false',
 'json.ignore-parse-errors' = 'true'
)

JSON flattening

JSON flattening is the process of converting a nested or hierarchical JSON object into a flat, single-level structure. This converts multiple levels of nesting into an object where all the keys and values are at the same level. Keys are combined using a delimiter such as a period (.) or underscore (_) to denote the original hierarchy. JSON flattening is useful when you need to work with a more simplified format. In both Spark and Flink, nested JSONs can be complicated to work with and may need additional processing or user-defined functions to manipulate. Flattened JSONs can simplify processing and improve performance due to reduced computational overhead, especially with operations like complex joins, aggregations, and windowing. In addition, flattened JSONs can help in easier debugging and troubleshooting data processing pipelines because there are fewer levels of nesting to navigate.

JSON flattening in Spark Structured Streaming

JSON flattening in Spark Structured Streaming requires you to use the select method and specify the schema that you need flattened. JSON flattening in Spark Structured Streaming involves specifying the nested field name that you’d like surfaced to the top-level list of fields. In the following example, company_info is a nested field and within company_info, there’s a field called company_name. With the following query, we’re flattening company_info.name to company_name:

stock_ticker_df = ...  # Streaming DataFrame w/ schema shown above

stock_ticker_df.select("symbol", "timestamp", "price", "company_info.name" as "company_name")

JSON flattening in Flink

In Flink SQL, you can use the JSON_VALUE function. Note that you can use this function only in Flink versions equal to or greater than 1.14. See the following code:

SELECT
   symbol,
   timestamp,
   price,
   JSON_VALUE(company_info, 'lax $.name' DEFAULT NULL ON EMPTY) AS company_name
FROM
   stock_ticker

The term lax in the preceding query has to do with JSON path expression handling in Flink SQL. For more information, refer to System (Built-in) Functions.

Data processing

Now that you have read the data, we can look at a few common data processing patterns.

Deduplication

Data deduplication in stream processing is crucial for maintaining data quality and ensuring consistency. It enhances efficiency by reducing the strain on the processing from duplicate data and helps with cost savings on storage and processing.

Spark Streaming deduplication query

The following code snippet is related to a Spark Streaming DataFrame named stock_ticker. The code performs an operation to drop duplicate rows based on the symbol column. The dropDuplicates method is used to eliminate duplicate rows in a DataFrame based on one or more columns.

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

stock_ticker.dropDuplicates("symbol")

Flink deduplication query

The following code shows the Flink SQL equivalent to deduplicate data based on the symbol column. The query retrieves the first row for each distinct value in the symbol column from the stock_ticker stream, based on the ascending order of proctime:

SELECT symbol, timestamp, price
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY proctime ASC) AS row_num
  FROM stock_ticker)
WHERE row_num = 1

Windowing

Windowing in streaming data is a fundamental construct to process data within specifications. Windows commonly have time bounds, number of records, or other criteria. These time bounds bucketize continuous unbounded data streams into manageable chunks called windows for processing. Windows help in analyzing data and gaining insights in real time while maintaining processing efficiency. Analyses or operations are performed on constantly updating streaming data within a window.

There are two common time-based windows used both in Spark Streaming and Flink that we will detail in this post: tumbling and sliding windows. A tumbling window is a time-based window that is a fixed size and doesn’t have any overlapping intervals. A sliding window is a time-based window that is a fixed size and moves forward in fixed intervals that can be overlapping.

Spark Streaming tumbling window query

The following is a Spark Streaming tumbling window query with a window size of 10 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in tumbling window
# of size 10 minutes
visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming tumbling window query

The following is an equivalent tumbling window query in Flink with a window size of 10 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    TUMBLE(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '10' MINUTES))
  GROUP BY ticker;

Spark Streaming sliding window query

The following is a Spark Streaming sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

stock_ticker = ...  # Streaming DataFrame w/ schema shown above

# Get max stock price in sliding window
# of size 10 minutes and slide interval of size
# 5 minutes

visitsByWindowAndUser = visits
   .withWatermark("timestamp", "3 minutes")
   .groupBy(
      window(stock_ticker.timestamp, "10 minutes", "5 minutes"),
      stock_ticker.symbol)
   .max(stock_ticker.price)

Flink Streaming sliding window query

The following is a Flink sliding window query with a window size of 10 minutes and slide interval of 5 minutes:

SELECT symbol, MAX(price)
  FROM TABLE(
    HOP(TABLE stock_ticker, DESCRIPTOR(timestamp), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY ticker;

Handling late data

Both Spark Structured Streaming and Flink support event time processing, where a field within the payload can be used for defining time windows as distinct from the wall clock time of the machines doing the processing. Both Flink and Spark use watermarking for this purpose.

Watermarking is used in stream processing engines to handle delays. A watermark is like a timer that sets how long the system can wait for late events. If an event arrives and is within the set time (watermark), the system will use it to update a request. If it’s later than the watermark, the system will ignore it.

In the preceding windowing queries, you specify the lateness threshold in Spark using the following code:

.withWatermark("timestamp", "3 minutes")

This means that any records that are 3 minutes late as tracked by the event time clock will be discarded.

In contrast, with the Flink Table API, you can specify an analogous lateness threshold directly in the DDL:

WATERMARK FOR timestamp AS timestamp - INTERVAL '3' MINUTE

Note that Flink provides additional constructs for specifying lateness across its various APIs.

Data enrichment

In this section, we compare data enrichment methods with Spark and Flink.

Calling an external API

Calling external APIs from user-defined functions (UDFs) is similar in Spark and Flink. Note that your UDF will be called for every record processed, which can result in the API getting called at a very high request rate. In addition, in production scenarios, your UDF code often gets run in parallel across multiple nodes, further amplifying the request rate.

For the following code snippets, let’s assume that the external API call entails calling the function:

response = my_external_api(request)

External API call in Spark UDF

The following code uses Spark:

class Predict(ScalarFunction):
def open(self, function_context):

with open("resources.zip/resources/model.pkl", "rb") as f:
self.model = pickle.load(f)

def eval(self, x):
return self.model.predict(x)

External API call in Flink UDF

For Flink, assume we define the UDF callExternalAPIUDF, which takes as input the ticker symbol symbol and returns enriched information about the symbol via a REST endpoint. We can then register and call the UDF as follows:

callExternalAPIUDF = udf(callExternalAPIUDF(), result_type=DataTypes.STRING())

SELECT
    symbol, 
    callExternalAPIUDF(symbol) as enriched_symbol
FROM stock_ticker;

Flink UDFs provide an initialization method that gets run one time (as opposed to one time per record processed).

Note that you should use UDFs judiciously as an improperly implemented UDF can cause your job to slow down, cause backpressure, and eventually stall your stream processing application. It’s advisable to use UDFs asynchronously to maintain high throughput, especially for I/O-bound use cases or when dealing with external resources like databases or REST APIs. To learn more about how you can use asynchronous I/O with Apache Flink, refer to Enrich your data stream asynchronously using Amazon Kinesis Data Analytics for Apache Flink.

Conclusion

Apache Flink and Apache Spark are both rapidly evolving projects and provide a fast and efficient way to process big data. This post focused on the top use cases we commonly encountered when customers wanted to see parallels between the two technologies for building real-time stream processing applications. We’ve included samples that were most frequently requested at the time of this writing. Let us know if you’d like more examples in the comments section.

About the author

Deepthi Mohan is a Principal Product Manager on the Amazon Kinesis Data Analytics team.

Karthi Thyagarajan was a Principal Solutions Architect on the Amazon Kinesis team.

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

2023-07-28 Maneesh Sharma

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/simplify-external-object-access-in-amazon-redshift-using-automatic-mounting-of-the-aws-glue-data-catalog/

Amazon Redshift is a petabyte-scale, enterprise-grade cloud data warehouse service delivering the best price-performance. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift to cost-effectively and quickly analyze their data using standard SQL and existing business intelligence (BI) tools.

Amazon Redshift now makes it easier for you to run queries in AWS data lakes by automatically mounting the AWS Glue Data Catalog. You no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. Now, you can use your AWS Identity and Access Management (IAM) credentials or IAM role to browse the Glue Data Catalog and query data lake tables directly from Amazon Redshift Query Editor v2 or your preferred SQL editors.

This feature is now available in all AWS commercial and US Gov Cloud Regions where Amazon Redshift RA3, Amazon Redshift Serverless, and AWS Glue are available. To learn more about auto-mounting of the Data Catalog in Amazon Redshift, refer to Querying the AWS Glue Data Catalog.

Enabling easy analytics for everyone

Amazon Redshift is helping tens of thousands of customers manage analytics at scale. Amazon Redshift offers a powerful analytics solution that provides access to insights for users of all skill levels. You can take advantage of the following benefits:

It enables organizations to analyze diverse data sources, including structured, semi-structured, and unstructured data, facilitating comprehensive data exploration
With its high-performance processing capabilities, Amazon Redshift handles large and complex datasets, ensuring fast query response times and supporting real-time analytics
Amazon Redshift provides features like Multi-AZ (preview) and cross-Region snapshot copy for high availability and disaster recovery, and provides authentication and authorization mechanisms to make it reliable and secure
With features like Amazon Redshift ML, it democratizes ML capabilities across a variety of user personas
The flexibility to utilize different table formats such as Apache Hudi, Delta Lake, and Apache Iceberg (preview) optimizes query performance and storage efficiency
Integration with advanced analytical tools empowers you to apply sophisticated techniques and build predictive models
Scalability and elasticity allow for seamless expansion as data and workloads grow

Overall, Amazon Redshift empowers organizations to uncover valuable insights, enhance decision-making, and gain a competitive edge in today’s data-driven landscape.

Amazon Redshift Top Benefits

The new automatic mounting of the AWS Glue Data Catalog feature enables you to directly query AWS Glue objects in Amazon Redshift without the need to create an external schema for each AWS Glue database you want to query. With automatic mounting the Data Catalog, Amazon Redshift automatically mounts the cluster account’s default Data Catalog during boot or user opt-in as an external database, named awsdatacatalog.

Relevant use cases for automatic mounting of the AWS Glue Data Catalog feature

You can use tools like Amazon EMR to create new data lake schemas in various formats, such as Apache Hudi, Delta Lake, and Apache Iceberg (preview). However, when analysts want to run queries against these schemas, it requires administrators to create external schemas for each AWS Glue database in Amazon Redshift. You can now simplify this integration using automatic mounting of the AWS Glue Data Catalog.

The following diagram illustrates this architecture.

Solution overview

You can now use SQL clients like Amazon Redshift Query Editor v2 to browse and query awsdatacatalog. In Query Editor V2, to connect to the awsdatacatalog database, choose the following:

Must use authentication method Temporary credentials using your IAM identity with the Redshift provisioned cluster
Must use the authentication method federated user to connect with a Redshift Serverless workgroup.

Complete the following high-level steps to integrate the automatic mounting of the Data Catalog using Query Editor V2 and a third-party SQL client:

Provision resources with AWS CloudFormation to populate Data Catalog objects.
Connect Redshift Serverless and query the Data Catalog as a federated user using Query Editor V2.
Connect with Redshift provisioned cluster and query the Data Catalog using Query Editor V2.
Configure permissions on catalog resources using AWS Lake Formation.
Federate with Redshift Serverless and query the Data Catalog using Query Editor V2 and a third-party SQL client.
Discover the auto-mounted objects.
Connect with Redshift provisioned cluster and query the Data Catalog as a federated user using a third-party client.
Connect with Amazon Redshift and query the Data Catalog as an IAM user using third-party clients.

The following diagram illustrates the solution workflow.

Prerequisites

You should have the following prerequisites:

An AWS account. If you don’t have one, you can sign up for one.
A Redshift cluster. For setup instructions, see Create a sample Amazon Redshift cluster.
Alternatively, you could use a Redshift Serverless endpoint. For setup instructions, see Getting started with Amazon Redshift Serverless.
The latest Amazon Redshift JDBC driver version.
A SQL client such as SQL workbench/J.

Provision resources with AWS CloudFormation to populate Data Catalog objects

In this post, we use an AWS Glue crawler to create the external table ny_pub stored in Apache Parquet format in the Amazon Simple Storage Service (Amazon S3) location s3://redshift-demos/data/NY-Pub/. In this step, we create the solution resources using AWS CloudFormation to create a stack named CrawlS3Source-NYTaxiData in either us-east-1 (use the yml download or launch stack) or us-west-2 (use the yml download or launch stack). Stack creation performs the following actions:

Creates the crawler NYTaxiCrawler along with the new IAM role AWSGlueServiceRole-RedshiftAutoMount
Creates automountdb as the AWS Glue database

When the stack is complete, perform the following steps:

On the AWS Glue console, under Data Catalog in the navigation pane, choose Crawlers.
Open NYTaxiCrawler and choose Run crawler.

After the crawler is complete, you can see a new table called ny_pub in the Data Catalog under the automountdb database.

Alternatively, you can follow the manual instructions from the Amazon Redshift labs to create the ny_pub table.

Connect with Redshift Serverless and query the Data Catalog as a federated user using Query Editor V2

In this section, we use an IAM role with principal tags to enable fine-grained federated authentication to Redshift Serverless to access auto-mounting AWS Glue objects.

Complete the following steps:

Create an IAM role and add following permissions. For this post, we add full AWS Glue, Amazon Redshift, and Amazon S3 permissions for demo purposes. In an actual production scenario, it’s recommended to apply more granular permissions.
On the Tags tab, create a tag with Key as RedshiftDbRoles and Value as automount.
In Query Editor V2, run the following SQL statement as an admin user to create a database role named automount:
```
Create role automount;
```

Grant usage privileges to the database role:

GRANT USAGE ON DATABASE awsdatacatalog to role automount;

Switch the role to automountrole by passing the account number and role name.
In the Query Editor v2, choose your Redshift Serverless endpoint (right-click) and choose Create connection.
For Authentication, select Federated user.
For Database, enter the database name you want to connect to.
Choose Create connection.

You’re now ready to explore and query the automatic mounting of the Data Catalog in Redshift Serverless.

Connect with Redshift provisioned cluster and query the Data Catalog using Query Editor V2

To connect with Redshift provisioned cluster and access the Data Catalog, make sure you have completed the steps in the preceding section. Then complete the following steps:

Connect to Redshift Query Editor V2 using the database user name and password authentication method. For example, connect to the dev database using the admin user and password.
In an editor tab, assuming the user is present in Amazon Redshift, run the following SQL statement to grant an IAM user access to the Data Catalog:
```
GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:automountrole";
```
As an admin user, choose the Settings icon, choose Account settings, and select Authenticate with IAM credentials.
Choose Save.
Switch roles to automountrole by passing the account number and role name.
Create or edit the connection and use the authentication method Temporary credentials using your IAM identity.

For more information about this authentication method, see Connecting to an Amazon Redshift database.

You are ready to explore and query the automatic mounting of the Data Catalog in Amazon Redshift.

Discover the auto-mounted objects

This section illustrates the SHOW commands for discovery of auto-mounted objects. See the following code:

// Discovery of Glue databases at the schema level 
SHOW SCHEMAS FROM DATABASE awsdatacatalog;

// Discovery of Glue tables 
 Syntax: SHOW TABLES FROM SCHEMA awsdatacatalog.<glue_db_name>;
Example: SHOW TABLES FROM SCHEMA awsdatacatalog.automountdb;

// Disocvery of Glue table columns 
 Syntax: SHOW COLUMNS FROM TABLE awsdatacatalog.<glue_db_name>.<glue_table_name>;
Example: SHOW COLUMNS FROM TABLE awsdatacatalog.automountdb.ny_pub;

Configure permissions on catalog resources using AWS Lake Formation

To maintain backward compatibility with AWS Glue, Lake Formation has the following initial security settings:

The Super permission is granted to the group IAMAllowedPrincipals on all existing Data Catalog resources
The Use only IAM access control setting is enabled for new Data Catalog resources

These settings effectively cause access to Data Catalog resources and Amazon S3 locations to be controlled solely by IAM policies. Individual Lake Formation permissions are not in effect.

In this step, we will configure permissions on catalog resources using AWS Lake Formation. Before you create the Data Catalog, you need to update the default settings of Lake Formation so that access to Data Catalog resources (databases and tables) is managed by Lake Formation permissions:

Change the default security settings for new resources. For instructions, see Change the default permission model.
Change the settings for existing Data Catalog resources. For instructions, see Upgrading AWS Glue data permissions to the AWS Lake Formation model.

For more information, refer to Changing the default settings for your data lake.

Federate with Redshift Serverless and query the Data Catalog using Query Editor V2 and a third-party SQL client

With Redshift Serverless, you can connect to awsdatacatalog from a third-party client as a federated user from any identity provider (IdP). In this section, we will configure permission on catalog resources for Federated IAM role in AWS Lake Formation. Using AWS Lake Formation with Redshift, currently permission can be applied on IAM user or IAM role level.

To connect as a federated user, we will be using Redshift Serverless. For setup instructions, refer to Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients.

There are additional changes required on following resources:

In Amazon Redshift, as an admin user, grant the usage to each federated user who needs access on awsdatacatalog:
```
GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:[email protected]";
```

If the user doesn’t exist in Amazon Redshift, you may need to create the IAM user with the password disabled as shown in the following code and then grant usage on awsdatacatalog:

Create User "IAMR:[email protected]" with password disable;

On the Lake Formation console, assign permissions on the AWS Glue database to the IAM role that you created as part of the federated setup.
1. Under Principals, select IAM users and roles.
2. Choose IAM role oktarole.
3. Apply catalog resource permissions, selecting automountdb database and granting appropriate table permissions.
Update the IAM role used in the federation setup. In addition to the permissions added to the IAM role, you need to add AWS Glue permissions and Amazon S3 permissions to access objects from Amazon S3. For this post, we add full AWS Glue and AWS S3 permissions for demo purposes. In an actual production scenario, it’s recommended to apply more granular permissions.

Now you’re ready to connect to Redshift Serverless using the Query Editor V2 and federated login.

Use the SSO URL from Okta and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
In the Query Editor v2, choose your Redshift Serverless instance (right-click) and choose Create connection.
For Authentication, select Federated user.
For Database, enter the database name you want to connect to.
Choose Create connection.
Run the command select current_user to validate that you are logged in as a federated user.

User Ethan will be able to explore and access awsdatacatalog data.

To connect Redshift Serverless with a third-party client, make sure you have followed all the previous steps.

For SQLWorkbench setup, refer to the section Configure the SQL client (SQL Workbench/J) in Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients.

The following screenshot shows that federated user ethan is able to query the awsdatacatalog tables using three-part notation:

Connect with Redshift provisioned cluster and query the Data Catalog as a federated user using third-party clients

With Redshift provisioned cluster, you can connect with awsdatacatalog from a third-party client as a federated user from any IdP.

To connect as a federated user with the Redshift provisioned cluster, you need to follow the steps in the previous section that detailed how to connect with Redshift Serverless and query the Data Catalog as a federated user using Query Editor V2 and a third-party SQL client.

There are additional changes required in IAM policy. Update the IAM policy with the following code to use the GetClusterCredentialsWithIAM API:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:ListGroups",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentialsWithIAM",
            "Resource": "arn:aws:redshift:us-east-2:01234567891:dbname:redshift-cluster-1/dev"
        }
    ]
}

Now you’re ready to connect to Redshift provisioned cluster using a third-party SQL client as a federated user.

For SQLWorkbench setup, refer to the section Configure the SQL client (SQL Workbench/J) in the post Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients.

Make the following changes:

Use the latest Redshift JDBC driver because it only supports querying the auto-mounted Data Catalog table for federated users
For URL, enter jdbc:redshift:iam://<cluster endpoint>:<port>:<databasename>?groupfederation=true. For example, jdbc:redshift:iam://redshift-cluster-1.abdef0abc0ab.us-east-2.redshift.amazonaws.com:5439/dev?groupfederation=true.

In the preceding URL, groupfederation is a mandatory parameter that allows you to authenticate with the IAM credentials.

The following screenshot shows that federated user ethan is able to query the awsdatacatalog tables using three-part notation.

Connect and query the Data Catalog as an IAM user using third-party clients

In this section, we provide instructions to set up a SQL client to query the auto-mounted awsdatacatalog.

Use three-part notation to reference the awsdatacatalog table in your SELECT statement. The first part is the database name, the second part is the AWS Glue database name, and the third part is the AWS Glue table name:

SELECT * FROM awsdatacatalog.<aws-glue-db-name>.<aws-glue-table-name>;

You can perform various scenarios that read the Data Catalog data and populate Redshift tables.

For this post, we use SQLWorkbench/J as the SQL client to query the Data Catalog. To set up SQL Workbench/J, complete the following steps:

Create a new connection in SQL Workbench/J and choose Amazon Redshift as the driver.
Choose Manage drivers and add all the files from the downloaded AWS JDBC driver pack .zip file (remember to unzip the .zip file).

You must use the latest Redshift JDBC driver because it only supports querying the auto-mounted Data Catalog table.

For URL, enter jdbc:redshift:iam://<cluster endpoint>:<port>:<databasename>?profile=<profilename>&groupfederation=true. For example, jdbc:redshift:iam://redshift-cluster-1.abdef0abc0ab.us-east-2.redshift.amazonaws.com:5439/dev?profile=user2&groupfederation=true.

We are using profile-based credentials as an example. You can use any AWS profile or IAM credential-based authentication as per your requirement. For more information on IAM credentials, refer to Options for providing IAM credentials.

The following screenshot shows that IAM user johndoe is able to list the awsdatacatalog tables using the SHOW command.

The following screenshot shows that IAM user johndoe is able to query the awsdatacatalog tables using three-part notation:

If you get the following error while using groupfederation=true, you need to use the latest Redshift driver:

Something unusual has occurred to cause the driver to fail. Please report this exception:Authentication with plugin is not supported for group federation [SQL State=99999]

Clean up

Complete the following steps to clean up your resources:

Delete the IAM role automountrole.
Delete the CloudFormation stack CrawlS3Source-NYTaxiData to clean up the crawler NYTaxiCrawler, the automountdb database from the Data Catalog, and the IAM role AWSGlueServiceRole-RedshiftAutoMount.
Update the default settings of Lake Formation:
1. In the navigation pane, under Data catalog, choose Settings.
2. Select both access control options choose Save.
3. In the navigation pane, under Permissions, choose Administrative roles and tasks.
4. In the Database creators section, choose Grant.
5. Search for IAMAllowedPrincipals and select Create database permission.
6. Choose Grant.

Considerations

Note the following considerations:

The Data Catalog auto-mount provides ease of use to analysts or database users. The security setup (setting up the permissions model or data governance) is owned by account and database administrators.
- To achieve fine-grained access control, build a permissions model in AWS Lake Formation.
- If the permissions have to be maintained at the Redshift database level, leave the AWS Lake Formation default settings as is and then run grant/revoke in Amazon Redshift.
If you are using a third-party SQL editor, and your query tool does not support browsing of multiple databases, you can use the “SHOW“ commands to list your AWS Glue databases and tables. You can also query awsdatacatalog objects using three-part notation (SELECT * FROM awsdatacatalog.<aws-glue-db-name>.<aws-glue-table-name>;) provided you have access to the external objects based on the permission model.

Conclusion

In this post, we introduced the automatic mounting of AWS Glue Data Catalog, which makes it easier for customers to run queries in their data lakes. This feature streamlines data governance and access control, eliminating the need to create an external schema in Amazon Redshift to use the data lake tables cataloged in AWS Glue Data Catalog. We showed how you can manage permission on auto-mounted AWS Glue-based objects using Lake Formation. The permission model can be easily managed and organized by administrators, allowing database users to seamlessly access external objects they have been granted access to.

As we strive for enhanced usability in Amazon Redshift, we prioritize unified data governance and fine-grained access control. This feature minimizes manual effort while ensuring the necessary security measures for your organization are in place.

For more information about automatic mounting of the Data Catalog in Amazon Redshift, refer to Querying the AWS Glue Data Catalog.

About the Authors

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

2023-07-28 Kishore Tata

Post Syndicated from Kishore Tata original https://aws.amazon.com/blogs/big-data/five-actionable-steps-to-gdpr-compliance-right-to-be-forgotten-with-amazon-redshift/

The GDPR (General Data Protection Regulation) right to be forgotten, also known as the right to erasure, gives individuals the right to request the deletion of their personally identifiable information (PII) data held by organizations. This means that individuals can ask companies to erase their personal data from their systems and any third parties with whom the data was shared. Organizations must comply with these requests provided that there are no legitimate grounds for retaining the personal data, such as legal obligations or contractual requirements.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is designed for analyzing large volumes of data and performing complex queries on structured and semi-structured data. Many customers are looking for best practices to keep their Amazon Redshift analytics environment compliant and have an ability to respond to GDPR right to forgotten requests.

In this post, we discuss challenges associated with implementation and architectural patterns and actionable best practices for organizations to respond to the right to be forgotten request requirements of the GDPR for data stored in Amazon Redshift.

Who does GDPR apply to?

The GDPR applies to all organizations established in the EU and to organizations, whether or not established in the EU, that process the personal data of EU individuals in connection with either the offering of goods or services to data subjects in the EU or the monitoring of behavior that takes place within the EU.

The following are key terms we use when discussing the GDPR:

Data subject – An identifiable living person and resident in the EU or UK, on whom personal data is held by a business or organization or service provider
Processor – The entity that processes the data on the instructions of the controller (for example, AWS)
Controller – The entity that determines the purposes and means of processing personal data (for example, an AWS customer)
Personal data – Information relating to an identified or identifiable person, including names, email addresses, and phone numbers

Implementing the right to be forgotten can include the following challenges:

Data identification – One of the main challenges is identifying all instances of personal data across various systems, databases, and backups. Organizations need to have a clear understanding of where personal data is being stored and how it is processed to effectively fulfill the deletion requests.
Data dependencies – Personal data can be interconnected and intertwined with other data systems, making it challenging to remove specific data without impacting the integrity of functionality of other systems or processes. It requires careful analysis to identify data dependencies and mitigate any potential risks or disruptions.
Data replication and backups – Personal data can exist in multiple copies due to data replication and backups. Ensuring the complete removal of data from all these copies and backups can be challenging. Organizations need to establish processes to track and manage data copies effectively.
Legal obligations and exemptions – The right to be forgotten is not absolute and may be subject to legal obligations or exemptions. Organizations need to carefully assess requests, considering factors such as legal requirements, legitimate interests, freedom of expression, or public interest to determine if the request can be fulfilled or if any exceptions apply.
Data archiving and retention – Organizations may have legal or regulatory requirements to retain certain data for a specific period. Balancing the right to be forgotten with the obligation to retain data can be a challenge. Clear policies and procedures need to be established to manage data retention and deletion appropriately.

Architecture patterns

Organizations are generally required to respond to right to be forgotten requests within 30 days from when the individual submits a request. This deadline can be extended by a maximum of 2 months taking into account the complexity and the number of the requests, provided that the data subject has been informed about the reasons for the delay within 1 month of the receipt of the request.

The following sections discuss a few commonly referenced architecture patterns, best practices, and options supported by Amazon Redshift to support your data subject’s GDPR right to be forgotten request in your organization.

Actionable Steps

Data management and governance

Addressing the challenges mentioned requires a combination of technical, operational, and legal measures. Organizations need to develop robust data governance practices, establish clear procedures for handling deletion requests, and maintain ongoing compliance with GDPR regulations.

Large organizations usually have multiple Redshift environments, databases, and tables spread across multiple Regions and accounts. To successfully respond to a data subject’s requests, organizations should have a clear strategy to determine how data is forgotten, flagged, anonymized, or deleted, and they should have clear guidelines in place for data audits.

Data mapping involves identifying and documenting the flow of personal data in an organization. It helps organizations understand how personal data moves through their systems, where it is stored, and how it is processed. By creating visual representations of data flows, organizations can gain a clear understanding of the lifecycle of personal data and identify potential vulnerabilities or compliance gaps.

Note that putting a comprehensive data strategy in place is not in scope for this post.

Audit tracking

Organizations must maintain proper documentation and audit trails of the deletion process to demonstrate compliance with GDPR requirements. A typical audit control framework should record the data subject requests (who is the data subject, when was it requested, what data, approver, due date, scheduled ETL process if any, and so on). This will help with your audit requests and provide the ability to roll back in case of accidental deletions observed during the QA process. It’s important to maintain the list of users and systems who may get impacted during this process to ensure effective communication.

Data discovery and findability

Findability is an important step of the process. Organizations need to have mechanisms to find the data under consideration in an efficient and quick manner for timely response. The following are some patterns and best practices you can employ to find the data in Amazon Redshift.

Tagging

Consider tagging your Amazon Redshift resources to quickly identify which clusters and snapshots contain the PII data, the owners, the data retention policy, and so on. Tags provide metadata about resources at a glance. Redshift resources, such as namespaces, workgroups, snapshots, and clusters can be tagged. For more information about tagging, refer to Tagging resources in Amazon Redshift.

Naming conventions

As a part of the modeling strategy, name the database objects (databases, schemas, tables, columns) with an indicator that they contain PII so that they can be queried using system tables (for example, make a list of the tables and columns where PII data is involved). Identifying the list of tables and users or the systems that have access to them will help streamline the communication process. The following sample SQL can help you find the databases, schemas, and tables with a name that contains PII:

SELECT
pg_catalog.pg_namespace.nspname AS schema_name,
pg_catalog.pg_class.relname AS table_name,
pg_catalog.pg_attribute.attname AS column_name,
pg_catalog.pg_database.datname AS database_name
FROM
pg_catalog.pg_namespace
JOIN pg_catalog.pg_class ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
JOIN pg_catalog.pg_attribute ON pg_catalog.pg_class.oid = pg_catalog.pg_attribute.attrelid
JOIN pg_catalog.pg_database ON pg_catalog.pg_attribute.attnum > 0
WHERE
pg_catalog.pg_attribute.attname LIKE '%PII%';

SELECT datname
FROM pg_database
WHERE datname LIKE '%PII%';

SELECT table_schema, table_name, column_name
FROM information_schema.columns
WHERE column_name LIKE '%PII%'

Separate PII and non-PII

Whenever possible, keep the sensitive data in a separate table, database, or schema. Isolating the data in a separate database may not always be possible. However, you can separate the non-PII columns in a separate table, for example, Customer_NonPII and Customer_PII, and then join them with an unintelligent key. This helps identify the tables that contain non-PII columns. This approach is straightforward to implement and keeps non-PII data intact, which can be useful for analysis purposes. The following figure shows an example of these tables.

PII-Non PII Example Tables

Flag columns

In the preceding tables, rows in bold are marked with Forgotten_flag=Yes. You can maintain a Forgotten_flag as a column with the default value as No and update this value to Yes whenever a request to be forgotten is received. Also, as a best practice from HIPAA, do a batch deletion once in a month. The downstream and upstream systems need to respect this flag and include this in their processing. This helps identify the rows that need to be deleted. For our example, we can use the following code:

Delete from Customer_PII where forgotten_flag=“Yes”

Use Master data management system

Organizations that maintain a master data management system maintain a golden record for a customer, which acts as a single version of truth from multiple disparate systems. These systems also contain crosswalks with several peripheral systems that contain the natural key of the customer and golden record. This technique helps find customer records and related tables. The following is a representative example of a crosswalk table in a master data management system.

Example of a MDM Records

Use AWS Lake Formation

Some organizations have use cases where you can share the data across multiple departments and business units and use Amazon Redshift data sharing. We can use AWS Lake Formation tags to tag the database objects and columns and define fine-grained access controls on who can have the access to use data. Organizations can have a dedicated resource with access to all tagged resources. With Lake Formation, you can centrally define and enforce database-, table-, column-, and row-level access permissions of Redshift data shares and restrict user access to objects within a data share.

By sharing data through Lake Formation, you can define permissions in Lake Formation and apply those permissions to data shares and their objects. For example, if you have a table containing employee information, you can use column-level filters to help prevent employees who don’t work in the HR department from seeing sensitive information. Refer to AWS Lake Formation-managed Redshift shares for more details on the implementation.

Use Amazon DataZone

Amazon DataZone introduces a business metadata catalog. Business metadata provides information authored or used by businesses and gives context to organizational data. Data discovery is a key task that business metadata can support. Data discovery uses centrally defined corporate ontologies and taxonomies to classify data sources and allows you to find relevant data objects. You can add business metadata in Amazon DataZone to support data discovery.

Data erasure

By using the approaches we’ve discussed, you can find the clusters, databases, tables, columns, snapshots that contain the data to be deleted. The following are some methods and best practices for data erasure.

Restricted backup

In some use cases, you may have to keep data backed up to align with government regulations for a certain period of time. It’s a good idea to take the backup of the data objects before deletion and keep it for an agreed-upon retention time. You can use AWS Backup to take automatic or manual backups. AWS Backup allows you to define a central backup policy to manage the data protection of your applications. For more information, refer to New – Amazon Redshift Support in AWS Backup.

Physical deletes

After we find the tables that contain the data, we can delete the data using the following code (using the flagging technique discussed earlier):

Delete from Customer_PII where forgotten_flag=“Yes”

It’s a good practice to delete data at a specified schedule, such as once every 25–30 days, so that it is simpler to maintain the state of the database.

Logical deletes

You may need to keep data in a separate environment for audit purposes. You can employ Amazon Redshift row access policies and conditional dynamic masking policies to filter and anonymize the data.

You can use row access policies on Forgotten_flag=No on the tables that contain PII data so that the designated users can only see the necessary data. Refer to Achieve fine-grained data security with row-level access control in Amazon Redshift for more information about how to implement row access policies.

You can use conditional dynamic data masking policies so that designated users can see the redacted data. With dynamic data masking (DDM) in Amazon Redshift, organizations can help protect sensitive data in your data warehouse. You can manipulate how Amazon Redshift shows sensitive data to the user at query time without transforming it in the database. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

Dynamic data masking policies hide, obfuscate, or pseudonymize data that matches a given format. When attached to a table, the masking expression is applied to one or more of its columns. You can further modify masking policies to only apply them to certain users or user-defined roles that you can create with role-based access control (RBAC). Additionally, you can apply DDM on the cell level by using conditional columns when creating your masking policy.

Organizations can use conditional dynamic data masking to redact sensitive columns (for example, names) where the forgotten flag column value is TRUE, and the other columns display the full values.

Backup and restore

Data from Redshift clusters can be transferred, exported, or copied to different AWS services or outside of the cloud. Organizations should have an effective governance process to detect and remove data to align with the GDPR compliance requirement. However, this is beyond the scope of this post.

Amazon Redshift offers backups and snapshots of the data. After deleting the PII data, organizations should also purge the data from their backups. To do so, you need to restore the snapshot to a new cluster, remove the data, and take a fresh backup. The following figure illustrates this workflow.

It’s good practice to keep the retention period at 29 days (if applicable) so that the backups are cleared after 30 days. Organizations can also set the backup schedule to a certain date (for example, the first of every month).

Backup and Restore

Communication

It’s important to communicate to the users and processes who may be impacted by this deletion. The following query helps identify the list of users and groups who have access to the affected tables:

SELECT
nspname AS schema_name,
relname AS table_name,
attname AS column_name,
usename AS user_name,
groname AS group_name
FROM pg_namespace
JOIN pg_class ON pg_namespace.oid = pg_class.relnamespace
JOIN pg_attribute ON pg_class.oid = pg_attribute.attrelid
LEFT JOIN pg_group ON pg_attribute.attacl::text LIKE '%' || groname || '%'
LEFT JOIN pg_user ON pg_attribute.attacl::text LIKE '%' || usename || '%'
WHERE
pg_attribute.attname LIKE '%PII%'
AND (usename IS NOT NULL OR groname IS NOT NULL);

Security controls

Maintaining security is of great importance in GDPR compliance. By implementing robust security measures, organizations can help protect personal data from unauthorized access, breaches, and misuse, thereby helping maintain the privacy rights of individuals. Security plays a crucial role in upholding the principles of confidentiality, integrity, and availability of personal data. AWS offers a comprehensive suite of services and features that can support GDPR compliance and enhance security measures.

The GDPR does not change the AWS shared responsibility model, which continues to be relevant for customers. The shared responsibility model is a useful approach to illustrate the different responsibilities of AWS (as a data processor or subprocessor) and customers (as either data controllers or data processors) under the GDPR.

Under the shared responsibility model, AWS is responsible for securing the underlying infrastructure that supports AWS services (“Security of the Cloud”), and customers, acting either as data controllers or data processors, are responsible for personal data they upload to AWS services (“Security in the Cloud”).

AWS offers a GDPR-compliant AWS Data Processing Addendum (AWS DPA), which enables you to comply with GDPR contractual obligations. The AWS DPA is incorporated into the AWS Service Terms.

Article 32 of the GDPR requires that organizations must “…implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk, including …the pseudonymization and encryption of personal data[…].” In addition, organizations must “safeguard against the unauthorized disclosure of or access to personal data.” Refer to the Navigating GDPR Compliance on AWS whitepaper for more details.

Conclusion

In this post, we delved into the significance of GDPR and its impact on safeguarding privacy rights. We discussed five commonly followed best practices that organizations can reference for responding to GDPR right to be forgotten requests for data that resides in Redshift clusters. We also highlighted that the GDPR does not change the AWS shared responsibility model.

We encourage you to take charge of your data privacy today. Prioritizing GPDR compliance and data privacy will not only strengthen trust, but also build customer loyalty and safeguard personal information in digital era. If you need assistance or guidance, reach out to an AWS representative. AWS has teams of Enterprise Support Representatives, Professional Services Consultants, and other staff to help with GDPR questions. You can contact us with questions. To learn more about GDPR compliance when using AWS services, refer to the General Data Protection Regulation (GDPR) Center. To learn more about the right to be forgotten, refer to Right to Erasure.

Disclaimer: The information provided above is not a legal advice. It is intended to showcase commonly followed best practices. It is crucial to consult with your organization’s privacy officer or legal counsel and determine appropriate solutions.

About the Authors

YaduKishore Profile Yadukishore Tatavarthi is a Senior Partner Solutions Architect supporting Healthcare and life science customers at Amazon Web Services. He has been helping the customers over the last 20 years in building the enterprise data strategies, advising customers on cloud implementations, migrations, reference architecture creation, data modeling best practices, data lake/warehouses architecture, and other technical processes.

Sudhir Gupta is a Principal Partner Solutions Architect, Analytics Specialist at AWS with over 18 years of experience in Databases and Analytics. He helps AWS partners and customers design, implement, and migrate large-scale data & analytics (D&A) workloads. As a trusted advisor to partners, he enables partners globally on AWS D&A services, builds solutions/accelerators, and leads go-to-market initiatives

Deepak Singh is a Senior Solutions Architect at Amazon Web Services with 20+ years of experience in Data & AIA. He enjoys working with AWS partners and customers on building scalable analytical solutions for their business outcomes. When not at work, he loves spending time with family or exploring new technologies in analytics and AI space.

Near-real-time analytics using Amazon Redshift streaming ingestion with Amazon Kinesis Data Streams and Amazon DynamoDB

2023-07-27 Poulomi Dasgupta

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/near-real-time-analytics-using-amazon-redshift-streaming-ingestion-with-amazon-kinesis-data-streams-and-amazon-dynamodb/

Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the widely used cloud data warehouse. You can run and scale analytics in seconds on all your data without having to manage your data warehouse infrastructure.

You can use the Amazon Redshift streaming ingestion capability to update your analytics databases in near-real time. Amazon Redshift streaming ingestion simplifies data pipelines by letting you create materialized views directly on top of data streams. With this capability in Amazon Redshift, you can use SQL (Structured Query Language) to connect to and directly ingest data from data streams, such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) data streams, and pull data directly to Amazon Redshift.

In this post, we discuss a solution that uses Amazon Redshift streaming ingestion to provide near-real-time analytics.

Overview of solution

We walk through an example pipeline to ingest data from an Amazon DynamoDB source table in near-real time using Kinesis Data Streams in combination with Amazon Redshift streaming ingestion. We also walk through using PartiQL in Amazon Redshift to unnest nested JSON documents and build fact and dimension tables that are used in your data warehouse refresh. The solution uses Kinesis Data Streams to capture item-level changes from an application DynamoDB table.

As shown in the following reference architecture, DynamoDB table data changes are streamed into Amazon Redshift through Kinesis Data Streams and Amazon Redshift streaming ingestion for near-real-time analytics dashboard visualization using Amazon QuickSight.

The process flow includes the following steps:

Create a Kinesis data stream and turn on the data stream from DynamoDB to capture item-level changes in your DynamoDB table.
Create a streaming materialized view in your Amazon Redshift cluster to consume live streaming data from the data stream.
The streaming data gets ingested into a JSON payload. Use a combination of a PartiQL statement and dot notation to unnest the JSON document into data columns of a staging table in Amazon Redshift.
Create fact and dimension tables in the Amazon Redshift cluster and keep loading the latest data at regular intervals from the staging table using transformation logic.
Establish connectivity between a QuickSight dashboard and Amazon Redshift to deliver visualization and insights.

Prerequisites

You must have the following:

An AWS account.
An Amazon Redshift cluster if you are using Amazon Redshift Provisioned. For instructions, refer to Create a sample Amazon Redshift cluster.
An Amazon Redshift workgroup if you are using Amazon Redshift Serverless. For instructions, refer to Create a workgroup with a namespace.
An existing DynamoDB table with an active workload.

Set up a Kinesis data stream

To configure your Kinesis data stream, complete the following steps:

Create a Kinesis data stream called demo-data-stream. For instructions, refer to Step 1 in Set up streaming ETL pipelines.

Configure the stream to capture changes from the DynamoDB table.

On the DynamoDB console, choose Tables in the navigation pane.
Open your table.
On the Exports and streams tab, choose Turn on under Amazon Kinesis data stream details.

For Destination Kinesis data stream, choose demo-data-stream.
Choose Turn on stream.

Item-level changes in the DynamoDB table should now be flowing to the Kinesis data stream.

To verify if the data is entering the stream, on the Kinesis Data Streams console, open demo-data-stream.
On the Monitoring tab, find the PutRecord success – average (Percent) and PutRecord – sum (Bytes) metrics to validate record ingestion.

Set up streaming ingestion

To set up streaming ingestion, complete the following steps:

Set up the AWS Identity and Access Management (IAM) role and trust policy required for streaming ingestion. For instructions, refer to Steps 1 and 2 in Getting started with streaming ingestion from Amazon Kinesis Data Streams.
Launch the Query Editor v2 from the Amazon Redshift console or use your preferred SQL client to connect to your Amazon Redshift cluster for the next steps.
Create an external schema:

CREATE EXTERNAL SCHEMA demo_schema
FROM KINESIS
IAM_ROLE { default | 'iam-role-arn' };

To use case-sensitive identifiers, set enable_case_sensitive_identifier to true at either the session or cluster level.
Create a materialized view to consume the stream data and store stream records in semi-structured SUPER format:

CREATE MATERIALIZED VIEW demo_stream_vw AS
    SELECT approximate_arrival_timestamp,
    partition_key,
    shard_id,
    sequence_number,
    json_parse(kinesis_data) as payload    
    FROM demo_schema."demo-data-stream";

Refresh the view, which triggers Amazon Redshift to read from the stream and load data into the materialized view:

REFRESH MATERIALIZED VIEW demo_stream_vw;

You can also set your streaming materialized view to use auto refresh capabilities. This will automatically refresh your materialized view as data arrives in the stream. See CREATE MATERIALIZED VIEW for instructions on how to create a materialized view with auto refresh.

Unnest the JSON document

The following is a sample of a JSON document that was ingested from the Kinesis data stream to the payload column of the streaming materialized view demo_stream_vw:

{
  "awsRegion": "us-east-1",
  "eventID": "6d24680a-6d12-49e2-8a6b-86ffdc7306c1",
  "eventName": "INSERT",
  "userIdentity": null,
  "recordFormat": "application/json",
  "tableName": "sample-dynamoDB",
  "dynamodb": {
    "ApproximateCreationDateTime": 1657294923614,
    "Keys": {
      "pk": {
        "S": "CUSTOMER#CUST_123"
      },
      "sk": {
        "S": "TRANSACTION#2022-07-08T23:59:59Z#CUST_345"
      }
    },
    "NewImage": {
      "completionDateTime": {
        "S": "2022-07-08T23:59:59Z"
      },
      "OutofPockPercent": {
        "N": 50.00
      },
      "calculationRequirements": {
        "M": {
          "dependentIds": {
            "L": [
              {
                "M": {
                  "sk": {
                    "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_567"
                  },
                  "pk": {
                    "S": "CUSTOMER#CUST_123"
                  }
                }
              },
              {
                "M": {
                  "sk": {
                    "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_890"
                  },
                  "pk": {
                    "S": "CUSTOMER#CUST_123"
                  }
                }
              }
            ]
          }
        }
      },
      "Event": {
        "S": "SAMPLE"
      },
      "Provider": {
        "S": "PV-123"
      },
      "OutofPockAmount": {
        "N": 1000
      },
      "lastCalculationDateTime": {
        "S": "2022-07-08T00:00:00Z"
      },
      "sk": {
        "S": "CUSTOMER#2022-07-08T23:59:59Z#CUST_567"
      },
      "OutofPockMax": {
        "N": 2000
      },
      "pk": {
        "S": "CUSTOMER#CUST_123"
      }
    },
    "SizeBytes": 694
  },
  "eventSource": "aws:dynamodb"
}

We can use dot notation to unnest the JSON document. But in addition to that, we should use a PartiQL statement to handle arrays if applicable. For example, in the preceding JSON document, there is an array under the element:

"dynamodb"."NewImage"."calculationRequirements"."M"."dependentIds"."L".

The following SQL query uses a combination of dot notation and a PartiQL statement to unnest the JSON document:

select 
substring(a."payload"."dynamodb"."Keys"."pk"."S"::varchar, position('#' in "payload"."dynamodb"."Keys"."pk"."S"::varchar)+1) as Customer_ID,
substring(a."payload"."dynamodb"."Keys"."sk"."S"::varchar, position('#TRANSACTION' in "payload"."dynamodb"."Keys"."sk"."S"::varchar)+1) as Transaction_ID,
substring(b."M"."sk"."S"::varchar, position('#CUSTOMER' in b."M"."sk"."S"::varchar)+1) Dependent_ID,
a."payload"."dynamodb"."NewImage"."OutofPockMax"."N"::int as OutofPocket_Max,
a."payload"."dynamodb"."NewImage"."OutofPockPercent"."N"::decimal(5,2) as OutofPocket_Percent,
a."payload"."dynamodb"."NewImage"."OutofPockAmount"."N"::int as OutofPock_Amount,
a."payload"."dynamodb"."NewImage"."Provider"."S"::varchar as Provider,
a."payload"."dynamodb"."NewImage"."completionDateTime"."S"::timestamptz as Completion_DateTime,
a."payload"."eventName"::varchar Event_Name,
a.approximate_arrival_timestamp
from demo_stream_vw a
left outer join a."payload"."dynamodb"."NewImage"."calculationRequirements"."M"."dependentIds"."L" b on true;

The query unnests the JSON document to the following result set.

Precompute the result set using a materialized view

Optionally, to precompute and store the unnested result set from the preceding query, you can create a materialized view and schedule it to refresh at regular intervals. In this post, we maintain the preceding unnested data in a materialized view called mv_demo_super_unnest, which will be refreshed at regular intervals and used for further processing.

To capture the latest data from the DynamoDB table, the Amazon Redshift streaming materialized view needs to be refreshed at regular intervals, and then the incremental data should be transformed and loaded into the final fact and dimension table. To avoid reprocessing the same data, a metadata table can be maintained at Amazon Redshift to keep track of each ELT process with status, start time, and end time, as explained in the following section.

Maintain an audit table in Amazon Redshift

The following is a sample DDL of a metadata table that is maintained for each process or job:

create table MetaData_ETL
(
JobName varchar(100),
StartDate timestamp, 
EndDate timestamp, 
Status varchar(50)
);

The following is a sample initial entry of the metadata audit table that can be maintained at job level. The insert statement is the initial entry for the ELT process to load the Customer_Transaction_Fact table:

insert into MetaData_ETL 
values
('Customer_Transaction_Fact_Load', current_timestamp, current_timestamp,'Ready' );

Build a fact table with the latest data

In this post, we demonstrate the loading of a fact table using specific transformation logic. We are skipping the dimension table load, which uses similar logic.

As a prerequisite, create the fact and dimension tables in a preferred schema. In following example, we create the fact table Customer_Transaction_Fact in Amazon Redshift:

CREATE TABLE public.Customer_Transaction_Fact (
Transaction_ID character varying(500),
Customer_ID character varying(500),
OutofPocket_Percent numeric(5,2),
OutofPock_Amount integer,
OutofPocket_Max integer,
Provider character varying(500),
completion_datetime timestamp
);

Transform data using a stored procedure

We load this fact table from the unnested data using a stored procedure. For more information, refer to Creating stored procedures in Amazon Redshift.

Note that in this sample use case, we are using transformation logic to identify and load the latest value of each column for a customer transaction.

The stored procedure contains the following components:

In the first step of the stored procedure, the job entry in the MetaData_ETL table is updated to change the status to Running and StartDate as the current timestamp, which indicates that the fact load process is starting.
Refresh the materialized view mv_demo_super_unnest, which contains the unnested data.
In the following example, we load the fact table Customer_Transaction_Fact using the latest data from the streaming materialized view based on the column approximate_arrival_timestamp, which is available as a system column in the streaming materialized view. The value of approximate_arrival_timestamp is set when a Kinesis data stream successfully receives and stores a record.
The following logic in the stored procedure checks if the approximate_arrival_timestamp in mv_demo_super_unnest is greater than the EndDate timestamp in the MetaData_ETL audit table, so that it can only process the incremental data.
Additionally, while loading the fact table, we identify the latest non-null value of each column for every Transaction_ID depending on the order of the approximate_arrival_timestamp column using the rank and min
The transformed data is loaded into the intermediate staging table
The impacted records with the same Transaction_ID values are deleted and reloaded into the Customer_Transaction_Fact table from the staging table
In the last step of the stored procedure, the job entry in the MetaData_ETL table is updated to change the status to Complete and EndDate as the current timestamp, which indicates that the fact load process has completed successfully.

See the following code:

CREATE OR REPLACE PROCEDURE SP_Customer_Transaction_Fact()
AS $$
BEGIN

set enable_case_sensitive_identifier to true;

--Update metadata audit table entry to indicate that the fact load process is running
update MetaData_ETL
set status = 'Running',
StartDate = getdate()
where JobName = 'Customer_Transaction_Fact_Load';

refresh materialized view mv_demo_super_unnest;

drop table if exists Customer_Transaction_Fact_Stg;

--Create latest record by Merging records based on approximate_arrival_timestamp
create table Customer_Transaction_Fact_Stg as
select 
m.Transaction_ID,
min(case when m.rank_Customer_ID =1 then m.Customer_ID end) Customer_ID,
min(case when m.rank_OutofPocket_Percent =1 then m.OutofPocket_Percent end) OutofPocket_Percent,
min(case when m.rank_OutofPock_Amount =1 then m.OutofPock_Amount end) OutofPock_Amount,
min(case when m.rank_OutofPocket_Max =1 then m.OutofPocket_Max end) OutofPocket_Max,
min(case when m.rank_Provider =1 then m.Provider end) Provider,
min(case when m.rank_Completion_DateTime =1 then m.Completion_DateTime end) Completion_DateTime
from
(
select *,
rank() over(partition by Transaction_ID order by case when mqp.Customer_ID is not null then 1 end, approximate_arrival_timestamp desc ) rank_Customer_ID,
rank() over(partition by Transaction_ID order by case when mqp.OutofPocket_Percent is not null then 1 end, approximate_arrival_timestamp desc ) rank_OutofPocket_Percent,
rank() over(partition by Transaction_ID order by case when mqp.OutofPock_Amount is not null then 1 end, approximate_arrival_timestamp  desc )  rank_OutofPock_Amount,
rank() over(partition by Transaction_ID order by case when mqp.OutofPocket_Max is not null then 1 end, approximate_arrival_timestamp desc ) rank_OutofPocket_Max,
rank() over(partition by Transaction_ID order by case when mqp.Provider is not null then 1 end, approximate_arrival_timestamp  desc ) rank_Provider,
rank() over(partition by Transaction_ID order by case when mqp.Completion_DateTime is not null then 1 end, approximate_arrival_timestamp desc )  rank_Completion_DateTime
from mv_demo_super_unnest mqp
where upper(mqp.event_Name) <> 'REMOVE' and mqp.approximate_arrival_timestamp > (select mde.EndDate from MetaData_ETL mde where mde.JobName = 'Customer_Transaction_Fact_Load') 
) m
group by m.Transaction_ID 
order by m.Transaction_ID
;

--Delete only impacted Transaction_ID from Fact table
delete from Customer_Transaction_Fact  
where Transaction_ID in ( select mqp.Transaction_ID from Customer_Transaction_Fact_Stg mqp);

--Insert latest records from staging table to Fact table
insert into Customer_Transaction_Fact
select * from Customer_Transaction_Fact_Stg; 

--Update metadata audit table entry to indicate that the fact load process is completed
update MetaData_ETL
set status = 'Complete',
EndDate = getdate()
where JobName = 'Customer_Transaction_Fact_Load';
END;
$$ LANGUAGE plpgsql;

Additional considerations for implementation

There are several additional capabilities that you could utilize to modify this solution to meet your needs. Many customers utilize multiple AWS accounts, and it’s common that the Kinesis data stream may be in a different AWS account than the Amazon Redshift data warehouse. If this is the case, you can utilize an Amazon Redshift IAM role that assumes a role in the Kinesis data stream AWS account in order to read from the data stream. For more information, refer to Cross-account streaming ingestion for Amazon Redshift.

Another common use case is that you need to schedule the refresh of your Amazon Redshift data warehouse jobs so that the data warehouse’s data is continuously updated. To do this, you can utilize Amazon EventBridge to schedule the jobs in your data warehouse to run on a regular basis. For more information, refer to Creating an Amazon EventBridge rule that runs on a schedule. Another option is to use Amazon Redshift Query Editor v2 to schedule the refresh. For details, refer to Scheduling a query with query editor v2.

If you have a requirement to load data from a DynamoDB table with existing data, refer to Loading data from DynamoDB into Amazon Redshift.

For more information on Amazon Redshift streaming ingestion capabilities, refer to Real-time analytics with Amazon Redshift streaming ingestion.

Clean up

To avoid unnecessary charges, clean up any resources that you built as part of this architecture that are no longer in use. This includes dropping the materialized view, stored procedure, external schema, and tables created as part of this post. Additionally, make sure you delete the DynamoDB table and delete the Kinesis data stream.

Conclusion

After following the solution in this post, you’re now able to build near-real-time analytics using Amazon Redshift streaming ingestion. We showed how you can ingest data from a DynamoDB source table using a Kinesis data stream in order to refresh your Amazon Redshift data warehouse. With the capabilities presented in this post, you should be able to increase the refresh rate of your Amazon Redshift data warehouse in order to provide the most up-to-date data in your data warehouse for your use case.

About the authors

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Matt Nispel is an Enterprise Solutions Architect at AWS. He has more than 10 years of experience building cloud architectures for large enterprise companies. At AWS, Matt helps customers rearchitect their applications to take full advantage of the cloud. Matt lives in Minneapolis, Minnesota, and in his free time enjoys spending time with friends and family.

Dan Dressel is a Senior Analytics Specialist Solutions Architect at AWS. He is passionate about databases, analytics, machine learning, and architecting solutions. In his spare time, he enjoys spending time with family, nature walking, and playing foosball.

Improved scalability and resiliency for Amazon EMR on EC2 clusters

2023-07-27 Ravi Kumar Singh

Post Syndicated from Ravi Kumar Singh original https://aws.amazon.com/blogs/big-data/improved-scalability-and-resiliency-for-amazon-emr-on-ec2-clusters/

Amazon EMR is the cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto. Customers asked us for features that would further improve the resiliency and scalability of their Amazon EMR on EC2 clusters, including their large, long-running clusters. We have been hard at work to meet those needs. Over the past 12 months, we have worked backward from customer requirements and launched over 30 new features that improve the resiliency and scalability of your Amazon EMR on EC2 clusters. This post covers some of these key enhancements across three main areas:

Improved cluster utilization with optimized scaling experience
Minimized interruptions with enhanced resiliency and availability
Improved cluster resiliency with upgraded logging and debugging capabilities

Let’s dive into each of these areas.

Improved cluster utilization with optimized scaling experience

Customers use Amazon EMR to run diverse analytics workloads with varying SLAs, ranging from near-real-time streaming jobs to exploratory interactive workloads and everything in between. To cater to these dynamic workloads, you can resize your clusters either manually or by enabling automatic scaling. You can also use the Amazon EMR managed scaling feature to automatically resize your clusters for optimal performance at the lowest possible cost. To ensure swift cluster resizes, we implemented multiple improvements that are available in the latest Amazon EMR releases:

Enhanced resiliency of cluster scaling workflow to EC2 Spot Instance interruptions – Many Amazon EMR customers use EC2 Spot Instances for their Amazon EMR on EC2 clusters to reduce costs. Spot Instances are spare Amazon Elastic Compute Cloud (Amazon EC2) compute capacity offered at discounts of up to 90% compared to On-Demand pricing. However, Amazon EC2 can reclaim Spot capacity with a two-minute warning, which can lead to interruptions in workload. We identified an issue where the cluster’s scaling operation gets stuck when over a hundred core nodes launched on Spot Instances are reclaimed by Amazon EC2 throughout the life of the cluster. Starting with Amazon EMR version 6.8.0, we mitigated this issue by fixing a gap in the process HDFS uses to decommission nodes that caused the scaling operations to get stuck. We contributed this improvement back to the open-source community, enabling seamless recovery and efficient scaling in the event of Spot interruptions.
Improve cluster utilization by recommissioning recently decommissioned nodes for Spark workloads within seconds – Amazon EMR allows you to scale down your cluster without affecting your workload by gracefully decommissioning core and task nodes. Furthermore, to prevent task failures, Apache Spark ensures that decommissioning nodes are not assigned any new tasks. However, if a new job is submitted immediately before these nodes are fully decommissioned, Amazon EMR will trigger a scale-up operation for the cluster. This results in these decommissioning nodes to be immediately recommissioned and added back into the cluster. Due to a gap in Apache Spark’s recommissioning logic, these recommissioned nodes would not accept new Spark tasks for up to 60 minutes. We enhanced the recommissioning logic, which ensures recommissioned nodes would start accepting new tasks within seconds, thereby improving cluster utilization. This improvement is available in Amazon EMR release 6.11 and higher.
Minimized cluster scaling interruptions due to disk over-utilization – The YARN ResourceManager exclude file is a key component of Apache Hadoop that Amazon EMR uses to centrally manage cluster resources for multiple data-processing frameworks. This exclude file contains a list of nodes to be removed from the cluster to facilitate a cluster scale-down operation. With Amazon EMR release 6.11.0, we improved the cluster scaling workflow to reduce scale-down failures. This improvement minimizes failures due to partial updates or corruption in the exclude file caused by low disk space. Additionally, we built a robust file recovery mechanism to restore the exclude file in case of corruption, ensuring uninterrupted cluster scaling operations.

Minimized interruptions with enhanced resiliency and availability

Amazon EMR offers high availability and fault tolerance for your big data workloads. Let’s look at a few key improvements we launched in this area:

Improved fault tolerance to hardware reconfiguration – Amazon EMR offers the flexibility to decouple storage and compute. We observed that customers often increase the size of or add incremental block-level storage to their EC2 instances as their data processing volume and concurrency grow. Starting with Amazon EMR release 6.11.0, we made the EMR cluster’s local storage file system more resilient to unpredictable instance reconfigurations such as instance restarts. By addressing scenarios where an instant restart could result in the block storage device name to change, we eliminated the risk of the cluster becoming inoperable or losing data.
Reduce cluster startup time for Kerberos-enabled EMR clusters with long-running bootstrap actions – Multiple customers use Kerberos for authentication and run long-running bootstrap actions on their EMR clusters. In Amazon EMR 6.9.0 and higher releases, we fixed a timing sequence mismatch issue that occurs between Apache BigTop and the Amazon EMR on EC2 cluster startup sequence. This timing sequence mismatch occurs when a system attempts to perform two or more operations at the same time instead of doing them in the proper sequence. This issue caused certain cluster configurations to experience instance startup timeouts. We contributed a fix to the open-source community and made additional improvements to the Amazon EMR startup sequence to prevent this condition, resulting in cluster start time improvements of up to 200% for such clusters.

Improved cluster resiliency with upgraded logging and debugging capabilities

Effective log management is essential to ensure log availability and maintain the health of EMR clusters. This becomes especially critical when you’re running multiple custom client tools and third-party applications on your Amazon EMR on EC2 clusters. Customers depend on EMR logs, in addition to EMR events, to monitor cluster and workload health, troubleshoot urgent issues, simplify security audit, and enhance compliance. Let’s look at a few key enhancements we made in this area:

Upgraded on-cluster log management daemon – Amazon EMR now automatically restarts the log management daemon if it’s interrupted. The Amazon EMR on-cluster log management daemon archives logs to Amazon Simple Storage Service (Amazon S3) and deletes them from instance storage. This minimizes cluster failures due to disk over-utilization, while allowing the log files to remain accessible even after the cluster or node stops. This upgrade is available in Amazon EMR release 6.10.0 and higher. For more information, see Configure cluster logging and debugging.
Enhanced cluster stability with improved log rotation and monitoring – Many of our customers have long-running clusters that have been operating for years. Some open-source application logs such as Hive and Kerberos logs that are never rotated can continue to grow on these long-running clusters. This could lead to disk over-utilization and eventually result in cluster failures. We enabled log rotation for such log files to minimize disk, memory, and CPU over-utilization scenarios. Furthermore, we expanded our log monitoring to include additional log folders. These changes, available starting with Amazon EMR version 6.10.0, minimize situations where EMR cluster resources are over-utilized, while ensuring log files are archived to Amazon S3 for a wider variety of use cases.

Conclusion

In this post, we highlighted the improvements that we made in Amazon EMR on EC2 with the goal to make your EMR clusters more resilient and stable. We focused on improving cluster utilization with the improved and optimized scaling experience for EMR workloads, minimized interruptions with enhanced resiliency and availability for Amazon EMR on EC2 clusters, and improved cluster resiliency with upgraded logging and debugging capabilities. We will continue to deliver further enhancements with new Amazon EMR releases. We invite you to try new features and capabilities in the latest Amazon EMR releases and get in touch with us through your AWS account team to share your valuable feedback and comments. To learn more and get started with Amazon EMR, check out the tutorial Getting started with Amazon EMR.

About the Authors

Ravi Kumar is a Senior Product Manager for Amazon EMR at Amazon Web Services.

Kevin Wikant is a Software Development Engineer for Amazon EMR at Amazon Web Services.

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

2023-07-26 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/end-to-end-development-lifecycle-for-data-engineers-to-build-a-data-integration-pipeline-using-aws-glue/

Data is a key enabler for your business. Many AWS customers have integrated their data across multiple data sources using AWS Glue, a serverless data integration service, in order to make data-driven business decisions. To grow the power of data at scale for the long term, it’s highly recommended to design an end-to-end development lifecycle for your data integration pipelines. The following are common asks from our customers:

Is it possible to develop and test AWS Glue data integration jobs on my local laptop?
Are there recommended approaches to provisioning components for data integration?
How can we build a continuous integration and continuous delivery (CI/CD) pipeline for our data integration pipeline?
What is the best practice to move from a pre-production environment to production?

To tackle these asks, this post defines the development lifecycle for data integration and demonstrates how software engineers and data engineers can design an end-to-end development lifecycle using AWS Glue, including development, testing, and CI/CD, using a sample baseline template.

End-to-end development lifecycle for a data integration pipeline

Today, it’s common to define not only data integration jobs but also all the data components in code. This means that you can rely on standard software best practices to build your data integration pipeline. The software development lifecycle on AWS defines the following six phases: Plan, Design, Implement, Test, Deploy, and Maintain.

In this section, we discuss each phase in the context of data integration pipeline.

Plan

In the planning phase, developers collect requirements from stakeholders such as end-users to define a data requirement. This could be what the use cases are (for example, ad hoc queries, dashboard, or troubleshooting), how much data to process (for example, 1 TB per day), what kinds of data, how many different data sources to pull from, how much data latency to accept to make it queryable (for example, 15 minutes), and so on.

Design

In the design phase, you analyze requirements and identify the best solution to build the data integration pipeline. In AWS, you need to choose the right services to achieve the goal and come up with the architecture by integrating those services and defining dependencies between components. For example, you may choose AWS Glue jobs as a core component for loading data from different sources, including Amazon Simple Storage Service (Amazon S3), then integrating them and preprocessing and enriching data. Then you may want to chain multiple AWS Glue jobs and orchestrate them. Finally, you may want to use Amazon Athena and Amazon QuickSight to present the enriched data to end-users.

Implement

In the implementation phase, data engineers code the data integration pipeline. They analyze the requirements to identify coding tasks to achieve the final result. The code includes the following:

AWS resource definition
Data integration logic

When using AWS Glue, you can define the data integration logic in a job script, which can be written in Python or Scala. You can use your preferred IDE to implement AWS resource definition using the AWS Cloud Development Kit (AWS CDK) or AWS CloudFormation, and also the business logic of AWS Glue job scripts for data integration. To learn more about how to implement your AWS Glue job scripts locally, refer to Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container.

Test

In the testing phase, you check the implementation for bugs. Quality analysis includes testing the code for errors and checking if it meets the requirements. Because many teams immediately test the code you write, the testing phase often runs parallel to the development phase. There are different types of testing:

Unit testing
Integration testing
Performance testing

For unit testing, even for data integration, you can rely on a standard testing framework such as pytest and ScalaTest. To learn more about how to achieve unit testing locally, refer to Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container.

Deploy

When data engineers develop a data integration pipeline, you code and test on a different copy of the product than the one that the end-users have access to. The environment that end-users use is called production, whereas other copies are said to be in the development or the pre-production environment.

Having separate build and production environments ensures that you can continue to use the data integration pipeline even while it’s being changed or upgraded. The deployment phase includes several tasks to move the latest build copy to the production environment, such as packaging, environment configuration, and installation.

The following components are deployed through the AWS CDK or AWS CloudFormation:

AWS resources
Data integration job scripts for AWS Glue

AWS CodePipeline helps you to build a mechanism to automate deployments among different environments, including development, pre-production, and production. When you commit your code to AWS CodeCommit, CodePipeline automatically provisions AWS resources based on the CloudFormation templates included in the commit and uploads script files included in the commit to Amazon S3.

Maintain

Even after you deploy your solution to a production environment, it’s not the end of your project. You need to monitor the data integration pipeline continuously and keep maintaining and improving it. More specifically, you also need to fix bugs, resolve customer issues, and manage software changes. In addition, you need to monitor the overall system performance, security, and user experience to identify new ways to improve the existing data integration pipeline.

Solution overview

Typically, you have multiple accounts to manage and provision resources for your data pipeline. In this post, we assume the following three accounts:

Pipeline account – This hosts the end-to-end pipeline
Dev account – This hosts the integration pipeline in the development environment
Prod account – This hosts the data integration pipeline in the production environment

If you want, you can use the same account and the same Region for all three.

To start applying this end-to-end development lifecycle model to your data platform easily and quickly, we prepared the baseline template aws-glue-cdk-baseline using the AWS CDK. The template is built on top of AWS CDK v2 and CDK Pipelines. It provisions two kinds of stacks:

AWS Glue app stack – This provisions the data integration pipeline: one in the dev account and one in the prod account
Pipeline stack – This provisions the Git repository and CI/CD pipeline in the pipeline account

The AWS Glue app stack provisions the data integration pipeline, including the following resources:

AWS Glue jobs
AWS Glue job scripts

The following diagram illustrates this architecture.

At the time of publishing of this post, the AWS CDK has two versions of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The sample AWS Glue app stack is defined using aws-glue-alpha, the L2 construct for AWS Glue, because it’s straightforward to define and manage AWS Glue resources. If you want to use the L1 construct, refer to Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines.

The pipeline stack provisions the entire CI/CD pipeline, including the following resources:

AWS Identity and Access Management (IAM) roles
S3 bucket
CodeCommit
CodePipeline
AWS CodeBuild

The following diagram illustrates the pipeline workflow.

Every time the business requirement changes (such as adding data sources or changing data transformation logic), you make changes on the AWS Glue app stack and re-provision the stack to reflect your changes. This is done by committing your changes in the AWS CDK template to the CodeCommit repository, then CodePipeline reflects the changes on AWS resources using CloudFormation change sets.

In the following sections, we present the steps to set up the required environment and demonstrate the end-to-end development lifecycle.

Prerequisites

You need the following resources:

Python 3.9 or later
AWS accounts for the pipeline account, dev account, and prod account
An AWS named profile for the pipeline account, dev account, and prod account
The AWS CDK Toolkit (cdk command) 2.87.0 or later
Docker
Visual Studio Code
Visual Studio Code Dev Containers

Initialize the project

To initialize the project, complete the following steps:

Clone the baseline template to your workplace:

$ git clone [email protected]:aws-samples/aws-glue-cdk-baseline.git
$ cd aws-glue-cdk-baseline.git

Create a Python virtual environment specific to the project on the client machine:
```
$ python3 -m venv .venv
```

We use a virtual environment in order to isolate the Python environment for this project and not install software globally.

Activate the virtual environment according to your OS:
- On MacOS and Linux, use the following command:
```
$ source .venv/bin/activate
```
- On a Windows platform, use the following command:
```
% .venv\Scripts\activate.bat
```

After this step, the subsequent steps run within the bounds of the virtual environment on the client machine and interact with the AWS account as needed.

Install the required dependencies described in requirements.txt to the virtual environment:
```
$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt
```

Edit the configuration file default-config.yaml based on your environments (replace each account ID with your own):

pipelineAccount:
awsAccountId: 123456789101
awsRegion: us-east-1

devAccount:
awsAccountId: 123456789102
awsRegion: us-east-1

prodAccount:
awsAccountId: 123456789103
awsRegion: us-east-1

Run pytest to initialize the snapshot test files by running the following command:
```
$ python3 -m pytest --snapshot-update
```

Bootstrap your AWS environments

Run the following commands to bootstrap your AWS environments:

In the pipeline account, replace PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your own values:

$ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

In the dev account, replace PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your own values:

$ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
--trust <PIPELINE-ACCOUNT-NUMBER>

In the prod account, replace PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your own values:

$ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
--trust <PIPELINE-ACCOUNT-NUMBER>

When you use only one account for all environments, you can just run the cdk bootstrap command one time.

Deploy your AWS resources

Run the command using the pipeline account to deploy the resources defined in the AWS CDK baseline template:

$ cdk deploy --profile <PIPELINE-PROFILE>

This creates the pipeline stack in the pipeline account and the AWS Glue app stack in the development account.

When the cdk deploy command is completed, let’s verify the pipeline using the pipeline account.

On the CodePipeline console, navigate to GluePipeline. Then verify that GluePipeline has the following stages: Source, Build, UpdatePipeline, Assets, DeployDev, and DeployProd. Also verify that the stages Source, Build, UpdatePipeline, Assets, DeployDev have succeeded, and DeployProd is pending. It can take about 15 minutes.

Now that the pipeline has been created successfully, you can also verify the AWS Glue app stack resource on the AWS CloudFormation console in the dev account.

At this step, the AWS Glue app stack is deployed only in the dev account. You can try to run the AWS Glue job ProcessLegislators to see how it works.

Configure your Git repository with CodeCommit

In an earlier step, you cloned the Git repository from GitHub. Although it’s possible to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this post, we use CodeCommit. If you prefer those third-party Git providers, configure the connections and edit pipeline_stack.py to define the variable source to use the target Git provider using CodePipelineSource.

Because you already ran the cdk deploy command, the CodeCommit repository has already been created with all the required code and related files. The first step is to set up access to CodeCommit. The next step is to clone the repository from the CodeCommit repository to your local. Run the following commands:

$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/aws-glue-cdk-baseline

In the next step, we make changes in this local copy of the CodeCommit repository.

End-to-end development lifecycle

Now that the environment has been successfully created, you’re ready to start developing a data integration pipeline using this baseline template. Let’s walk through end-to-end development lifecycle.

When you want to define your own data integration pipeline, you need to add more AWS Glue jobs and implement job scripts. For this post, let’s assume the use case to add a new AWS Glue job with a new job script to read multiple S3 locations and join them.

Implement and test in your local environment

First, implement and test the AWS Glue job and its job script in your local environment using Visual Studio Code.

Set up your development environment by following the steps in Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container. The following steps are required in the context of this post:

Start Docker.
Pull the Docker image that has the local development environment using the AWS Glue ETL library:
```
$ docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
```
Run the following command to define the AWS named profile name:
```
$ PROFILE_NAME="<DEV-PROFILE>"
```
Run the following command to make it available with the baseline template:
```
$ cd aws-glue-cdk-baseline/
$ WORKSPACE_LOCATION=$(pwd)
```

Run the Docker container:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true 
--rm -p 4040:4040 -p 18080:18080 
--name glue_pyspark public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark

Start Visual Studio Code.
Choose Remote Explorer in the navigation pane, then choose the arrow icon of the workspace folder in the container public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01.

If the workspace folder is not shown, choose Open folder and select /home/glue_user/workspace.

Then you will see a view similar to the following screenshot.

Optionally, you can install AWS Tool Kit for Visual Studio Code, and start Amazon CodeWhisperer to enable code recommendations powered by machine learning model. For example, in aws_glue_cdk_baseline/job_scripts/process_legislators.py, you can put comments like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will recommend a code snippet similar to the following:

Now you install the required dependencies described in requirements.txt to the container environment.

Run the following commands in the terminal in Visual Studio Code:

$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt

Implement the code.

Now let’s make the required changes for a new AWS Glue job here.

Edit the file aws_glue_cdk_baseline/glue_app_stack.py. Let’s add the following new code block after the existing job definition of ProcessLegislators in order to add the new AWS Glue job JoinLegislators:

        self.new_glue_job = glue.Job(self, "JoinLegislators",
            executable=glue.JobExecutable.python_etl(
                glue_version=glue.GlueVersion.V4_0,
                python_version=glue.PythonVersion.THREE,
                script=glue.Code.from_asset(
                    path.join(path.dirname(__file__), "job_scripts/join_legislators.py")
                )
            ),
            description="a new example PySpark job",
            default_arguments={
                "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
            },
            tags={
                "environment": self.environment,
                "artifact_id": self.artifact_id,
                "stack_id": self.stack_id,
                "stack_name": self.stack_name
            }
        )

Here, you added three job parameters for different S3 locations using the variable config. It is the dictionary generated from default-config.yaml. In this baseline template, we use this central config file for managing parameters for all the Glue jobs in the structure <stage name>/jobs/<job name>/<parameter name>. In the proceeding steps, you provide those locations through the AWS Glue job parameters.

Create a new job script called aws_glue_cdk_baseline/job_scripts/join_legislators.py:

aws_glue_cdk_baseline/job_scripts/join_legislators.py:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions


class JoinLegislators:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
            params.append('input_path_orgs')
            params.append('input_path_persons')
            params.append('input_path_memberships')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
            self.input_path_orgs = args['input_path_orgs']
            self.input_path_persons = args['input_path_persons']
            self.input_path_memberships = args['input_path_memberships']
        else:
            jobname = "test"
            self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
            self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
            self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
        self.job.init(jobname, args)
    
    def run(self):
        dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
        df = dyf.toDF()
        df.printSchema()
        df.show()
        print(df.count())

def read_dynamic_frame_from_json(glue_context, path):
    return glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )

def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
    orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
    persons = read_dynamic_frame_from_json(glue_context, path_persons)
    memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
    orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('name', 'org_name')
    dynamicframe_joined = Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
    return dynamicframe_joined

if __name__ == '__main__':
    JoinLegislators().run()

Create a new unit test script for the new AWS Glue job called aws_glue_cdk_baseline/job_scripts/tests/test_join_legislators.py:

import pytest
import sys
import join_legislators
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

def test_counts(glue_context):
    dyf = join_legislators.join_legislators(glue_context, 
        "s3://awsglue-datasets/examples/us-legislators/all/organizations.json",
        "s3://awsglue-datasets/examples/us-legislators/all/persons.json", 
        "s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
    assert dyf.toDF().count() == 10439

In default-config.yaml, add the following under prod and dev:

 JoinLegislators:
      inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
      inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
      inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"

Add the following under "jobs" in the variable config in tests/unit/test_glue_app_stack.py, tests/unit/test_pipeline_stack.py, and tests/snapshot/test_snapshot_glue_app_stack.py (no need to replace S3 locations):

,
            "JoinLegislators": {
                "inputLocationOrgs": "s3://path_to_data_orgs",
                "inputLocationPersons": "s3://path_to_data_persons",
                "inputLocationMemberships": "s3://path_to_data_memberships"
            }

Choose Run at the top right to run the individual job scripts.

If the Run button is not shown, install Python into the container through Extensions in the navigation pane.

For local unit testing, run the following command in the terminal in Visual Studio Code:
```
$ cd aws_glue_cdk_baseline/job_scripts/
$ python3 -m pytest
```

Then you can verify that the newly added unit test passed successfully.

Run pytest to initialize the snapshot test files by running following command:
```
$ cd ../../
$ python3 -m pytest --snapshot-update
```

Deploy to the development environment

Complete following steps to deploy the AWS Glue app stack to the development environment and run integration tests there:

Set up access to CodeCommit.

Commit and push your changes to the CodeCommit repo:

$ git add .
$ git commit -m "Add the second Glue job"
$ git push

You can see that the pipeline is successfully triggered.

Integration test

There is nothing required for running the integration test for the newly added AWS Glue job. The integration test script integ_test_glue_app_stack.py runs all the jobs including a specific tag, then verifies the state and its duration. If you want to change the condition or the threshold, you can edit assertions at the end of the integ_test_glue_job method.

Deploy to the production environment

Complete the following steps to deploy the AWS Glue app stack to the production environment:

On the CodePipeline console, navigate to GluePipeline.
Choose Review under the DeployProd stage.
Choose Approve.

Wait for the DeployProd stage to complete, then you can verify the AWS Glue app stack resource in the dev account.

Clean up

To clean up your resources, complete following steps:

Run the following command using the pipeline account:
```
$ cdk destroy --profile <PIPELINE-PROFILE>
```
Delete the AWS Glue app stack in the dev account and prod account.

Conclusion

In this post, you learned how to define the development lifecycle for data integration and how software engineers and data engineers can design an end-to-end development lifecycle using AWS Glue, including development, testing, and CI/CD, through a sample AWS CDK template. You can get started building your own end-to-end development lifecycle for your workload using AWS Glue.

About the author

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Build data integration jobs with AI companion on AWS Glue Studio notebook powered by Amazon CodeWhisperer

2023-07-26 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/build-data-integration-jobs-with-ai-companion-on-aws-glue-studio-notebook-powered-by-amazon-codewhisperer/

Data is essential for businesses to make informed decisions, improve operations, and innovate. Integrating data from different sources can be a complex and time-consuming process. AWS offers AWS Glue to help you integrate your data from multiple sources on serverless infrastructure for analysis, machine learning (ML), and application development. AWS Glue provides different authoring experiences for you to build data integration jobs. One of the most common options is the notebook. Data scientists tend to run queries interactively and retrieve results immediately to author data integration jobs. This interactive experience can accelerate building data integration pipelines.

Recently, AWS announced general availability of Amazon CodeWhisperer. Amazon CodeWhisperer is an AI coding companion that uses foundational models under the hood to improve developer productivity. This works by generating code suggestions in real time based on developers’ comments in natural language and prior code in their integrated development environment (IDE). AWS also announced the Amazon CodeWhisperer Jupyter extension to help Jupyter users by generating real-time, single-line, or full-function code suggestions for Python notebooks on Jupyter Lab and Amazon SageMaker Studio.

Today, we are excited to announce that AWS Glue Studio notebooks now support Amazon CodeWhisperer for AWS Glue users to improve your experience and help boost development productivity. Now, in your Glue Studio notebook, you can write a comment in natural language (in English) that outlines a specific task, such as “Create a Spark DataFrame from a json file.”. Based on this information, CodeWhisperer recommends one or more code snippets directly in the notebook that can accomplish the task. You can quickly accept the top suggestion, view more suggestions, or continue writing your own code.

This post demonstrates how the user experience on AWS Glue Studio notebook has been changed with the Amazon CodeWhisperer integration.

Prerequisites

Before going forward with this tutorial, you need to complete the following prerequisites:

Set up AWS Glue Studio.

Configure an AWS Identity and Access Management (IAM) role to interact with Amazon CodeWhisperer. Attach the following policy to your IAM role for the AWS Glue Studio notebook:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CodeWhispererPermissions",
            "Effect": "Allow",
            "Action": [
                "codewhisperer:GenerateRecommendations"
            ],
            "Resource": "*"
        }
    ]
}

Getting Started

Let’s get started. Create a new AWS Glue Studio notebook job by completing the following steps:

On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
Select Jupyter Notebook and choose Create.
For Job name, enter codewhisperer-demo.
For IAM Role, select your IAM role that you configured as a prerequisite.
Choose Start notebook.

A new notebook is created with sample cells.

At the bottom, there is a menu named CodeWhisperer. By choosing this menu, you can see the shortcuts and several options, including disabling auto-suggestions.

Let’s try your first recommendation by Amazon CodeWhisperer. Note that this post contains examples of recommendations, but you may see different code snippets recommended by Amazon CodeWhisperer.

Add a new cell and enter your comment to describe what you want to achieve. After you press Enter, the recommended code is shown.

If you press Tab, then code is chosen. If you press arrow keys, then you can select other recommendations. You can learn more in User actions.

Now let’s read a JSON file from Amazon Simple Storage Service (Amazon S3). Enter the following code comment into a notebook cell and press Enter:

# Create a Spark DataFrame from a json file

CodeWhisperer will recommend a code snippet similar to the following:

def create_spark_df_from_json(spark, file_path):
    return spark.read.json(file_path)

Now use this method to utilize the suggested code snippet:

df = create_spark_df_from_json(spark, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
df.show()

The proceeding code returns the following output:

+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+
|birth_date|     contact_details|death_date|family_name|gender|given_name|                  id|         identifiers|               image|              images|               links|              name|         other_names|       sort_name|
+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+
|1944-10-15|                null|      null|    Collins|  male|   Michael|0005af3a-9471-4d1...|[{C000640, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|       Mac Collins|[{bar, Mac Collin...|Collins, Michael|
|1969-01-31|[{fax, 202-226-07...|      null|   Huizenga|  male|      Bill|00aa2dc0-bfb6-441...|[{Bill Huizenga, ...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Bill Huizenga|[{da, Bill Huizen...|  Huizenga, Bill|
|1959-09-28|[{phone, 202-225-...|      null|    Clawson|  male|    Curtis|00aca284-9323-495...|[{C001102, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...|      Curt Clawson|[{bar, Curt Claws...| Clawson, Curtis|
|1930-08-14|                null|2001-10-26|    Solomon|  male|    Gerald|00b73df5-4180-441...|[{S000675, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|    Gerald Solomon|[{null, Gerald B....| Solomon, Gerald|
|1960-05-28|[{fax, 202-225-42...|      null|     Rigell|  male|    Edward|00bee44f-db04-4a7...|[{R000589, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|   E. Scott Rigell|[{null, Scott Rig...|  Rigell, Edward|
|1951-05-20|[{twitter, MikeCr...|      null|      Crapo|  male|   Michael|00f8f12d-6e27-4a2...|[{Mike Crapo, bal...|https://theunited...|[{https://theunit...|[{Wikipedia (da),...|        Mike Crapo|[{da, Mike Crapo,...|  Crapo, Michael|
|1926-05-12|                null|      null|      Hutto|  male|      Earl|015d77c8-6edb-4ed...|[{H001018, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|        Earl Hutto|[{null, Earl Dewi...|     Hutto, Earl|
|1937-11-07|                null|2015-11-19|      Ertel|  male|     Allen|01679bc3-da21-482...|[{E000208, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|       Allen Ertel|[{null, Allen E. ...|    Ertel, Allen|
|1916-09-01|                null|2007-11-24|     Minish|  male|    Joseph|018247d0-2961-423...|[{M000796, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Joseph Minish|[{bar, Joseph Min...|  Minish, Joseph|
|1957-08-04|[{phone, 202-225-...|      null|    Andrews|  male|    Robert|01b100ac-192e-4b5...|[{A000210, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...| Robert E. Andrews|[{null, Rob Andre...| Andrews, Robert|
|1957-01-10|[{fax, 202-225-57...|      null|     Walden|  male|      Greg|01bc21bf-8939-487...|[{Greg Walden, ba...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...|       Greg Walden|[{bar, Greg Walde...|    Walden, Greg|
|1919-01-17|                null|1987-11-29|      Kazen|  male|   Abraham|02059c1e-0bdf-481...|[{K000025, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|Abraham Kazen, Jr.|[{null, Abraham K...|  Kazen, Abraham|
|1960-01-11|[{fax, 202-225-67...|      null|     Turner|  male|   Michael|020aa7dd-54ef-435...|[{Michael R. Turn...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...| Michael R. Turner|[{null, Mike Turn...| Turner, Michael|
|1942-06-28|                null|      null|      Kolbe|  male|     James|02141651-eca2-4aa...|[{K000306, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|         Jim Kolbe|[{ca, Jim Kolbe, ...|    Kolbe, James|
|1941-03-08|[{fax, 202-225-79...|      null|  Lowenthal|  male|      Alan|0231c6ef-6e92-49b...|[{Alan Lowenthal,...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...| Alan S. Lowenthal|[{null, Alan Lowe...| Lowenthal, Alan|
|1952-01-09|[{fax, 202-225-93...|      null|    Capuano|  male|   Michael|0239032f-be5c-4af...|[{Michael Capuano...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|Michael E. Capuano|[{null, Mike Capu...|Capuano, Michael|
|1951-10-19|[{fax, 202-225-56...|      null|   Schrader|  male|      Kurt|0263f619-eff8-4e1...|[{Kurt Schrader, ...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Kurt Schrader|[{bar, Kurt Schra...|  Schrader, Kurt|
|1947-06-13|[{fax, 202-225-69...|      null|     Nadler|  male|   Jerrold|029e793d-ec40-4a1...|[{N000002, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|    Jerrold Nadler|[{ca, Jerrold Nad...| Nadler, Jerrold|
|1970-02-03|[{fax, 202-225-82...|      null|     Graves|  male|       Tom|02b621fc-0523-449...|[{Tom Graves, bal...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|        Tom Graves|[{bar, Tom Graves...|     Graves, Tom|
|1932-05-09|                null|      null|   McMillan|  male|      John|03018f7c-f866-419...|[{M000566, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     John McMillan|[{null, Alex McMi...|  McMillan, John|
+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+
only showing top 20 rows

As you can see from the result, you can quickly utilize the code snippet recommended by Amazon CodeWhisperer.

Examples of code recommendations

In this section, we provide additional examples of code recommendations. Note that these are just our examples, and different code snippets may be suggested by Amazon CodeWhisperer.

Add a column with a calculation

In extract, transform, and load (ETL) use cases, it’s common to add new columns from existing columns. When we need to add columns to our Spark DataFrame, we can articulate with a high level of detail to Amazon CodeWhisperer what type of column we need added and its respective attributes:

# Add age column to a given person DataFrame
# age is calculated from current date and birth_date. When death_date is not null, then use death_date to calculate age

Amazon CodeWhisperer will recommend a code snippet similar to the following:

def add_age_column(df):
    # Use current date to calculate age
    current_date = current_timestamp()
    # Use birth_date column to calculate age
    df = df.withColumn("age", datediff(current_date, df.birth_date) / 365)
    # Use death_date column to calculate age
    df = df.withColumn("age", when(df.death_date.isNull(), df.age).otherwise(datediff(df.death_date, df.birth_date) / 365))
    return df

It can be utilized as follows:

df = add_age_column(df)
df.show()

The preceding code returns the following output:

+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+------------------+--------------------+
|birth_date|     contact_details|death_date|family_name|gender|given_name|                  id|         identifiers|               image|              images|               links|              name|         other_names|       sort_name|               age|        current_date|
+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+------------------+--------------------+
|1944-10-15|                null|      null|    Collins|  male|   Michael|0005af3a-9471-4d1...|[{C000640, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|       Mac Collins|[{bar, Mac Collin...|Collins, Michael| 78.71506849315068|2023-06-14 06:12:...|
|1969-01-31|[{fax, 202-226-07...|      null|   Huizenga|  male|      Bill|00aa2dc0-bfb6-441...|[{Bill Huizenga, ...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Bill Huizenga|[{da, Bill Huizen...|  Huizenga, Bill|  54.4027397260274|2023-06-14 06:12:...|
|1959-09-28|[{phone, 202-225-...|      null|    Clawson|  male|    Curtis|00aca284-9323-495...|[{C001102, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...|      Curt Clawson|[{bar, Curt Claws...| Clawson, Curtis| 63.75342465753425|2023-06-14 06:12:...|
|1930-08-14|                null|2001-10-26|    Solomon|  male|    Gerald|00b73df5-4180-441...|[{S000675, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|    Gerald Solomon|[{null, Gerald B....| Solomon, Gerald| 71.24931506849315|2023-06-14 06:12:...|
|1960-05-28|[{fax, 202-225-42...|      null|     Rigell|  male|    Edward|00bee44f-db04-4a7...|[{R000589, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|   E. Scott Rigell|[{null, Scott Rig...|  Rigell, Edward|63.087671232876716|2023-06-14 06:12:...|
|1951-05-20|[{twitter, MikeCr...|      null|      Crapo|  male|   Michael|00f8f12d-6e27-4a2...|[{Mike Crapo, bal...|https://theunited...|[{https://theunit...|[{Wikipedia (da),...|        Mike Crapo|[{da, Mike Crapo,...|  Crapo, Michael| 72.11780821917809|2023-06-14 06:12:...|
|1926-05-12|                null|      null|      Hutto|  male|      Earl|015d77c8-6edb-4ed...|[{H001018, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|        Earl Hutto|[{null, Earl Dewi...|     Hutto, Earl| 97.15616438356165|2023-06-14 06:12:...|
|1937-11-07|                null|2015-11-19|      Ertel|  male|     Allen|01679bc3-da21-482...|[{E000208, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|       Allen Ertel|[{null, Allen E. ...|    Ertel, Allen| 78.08493150684932|2023-06-14 06:12:...|
|1916-09-01|                null|2007-11-24|     Minish|  male|    Joseph|018247d0-2961-423...|[{M000796, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Joseph Minish|[{bar, Joseph Min...|  Minish, Joseph|  91.2904109589041|2023-06-14 06:12:...|
|1957-08-04|[{phone, 202-225-...|      null|    Andrews|  male|    Robert|01b100ac-192e-4b5...|[{A000210, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...| Robert E. Andrews|[{null, Rob Andre...| Andrews, Robert|  65.9041095890411|2023-06-14 06:12:...|
|1957-01-10|[{fax, 202-225-57...|      null|     Walden|  male|      Greg|01bc21bf-8939-487...|[{Greg Walden, ba...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...|       Greg Walden|[{bar, Greg Walde...|    Walden, Greg| 66.46849315068494|2023-06-14 06:12:...|
|1919-01-17|                null|1987-11-29|      Kazen|  male|   Abraham|02059c1e-0bdf-481...|[{K000025, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|Abraham Kazen, Jr.|[{null, Abraham K...|  Kazen, Abraham| 68.91232876712328|2023-06-14 06:12:...|
|1960-01-11|[{fax, 202-225-67...|      null|     Turner|  male|   Michael|020aa7dd-54ef-435...|[{Michael R. Turn...|https://theunited...|[{https://theunit...|[{Wikipedia (comm...| Michael R. Turner|[{null, Mike Turn...| Turner, Michael|63.465753424657535|2023-06-14 06:12:...|
|1942-06-28|                null|      null|      Kolbe|  male|     James|02141651-eca2-4aa...|[{K000306, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|         Jim Kolbe|[{ca, Jim Kolbe, ...|    Kolbe, James| 81.01643835616439|2023-06-14 06:12:...|
|1941-03-08|[{fax, 202-225-79...|      null|  Lowenthal|  male|      Alan|0231c6ef-6e92-49b...|[{Alan Lowenthal,...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...| Alan S. Lowenthal|[{null, Alan Lowe...| Lowenthal, Alan| 82.32328767123288|2023-06-14 06:12:...|
|1952-01-09|[{fax, 202-225-93...|      null|    Capuano|  male|   Michael|0239032f-be5c-4af...|[{Michael Capuano...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|Michael E. Capuano|[{null, Mike Capu...|Capuano, Michael| 71.47671232876712|2023-06-14 06:12:...|
|1951-10-19|[{fax, 202-225-56...|      null|   Schrader|  male|      Kurt|0263f619-eff8-4e1...|[{Kurt Schrader, ...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Kurt Schrader|[{bar, Kurt Schra...|  Schrader, Kurt|  71.7013698630137|2023-06-14 06:12:...|
|1947-06-13|[{fax, 202-225-69...|      null|     Nadler|  male|   Jerrold|029e793d-ec40-4a1...|[{N000002, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|    Jerrold Nadler|[{ca, Jerrold Nad...| Nadler, Jerrold| 76.05479452054794|2023-06-14 06:12:...|
|1970-02-03|[{fax, 202-225-82...|      null|     Graves|  male|       Tom|02b621fc-0523-449...|[{Tom Graves, bal...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|        Tom Graves|[{bar, Tom Graves...|     Graves, Tom|53.394520547945206|2023-06-14 06:12:...|
|1932-05-09|                null|      null|   McMillan|  male|      John|03018f7c-f866-419...|[{M000566, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     John McMillan|[{null, Alex McMi...|  McMillan, John| 91.15890410958905|2023-06-14 06:12:...|
+----------+--------------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------------+------------------+--------------------+
only showing top 20 rows

Sort and extract records

You can use Amazon CodeWhisperer for sorting data and extracting records within a Spark DataFrame as well:

# Show top 5 oldest persons from DataFrame
# Use age column

Amazon CodeWhisperer will recommend a code snippet similar to the following:

def get_oldest_person(df):
    return df.orderBy(desc("age")).limit(5)

It can be utilized as follows:

get_oldest_person(df).show()

The preceding code returns the following output:

+----------+---------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+---------------+------------------+--------------------+
|birth_date|contact_details|death_date|family_name|gender|given_name|                  id|         identifiers|               image|              images|               links|           name|         other_names|      sort_name|               age|        current_date|
+----------+---------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+---------------+------------------+--------------------+
|1919-08-22|           null|      null|       Winn|  male|    Edward|942d20ed-d838-436...|[{W000636, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|Larry Winn, Jr.|[{null, Larry Win...|   Winn, Edward|103.88219178082191|2023-06-14 06:13:...|
|1920-03-23|           null|      null|      Smith|  male|      Neal|84a9cbe4-651b-46d...|[{S000596, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|     Neal Smith|[{null, Neal Edwa...|    Smith, Neal| 103.2958904109589|2023-06-14 06:13:...|
|1920-09-17|           null|      null|       Holt|female|  Marjorie|8bfb671a-3147-4bc...|[{H000747, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...|  Marjorie Holt|[{bar, Marjorie H...| Holt, Marjorie| 102.8082191780822|2023-06-14 06:13:...|
|1921-03-05|           null|      null|     Bedell|  male|   Berkley|896f0ce3-afe4-4ea...|[{B000298, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (de),...| Berkley Bedell|[{ca, Berkley Bed...|Bedell, Berkley|102.34520547945205|2023-06-14 06:13:...|
|1921-06-23|           null|      null|    Findley|  male|      Paul|2811f793-1108-4fb...|[{F000123, biogui...|https://theunited...|[{https://theunit...|[{Wikipedia (azb)...|   Paul Findley|[{azb, پاول فایند...|  Findley, Paul|102.04383561643836|2023-06-14 06:13:...|
+----------+---------------+----------+-----------+------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+---------------+------------------+--------------------+

Generate sample datasets in a Spark DataFrame

Amazon CodeWhisperer is powerful enough to generate sample Spark DataFrames as well, which can be done like so:

# Generate sample Spark DataFrame of country name and country code
# First column name is country_name, and second column name is country_code

Amazon CodeWhisperer will recommend a code snippet similar to the following:

def get_country_code_df(spark):
    return spark.createDataFrame(
        [("United States", "US"), ("United Kingdom", "UK"), ("Canada", "CA")],
        ["country_name", "country_code"]
    )

It can be utilized as follows:

df = get_country_code_df(spark)
df.show()

The preceding code returns the following output:

+--------------+------------+
|  country_name|country_code|
+--------------+------------+
| United States|          US|
|United Kingdom|          UK|
|        Canada|          CA|
+--------------+------------+

Generate transformations in SQL

We can also use Amazon CodeWhisperer to create a code snippet for transformation in SQL and create a new table from the SQL query results (CTAS) like so:

# Generate CTAS query by selecting all the records in a table with grouping by a given column

Amazon CodeWhisperer will recommend a code snippet similar to following:

def generate_ctas_query_with_group_by(table_name, group_by_col):
    ctas_query = "CREATE TABLE " + table_name + " AS SELECT * FROM " + table_name + " GROUP BY " + group_by_col
    return ctas_query

Conclusion

In this post, we demonstrated how AWS Glue Studio notebook integration with Amazon CodeWhisperer helps you build data integration jobs faster. This integration is available today in US East (N. Virginia). You can start using the AWS Glue Studio notebook with Amazon CodeWhisperer to accelerate building your data integration jobs. To get started with AWS Glue, visit AWS Glue.

Learn more

To learn more about using AWS Glue notebooks and Amazon CodeWhisperer, check out the following video.

About the authors

Gal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI, and is based in California. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products. In her spare time, she enjoys playing card games.