Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Post Syndicated from Tom Romano original https://aws.amazon.com/blogs/big-data/empower-your-jira-data-in-a-data-lake-with-amazon-appflow-and-aws-glue/

In the world of software engineering and development, organizations use project management tools like Atlassian Jira Cloud. Managing projects with Jira leads to rich datasets, which can provide historical and predictive insights about project and development efforts.

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Companies often take a data lake approach to their analytics, bringing data from many different systems into one place to simplify how the analytics are done.

This post shows you how to use Amazon AppFlow and AWS Glue to create a fully automated data ingestion pipeline that will synchronize your Jira data into your data lake. Amazon AppFlow provides software as a service (SaaS) integration with Jira Cloud to load the data into your AWS account. AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Additionally, this post strives to achieve a low-code and serverless solution for operational efficiency and cost optimization, and the solution supports incremental loading for cost optimization.

Solution overview

This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. The data is synchronized to an Amazon Simple Storage Service (Amazon S3) bucket using an initial full download and subsequent incremental downloads of changes. When new data arrives in the S3 bucket, an AWS Step Functions workflow is triggered that orchestrates extract, transform, and load (ETL) activities using AWS Glue crawlers and AWS Glue DataBrew. The data is then available in the AWS Glue Data Catalog and can be queried by services such as Amazon Athena, Amazon QuickSight, and Amazon Redshift Spectrum. The solution is completely automated and serverless, resulting in low operational overhead. When this setup is complete, your Jira data will be automatically ingested and kept up to date in your data lake!

The following diagram illustrates the solution architecture.

The Jira Appflow Architecture is shown. The Jira Cloud data is retrieved by Amazon AppFlow and is stored in Amazon S3. This triggers an Amazon EventBridge event that runs an AWS Step Functions workflow. The workflow uses AWS Glue to catalog and transform the data, The data is then queried with QuickSight.

The Step Functions workflow orchestrates the following ETL activities, resulting in two tables:

  • An AWS Glue crawler collects all downloads into a single AWS Glue table named jira_raw. This table is comprised of a mix of full and incremental downloads from Jira, with many versions of the same records representing changes over time.
  • A DataBrew job prepares the data for reporting by unpacking key-value pairs in the fields, as well as removing depreciated records as they are updated in subsequent change data captures. This reporting-ready data will available in an AWS Glue table named jira_data.

The following figure shows the Step Functions workflow.

A diagram represents the AWS Step Functions workflow. It contains the steps to run an AWS Crawler, wait for it's completion, and then run a AWS Glue DataBrew data transformation job.

Prerequisites

This solution requires the following:

  • Administrative access to your Jira Cloud instance, and an associated Jira Cloud developer account.
  • An AWS account and a login with access to the AWS Management Console. Your login will need AWS Identity and Access Management (IAM) permissions to create and access the resources in your AWS account.
  • Basic knowledge of AWS and working knowledge of Jira administration.

Configure the Jira Instance

After logging in to your Jira Cloud instance, you establish a Jira project with associated epics and issues to download into a data lake. If you’re starting with a new Jira instance, it helps to have at least one project with a sampling of epics and issues for the initial data download, because it allows you to create an initial dataset without errors or missing fields. Note that you may have multiple projects as well.

An image show a Jira Cloud example, with several issues arranged in a Kansan board.

After you have established your Jira project and populated it with epics and issues, ensure you also have access to the Jira developer portal. In later steps, you use this developer portal to establish authentication and permissions for the Amazon AppFlow connection.

Provision resources with AWS CloudFormation

For the initial setup, you launch an AWS CloudFormation stack to create an S3 bucket to store data, IAM roles for data access, and the AWS Glue crawler and Data Catalog components. Complete the following steps:

  1. Sign in to your AWS account.
  2. Click Launch Stack:
  3. For Stack name, enter a name for the stack (the default is aws-blog-jira-datalake-with-AppFlow).
  4. For GlueDatabaseName, enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake).
  5. For InitialRunFlag, choose Setup. This mode will scan all data and disable the change data capture (CDC) features of the stack. (Because this is the initial load, the stack needs an initial data load before you configure CDC in later steps.)
  6. Under Capabilities and transforms, select the acknowledgement check boxes to allow IAM resources to be created within your AWS account.
  7. Review the parameters and choose Create stack to deploy the CloudFormation stack. This process will take around 5–10 minutes to complete.
    An image depicts the Amazon CloudFormation configuration steps, including setting a stack name, setting parameters to "jiralake" and "Setup" mode, and checking all IAM capabilities requested.
  8. After the stack is deployed, review the Outputs tab for the stack and collect the following values to use when you set up Amazon AppFlow:
    • Amazon AppFlow destination bucket (o01AppFlowBucket)
    • Amazon AppFlow destination bucket path (o02AppFlowPath)
    • Role for Amazon AppFlow Jira connector (o03AppFlowRole)
      An image demonstrating the Amazon Cloudformation "Outputs" tab, highlighting the values to add to the Amazon AppFlow configuration.

Configure Jira Cloud

Next, you configure your Jira Cloud instance for access by Amazon AppFlow. For full instructions, refer to Jira Cloud connector for Amazon AppFlow. The following steps summarize these instructions and discuss the specific configuration to enable OAuth in the Jira Cloud:

  1. Open the Jira developer portal.
  2. Create the OAuth 2 integration from the developer application console by choosing Create an OAuth 2.0 Integration. This will provide a login mechanism for AppFlow.
  3. Enable fine-grained permissions. See Recommended scopes for the permission settings to grant AppFlow appropriate access to your Jira instance.
  4. Add the following permission scopes to your OAuth app:
    1. manage:jira-configuration
    2. read:field-configuration:jira
  5. Under Authorization, set the Call Back URL to return to Amazon AppFlow with the URL https://us-east-1.console.aws.amazon.com/AppFlow/oauth.
  6. Under Settings, note the client ID and secret to use in later steps to set up authentication from Amazon AppFlow.

Create the Amazon AppFlow Jira Cloud connection

In this step, you configure Amazon AppFlow to run a one-time full data fetch of all your data, establishing the initial data lake:

  1. On the Amazon AppFlow console, choose Connectors in the navigation pane.
  2. Search for the Jira Cloud connector.
  3. Choose Create flow on the connector tile to create the connection to your Jira instance.
    An image of Amazon AppFlor, showing the search for the "Jira Cloud" connector.
  4. For Flow name, enter a name for the flow (for example, JiraLakeFlow).
  5. Leave the Data encryption setting as the default.
  6. Choose Next.
    The Amazon AppFlow Jira connector configuration, showing the Flow name set to "JiraLakeFlow" and clicking the "next" button.
  7. For Source name, keep the default of Jira Cloud.
  8. Choose Create new connection under Jira Cloud connection.
  9. In the Connect to Jira Cloud section, enter the values for Client ID, Client secret, and Jira Cloud Site that you collected earlier. This provides the authentication from AppFlow to Jira Cloud.
  10. For Connection Name, enter a connection name (for example, JiraLakeCloudConnection).
  11. Choose Connect. You will be prompted to allow your OAuth app to access your Atlassian account to verify authentication.
    An image of the Amazon AppFlow conflagration, reflecting the completion of the prior steps.
  12. In the Authorize App window that pops up, choose Accept.
  13. With the connection created, return to the Configure flow section on the Amazon AppFlow console.
  14. For API version, choose V2 to use the latest Jira query API.
  15. For Jira Cloud object, choose Issue to query and download all issues and associated details.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  16. For Destination Name in the Destination Details section, choose Amazon S3.
  17. For Bucket details, choose the S3 bucket name that matches the Amazon AppFlow destination bucket value that you collected from the outputs of the CloudFormation stack.
  18. Enter the Amazon AppFlow destination bucket path to complete the full S3 path. This will send the Jira data to the S3 bucket created by the CloudFormation script.
  19. Leave Catalog your data in the AWS Glue Data Catalog unselected. The CloudFormation script uses an AWS Glue crawler to update the Data Catalog in a different manner, grouping all the downloads into a common table, so we disable the update here.
  20. For File format settings, select Parquet format and select Preserve source data types in Parquet output. Parquet is a columnar format to optimize subsequent querying.
  21. Select Add a timestamp to the file name for Filename preference. This will allow you to easily find data files downloaded at a specific date and time.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  22. For now, select Run on Demand for the Flow trigger to run the full load flow manually. You will schedule downloads in a later step when implementing CDC.
  23. Choose Next.
    An image of the Amazon AppFlow Flow Trigger configuration, reflecting the completion of the prior steps.
  24. On the Map data fields page, select Manually map fields.
  25. For Source to destination field mapping, choose the drop-down box under Source field name and select Map all fields directly. This will bring down all fields as they are received, because we will instead implement data preparation in later steps.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 24 & 25.
  26. Under Partition and aggregation settings, you can set up the partitions in a way that works for your use case. For this example, we use a daily partition, so select Date and time and choose Daily.
  27. For Aggregation settings, leave it as the default of Don’t aggregate.
  28. Choose Next.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 26-28.
  29. On the Add filters page, you can create filters to only download specific data. For this example, you download all the data, so choose Next.
  30. Review and choose Create flow.
  31. When the flow is created, choose Run flow to start the initial data seeding. After some time, you should receive a banner indicating the run finished successfully.
    An image of the Amazon AppFlow configuration, reflecting the completion of step 31.

Review seed data

At this stage in the process, you now have data in your S3 environment. When new data files are created in the S3 bucket, it will automatically run an AWS Glue crawler to catalog the new data. You can see if it’s complete by reviewing the Step Functions state machine for a Succeeded run status. There is a link to the state machine on the CloudFormation stack’s Resources tab, which will redirect you to the Step Functions state machine.

A image showing the CloudFormation resources tab of the stack, with a link to the AWS Step Functions workflow.

When the state machine is complete, it’s time to review the raw Jira data with Athena. The database is as you specified in the CloudFormation stack (jiralake by default), and the table name is jira_raw. If you kept the default AWS Glue database name of jiralake, the Athena SQL is as follows:

SELECT * FROM "jiralake"."jira_raw" limit 10;

If you explore the data, you’ll notice that most of the data you would want to work with is actually packed into a column called fields. This means the data is not available as columns in your Athena queries, making it harder to select, filter, and sort individual fields within an Athena SQL query. This will be addressed in the next steps.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_raw" limit 10;

Set up CDC and unpack the fields columns

To add the ongoing CDC and reformat the data for analytics, we introduce a DataBrew job to transform the data and filter to the most recent version of each record as changes come in. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps.

  1. On the AWS CloudFormation console, return to the stack.
  2. Choose Update.
  3. Select Use current template and choose Next.
    An image showing Amazon CloudFormation, with steps 1-3 complete.
  4. For SetupOrCDC, choose CDC, then choose Next. This will enable both the CDC steps and the data transformation steps for the Jira data.
    An image showing Amazon CloudFormation, with step 4 complete.
  5. Continue choosing Next until you reach the Review section.
  6. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
    An image showing Amazon CloudFormation, with step 5-6 complete.
  7. Return to the Amazon AppFlow console and open your flow.
  8. On the Actions menu, choose Edit flow. We will now edit the flow trigger to run an incremental load on a periodic basis.
  9. Select Run flow on schedule.
  10. Configure the desired repeats, as well as start time and date. For this example, we choose Daily for Repeats and enter 1 for the number of days you’ll have the flow trigger. For Starting at, enter 01:00.
  11. Select Incremental transfer for Transfer mode.
  12. Choose Updated on the drop-down menu so that changes will be captured based on when the records were updated.
  13. Choose Save. With these settings in our example, the run will happen nightly at 1:00 AM.
    An image showing the Flow Trigger, with incremental transfer selected.

Review the analytics data

When the next incremental load occurs that results in new data, the Step Functions workflow will start the DataBrew job and populate a new staged analytical data table named jira_data in your Data Catalog database. If you don’t want to wait, you can trigger the Step Functions workflow manually.

The DataBrew job performs data transformation and filtering tasks. The job unpacks the key-values from the Jira JSON data and the raw Jira data, resulting in a tabular data schema that facilitates use with BI and AI/ML tools. As Jira items are changed, the changed item’s data is resent, resulting in multiple versions of an item in the raw data feed. The DataBrew job filters the raw data feed so that the resulting data table only contains the most recent version of each item. You could enhance this DataBrew job to further customize the data for your needs, such as renaming the generic Jira custom field names to reflect their business meaning.

When the Step Functions workflow is complete, we can query the data in Athena again using the following query:

SELECT * FROM "jiralake"."jira_data" limit 10;

You can see that in our transformed jira_data table, the nested JSON fields are broken out into their own columns for each field. You will also notice that we’ve filtered out obsolete records that have been superseded by more recent record updates in later data loads so the data is fresh. If you want to rename custom fields, remove columns, or restructure what comes out of the nested JSON, you can modify the DataBrew recipe to accomplish this. At this point, the data is ready to be used by your analytics tools, such as Amazon QuickSight.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_data" limit 10;

Clean up

If you would like to discontinue this solution, you can remove it with the following steps:

  1. On the Amazon AppFlow console, deactivate the flow for Jira, and optionally delete it.
  2. On the Amazon S3 console, select the S3 bucket for the stack, and empty the bucket to delete the existing data.
  3. On the AWS CloudFormation console, delete the CloudFormation stack that you deployed.

Conclusion

In this post, we created a serverless incremental data load process for Jira that will synchronize data while handling custom fields using Amazon AppFlow, AWS Glue, and Step Functions. The approach uses Amazon AppFlow to incrementally load the data into Amazon S3. We then use AWS Glue and Step Functions to manage the extraction of the Jira custom fields and load them in a format to be queried by analytics services such as Athena, QuickSight, or Redshift Spectrum, or AI/ML services like Amazon SageMaker.

To learn more about AWS Glue and DataBrew, refer to Getting started with AWS Glue DataBrew. With DataBrew, you can take the sample data transformation in this project and customize the output to meet your specific needs. This could include renaming columns, creating additional fields, and more.

To learn more about Amazon AppFlow, refer to Getting started with Amazon AppFlow. Note that Amazon AppFlow supports integrations with many SaaS applications in addition to the Jira Cloud.

To learn more about orchestrating flows with Step Functions, see Create a Serverless Workflow with AWS Step Functions and AWS Lambda. The workflow could be enhanced to load the data into a data warehouse, such as Amazon Redshift, or trigger a refresh of a QuickSight dataset for analytics and reporting.

In future posts, we will cover how to unnest parent-child relationships within the Jira data using Athena and how to visualize the data using QuickSight.


About the Authors

Tom Romano is a Sr. Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics, and is an Analytics Specialist. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Shane Thompson is a Sr. Solutions Architect based out of San Luis Obispo, California, working with AWS Startups. He works with customers who use AI/ML in their business model and is passionate about democratizing AI/ML so that all customers can benefit from it. In his free time, Shane loves to spend time with his family and travel around the world.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

Post Syndicated from James Alaniz original https://blog.rapid7.com/2023/08/01/new-insightcloudsec-compliance-pack-for-cis-aws-benchmark-2-0-0/

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

The Center for Internet Security (CIS) recently released version two of their AWS Benchmark. CIS AWS Benchmark 2.0.0 brings two new recommendations and eliminates one from the previous version. The update also includes some minor formatting changes to certain recommendation descriptions.

In this post, we’ll talk a little bit about the “why” behind these changes. We’ll also look at using InsightCloudSec’s new, out-of-the-box compliance pack to implement and enforce the benchmark’s recommendations.

What’s new, what’s changed, and why

Version 2.0.0 of the CIS AWS Benchmark included two new recommendations:

  • Ensure access to AWSCloudShellFullAccess is restricted
    An important addition from CIS, this recommendation focuses on restricting access to the AWSCloudShellFullAccess policy, which presents a potential path for data exfiltration by malicious cloud admins that are given full permissions to the service. AWS documentation describes how to create a more restrictive IAM policy that denies file transfer permissions.
  • Ensure that EC2 Metadata Service only allows IMDSv2
    Users should be using IMDSv2 to avoid leaving your EC2 instances susceptible to Server-Side Request Forgery (SSRF) attacks, a critical fault of IMDSv1.

The update also included the removal of the previous recommendation:

  • Ensure all S3 buckets employ encryption-at-rest
    This recommendation was removed because AWS now encrypts all new objects by default as of January 2023. It’s important to note that this only applies to newly created S3 buckets. So, if you’ve got some buckets that have been kicking around for a while, make sure they are employing encryption-at-rest and that it can not be inadvertently turned off at some point down the line.

Along with these changes, CIS also made a few minor changes related to the wording in some of the benchmark titles and descriptions.

How does ICS help me implement this in my environment?

Available as a compliance pack within InsightCloudSec right out-of-the-box, Rapid7 makes it easy for teams to scan their AWS environments for compliance against the recommendations and controls outlined in the CIS AWS Benchmark. If you’re not yet using InsightCloudSec today, be sure to check out the docs pages here, which will guide you through getting started with the platform.

Once you’re up and running, scoping your compliance assessment to a specific pack is as easy as 4 clicks. First, from the Compliance Summary page  you’ll want to select your desired benchmark. In this case, of course, CIS AWS Benchmark 2.0.0.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

From there, we can select the specific cloud or clouds we want to scan.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

And because we’ve got our badging and tagging strategies in order (right…….RIGHT?!) we can hone in even further. For this example, let’s focus on the production environment.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

You’ll get some trending insights that show how your organization as a whole, as well as how specific teams and accounts are doing and whether or not you’re seeing the improvement over time.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

Finally, if you’ve got a number of cloud accounts and/or clusters running across your environment, you can even scope down to that level. In this example, we’ll select all.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

Once you’ve got your filters set, you can apply and get real-time insight into how well your organization is adhering to the CIS AWS Benchmark. As with any pack, you can see your current compliance score overall along with a breakdown of the risk level associated with each instance of non-compliance.

New InsightCloudSec Compliance Pack for CIS AWS Benchmark 2.0.0

So as you can see, it’s fairly simple to assess your cloud environment for compliance with the CIS AWS Benchmark with a cloud security tool like InsightCloudSec. If you’re just starting your cloud security journey or aren’t really sure where to start, utilizing an out-of-the-box compliance pack is a great way to set a foundation to build off of.

In fact, Rapid7 recently partnered with AWS to help organizations in that very situation. Using a combination of market-leading technology and hands-on expertise, our AWS Cloud Risk Assessment provides a point-in-time understanding of your entire AWS cloud footprint and its security posture.

During the assessment, our experts will inspect your cloud environment for more than 100 distinct risks and misconfigurations, including publicly exposed resources, lack of encryption, and user accounts not utilizing multi-factor authentication. At the end of this assessment, your team will receive an executive-level report aligned to the AWS Foundational Security Best Practices, participate in a read-out call, and discuss next steps for executing your cloud risk mitigation program alongside experts from Rapid7 and our services partners.

If you’re interested, be sure to reach out about the AWS Cloud Risk Assessment with this quick request form!

Hall: IBM, Red Hat and Free Software: An old maddog’s view

Post Syndicated from corbet original https://lwn.net/Articles/939922/

Here is a
long reminiscence
from Jon “maddog” Hall leading up to some thoughts on
Red Hat’s source-release policy changes.

Recently I have been seeing some cracks in the dike. As more and
more users of FOSS come on board, they put more and more demands on
developers whose numbers are not growing sufficiently fast enough
to keep all the software working.

I hear from FOSS developers that too few, and sometimes no,
developers are working on blocks of code. Of course this can also
happen to closed-source code, but this shortness hits mostly in
areas that are not considered “sexy”, such as quality assurance,
release engineering, documentation and translations.

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/939917/

Security updates have been issued by Debian (tiff), Fedora (curl), Red Hat (bind, ghostscript, iperf3, java-1.8.0-ibm, nodejs, nodejs:18, openssh, postgresql:15, and samba), Scientific Linux (iperf3), Slackware (mozilla and seamonkey), SUSE (compat-openssl098, gnuplot, guava, openssl-1_0_0, pipewire, python-requests, qemu, samba, and xmltooling), and Ubuntu (librsvg, openjdk-8, openjdk-lts, openjdk-17, openssh, rabbitmq-server, and webkit2gtk).

How To Present SecOps Metrics (The Right Way)

Post Syndicated from Rapid7 original https://blog.rapid7.com/2023/08/01/how-to-present-secops-metrics-the-right-way/

How To Present SecOps Metrics (The Right Way)

SecOps metrics can be a gold mine of potential for informing better business decisions, but 78% of CEOs say they don’t have adequate data on risk exposure to make good decisions. Even when they do see the right data, 82% are inclined to “trust their gut” anyway.

Here lies the disconnect between data and decisions for C-level executives: a lack of effective presentation. Ultimately, the responsibility of communicating that SecOps metrics matter falls on today’s security teams. They must transform numbers into narratives that illustrate the challenges in today’s attack landscape to decision-makers — and, most importantly, make stakeholders care about those challenges.

But metrics presentations can get boring. So, how can security professionals present SecOps metrics in an engaging way?

Stories inspire empathy and action

While facts and figures play a role in communication, humans respond differently to stories. With narratives, we understand meaning more deeply, remember events longer, and factor what each story taught us into future decisions. Storytelling is also an effective way for security teams to inspire empathy — and therefore, action — in today’s decision-makers.

It’s critical for security professionals to identify the narrative thread in the metrics they’re analyzing. Here’s what we mean by that, step by step and metric by metric.

Establish how hungry the bad guys are

Hone in on the frequency of security incidents. This metric directly correlates to the power and reach threat actors have. Dive into the causes behind incidents, how much impact incidents had, and what can be done to stop them.

This information gives executives direct insight into the potential risks your organization faces and the negative outcomes associated with them. When executives can see the cold hard number of times their organizations have suffered from a breach, attack, or leak, it can highlight where security strategies are still lacking — and where they’re losing out to malicious actors.

Show how the villains keep winning

MTTD (mean time to detection) is a measure of how fast security teams can detect incidents. While it might not be a flashy metric in and of itself, it can pack a powerful punch when illustrating the damage bad actors can do before they’re suspected of even breaking in.

MTTD provides insight into the efficacy of an organization’s current cybersecurity tools and data coverage. It can also be a helpful indicator of how well current security processes are working — and how overworked or resource-strained a security team might be.

Tell the underdog’s story

Here’s where you leverage MTTR (mean time to respond). This metric shows how quickly the security team can spring into action. More often than not, security teams have a litany of other important tasks at hand that can make MTTR less than ideal. This demonstrates why resource-strained and overworked security professionals are set up to fail if they don’t have the right tools, strategies, and support.

With MTTR, security teams can add an extra layer of context to the data shown by MTTD. This metric highlights how quickly security teams respond to incidents — which can be another indicator of how well tools and processes match up to current threats.

Describe the loot you stand to lose

Finally, communicate the potential cost per incident. Money speaks volumes when you’re crafting a narrative out of SecOps metrics — so it’s best to close out your stories with this powerful data point. This metric provides insight into the efficiency of a cybersecurity program’s processes, tools, and resource allocation.

This is perhaps the most effective metric security professionals can use with executives because it speaks directly to one of their critical concerns: the bottom line.

Putting it all together

While many additional SecOps metrics matter, those four data points can come together most effectively to weave a story that speaks to C-level execs.

However, executives will have their own set of questions and concerns at SecOps briefs. So, it’s important to supplement even the strongest SecOps stories with additional answers, such as:

  • How efficiently your organization is addressing risks compared to other similar companies.
  • Where budget spend works and where it doesn’t in terms of ROI.
  • Where opportunities for increased efficiencies (namely, breaking down disparate silos and cutting costs with consolidation) can come into play.

It all comes down to communication

By focusing on crafting data narratives, security teams can turn SecOps metrics into actionable decisions for stakeholders. Telling the right story to the right people can help procure backing from the top — which means getting the resources, people, and budget security leaders need to stay ahead of threats.

Effectively communicating with C-levels helps build a rapport between stakeholders and boots-on-the-ground security professionals. By presenting metrics as parts of a larger story, organizations can unlock better collaboration, better relationships, and better business outcomes.

All while keeping threat actors in check.

Want to learn more about creating SecOps narratives that pack a punch? Download Presenting Upward: How to Showcase SecOps Metrics That Matter now.

Zabbix in: exploratory data analysis rehearsal – Part 3

Post Syndicated from Paulo R. Deolindo Jr. original https://blog.zabbix.com/zabbix-in-exploratory-data-analysis-rehearsal-part-2-2/26266/

Abstract

This will be the last blog post of the “Zabbix in… exploratory data analysis rehearsal” series. To continue our initial proposal, we’ll close the third and fourth data distribution moments concept. This time, we’ll talk about Skewness and Kurtosis.

Remember the first and second articles of the series to be aware of what we are discussing here.

The four moments for a data distribution

While the first moment helps us with the location estimate for some data distribution, the second moment works with its variance. The third moment, called asymmetry, allows us to understand the value trends and the degree of the asymmetry. The fourth moment is called kurtosis and is about the probability of the peak’s existence (outliers).

These four moments are not the final study about the data distribution. There is so much to learn and apply to data science when considering statistical concepts, but for now we must finish the initial proposal and bring forward some insights for decision makers.

Let’s get started!

Asymmetry

Based on our web application scenario, we can see a certain asymmetry in response time in most cases. This is normal and expected – so far, no problems. But it is also true that some symmetry is also possible in certain cases. Again, no problems here.

So, where is the problem? When does it happen?

Sometimes, the web application response time can be too different from the previous one, and we have no control over it. In these cases, the outliers must be found, and the correct interpretation must be applied. At that point, we must consider anomalies in the environment. Sometimes, the outliers are just a deviation. In all cases, we must pay attention and monitor the metrics that can make the difference.

Speaking of asymmetry – why is this topic so special? One of the possible answers is that we need to understand the degree of the asymmetry – whether it is high or moderated and whether the values were in most cases smaller than or bigger than the mean or median. In other words, what does the asymmetry say about the web application performance?

Let’s check some implementations.

The key skewness

From version 6.0, Zabbix introduced the item key skewness.

For example, it can be used like this:

skewness(/host/key,1h) # the skewness for the last hour until now

Now, let’s see how this formula could be applied to our scenario:

skewness(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Using skewness and time shift “1h:now/h”, we are looking for some web application response time asymmetry at the previous hour.

The asymmetry can be negative (left skew), zero, positive (right skew) or undefined.

Definition: a left-skewed distribution is longer on the left side of its peak than on its right. In other words, a left-skewed distribution has a long tail on its left side.

Considering a left skew, it is possible to state that at the previous hour, the web application had more higher values than smaller values. This means that our web application does not perform as well as it should.

Look at the graph above. You can see some bars on the left side of the mean and other bars on the right side of the mean, with the same size as a mirror. We can consider this a normal distribution for web application response time, but it does not mean that the response times were good or bad – it only means that they had some balance and suggests more investigation.

Definition: a right-skewed distribution is longer on the right side of its peak than on its left. In other words, a right-skewed distribution has a long tail on its right side.

Considering a right skew, it is possible to state that at the previous hour, the web application had more smaller values than higher values. This means our web application performs as expected.

Value Map for Skewness

You must create in your template the following value map:

If you wish, the value map can also be as below:

“is greater than or equals”                                 0.1              à Mais tempos bons, se comparados à média

“equals”                                                               0               à Tempos de resposta simétricos ou bem distribuídos em bons e ruins

“is less than or equals”                                       0                à Mais tempos ruins, se comparados à média

Pearson Skewness Coefficient

The Pearson’s Coefficient is a very interesting indicator. Considering some skewness for a data distribution, it tells us if the asymmetry is strong or only moderate.

We can create a calculated item for the Pearson’s Coefficient:

(3*(avg-median))/stddevpop

In Zabbix, we need:

  • One item for the response time average considering the previous hour:
    • Key: resp.time.previous.hour
      • Formula: trendavg(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)
    • One item for calculating the median (for percentile, see the previous blog post):
      • Key: response.time.previous.hour
        • Formula: (last(//p51.previous.hour)+last(//p50.previous.hour))/2
      • One item for the standard deviation calculation:
        • Key:
          • Formula: response.time.previous.hour stddevpop(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Finally, a calculated item for a Pearson’s Coefficient:

  • Key: coefficient.requests.previous.hour
    • Formula:
      ((last(//trendavg.requests.per.minute)-
      last(//median.access.previous.hour))*3)
      /
      last(//stddevpop.requests.previous.hour)

To finish the exercise, create a value map:

Curtose

Kurtosis is the fourth data distribution and can indicate if the values are prone to peaks.

In Zabbix, you can implement or calculate kurtosis with other calculated items:

kurtosis(/host/key,1h)

Um valor negativo de curtose nos diz que a distribuição não está propensa ou produziu poucos outliers. Já um valor positivo de curtose nos diz que a distribuição está propensa ou produziu muitos outliers. Tudo gira em torno da média da distribuição de dados. Já um valor neutro, ou zero, nos diz que a distribuição é considerada simétrica.

Para nosso cenário web, temos:

kurtosis(//net.tcp.service.perf[http,”{HOST.CONN}”,”{$NGINX.STUB_STATUS.PORT}”],1h:now/h)

Explanatory Dashboard

Considering the image above, the data distribution for the web application response time at the previous hour has a left skew. This suggests that, when considering the response time mean at the previous hour, there were more higher values than smallest values. That’s sad – our web application performed poorly. Why?

However, because the skewness was moderated (considering the Pearson’s Coefficient), it means that the response time values were not so different from the others at the same hour.

As for kurtosis, we can say that the data distribution is prone to peaks because we have positive kurtosis. In basic statistics, the values are near the mean and there is a high probability of producing outliers.

This is a point to pay attention to if you are looking at a critical service.

Looking at other metrics

To check and validate our interpretation, let’s visualize the collected values on a simple Zabbix graph, using a graph widget.

Please consider using the following “time period” configuration:

The graph will show only the data collected at the previous hour – in this case, from 11:00 to 11:59.

The graph in Zabbix allows us to visualize the 50th percentile as well. We do not display the mean on the graph because it wouldn’t be well represented visually as it was collected only once, thus lacking an interesting visual trend line like the 50th percentile. However, notice that the mean and the 50th percentile values are very close, which will give us an idea of the data distribution around this measure.

Partial Conclusion

Skewness and Kurtosis, respectively, are the third and fourth moments of a data distribution. They help us understand the environment’s behavior and allow us to gain insight into a lot of things (in this case, we simply applied these concepts to IT infrastructure monitoring and focused on our web application to analyze its performance).

In most cases, the asymmetry will exist – it’s normal and expected. However, knowing some properties of the skewness can help us understand the response times, indicating good or poor performance. The skewness coefficient allows us to know if the asymmetry was strong or just moderated. Meanwhile, kurtosis helps us to understand if the data distribution produced some peaks considering an observation period or whether it is prone to produce peaks. We can then create some triggers for that and avoid some undesirable behaviors in the future, based on our data distribution observation. This is applied data science at its best.

Conclusion

Data science can be easily applied to (and bring up insights about) Zabbix and its Aggregate functions. It’s true that there are some special functions such as skewness, kurtosis, stddevpop, stddevsamp, mad, and so on, but there are old functions to help us too, such as percentile, forecast, timeleft, etc. All these functions must be used in calculated item formulas.

One of the interesting advantages of using Zabbix in data science and performing an exploratory data analysis is the fact that Zabbix can monitor everything. This means that the database already exists with the relevant date to analyze, in real time.

In the blog posts in this series, we took as a basis some data referring to the previous hour, the previous day, and so on, but we did nothing regarding the current hour. If we applied the concepts studied in real time data, we would have other “live” results, including support for decision-making. This is because we will not only study historical data, but instead will have the opportunity to change the course of events.

Zabbix is improving dashboards significantly. From Zabbix 6.4, we have many new out-of-the-box widgets and the possibility to create our own. However, there is a concern – Zabbix administrators sometimes show unnecessary data in dashboards, which can cloud the decision-making process. Zabbix administrators, in general, might want to learn storytelling techniques to rectify this situation. Maybe in a perfect world!

I hope you have enjoyed this blog post series.

Keep studying!

 

The post Zabbix in: exploratory data analysis rehearsal – Part 3 appeared first on Zabbix Blog.

Hacking AI Resume Screening with Text in a White Font

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/08/hacking-ai-resume-screening-with-text-in-a-white-font.html

The Washington Post is reporting on a hack to fool automatic resume sorting programs: putting text in a white font. The idea is that the programs rely primarily on simple pattern matching, and the trick is to copy a list of relevant keywords—or the published job description—into the resume in a white font. The computer will process the text, but humans won’t see it.

Clever. I’m not sure it’s actually useful in getting a job, though. Eventually the humans will figure out that the applicant doesn’t actually have the required skills. But…maybe.

Now Open – AWS Israel (Tel Aviv ) Region

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/now-open-aws-israel-tel-aviv-region/

In June 2021, Jeff Barr announced the upcoming AWS Israel (Tel Aviv) Region. Today we’re announcing the general availability of the AWS Israel (Tel Aviv) Region, with three Availability Zones and the il-central-1 API name.

The new Tel Aviv Region gives customers an additional option for running their applications and serving users from data centers located in Israel. Customers can securely store data in Israel while serving users in the vicinity with even lower latency.

AWS Services in the AWS Israel (Tel Aviv) Region
In the new Tel Aviv Region, you can use C5, C5d, C6g, C6gn, C6i, C6id, D3, G5, I3I3en, I4i, M5, M5dM6gM6gd, M6i, M6id, P4de (public preview only), R5R5d, R6g, R6i, R6id, T3, T3a, T4g instances, and a long list of AWS services including: Amazon API Gateway, AWS AppConfig, AWS Application Auto Scaling, Amazon Aurora, Aurora PostgreSQL, AWS Budgets, AWS Certificate Manager, AWS CloudFormation, Amazon Cloudfront, AWS Cloud Map, AWS CloudTrail, Amazon CloudWatch, Amazon CloudWatch Events, Amazon CloudWatch Logs, AWS CodeBuild, AWS CodeDeploy, AWS Config, AWS Cost Explorer, AWS Database Migration Service, AWS Direct Connect, AWS Directory Service, Amazon DynamoDB, Amazon Elastic Block Store (Amazon EBS), Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 Auto Scaling, EC2 Image Builder, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service, Amazon ElastiCache, AWS Elastic Beanstalk, Elastic Load Balancing, Elastic Load Balancing – Network (NLB), Amazon EMR, Amazon EventBridge, AWS Fargate, Glacier, AWS Health Dashboard, AWS Identity and Access Management (IAM), Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, AWS Key Management Service (AWS KMS), AWS Lambda, AWS Marketplace, AWS Mobile SDK for iOS and Android, Amazon OpenSearch Service, AWS Organizations, Amazon Redshift, AWS Resource Access Manager, Amazon Relational Database Service (Amazon RDS), Resource Groups, Amazon Route 53, Amazon Virtual Private Cloud (Amazon VPC), AWS Secrets Manager, AWS Shield Standard, AWS Shield Advanced, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Simple Storage Service (Amazon S3), Amazon Simple Workflow Service (Amazon SWF), AWS Step Functions, AWS Support API, AWS Systems Manager, AWS Trusted Advisor, VM Import/Export, AWS VPN, AWS WAF, and AWS X-Ray.

AWS in Israel
According to the Israel Ministry of Economic Industry, Israel is in the front line of the cloud computing era and “is known to be the ‘start-up nation’ of the number of global start-ups being produced. Over the past decade, Israel has produced over 2,000 start-ups, the majority of these start-ups are driven by software as a service (SaaS). Israeli cloud technology remains a strong promise in the market as new start-ups are continuously penetrating the market.”

AWS began supporting startups in Israel in 2013 through its AWS Activate program. In Israel, AWS works with accelerator organizations such as 8200 EISP, F2 Venture Capitalthejunction, and TechStars as well as venture capital firms like Entrée Capital, Bessemer Venture Partners, Pitango, Vertex Ventures Israel, and Viola Group to support the rapid growth of their portfolio companies.

Back in 2014, we opened an AWS office and a research and development (R&D) center in Israel. Since then, Amazon has expanded its R&D presence in the country, which now includes Prime Air and Alexa Shopping.

In 2015, AWS acquired Annapurna Labs, an Israeli microelectronics company, which has developed advanced compute, networking, security, and storage technologies for AWS—such as AWS-designed Graviton processors, AWS Inferentia, AWS Trainium chips, and the AWS Nitro System.

In 2018, we expanded to new offices in Tel Aviv, including AWS Experience Tel Aviv on Floor28 to support the growth of Israeli startups, enterprises, and government customers through technology-focused events and educational activities. Now, AWS Experience Tel Aviv on Floor28 is an education hub where anyone interested in AWS can attend industry events, workshops, and meetups, and receive free, in-person technical and business guidance from AWS experts.

In 2019, we launched the first AWS infrastructure in Israel, opening an Amazon CloudFront edge location. In 2020, we brought AWS Outposts and AWS Direct Connect to Israel, providing Israeli organizations with the ability to run AWS technology in their own data centers and establish dedicated connections back to the AWS Cloud.

In April 2021, the government of Israel announced that it had selected AWS as its primary cloud provider as part of the Nimbus contract. The Nimbus framework will enable government departments—including the ministries, education, healthcare, and municipalities—to accelerate their digital transformation by using AWS technologies.

AWS continues to invest in upskilling local developers, students, and the next generation of IT leaders in Israel through programs such as AWS Educate, AWS Academy, AWS re/Start, and other Training and Certification programs.

AWS Educate and Academy programs are providing free resources to accelerate cloud-related learning and preparing today’s students in Israel for the jobs of the future. Israel colleges already participating in the AWS Academy program include the Bar Ilan University, Ben-Gurion University of the Negev, Holon Institute of Technology, Jerusalem College of Technology, and University of Haifa. We also launched AWS re/Start to focus on helping unemployed or underemployed individuals to launch a new cloud career. You can now apply to AWS re/Start programs through Appleseeds, Sigma Labs Jerusalem, and Analiza Cyber Intelligence in Israel.

AWS Customers in Israel
We have many amazing customers in Israel who are doing incredible things with AWS, for example:

AI21 Labs – AI21 Labs offers access to its state-of-the-art proprietary language models through AI21 Studio for businesses to build their own generative artificial intelligence applications, as well as its consumer product, Wordtune, the first AI-based writing assistant to understand context and meaning. AI21 Labs scaled to hundreds of GPUs efficiently and cost effectively to build the Jurassic-2 family of language models. These models were trained with distributed and parallelized infrastructure based on Amazon EC2 P4d instances 400 Gbps high-performance networking supported by Elastic Fabric Adaptor (EFA).

Bank Leumi – Leumi is one of the leading banks in Israel and has over 200 branches across the country and dedicated teams using AWS to build an advanced banking services marketplace. In just 5 months, Leumi migrated 16 on-premises applications from its former Kubernetes solution to Amazon EKS Anywhere with no service interruptions. The bank’s new environment facilitates a consistent, scalable approach to deployments, saving time and money and increasing innovation velocity.

CyberArk – CyberArk is an AWS partner in the identity security industry. Centered on privileged access management, CyberArk provides the most comprehensive security SaaS offering on AWS for any identity—human or machine—across business applications, distributed workforces, hybrid cloud workloads, and throughout the DevOps lifecycle. CyberArk Identity Security Intelligence has integrated with AWS CloudTrail Lake to increase visibility and responsiveness associated with targeted threats. CyberArk Audit also delivers security event information to Amazon Security Lake.

Ichilov Hospital – The I-Medata Innovation Center of Ichilov Hospital uses AWS Control Tower to facilitate the fast, consistent, and secure creation of AWS accounts while protecting sensitive medical data. The center also relies on Amazon SageMaker to enable its scientists to build, train, and deploy advanced machine learning models for early detection of deterioration in COVID-19 patients. They had full protection of sensitive medical data on AWS while continuing to enable the productivity of researchers.

You can find more customer stories from Israel.

Available Now
The new Tel Aviv Region is ready to support your business. You can find a detailed list of the services available in this Region on the AWS Regional Services List.

With this launch, AWS now spans 102 Availability Zones in 32 geographic Regions around the world. We have also announced plans for 12 more Availability Zones and four more Regions in Canada, Malaysia, New Zealand, and Thailand.

To learn more, see the Global Infrastructure page, give it a try, and send feedback through your usual AWS support contacts in Israel.

— Channy

P.S. We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

AWS Week in Review – Agents for Amazon Bedrock, Amazon SageMaker Canvas New Capabilities, and More – July 31, 2023

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-week-in-review-agents-for-amazon-bedrock-amazon-sagemaker-canvas-new-capabilities-and-more-july-31-2023/

This July, AWS communities in ASEAN wrote a new history. First, the AWS User Group Malaysia recently held the first AWS Community Day in Malaysia.

Another significant milestone has been achieved by the AWS User Group Philippines. They just celebrated their tenth anniversary by running 2 days of AWS Community Day Philippines. Here are a few photos from the event, including Jeff Barr sharing his experiences attending AWS User Group meetup, in Manila, Philippines 10 years ago.

Big congratulations to AWS Community Heroes, AWS Community Builders, AWS User Group leaders and all volunteers who organized and delivered AWS Community Days! Also, thank you to everyone who attended and help support our AWS communities.

Last Week’s Launches
We had interesting launches last week, including from AWS Summit, New York. Here are some of my personal highlights:

(Preview) Agents for Amazon Bedrock – You can now create managed agents for Amazon Bedrock to handle tasks using API calls to company systems, understand user requests, break down complex tasks into steps, hold conversations to gather more information, and take actions to fulfill requests.

(Coming Soon) New LLM Capabilities in Amazon QuickSight Q – We are expanding the innovation in QuickSight Q by introducing new LLM capabilities through Amazon Bedrock. These Generative BI capabilities will allow organizations to easily explore data, uncover insights, and facilitate sharing of insights.

AWS Glue Studio support for Amazon CodeWhisperer – You can now write specific tasks in natural language (English) as comments in the Glue Studio notebook, and Amazon CodeWhisperer provides code recommendations for you.

(Preview) Vector Engine for Amazon OpenSearch Serverless – This capability empowers you to create modern ML-augmented search experiences and generative AI applications without the need to handle the complexities of managing the underlying vector database infrastructure.

Last week, Amazon SageMaker Canvas also released a set of new capabilities:

AWS Open-Source Updates
As always, my colleague Ricardo has curated the latest updates for open-source news at AWS. Here are some of the highlights.

cdk-aws-observability-accelerator is a set of opinionated modules to help you set up observability for your AWS environments with AWS native services and AWS-managed observability services such as Amazon Managed Service for Prometheus, Amazon Managed Grafana, AWS Distro for OpenTelemetry (ADOT) and Amazon CloudWatch.

iac-devtools-cli-for-cdk is a command line interface tool that automates many of the tedious tasks of building, adding to, documenting, and extending AWS CDK applications.

Upcoming AWS Events
There are upcoming events that you can join to learn. Let’s start with AWS events:

And let’s learn from our fellow builders and join AWS Community Days:

Open for Registration for AWS re:Invent
We want to be sure you know that AWS re:Invent registration is now open!


This learning conference hosted by AWS for the global cloud computing community will be held from November 27 to December 1, 2023, in Las Vegas.

Pro-tip: You can use information on the Justify Your Trip page to prove the value of your trip to AWS re:Invent trip.

Give Us Your Feedback
We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

That’s all for this week. Check back next Monday for another Week in Review.

Happy building!

Donnie

This post is part of our Week in Review series. Check back each week for a quick round-up of interesting news and announcements from AWS!


P.S. We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

Perform continuous vulnerability scanning of AWS Lambda functions with Amazon Inspector

Post Syndicated from Manjunath Arakere original https://aws.amazon.com/blogs/security/perform-continuous-vulnerability-scanning-of-aws-lambda-functions-with-amazon-inspector/

This blog post demonstrates how you can activate Amazon Inspector within one or more AWS accounts and be notified when a vulnerability is detected in an AWS Lambda function.

Amazon Inspector is an automated vulnerability management service that continually scans workloads for software vulnerabilities and unintended network exposure. Amazon Inspector scans mixed workloads like Amazon Elastic Compute Cloud (Amazon EC2) instances and container images located in Amazon Elastic Container Registry (Amazon ECR). At re:Invent 2022, we announced Amazon Inspector support for Lambda functions and Lambda layers to provide a consolidated solution for compute types.

Only scanning your functions for vulnerabilities before deployment might not be enough since vulnerabilities can appear at any time, like the widespread Apache Log4j vulnerability. So it’s essential that workloads are continuously monitored and rescanned in near real time as new vulnerabilities are published or workloads are changed.

Amazon Inspector scans are intelligently initiated based on the updates to Lambda functions or when new Common Vulnerabilities and Exposures (CVEs) are published that are relevant to your function. No agents are needed for Amazon Inspector to work, which means you don’t need to install a library or agent in your Lambda functions or layers. When Amazon Inspector discovers a software vulnerability or network configuration issue, it creates a finding which describes the vulnerability, identifies the affected resource, rates the severity of the vulnerability, and provides remediation guidance.

In addition, Amazon Inspector integrates with several AWS services, such as Amazon EventBridge and AWS Security Hub. You can use EventBridge to build automation workflows like getting notified for a specific vulnerability finding or performing an automatic remediation with the help of Lambda or AWS Systems Manager.

In this blog post, you will learn how to do the following:

  1. Activate Amazon Inspector in a single AWS account and AWS Region.
  2. See how Amazon Inspector automated discovery and continuous vulnerability scanning works by deploying a new Lambda function with a vulnerable package dependency.
  3. Receive a near real-time notification when a vulnerability with a specific severity is detected in a Lambda function with the help of EventBridge and Amazon Simple Notification Service (Amazon SNS).
  4. Remediate the vulnerability by using the recommendation provided in the Amazon Inspector dashboard.
  5. Activate Amazon Inspector in multiple accounts or Regions through AWS Organizations.

Solution architecture

Figure 1 shows the AWS services used in the solution and how they are integrated.

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

The workflow for the solution is as follows:

  1. Deploy a new Lambda function by using the AWS Serverless Application Model (AWS SAM).
  2. Amazon Inspector scans when a new vulnerability is published or when an update to an existing Lambda function or a new Lambda function is deployed. Vulnerabilities are identified in the deployed Lambda function.
  3. Amazon EventBridge receives the events from Amazon Inspector and checks against the rules for specific events or filter conditions.
  4. In this case, an EventBridge rule exists for the Amazon Inspector findings, and the target is defined as an SNS topic to send an email to the system operations team.
  5. The EventBridge rule invokes the target SNS topic with the event data, and an email is sent to the confirmed subscribers in the SNS topic.
  6. The system operations team receives an email with detailed information on the vulnerability, the fixed package versions, the Amazon Inspector score to prioritize, and the impacted Lambda functions. By using the remediation information from Amazon Inspector, the team can now prioritize actions and remediate.

Prerequisites

To follow along with this demo, we recommend that you have the following in place:

  • An AWS account.
  • A command line interface: AWS CloudShell or AWS CLI. In this post, we recommend the use of CloudShell because it already has Python and AWS SAM. However, you can also use your CLI with AWS CLI, SAM, and Python.
  • An AWS Region where Amazon Inspector Lambda code scanning is available.
  • An IAM role in that account with administrator privileges.

The solution in this post includes the following AWS services: Amazon Inspector, AWS Lambda, Amazon EventBridge, AWS Identity and Access Management (IAM), Amazon SNS, AWS CloudShell and AWS Organizations for activating Amazon Inspector at scale (multi-accounts).

Step 1: Activate Amazon Inspector in a single account in the Region

The first step is to activate Amazon Inspector in your account in the Region you are using.

To activate Amazon Inspector

  1. Sign in to the AWS Management Console.
  2. Open AWS CloudShell. CloudShell inherits the credentials and permissions of the IAM principal who is signed in to the AWS Management Console. CloudShell comes with the CLIs and runtimes that are needed for this demo (AWS CLI, AWS SAM, and Python).
  3. Use the following command in CloudShell to get the status of the Amazon Inspector activation.
    aws inspector2 batch-get-account-status

  4. Use the following command to activate Inspector in the default Region for resource type LAMBDA. Other allowed values for resource types are EC2, ECR and LAMDA_CODE.
    aws inspector2 enable --resource-types '["LAMBDA"]'

  5. Use the following command to verify the status of the Amazon Inspector activation.
    aws inspector2 batch-get-account-status

You should see a response that shows that Amazon Inspector is enabled for Lambda resources, as shown in Figure 2.

Figure 2: Amazon Inspector status after you enable Lambda scanning

Figure 2: Amazon Inspector status after you enable Lambda scanning

Step 2: Create an SNS topic and subscription for notification

Next, create the SNS topic and the subscription so that you will be notified of each new Amazon Inspector finding.

To create the SNS topic and subscription

  1. Use the following command in CloudShell to create the SNS topic and its subscription and replace <REGION_NAME>, <AWS_ACCOUNTID> and <[email protected]> by the relevant values.
    aws sns create-topic --name amazon-inspector-findings-notifier; 
    
    aws sns subscribe \
    --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier \
    --protocol email --notification-endpoint <[email protected]>

  2. Check the email inbox you entered for <[email protected]>, and in the email from Amazon SNS, choose Confirm subscription.
  3. In the CloudShell console, use the following command to list the subscriptions, to verify the topic and email subscription.
    aws sns list-subscriptions

    You should see a response that shows subscription details like the email address and ARN, as shown in Figure 3.

    Figure 3: Subscribed email address and SNS topic

    Figure 3: Subscribed email address and SNS topic

  4. Use the following command to send a test message to your subscribed email and verify that you receive the message by replacing <REGION_NAME> and <AWS_ACCOUNTID>.
    aws sns publish \
        --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
        --message "Hello from Amazon Inspector2"

Step 3: Set up Amazon EventBridge with a custom rule and the SNS topic as target

Create an EventBridge rule that will invoke your previously created SNS topic whenever Amazon Inspector finds a new vulnerability with a critical severity.

To set up the EventBridge custom rule

  1. In the CloudShell console, use the following command to create an EventBridge rule named amazon-inspector-findings with filters InspectorScore greater than 8 and severity state set to CRITICAL.
    aws events put-rule \
        --name "amazon-inspector-findings" \
        --event-pattern "{\"source\": [\"aws.inspector2\"],\"detail-type\": [\"Inspector2 Finding\"],\"detail\": {\"inspectorScore\": [ { \"numeric\": [ \">\", 8] } ],\"severity\": [\"CRITICAL\"]}}"

    Refer to the topic Amazon EventBridge event schema for Amazon Inspector events to customize the event pattern for your application needs.

  2. To verify the rule creation, go to the EventBridge console and in the left navigation bar, choose Rules.
  3. Choose the rule with the name amazon-inspector-findings. You should see the event pattern as shown in Figure 4.
    Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.

    Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.

  4. Add the SNS topic you previously created as the target to the EventBridge rule. Replace <REGION_NAME>, <AWS_ACCOUNTID>, and <RANDOM-UNIQUE-IDENTIFIER-VALUE> with the relevant values. For RANDOM-UNIQUE-IDENTIFIER-VALUE, create a memorable and unique string.
    aws events put-targets \
        --rule amazon-inspector-findings \
        --targets "Id"="<RANDOM-UNIQUE-IDENTIFIER-VALUE>","Arn"="arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier"

    Important: Save the target ID. You will need this in order to delete the target in the last step.

  5. Provide permission to enable Amazon EventBridge to publish to SNS topic amazon-inspector-findings-notifier
    aws sns set-topic-attributes --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
    --attribute-name Policy \
    --attribute-value "{\"Version\":\"2012-10-17\",\"Id\":\"__default_policy_ID\",\"Statement\":[{\"Sid\":\"PublishEventsToMyTopic\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"sns:Publish\",\"Resource\":\"arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier\"}]}"

Step 4: Deploy the Lambda function to the AWS account by using AWS SAM

In this step, you will use Serverless Application Manager (SAM) quick state templates to build and deploy a Lambda function with a vulnerable library, in order to generate findings. Learn more about AWS SAM.

To deploy the Lambda function with a vulnerable library

  1. In the CloudShell console, use a prebuilt “hello-world” AWS SAM template to deploy the Lambda function.
    sam init --runtime python3.7 --dependency-manager pip --app-template hello-world --name sam-app

  2. Use the following command to add the vulnerable package python-jwt==3.3.3 to the Lambda function.
    cd sam-app;
    echo -e 'requests\npython-jwt==3.3.3' > hello_world/requirements.txt

  3. Use the following command to build the application.
    sam build

  4. Use the following command to deploy the application with the guided option.
    sam deploy --guided

    This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the:

    1. Stack name you want
    2. Set the default options, except for the
      1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
      2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.

Step 5: View Amazon Inspector findings

Amazon Inspector will automatically generate findings when scanning the Lambda function previously deployed. To view those findings, follow the steps below.

To view Amazon Inspector findings for the vulnerability

  1. Navigate to the Amazon Inspector console.
  2. In the left navigation menu, choose All findings to see all of the Active findings, as shown in Figure 5.

    Due to the custom event pattern rule in Amazon EventBridge, even though there are multiple findings for the vulnerable package python-jwt==3.3.3, you will be notified only for the finding that has InspectorScore greater than 8 and severity CRITICAL.

  3. Choose the title of each finding to see detailed information about the vulnerability.
    Figure 5: Example of findings from the Amazon Inspector console

    Figure 5: Example of findings from the Amazon Inspector console

Step 6: Remediate the vulnerability by applying the fixed package version

Now you can remediate the vulnerability by updating the package version as suggested by Amazon Inspector.

To remediate the vulnerability

  1. In the Amazon Inspector console, in the left navigation menu, choose All Findings.
  2. Choose the title of the vulnerability to see the finding details and the remediation recommendations.
    Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation

    Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation

  3. To remediate, use the following command to update the package version to the fixed version as suggested by Amazon Inspector.
    cd /home/cloudshell-user/sam-app;
    echo -e "requests\npython-jwt==3.3.4" > hello_world/requirements.txt

  4. Use the following command to build the application.
    sam build

  5. Use the following command to deploy the application with the guided option.
    sam deploy --guided

    This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the

    1. Stack name you want
    2. Set the default options, except for the
      1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
      2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.
  6. Amazon Inspector automatically rescans the function after its deployment and reevaluates the findings. At this point, you can navigate back to the Amazon Inspector console, and in the left navigation menu, choose All findings. In the Findings area, you can see that the vulnerabilities are moved from Active to Closed status.

    Due to the custom event pattern rule in Amazon EventBridge, you will be notified by email with finding status as CLOSED.

    Figure 7: Inspector rescan results, showing no open findings after remediation

    Figure 7: Inspector rescan results, showing no open findings after remediation

(Optional) Step 7: Activate Amazon Inspector in multiple accounts and Regions

To benefit from Amazon Inspector scanning capabilities across the accounts that you have in AWS Organizations and in your selected Regions, use the following steps:

To activate Amazon Inspector in multiple accounts and Regions

  1. In the CloudShell console, use the following command to clone the code from the aws-samples inspector2-enablement-with-cli GitHub repo.
    cd /home/cloudshell-user;
    git clone https://github.com/aws-samples/inspector2-enablement-with-cli.git;
    cd inspector2-enablement-with-cli

  2. Follow the instructions from the README.md file.
  3. Configure the file param_inspector2.json with the relevant values, as follows:
    • inspector2_da: The delegated administrator account ID for Amazon Inspector to manage member accounts.
    • scanning_type: The resource types (EC2, ECR, LAMBDA) to be enabled by Amazon Inspector.
    • auto_enable: The resource types to be enabled on every account that is newly attached to the delegated administrator.
    • regions: Because Amazon Inspector is a regional service, provide the list of AWS Regions to enable.
  4. Select the AWS account that would be used as the delegated administrator account (<DA_ACCOUNT_ID>).
  5. Delegate an account as the admin for Amazon Inspector by using the following command.
    ./inspector2_enablement_with_awscli.sh -a delegate_admin -da <DA_ACCOUNT_ID>

  6. Activate the delegated admin by using the following command:
    ./inspector2_enablement_with_awscli.sh -a activate -t <DA_ACCOUNT_ID> -s all

  7. Associate the member accounts by using the following command:
    ./inspector2_enablement_with_awscli.sh -a associate -t members

  8. Wait five minutes.
  9. Enable the resource types (EC2, ECR, LAMBDA) on your member accounts by using the following command:
    ./inspector2_enablement_with_awscli.sh -a activate -t members

  10. Enable Amazon Inspector on the new member accounts that are associated with the organization by using the following command:
    ./inspector2_enablement_with_awscli.sh -auto_enable

  11. Check the Amazon Inspector status in your accounts and in multiple selected Regions by using the following command:
    ./inspector2_enablement_with_awscli.sh -a get_status

There are other options you can use to enable Amazon Inspector in multiple accounts, like AWS Control Tower and Terraform. For the reference architecture for Control Tower, see the AWS Security Reference Architecture Examples on GitHub. For more information on the Terraform option, see the Terraform aws_inspector2_enabler resource page.

Step 8: Delete the resources created in the previous steps

AWS offers a 15-day free trial for Amazon Inspector so that you can evaluate the service and estimate its cost.

To avoid potential charges, delete the AWS resources that you created in the previous steps of this solution (Lambda function, EventBridge target, EventBridge rule, and SNS topic), and deactivate Amazon Inspector.

To delete resources

  1. In the CloudShell console, enter the sam-app folder.
    cd /home/cloudshell-user/sam-app

  2. Delete the Lambda function and confirm by typing “y” when prompted for confirmation.
    sam delete

  3. Remove the SNS target from the Amazon EventBridge rule.
    aws events remove-targets --rule "amazon-inspector-findings" --ids <RANDOM-UNIQUE-IDENTIFIER-VALUE>

    Note: If you don’t remember the target ID, navigate to the Amazon EventBridge console, and in the left navigation menu, choose Rules. Select the rule that you want to delete. Choose CloudFormation, and copy the ID.

  4. Delete the EventBridge rule.
    aws events delete-rule --name amazon-inspector-findings

  5. Delete the SNS topic.
    aws sns delete-topic --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier

  6. Disable Amazon Inspector.
    aws inspector2 disable --resource-types '["LAMBDA"]'

    Follow the new few steps to roll back changes only if you have performed the activities listed in Step 7: Activate Amazon Inspector in multiple accounts and Regions.

  7. In the CloudShell console, enter the folder inspector2-enablement-with-cli.
    cd /home/cloudshell-user/inspector2-enablement-with-cli

  8. Deactivate the resource types (EC2, ECR, LAMBDA) on your member accounts.
    ./inspector2_enablement_with_awscli.sh -a deactivate -t members -s all

  9. Disassociate the member accounts.
    ./inspector2_enablement_with_awscli.sh -a disassociate -t members

  10. Deactivate the delegated admin account.
    ./inspector2_enablement_with_awscli.sh -a deactivate -t <DA_ACCOUNT_ID> -s all

  11. Remove the delegated account as the admin for Amazon Inspector.
    ./inspector2_enablement_with_awscli.sh -a remove_admin -da <DA_ACCOUNT_ID>

Conclusion

In this blog post, we discussed how you can use Amazon Inspector to continuously scan your Lambda functions, and how to configure an Amazon EventBridge rule and SNS to send out notification of Lambda function vulnerabilities in near real time. You can then perform remediation activities by using AWS Lambda or AWS Systems Manager. We also showed how to enable Amazon Inspector at scale, activating in both single and multiple accounts, in default and multiple Regions.

As of the writing this post, a new feature to perform code scans for Lambda functions is available. Amazon Inspector can now also scan the custom application code within a Lambda function for code security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption, based on AWS security best practices. You can use this additional scanning functionality to further protect your workloads.

If you have feedback about this blog post, submit comments in the Comments section below. If you have question about this blog post, start a new thread on the Amazon Inspector forum or contact AWS Support.

 
Want more AWS Security news? Follow us on Twitter.

Manjunath Arakere

Manjunath Arakere

Manjunath is a Senior Solutions Architect in the Worldwide Public Sector team at AWS. He works with Public Sector partners to design and scale well-architected solutions, and he supports their cloud migrations and application modernization initiatives. Manjunath specializes in migration, modernization and serverless technology.

Stéphanie Mbappe

Stéphanie Mbappe

Stéphanie is a Security Consultant with Amazon Web Services. She delights in assisting her customers at every step of their security journey. Stéphanie enjoys learning, designing new solutions, and sharing her knowledge with others.

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

Post Syndicated from Mitesh Patel original https://aws.amazon.com/blogs/big-data/migrate-your-existing-sql-based-etl-workload-to-an-aws-serverless-etl-infrastructure-using-aws-glue/

Data has become an integral part of most companies, and the complexity of data processing is increasing rapidly with the exponential growth in the amount and variety of data. Data engineering teams are faced with the following challenges:

  • Manipulating data to make it consumable by business users
  • Building and improving extract, transform, and load (ETL) pipelines
  • Scaling their ETL infrastructure

Many customers migrating data to the cloud are looking for ways to modernize by using native AWS services to further scale and efficiently handle ETL tasks. In the early stages of their cloud journey, customers may need guidance on modernizing their ETL workload with minimal effort and time. Customers often use many SQL scripts to select and transform the data in relational databases hosted either in an on-premises environment or on AWS and use custom workflows to manage their ETL.

AWS Glue is a serverless data integration and ETL service with the ability to scale on demand. In this post, we show how you can migrate your existing SQL-based ETL workload to AWS Glue using Spark SQL, which minimizes the refactoring effort.

Solution overview

The following diagram describes the high-level architecture for our solution. This solution decouples the ETL and analytics workloads from our transactional data source Amazon Aurora, and uses Amazon Redshift as the data warehouse solution to build a data mart. In this solution, we employ AWS Database Migration Service (AWS DMS) for both full load and continuous replication of changes from Aurora. AWS DMS enables us to capture deltas, including deletes from the source database, through the use of Change Data Capture (CDC) configuration. CDC in DMS enables us to capture deltas without writing code and without missing any changes, which is critical for the integrity of the data. Please refer CDC support in DMS to extend the solutions for ongoing CDC.

The workflow includes the following steps:

  1. AWS Database Migration Service (AWS DMS) connects to the Aurora data source.
  2. AWS DMS replicates data from Aurora and migrates to the target destination Amazon Simple Storage Service (Amazon S3) bucket.
  3. AWS Glue crawlers automatically infer schema information of the S3 data and integrate into the AWS Glue Data Catalog.
  4. AWS Glue jobs run ETL code to transform and load the data to Amazon Redshift.

For this post, we use the TPCH dataset for sample transactional data. The components of TPCH consist of eight tables. The relationships between columns in these tables are illustrated in the following diagram.

We use Amazon Redshift as the data warehouse to implement the data mart solution. The data mart fact and dimension tables are created in the Amazon Redshift database. The following diagram illustrates the relationships between the fact (ORDER) and dimension tables (DATE, PARTS, and REGION).

Set up the environment

To get started, we set up the environment using AWS CloudFormation. Complete the following steps:

  1. Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.
  2. Choose Launch Stack and open the page on a new tab:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. In the Parameters section, enter the required parameters.
  6. Choose Next.

  1. On the Configure stack options page, leave all values as default and choose Next.
  2. On the Review stack page, select the check boxes to acknowledge the creation of IAM resources.
  3. Choose Submit.

Wait for the stack creation to complete. You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you will see the status CREATE_COMPLETE. The stack takes approximately 25–30 minutes to complete.

This template configures the following resources:

  • The Aurora MySQL instance sales-db.
  • The AWS DMS task dmsreplicationtask-* for full load of data and replicating changes from Aurora (source) to Amazon S3 (destination).
  • AWS Glue crawlers s3-crawler and redshift_crawler.
  • The AWS Glue database salesdb.
  • AWS Glue jobs insert_region_dim_tbl, insert_parts_dim_tbl, and insert_date_dim_tbl. We use these jobs for the use cases covered in this post. We create the insert_orders_fact_tbl AWS Glue job manually using AWS Glue Visual Studio.
  • The Redshift cluster blog_cluster with database sales and fact and dimension tables.
  • An S3 bucket to store the output of the AWS Glue job runs.
  • IAM roles and policies with appropriate permissions.

Replicate data from Aurora to Amazon S3

Now let’s look at the steps to replicate data from Aurora to Amazon S3 using AWS DMS:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task dmsreplicationtask-* and on the Action menu, choose Restart/Resume.

This will start the replication task to replicate the data from Aurora to the S3 bucket. Wait for the task status to change to Full Load Complete. The data from the Aurora tables is now copied to the S3 bucket under a new folder, sales.

Create AWS Glue Data Catalog tables

Now let’s create AWS Glue Data Catalog tables for the S3 data and Amazon Redshift tables:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.
  2. Select RedshiftConnection and on the Actions menu, choose Edit.
  3. Choose Save changes.
  4. Select the connection again and on the Actions menu, choose Test connection.
  5. For IAM role¸ choose GlueBlogRole.
  6. Choose Confirm.

Testing the connection can take approximately 1 minute. You will see the message “Successfully connected to the data store with connection blog-redshift-connection.” If you have trouble connecting successfully, refer to Troubleshooting connection issues in AWS Glue.

  1. Under Data Catalog in the navigation pane, choose Crawlers.
  2. Select s3_crawler and choose Run.

This will generate eight tables in the AWS Glue Data Catalog. To view the tables created, in the navigation pane, choose Databases under Data Catalog, then choose salesdb.

  1. Repeat the steps to run redshift_crawler and generate four additional tables.

If the crawler fails, refer to Error: Running crawler failed.

Create SQL-based AWS Glue jobs

Now let’s look at how the SQL statements are used to create ETL jobs using AWS Glue. AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. AWS Glue Studio is a graphical interface that makes it simple to create, run, and monitor ETL jobs in AWS Glue. You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

Let’s go through the steps of creating an AWS Glue job for loading the orders fact table using AWS Glue Studio.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. Select Visual with a blank canvas, then choose Create.

  1. Navigate to the Job details tab.
  2. For Name, enter insert_orders_fact_tbl.
  3. For IAM Role, choose GlueBlogRole.
  4. For Job bookmark, choose Enable.
  5. Leave all other parameters as default and choose Save.

  1. Navigate to the Visual tab.
  2. Choose the plus sign.
  3. Under Add nodes, enter Glue in the search bar and choose AWS Glue Data Catalog (Source) to add the Data Catalog as the source.

  1. In the right pane, on the Data source properties – Data Catalog tab, choose salesdb for Database and customer for Table.

  1. On the Node properties tab, for Name, enter Customers.

  1. Repeat these steps for the Orders and LineItem tables.

This concludes creating data sources on the AWS Glue job canvas. Next, we add transformations by combining data from these different tables.

Transform the data

Complete the following steps to add data transformations:

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose SQL Query.
  3. On the Transform tab, for Node parents, select all the three data sources.
  4. On the Transform tab, under SQL query, enter the following query:
SELECT orders.o_orderkey        AS ORDERKEY,
orders.o_orderdate       AS ORDERDATE,
lineitem.l_linenumber    AS LINENUMBER,
lineitem.l_partkey       AS PARTKEY,
lineitem.l_receiptdate   AS RECEIPTDATE,
lineitem.l_quantity      AS QUANTITY,
lineitem.l_extendedprice AS EXTENDEDPRICE,
orders.o_custkey         AS CUSTKEY,
customer.c_nationkey     AS NATIONKEY,
CURRENT_TIMESTAMP        AS UPDATEDATE
FROM   orders orders,
lineitem lineitem,
customer customer
WHERE  orders.o_orderkey = lineitem.l_orderkey
AND orders.o_custkey = customer.c_custkey
  1. Update the SQL aliases values as shown in the following screenshot.

  1. On the Data preview tab, choose Start data preview session.
  2. When prompted, choose GlueBlogRole for IAM role and choose Confirm.

The data preview process will take a minute to complete.

  1. On the Output schema tab, choose Use data preview schema.

You will see the output schema similar to the following screenshot.

Now that we have previewed the data, we change a few data types.

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose Change Schema.
  3. Select the node.
  4. On the Transform tab, update the Data type values as shown in the following screenshot.

Now let’s add the target node.

  1. Choose the Change Schema node and choose the plus sign.
  2. In the search bar, enter target.
  3. Choose Amazon Redshift as the target.

  1. Choose the Amazon Redshift node, and on the Data target properties – Amazon Redshift tab, for Redshift access type, select Direct data connection.
  2. Choose RedshiftConnection for Redshift Connection, public for Schema, and order_table for Table.
  3. Select Merge data into target table under Handling of data and target table.
  4. Choose orderkey for Matching keys.

  1. Choose Save.

AWS Glue Studio automatically generates the Spark code for you. You can view it on the Script tab. If you would like to do any out-of-the-box transformations, you can modify the Spark code. The AWS Glue job uses the Apache SparkSQL query for SQL query transformation. To find the available SparkSQL transformations, refer to the Spark SQL documentation.

  1. Choose Run to run the job.

As part of the CloudFormation stack, three other jobs are created to load the dimension tables.

  1. Navigate back to the Jobs page on the AWS Glue console, select the job insert_parts_dim_tbl, and choose Run.

This job uses the following SQL to populate the parts dimension table:

SELECT part.p_partkey,
part.p_type,
part.p_brand
FROM   part part
  1. Select the job insert_region_dim_tbl and choose Run.

This job uses the following SQL to populate the region dimension table:

SELECT nation.n_nationkey,
nation.n_name,
region.r_name
FROM   nation,
region
WHERE  nation.n_regionkey = region.r_regionkey
  1. Select the job insert_date_dim_tbl and choose Run.

This job uses the following SQL to populate the date dimension table:

SELECT DISTINCT( l_receiptdate )        AS DATEKEY,
Dayofweek(l_receiptdate) AS DAYOFWEEK,
Month(l_receiptdate)     AS MONTH,
Year(l_receiptdate)      AS YEAR,
Day(l_receiptdate)       AS DATE
FROM   lineitem lineitem

You can view the status of the running jobs by navigating to the Job run monitoring section on the Jobs page. Wait for all the jobs to complete. These jobs will load the data into the facts and dimension tables in Amazon Redshift.

To help optimize the resources and cost, you can use the AWS Glue Auto Scaling feature.

Verify the Amazon Redshift data load

To verify the data load, complete the following steps:

  1. On the Amazon Redshift console, select the cluster blog-cluster and on the Query Data menu, choose Query in query editor 2.
  2. For Authentication, select Temporary credentials.
  3. For Database, enter sales.
  4. For User name, enter admin.
  5. Choose Save.

  1. Run the following commands in the query editor to verify that the data is loaded into the Amazon Redshift tables:
SELECT *
FROM   sales.PUBLIC.order_table;

SELECT *
FROM   sales.PUBLIC.date_table;

SELECT *
FROM   sales.PUBLIC.parts_table;

SELECT *
FROM   sales.PUBLIC.region_table;

The following screenshot shows the results from one of the SELECT queries.

Now for the CDC, update the quantity of a line item for order number 1 in Aurora database using the below query. (To connect to your Aurora cluster use Cloud9 or any SQL client tools like MySQL command-line client).

UPDATE lineitem SET l_quantity = 100 WHERE l_orderkey = 1 AND l_linenumber = 4;

DMS will replicate the changes into the S3 bucket as shown in the below screenshot.

Re-running the Glue job insert_orders_fact_tbl will update the changes to the ORDER fact table as shown in the below screenshot

Clean up

To avoid incurring future charges, delete the resources created for the solution:

  1. On the Amazon S3 console, select the S3 bucket created as part of the CloudFormation stack, then choose Empty.
  2. On the AWS CloudFormation console, select the stack that you created initially and choose Delete to delete all the resources created by the stack.

Conclusion

In this post, we showed how you can migrate existing SQL-based ETL to an AWS serverless ETL infrastructure using AWS Glue jobs. We used AWS DMS to migrate data from Aurora to an S3 bucket, then SQL-based AWS Glue jobs to move the data to fact and dimension tables in Amazon Redshift.

This solution demonstrates a one-time data load from Aurora to Amazon Redshift using AWS Glue jobs. You can extend this solution for moving the data on a scheduled basis by orchestrating and scheduling jobs using AWS Glue workflows. To learn more about the capabilities of AWS Glue, refer to AWS Glue.


About the Authors

Mitesh Patel is a Principal Solutions Architect at AWS with specialization in data analytics and machine learning. He is passionate about helping customers building scalable, secure and cost effective cloud native solutions in AWS to drive the business growth. He lives in DC Metro area with his wife and two kids.

Sumitha AP is a Sr. Solutions Architect at AWS. She works with customers and help them attain their business objectives by  designing secure, scalable, reliable, and cost-effective solutions in the AWS Cloud. She has a focus on data and analytics and provides guidance on building analytics solutions on AWS.

Deepti Venuturumilli is a Sr. Solutions Architect in AWS. She works with commercial segment customers and AWS partners to accelerate customers’ business outcomes by providing expertise in AWS services and modernize their workloads. She focuses on data analytics workloads and setting up modern data strategy on AWS.

Deepthi Paruchuri is an AWS Solutions Architect based in NYC. She works closely with customers to build cloud adoption strategy and solve their business needs by designing secure, scalable, and cost-effective solutions in the AWS cloud.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close