AWS Security Profiles: Merritt Baer, Principal in OCISO

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/aws-security-profiles-merritt-baer-principal-in-ociso/

AWS Security Profiles: Merritt Baer, Principal in OCISO
In the week leading up AWS re:Invent 2021, we’ll share conversations we’ve had with people at AWS who will be presenting, and get a sneak peek at their work.


How long have you been at Amazon Web Services (AWS), and what do you do in your current role?

I’m a Principal in the Office of the Chief Information Security Officer (OCISO), and I’ve been at AWS about four years. In the past, I’ve worked in all three branches of the U.S. Government, doing security on behalf of the American people.

My current role involves both internal- and external- facing security.

I love having C-level conversations around hard but simple questions about how to prioritize the team’s resources and attention. A lot of my conversations revolve around organizational change, and how to motivate the move to the cloud from a security perspective. Within that, there’s a technical “how”—we might talk about the move to an intelligent multi-account governance structure using AWS Organizations, or the use of appropriate security controls, including remediations like AWS Config Rules and Amazon EventBridge. We might also talk about the ability to do forensics, which in the cloud looks like logging and monitoring with AWS CloudTrail, Amazon CloudWatch, Amazon GuardDuty, and others aggregated in AWS Security Hub.

I also handle strategic initiatives for our security shop, from operational considerations like how we share threat intelligence internally, to the ways we can better streamline our policy and contract vehicles, to the ways that we can incorporate customer feedback into our products and services. The work I do for AWS’ security gives me the empathy and credibility to talk with our customers—after all, we’re a security organization, running on AWS.

What drew you to security?

(Sidebar: it’s a little bit of who I am— I mean, doesn’t everyone rely on polaroid photos? just kidding— kind of :))
 
Merritt Baer polaroid photo

I always wanted to matter.

I was in school post-9/11, and security was an imperative. Meanwhile, I was in Mark Zuckerberg’s undergrad class at Harvard. A lot of the technologies that feel so intimate and foundational—cloud, AI/ML, IoT, and the use of mobile apps, for example—were just gaining traction back then. I loved both emerging tech and security, and I was convinced that they needed to speak to and with one another. I wanted our approach to include considerations around how our systems impact vulnerable people and communities. I became an expert in child pornography law, which continues to be an important area of security definition.

I am someone who wonders what we’re all doing here, and I got into security because I wanted to help change the world. In the words of Poet Laureate Joy Harjo, “There is no world like the one surfacing.”

How do you explain your job to non-tech friends?

I often frame my work relative to what they do, or where we are when we’re chatting. Today, nearly everyone interacts with cloud infrastructure in our everyday lives. If I’m talking to a person who works in finance, I might point to AWS’ role providing IT infrastructure to the global financial system; if we’re walking through a pharmacy I might describe how research and development cycles have accelerated because of high-performance computing (HPC) on AWS.

What are you currently working on that you’re excited about?

Right now, I’m helping customer executives who’ve had a tumultuous (different, not necessarily all bad) couple of years. I help them adjust to a new reality in their employee behavior and access needs, like the move to fully remote work. I listen to their challenges in the ability to democratize security knowledge through their organizations, including embedding security in dev teams. And I help them restructure their consumption of AWS, which has been changing in light of the events of the last two years.

On a strategic level, I have a lot going on … here’s a good sampling: I’ve been championing new work based on customers asking our experts to be more proactive by “snapshotting” metadata about their resources and evaluating that metadata against our well-architected security framework. I work closely with our Trust and Safety team on new projects that both increase automation for high volume issues but also provide more “high touch” and prioritized responses to trusted reporters. I’m also building the business case for security service teams to make their capabilities even more broadly available by extended free tiers and timelines. I’m providing expertise to our private equity folks on a framework for evaluating the maturity of security capabilities of target acquisitions. Finally, I’ve helped lead our efforts to add tighter security controls when AWS teams provide prototyping and co-development work. I live in Miami, Florida, USA, and I also work on building out the local tech ecosystem here!

I’m also working on some of the ways we can address ransomware. During our interview process, Amazon requests that folks do an hour-long presentation on a topic of your choice. I did mine on ransomware in the cloud, and when I came on board I pointed to that area of need for security solutions. Now we have a ransomware working group I help lead, with efforts underway to help out customers doing both education and architectural guidance, as well as curated solutions with industries and partners, including healthcare.

You’re presenting at AWS re:Invent this year—can you give readers a sneak peek at what you’re covering?

One talk is on cloud-native approaches to ransomware defense, encouraging folks to think innovatively as they mature their IT infrastructure. And a second talk highlights partner solutions that can help meet customers where they are, and improve their anti-ransomware posture using vendors—from MSSPs and systems integrators, to endpoint security, DNS filtering, and custom backup solutions.

What are you hoping the audience will take away from the sessions?

These days, security doesn’t just take the form of security services (like GuardDuty and AWS WAF), but will also manifest in the ways you design a cloud-aware architecture. For example, our managed database service Aurora can be cloned; that clone might act as a canary when you see data drift (a canary is security concept for testing your expectations). You can use this to get back to a known good state.

Security is a bottom line proposition. What I mean by that is:

  1. It’s a business criticality to avoid a bad day
  2. Embracing mature security will enable your entity’s development innovation
  3. The security of your products is a meaningful part of what you deliver on to your customers.

From your perspective, what’s the most important thing to know about ransomware?

Ransomware is a big headline-maker right now, but it’s not new. Most ransomware attacks are not based on zero days; they’re knowable but opportunistic. So, without victim-blaming, I mean to equip us with the confidence to confront the security issue. There’s no need to be ransomed.

I try not to get wrapped around particular issues, and instead emphasize building the foundation right. So sure, we can call it ransomware defense, but we can also point to these security maturity measures as best practices in general.

I think it’s fair to say that you’re passionate about women in tech and in security specifically. You recently presented at the Day of Shecurity conference and the Women in Business Summit, and did an Instagram takeover for Women in CyberSecurity (WiCyS). Why do you feel passionately about this?

I see security as an inherently creative field. As security professionals, we’re capable of freeing the business to get stuff done, and to get it done securely. That sounds simple, and it’s hard!

Any time you’re working in a creative field, you rely on human ingenuity and pragmatism to ensure you’re doing it imaginatively instead of simply accepting old realities. When we want to be creative, we need more of the stuff life is made of: human experience. We know that people who move through the world with different identities and experiences think differently. They approach problems differently. They code differently.

So, I think having women in security is important, both for the women who choose to work in security, and for the security field as a whole.

What advice would you give a woman just starting out in the security industry?

No one is born with a brain full of security knowledge. Technology is human-made and imperfect, and we all had to learn it at some point. Start somewhere. No one is going to tap you on the shoulder and invite you to your life 🙂

Operationally, I recommend:

  • Curate your “elevator pitch” about who you are and what you’re looking for, and be explicit when asking for folks for a career conversation or a referral (you can find me on Twitter @MerrittBaer, feel free to send a note).
  • Don’t accept a first job offer—ask for more.
  • Beware of false choices. For example, sometimes there’s a job that’s not in the description—consider writing your own value proposition and pitching it to the organization. This is a field that’s developing all the time, and you may be seeing a need they hadn’t yet solidified.

What’s your favorite Leadership Principle at Amazon and why?

I think Bias for Action takes precedence for me— there’s a business decision here to move fast. We know that comes with some costs and risks, but we’ve made that calculated decision to pursue high velocity.

I have a law degree, and I see the Leadership Principles sort of like the Bill of Rights: they are frequently in tension and sometimes even at odds with one another (for example, Bias for Action and Are Right, A Lot might demand different modes). That is what makes them timeless—yet even more contingent on our interpretation—as we derive value from them. As a security person, I want us to pursue the good, and also to transcend the particular fears of the day.

If you had to pick any other industry, what would you want to do?

Probably public health. I think if I wasn’t doing security, I would want to do something else landscape-level.

Even before I had a daughter, but certainly now that I have a one-year-old, I would calculate the ROI of my life’s existence and my investment in my working life.

That being said, there are days I just need to come home to some unconditional love from my rescue pug, Peanut Butter.
 
Peanut Butter the dog

 

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

 Merritt Baer

Merritt Baer

Merritt is a Principal in the Office of the CISO. She can be found on Twitter at @merrittbaer and looks forward to meeting you at re:Invent, or in your next executive conversation.

Author

Maddie Bacon

Maddie (she/her) is a technical writer for AWS Security with a passion for creating meaningful content. She previously worked as a security reporter and editor at TechTarget and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and all things Harry Potter.

Security considerations for Amazon Redshift cross-account data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/security-considerations-for-amazon-redshift-cross-account-data-sharing/

Data driven organizations recognize the intrinsic value of data and realize that monetizing data is not just about selling data to subscribers. They understand the indirect economic impact of data and the value that good data brings to the organization. They must democratize data and make it available for business decision makers to realize its benefits. Today, this would mean replicating data across multiple disparate databases, which requires moving the data across various platforms.

Amazon Redshift data sharing lets you securely and easily share live data across Amazon Redshift clusters or AWS accounts for read purposes. Data sharing can improve the agility of your organization by giving you instant, granular, and high-performance access to data across Amazon Redshift clusters without manually copying or moving it. Data sharing provides you with live access to data so that your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Cross-account data sharing lets you share data across multiple accounts. The accounts can be within the same organization or across different organizations. We have built in additional authorization steps for security control, since sharing data across accounts could also mean sharing data across different organizations. Please review AWS documentation on cross-account data sharing and a blog from our colleague for detailed steps. We also have a YouTube video on setting up cross-account data sharing for a business use case which you can refer as well.

Cross-account data sharing scenario

For this post, we will use this use case to demonstrate how you could setup cross-account data sharing with the option to control data sharing to specific consumer accounts from the producer account. The producer organization has one AWS account and one Redshift cluster. The consumer organization has two AWS accounts and three Redshift clusters in each of the accounts. The producer organization wants to share data from the producer cluster to one of the consumer accounts “ConsumerAWSAccount1”, and the consumer organization wants to restrict access to the data share to a specific Redshift cluster, “ConsumerCluster1”. Sharing to the second consumer account “ConsumerAWSAccount2” should be disallowed. Similarly, access to the data share should be restricted to the first consumer cluster, “ConsumerCluster1”.

Walkthrough

You can setup this behavior using the following steps:

Setup on the producer account:

  • Create a data share in the Producer cluster and add schema and tables.
  • Setup IAM policy to control which consumer accounts can be authorized for data share.
  • Grant data share usage to a consumer AWS account.

Setup on the consumer account:

  • Setup IAM policy to control which of the consumer Redshift clusters can be associated with the producer data share.
  • Associate consumer cluster to the data share created on the producer cluster.
  • Create database referencing the associated data share.

Prerequisites

To set up cross-account data sharing, you should have the following prerequisites:

  • Three AWS accounts. Once for producer < ProducerAWSAccount1>, and two consumer accounts – <ConsumerAWSAccount1> and < ConsumerAWSAccount2>.
  • AWS permissions to provision Amazon Redshift and create an IAM role and policy.

We assume you have provisioned the required Redshift clusters: one for the producer in the producer AWS Account, two Redshift clusters in ConsumerCluster1, and optionally one Redshift cluster in ConsumerCluster2

  • Two users in producer account, and two users in consumer account 1
    • ProducerClusterAdmin
    • ProducerCloudAdmin
    • Consumer1ClusterAdmin
    • Consumer1CloudAdmin

Security controls from producer and consumer

Approved list of consumer accounts from the producer account

When you share data across accounts, the producer admin can grant usage of the data share to a specific account. For additional security to allow the separation of duty between the database admin and the cloud security administrator, organizations might want to have an approved list of AWS accounts that can be granted access. You can achieve this by creating an IAM policy listing all of the approved accounts, and then add this policy to the role attached to the producer cluster.

Creating the IAM Policy for the approved list of consumer accounts

  1. On the AWS IAM Console, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following policy:
    This is the producer side policy. Note: you should replace the following text with the specific details for your cluster and account.
    • “Resource”: “*” – Replace “*” with the ARN of the specific data share.
    • <AWSAccountID> – Add one or more consumer account numbers based on the requirement.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow",
"Effect": "Allow",
"Action": [
"redshift:AuthorizeDataShare",
"redshift:DeauthorizeDataShare"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"redshift:ConsumerIdentifier": [
"<AWSAccountID>"
]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}
  1. From the Amazon Redshift console in the producer AWS Account, choose Query Editor V2 and connect to the producer cluster using temporary credentials.
  2. After connecting to the producer cluster, create the data share and add the schema and tables to the data share. Then, grant usage to the consumer accounts<ConsumerAWSAccount1> and <ConsumerAWSAccount2>
CREATE DATASHARE ds;

ALTER DATASHARE ds ADD SCHEMA PUBLIC;
ALTER DATASHARE ds ADD TABLE table1;
ALTER DATASHARE ds ADD ALL TABLES IN SCHEMA sf_schema;

GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount1>;
GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount2>;

Note: the GRANT will be successful even though the account is not listed in the IAM policy. But the Authorize step will validate against the list of approved accounts in the IAM policy, and it will fail if the account is not in the approved list.

  1. Now the producer admin can authorize the data share by using the AWS CLI command line interface or the console. When you authorize the data share to <ConsumerAWSAccount1>, then the authorization is successful.
aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount1>

  1. When you authorize the data share to <ConsumerAWSAccount2>, the authorization fails, as the IAM policy we setup in the earlier step does not allow data share to <ConsumerAWSAccount2>.
aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount2>

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by using a conditional construct with an approved account list in the IAM policy.

Approved list of Redshift clusters on consumer account

When you grant access to a data share to a consumer account, the consumer admin can determine which Redshift clusters can read the data share by associating it with the appropriate cluster. If the organization wants to control which of the Redshift clusters the admin can associate with the data share, then you can specify the approved list of Redshift clusters by using the cluster ARN in an IAM policy.

  1. On the AWS IAM Console, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following policy:
    This is the consumer side policy. Note: you should replace the following text with the specific details for your cluster and account.
    • “Resource”: “*” – Replace “*” with the ARN of the specific data share.
    • Replace “<ProducerDataShareARN>” with the ARN of the data share created in the Redshift cluster in AWS Consumer account 1.
    • Replace “<ConsumerRedshiftCluster1ARN>” with the ARN of the first Redshift cluster in AWS Consumer account 1.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"redshift:AssociateDataShareConsumer",
"redshift:DisassociateDataShareConsumer"
],
"Resource": "<ProducerDataShareARN>",
"Condition": {
"StringEquals": {
"redshift:ConsumerArn": [ "<ConsumerRedshiftCluster1ARN>" ]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}
  1. Now the consumer admin can associate the data share using the AWS CLI command line interface or the console. When you associate the Redshift cluster 1 <ConsumerRedshiftCluster1ARN >, the association is successful.
aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster1ARN>

  1. Now the consumer admin can associate the data share by using the AWS CLI command line interface or the console. When you associate the Redshift cluster 2 <ConsumerRedshiftCluster2ARN >, the association fails.
aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster2ARN>

  1. After associating the Consumer Redshift cluster 1 to the producer data share, from the Amazon Redshift console in the Consumer AWS Account, choose Query Editor V2 and connect to the consumer cluster using temporary credentials.
  2. After connecting to the consumer cluster, you can create a database referencing the data share on the producer cluster, and then start querying the data.
CREATE DATABASE ds_db FROM DATASHARE ds OF ACCOUNT <PRODUCER ACCOUNT> NAMESPACE <PRODUCER CLUSTER NAMESPACE>;
 
Optional:
CREATE EXTERNAL SCHEMA Schema_from_datashare FROM REDSHIFT DATABASE 'ds_db' SCHEMA 'public';

GRANT USAGE ON DATABASE ds_db TO user/group;

GRANT USAGE ON SCHEMA Schema_from_datashare TO GROUP Analyst_group;

SELECT  * FROM ds_db.public.producer_t1;

You can use the query editor or the new Amazon Redshift Query Editor V2 to run the statements above to read the shared data from the producer by creating an external database reference from the consumer cluster.

Conclusion

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by listing approved accounts in the IAM policy.

On the consumer side, we have also demonstrated how you can restrict access to a particular Redshift cluster on the consumer account for the data share created on the producer cluster by listing approved Redshift cluster(s) in the IAM policy. Enterprises and businesses can use this approach to control the boundaries of Redshift data sharing at account and cluster granularity.

We encourage you to try cross-account data sharing with these additional security controls to securely share data across Amazon Redshift clusters both within your organizations and with your customers or partners.


About the Authors

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide the direction of our product offerings.

Kiran Sharma is a Senior Big Data Consultant for AWS Professional Services. She works with our customers to architect and implement Big Data Solutions on variety of projects on AWS.

Eric Hotinger is a Software Engineer at AWS. He enjoys solving seemingly impossible problems in the areas of analytics, streaming, containers, and serverless.

Visualizing AWS Step Functions workflows from the Amazon Athena console

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-aws-step-functions-workflows-from-the-amazon-athena-console/

This post is written by Dhiraj Mahapatro, Senior Specialist SA, Serverless.

In October 2021, AWS announced visualizing AWS Step Functions from the AWS Batch console. Now you can also visualize Step Functions from the Amazon Athena console.

Amazon Athena is an interactive query service that makes it easier to analyze Amazon S3 data using standard SQL. Athena is a serverless service and can interact directly with data stored in S3. Athena can process unstructured, semistructured, and structured datasets.

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Step Functions workflows manage failures, retries, parallelization, service integrations, and observability so builders can focus on business logic. Athena is one of the service integrations that are available for Step Functions.

This blog walks through Step Functions integration in Amazon Athena console. It shows how you can visualize and operate Athena queries at scale using Step Functions.

Introducing workflow orchestration in Amazon Athena console

AWS customers store large amounts of historical data on S3 and query the data using Athena to get results quickly. They also use Athena to process unstructured data or analyze structured data as part of a data processing pipeline.

Data processing involves discrete steps for ingesting, processing, storing the transformed data, and post-processing, such as visualizing or analyzing the transformed data. Each step involves multiple AWS services. With Step Functions workflow integration, you can orchestrate these steps. This helps to create repeatable and scalable data processing pipelines as part of a larger business application and visualize the workflows in the Athena console.

With Step Functions, you can run queries on a schedule or based on an event by using Amazon EventBridge. You can poll long-running Athena queries before moving to the next step in the process, and handle errors without writing custom code. Combining these two services provides developers with a single method that is scalable and repeatable.

Step Functions workflows in the Amazon Athena console allow orchestration of Athena queries with Step Functions state machines:

Athena console

Using Athena query patterns from Step Functions

Execute multiple queries

In Athena, you run SQL queries in the Athena console against Athena workgroups. With Step Functions, you can run Athena queries in a sequence or run independent queries simultaneously in parallel using a parallel state. Step Functions also natively handles errors and retries related to Athena query tasks.

Workflow orchestration in the Athena console provides these capabilities to run and visualize multiple queries in Step Functions. For example:

UI flow

  1. Choose Get Started from Execute multiple queries.
  2. From the pop-up, choose Create your own workflow and select Continue.

A new browser tab opens with the Step Functions Workflow Studio. The designer shows a workflow pattern template pre-created. The workflow loads data from a data source running two Athena queries in parallel. The results are then published to an Amazon SNS topic.

Alternatively, choosing Deploy a sample project from the Get Started pop-up deploys a sample Step Functions workflow.

Get started flow

This option creates a state machine. You then review the workflow definition, deploy an AWS CloudFormation stack, and run the workflow in the Step Functions console.

Deploy and run

Once deployed, the state machine is visible in the Step Functions console as:

State machines

Select the AthenaMultipleQueriesStateMachine to land on the details page:

Details page

The CloudFormation template provisions the required AWS Glue database, S3 bucket, an Athena workgroup, the required AWS Lambda functions, and the SNS topic for query results notification.

To see the Step Functions workflow in action, choose Start execution. Keep the optional name and input and choose Start execution:

Start execution

The state machine completes the tasks successfully by Executing multiple queries in parallel using Amazon Athena and Sending query results using the SNS topic:

Successful execution

The state machine used the Amazon Athena StartQueryExecution and GetQueryResults tasks. The Workflow orchestration in Athena console now highlights this newly created Step Functions state machine:

Athena console

Any state machine that uses this task in Step Functions in this account is listed here as a state machine that orchestrates Athena queries.

Query large datasets

You can also ingest an extensive dataset in Amazon S3, partition it using AWS Glue crawlers, then run Amazon Athena queries against that partition.

Select Get Started from the Query large datasets pop-up, then choose Create your own workflow and Continue. This action opens the Step Functions Workflow studio with the following pattern. The Glue crawler starts and partitions large datasets for Athena to query in the subsequent query execution task:

Query large datasets

Step Functions allows you to combine Glue crawler tasks and Athena queries to partition where necessary before querying and publishing the results.

Keeping data up to date

You can also use Athena to query a target table to fetch data, then update it with new data from other sources using Step Functions’ choice state. The choice state in Step Functions provides branching logic for a state machine.

Keep data up to date

You are not limited to the previous three patterns shown in workflow orchestration in the Athena console. You can start from scratch and build Step Functions state machine by navigating to the bottom right and using Create state machine:

Create state machine

Create State Machine in the Athena console opens a new tab showing the Step Functions console’s Create state machine page.

Choosing authoring method

Refer to building a state machine AWS Step Functions Workflow Studio for additional details.

Step Functions integrates with all Amazon Athena’s API actions

In September 2021, Step Functions announced integration support for 200 AWS services to enable easier workflow automation. With this announcement, Step Functions can integrate with all Amazon Athena API actions today.

Step Functions can automate the lifecycle of an Athena query: Create/read/update/delete/list workGroups; Create/read/update/delete/list data catalogs, and more.

Other AWS service integrations

Step Functions’ integration with the AWS SDK provides support for 200 AWS Services and over 9,000 API actions. Athena tasks in Step Functions can evolve by integrating available AWS services in the workflow for their pre and post-processing needs.

For example, you can read Athena query results that are put to an S3 bucket by using a GetObject S3 task AWS SDK integration in Step Functions. You can combine different AWS services into a single business process so that they can ingest data through Amazon Kinesis, do processing via AWS Lambda or Amazon EMR jobs, and send notifications to interested parties via Amazon SNS or Amazon SQS or Amazon EventBridge to trigger other parts of their business application.

There are multiple ways to decorate around an Amazon Athena job task. Refer to AWS SDK service integrations and optimized integrations for Step Functions for additional details.

Important considerations

Workflow orchestrations in the Athena console only show Step Functions state machines that use Athena’s optimized API integrations. This includes StartQueryExecution, StopQueryExecution, GetQueryExecution, and GetQueryResults.

Step Functions state machines do not show in the Athena console when:

  1. A state machine uses any other AWS SDK Athena API integration task.
  2. The APIs are invoked inside a Lambda function task using an AWS SDK client (like Boto3 or Node.js or Java).

Cleanup

First, empty DataBucket and AthenaWorkGroup to delete the stack successfully. To delete the sample application stack, use the latest version of AWS CLI and run:

aws cloudformation delete-stack --stack-name <stack-name>

Alternatively, delete the sample application stack in the CloudFormation console by selecting the stack and choosing Delete:

Stack deletion

Conclusion

Amazon Athena console now provides an integration with AWS Step Functions’ workflows. You can use the provided patterns to create and visualize Step Functions’ workflows directly from the Amazon Athena console. Step Functions’ workflows that use Athena’s optimized API integration appear in the Athena console. To learn more about Amazon Athena, read the user guide.

To get started, open the Workflows page in the Athena console. Select Create Athena jobs with Step Functions Workflows to deploy a sample project, if you are new to Step Functions.

For more serverless learning resources, visit Serverless Land.

How to Connect Your QNAP NAS to Backblaze B2 Cloud Storage

Post Syndicated from Troy Liljedahl original https://www.backblaze.com/blog/guide-qnap-backup-b2-cloud-storage/

Network attached storage (NAS) devices are a popular solution for data storage, sharing files for remote collaboration purposes, syncing files that are part of a workflow, and more. QNAP, one of the leading NAS manufacturers, makes it incredibly easy to backup and/or sync your business or personal data for these purposes with the inclusion of its application, Hybrid Backup Sync (HBS). HBS consolidates backup, restoration, and synchronization functions into a single application.

Protecting your data with a NAS is a great first step, but you shouldn’t stop there. NAS devices are still vulnerable to any kind of on-premises disaster like fires, floods, and tornados. They’re also not safe from ransomware attacks that might hit your network. To truly protect your data, it’s important to back up or sync to an off-site cloud storage destination like Backblaze B2 Cloud Storage. Backblaze B2 offers a geographically distanced location for your data for $5/TB per month, and you can also embed it into your NAS-based workflows to streamline access across multiple locations.

Read on for more information on whether you should use backup or sync for your purposes and how to connect your QNAP NAS to Backblaze B2 step-by-step. We’ve even provided videos that show you just how easy it is—it typically takes less than 15 minutes!

➔ Download Our Complete NAS Guide

Should I Back Up or Sync?

It’s easy to confuse backup and sync. They’re essentially both making a copy of your data, but they have different use cases. It’s important to understand the difference so you’re getting the right protection and accessibility for your data.

Check out the table below. You’ll see that backup is best for being able to recover from a data disaster, including the ability to access previous versions of data. However, if you’re just looking for a mirror copy of your data, sync functionality is all you need. Sync is also useful as part of remote workflows: you can sync your data between your QNAP and Backblaze B2, and then remote workers can pull down the most up-to-date files from the B2 cloud.

A table comparing Backup vs. Sync

A table comparing Backup vs. Sync.

Because Hybrid Backup Sync provides both functions in one application, you should first identify which feature you truly need. The setup process is similar, but you will need to take different steps to configure backup vs. sync in HBS.

How to Set Up Your Backblaze B2 Account

Now that you’ve determined whether you want to back up or sync your data, it’s time to create your Backblaze B2 Cloud Storage account to securely protect your on-premises data.

If you already have a B2 Cloud Storage account, feel free to skip ahead. Otherwise, you can sign up for an account and get started with 10GB of free storage to test it out.

Ready to get started? You can follow along with the directions in this blog or take a look at our video guides. Greg Hamer, Senior Technical Evangelist, demonstrates how to get your data into B2 Cloud Storage in under 15 minutes using HBS for either backup or sync.

Video: Back Up QNAP to Backblaze B2 Cloud Storage with QNAP Hybrid Backup Sync

Video: Sync QNAP to Backblaze B2 Cloud Storage with QNAP Hybrid Backup Sync

How to Set Up a Bucket, Application Key ID, and Application Key

Once you’ve signed up for a Backblaze B2 Account, you’ll need to create a bucket, Application Key ID, and Application Key. This may sound like a lot, but all you need are a few clicks, a couple names, and less than a minute!

  1. On the Buckets page of your account, click the Create a Bucket button.
  2. An screenshot of the B2 Cloud Storage Buckets page.

  3. Give your bucket a name and enable encryption for added security.
  4. An image showing the Create a Bucket page with security features to be enabled.

  5. Click the Create a Bucket button and you should see your new bucket on the Buckets page.
  6. An image showing a successfully created bucket.

  7. Navigate to the App Keys page of your account and click Add a New Application Key.
  8. Name your Application Key and click the Create New Key button. Make sure that your key has both read and write permissions (the default option).
  9. Your Application Key ID and Application Key will appear on your App Keys page. Important: Make sure to copy these somewhere secure as the Application Key will not appear again!

How to Set Up QNAP’s Hybrid Backup Sync to Work With B2 Cloud Storage

To set up your QNAP with Backblaze B2 sync support, you’ll need access to your B2 Cloud Storage account. You’ll also need your B2 Cloud Storage account ID, Application Key, and bucket name—all of which are available after you log in to your Backblaze account. Finally, you’ll need the Hybrid Backup Sync application installed in QTS. You’ll need QTS 4.3.3 or later and Hybrid Backup Sync v2.1.170615 or later.

To configure a backup or sync job, simply follow the rest of the steps in this integration guide or reference the videos posted above. Once you follow the rest of the configuration steps, you’ll have a set-it-and-forget-it solution in place.

What Can You Do With Backblaze B2 and QNAP Hybrid Backup Sync?

With QNAP’s Hybrid Backup Sync software, you can easily back up and sync data to the cloud. Here’s some more information on what you can do to make the most of your setup.

Hybrid Backup Sync 3.0

QNAP and Backblaze B2 users can take advantage of Hybrid Backup Sync, as explained above. Hybrid Backup Sync is a powerful tool that provides true backup capability with features like version control, client-side encryption, and block-level deduplication. QNAP’s operating system, QTS, continues to deliver innovation and add thrilling new features. The ability to preview backed up files using the QuDedup Extract Tool, a feature first released in QTS 4.4.1, allowed QNAP users to save on bandwidth costs.

You can download the latest QTS update here and Hybrid Backup Sync is available in the App Center on your QNAP device.

Hybrid Mount and VJBOD Cloud

The Hybrid Mount and VJBOD Cloud apps allow QNAP users to designate a drive in their system to function as a cache while accessing B2 Cloud Storage. This allows users to interact with Backblaze B2 just like you would a folder on your QNAP device while using Backblaze B2 as an active storage location.

Hybrid Mount and VJBOD Cloud are both included in the QTS 4.4.1 versions and higher, and function as a storage gateway on a file-based or block-based level, respectively. Hybrid Mount enables Backblaze B2 to be used as a file server and is ideal for online collaboration and file-level data analysis. VJBOD Cloud is ideal for a large number of small files or singular massively large files (think databases!) since it’s able to update and change files on a block-level basis. Both apps offer the ability to connect to B2 Cloud Storage via popular protocols to fit any environment, including server message block (SMB), Apple Filing Protocol (AFP), network file sharing (NFS), file transfer protocol (FTP), and WebDAV.

QuDedup

QuDedup introduces client-side deduplication to the QNAP ecosystem. This helps users at all levels save on space on their NAS by avoiding redundant copies in storage. Backblaze B2 users have something to look forward to as well since these savings carry over to cloud storage via the HBS 3.0 update.

Why Backblaze B2?

QNAP continues to innovate and unlock the potential of B2 Cloud Storage in the NAS ecosystem. If you haven’t given B2 Cloud Storage a try yet, now is the time. You can get started with Backblaze B2 and your QNAP NAS right now, and make sure your NAS is synced securely and automatically to the cloud.

The post How to Connect Your QNAP NAS to Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Choose the right storage tier for your needs in Amazon OpenSearch Service

Post Syndicated from Changbin Gong original https://aws.amazon.com/blogs/big-data/choose-the-right-storage-tier-for-your-needs-in-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) enables organizations to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

In this post, we present three storage tiers of Amazon OpenSearch Service—hot, UltraWarm, and cold storage—and discuss how to effectively choose the right storage tier for your needs. This post can help you understand how these storage tiers integrate together and what the trade-off is for each storage tier. To choose a storage tier of Amazon OpenSearch Service for your use case, you need to consider the performance, latency, and cost of these storage tiers in order to make the right decision.

Amazon OpenSearch Service storage tiers overview

There are three different storage tiers for Amazon OpenSearch Service: hot, UltraWarm, and cold. The following diagram illustrates these three storage tiers.

Hot storage

Hot storage for Amazon OpenSearch Service is used for indexing and updating, while providing fast access to data. Standard data nodes use hot storage, which takes the form of instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node. Hot storage provides the fastest possible performance for indexing and searching new data.

You get the lowest latency for reading data in the hot tier, so you should use the hot tier to store frequently accessed data driving real-time analysis and dashboards. As your data ages, you access it less frequently and can tolerate higher latency, so keeping data in the hot tier is no longer cost-efficient.

If you want to have low latency and fast access to the data, hot storage is a good choice for you.

UltraWarm storage

UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) with related caching solutions to improve performance. UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and don’t need the same performance as hot storage. Although you can’t modify the data while in UltraWarm, you can move the data to the hot storage tier for edits before moving it back.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards. When you query for the list of shards in UltraWarm, you still see the primary and replicas listed. Both shards are stubs for the same, single copy of the data, which is in Amazon S3. The durability of data in Amazon S3 removes the need for replicas, and Amazon S3 abstracts away any operating system or service considerations. In the hot tier, accounting for one replica, 20 GB of index uses 40 GB of storage. In the UltraWarm tier, it’s billed at 20 GB.

The UltraWarm tier acts like a caching layer on top of the data in Amazon S3. UltraWarm moves data from Amazon S3 onto the UltraWarm nodes on demand, which speeds up access for subsequent queries on that data. For that reason, UltraWarm works best for use cases that access the same, small slice of data multiple times. You can add or remove UltraWarm nodes to increase or decrease the amount of cache against your data in Amazon S3 to optimize your cost per GB. To dial in your cost, be sure to test using a representative dataset. To monitor performance, use the WarmCPUUtilization and WarmJVMMemoryPressure metrics. See UltraWarm metrics for a complete list of metrics.

The combined CPU cores and RAM allocated to UltraWarm nodes affects performance for simultaneous searches across shards. We recommend deploying enough UltraWarm instances so that you store no more than 400 shards per ultrawarm1.medium.search node and 1,000 shards per ultrawarm1.large.search node (including both primaries and replicas). We recommend a maximum shard size of 50 GB for both hot and warm tiers. When you query UltraWarm, each shard uses a CPU and moves data from Amazon S3 to local storage. Running single or concurrent queries that access many indexes can overwhelm the CPU and local disk resources. This can cause longer latencies through inefficient use of local storage, and even cause cluster failures.

UltraWarm storage requires OpenSearch 1.0 or later, or Elasticsearch version 6.8 or later.

If you have large amounts of read-only data and want to balance the cost and performance, use UltraWarm for your infrequently accessed, older data.

Cold storage

Cold storage is optimized to store infrequently accessed or historical data at $0.024 per GB per month. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data. Cold storage is a great fit for scenarios in which a low ROI necessitates an archive or delete action on historical data, or if you need to conduct research or perform forensic analysis on older data with Amazon OpenSearch Service.

Cold storage doesn’t have specific instance types because it doesn’t have any compute capacity attached to it. You can store any amount of data in cold storage.

Cold storage requires OpenSearch 1.0 or later, or Elasticsearch version 7.9 or later and UltraWarm.

Manage storage tiers in OpenSearch Dashboards

OpenSearch Dashboards installed on your Amazon OpenSearch Service domain provides a useful UI for managing indexes in different storage tiers on your domain. From the OpenSearch Dashboards main menu, you can view all indexes in hot, UltraWarm, and cold storage. You can also see the indexes managed by Index State Management (ISM) policies. OpenSearch Dashboards enables you to migrate indexes between UltraWarm and cold storage, and monitor index migration status, without using the AWS Command Line Interface (AWS CLI) or configuration API. For more information on OpenSearch Dashboards, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

Cost considerations

The hot tier requires you to pay for what is provisioned, which includes the hourly rate for the instance type. Storage is either Amazon EBS or a local SSD instance store. For Amazon EBS-only instance types, additional EBS volume pricing applies. You pay for the amount of storage you deploy.

UltraWarm nodes charge per hour just like other node types, but you only pay for the storage actually stored in Amazon S3. For example, although the instance type ultrawarm1.large.elasticsearch provides up to 20 TiB addressable storage on Amazon S3, if you only store 2 TiB of data, you’re only billed for 2 TiB. Like the standard data node types, you also pay an hourly rate for each UltraWarm node. For more information, see Pricing for Amazon OpenSearch Service.

Cold storage doesn’t incur compute costs, and like UltraWarm, you’re only billed for the amount of data stored in Amazon S3. There are no additional transfer charges when moving data between cold and UltraWarm storage.

Example use case

Let’s look at an example with 1 TB of source data per day, 7 days hot, 83 days warm, 365 days cold. For more information on sizing the cluster, see Sizing Amazon OpenSearch Service domains.

For hot storage, you can go through a baseline estimation with the calculation as: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention. With the best practice for two replicas, we should use two replicas here. The minimum storage requirement to retain 7 TB of data on the hot tier is (7TB*1.25)*(2+1)= 26.25 TB. For this amount of storage, we need 6x R6g.4xlarge.search instances given the Amazon EBS size limit.

We also need to verify from the CPU side, we need 25 primary shards (1TB*1.25/50GB) =25. We have two replicas. With that, we have total 75 active shards. With that, the total vCPU needed is 75*1.5=112.5 vCPU. This means 8x R6g.4xlarge.search instances. This also requires three dedicated c6g.xlarge.search leader nodes.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards, because that’s the amount of data stored in Amazon S3. For this example, the total primary shard size for warm storage is 83*1.25=103.75 TB. Each ultrawarm1.large.search instance has 16 CPU cores and can address up to 20 TiB of storage on Amazon S3. A minimum of six ultrawarm1.large.search nodes is recommended. You’re charged for the actual storage, which is 103.75 TB.

For cold storage, you only pay for the cost of storing 365*1.25=456.25 TB on Amazon S3. The following table contains a breakdown of the monthly costs (USD) you’re likely to incur. This assumes a 1-year reserved instance for the cluster instances with no upfront payment in the US East (N. Virgina) Region.

Cost Type Pricing Usage Cost per month
Instance Usage R6g.4xlarge.search = $0.924 per hour 8 instances * 730 hours in a month = 5,840 hours 5,840 hours * $0.924 = $5,396.16
c6g.xlarge.search = $0.156 per hour 3 instances (leader nodes) * 730 hours in a month = 2,190 hours 2,190 hours * $0.156 = $341.64
ultrawarm1.large.search = $2.68 per hour 6 instances * 730 hours = 4,380 hours 4,380 hours * $2.68 = $11,738.40
Storage Cost Hot storage cost (Amazon EBS) EBS general purpose SSD (gp3) = $0.08 per GB per month 7 days host = 26.25TB 26,880 GB * $0.08 = $2,150.40
UltraWarm managed storage cost = $0.024 per GB per month 83 days warm = 103.75 TB per month 106,240 GB * $0.024 = $2,549.76
Cold storage cost on Amazon S3 = $0.022 per GB per month 365 days cold = 456.25 TB per month 467,200 GB * $0.022 = $10,278.40

The total monthly cost is $32,454.76. The hot tier costs $7,888.20, UltraWarm costs $14,288.16, and cold storage is $10,278.40. UltraWarm allows 83 days of additional retention for slightly more cost than the hot tier, which only provides 7 days. For nearly the same cost as the hot tier, the cold tier stores the primary shards for up to 1 year.

Conclusion

Amazon OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers. To start using these storage tiers, sign in to the AWS Management Console, use the AWS SDK, or AWS CLI, and enable the corresponding storage tier.


About the Author

Changbin Gong is a Senior Solutions Architect at Amazon Web Services (AWS). He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Changbin enjoys reading, running, and traveling.

Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.

AWS Cloud Adoption Framework (CAF) 3.0 is Now Available

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-cloud-adoption-framework-caf-3-0-is-now-available/

The AWS Cloud Adoption Framework (AWS CAF) is designed to help you to build and then execute a comprehensive plan for your digital transformation. Taking advantage of AWS best practices and lessons learned from thousands of customer engagements, the AWS CAF will help you to identify and prioritize transformation opportunities, evaluate and improve your cloud readiness, and iteratively evolve the roadmaps that you follow to guide your transformation.

Version 3.0 Now Available
I am happy to announce the version 3.0 of the AWS CAF is now available. This version represents what we have learned since we released version 2.0, with a focus on digital transformation and an emphasis on the use of data & analytics.

The framework starts by identifying six groups of foundational perspectives (Business, People, Governance, Platform, Security, and Operations), totaling 47 discrete capabilities, up from 31 in the previous version.

From there it identifiers four transformation domains (Technology, Process, Organization, and Product) that must participate in a successful digital transformation.

With the capabilities and the transformation domains as a base, the AWS Cloud Adoption Framework then recommends a set of four iterative and incremental cloud transformation phases:

Envision – Demonstrate how the cloud will accelerate business outcomes. This phase is delivered as a facilitator-led interactive workshop that will help you to identify transformation opportunities and create a foundation for your digital transformation.

Align – Identify capability gaps across the foundational capabilities. This phase also takes the form of a facilitator-led workshop and results in an action plan.

Launch – Build and deliver pilot initiatives in production, while demonstrating incremental business value.

Scale – Expand pilot initiatives to the desired scale while realizing the anticipated & desired business benefits.

All in all, the AWS Cloud Adoption Framework is underpinned by hundreds of AWS offerings and programs that help you achieve specific business and technical outcomes.

Getting Started with the AWS Cloud Adoption Framework
You can use the following resources to learn more and to get started:

Web Page – Visit the AWS Cloud Adoption Framework web page.

White Paper – Download and read the AWS CAF Overview.

AWS Account Team – Your AWS account team stands ready to assist you with any and all of the phases of the AWS Cloud Adoption Framework.

Jeff;

OWASP Top 10 Deep Dive: Defending Against Server-Side Request Forgery

Post Syndicated from Neville O'Neill original https://blog.rapid7.com/2021/11/23/owasp-top-10-deep-dive-defending-against-server-side-request-forgery/

OWASP Top 10 Deep Dive: Defending Against Server-Side Request Forgery

Web applications are no longer just assets to a company — they’re an organization’s identity, playing a major role in how customers, clients, and users see a brand. Due to this importance, web apps have also become a primary target for attack.

Over the years, these applications have grown more complex and bigger in size. Meanwhile, attackers have gotten more skillful. This has created greater opportunities for malicious actors to exploit potential vulnerabilities in web applications.

With the recent release of the 2021 Open Web Application Security Project (OWASP) top 10, we’re taking a deep dives into some of the new items added to the list. So far, we’ve covered injection and vulnerable and outdated components. In this post, we’ll focus on server-side request forgery (SSRF), which comes in at number 10 on the updated list.

SSRF attacks present a range of risks, from potentially stealing sensitive information from the application to bringing the entire web application down. These attacks target systems that are located behind firewalls and restrict access from non-trusted networks. Protecting your application from such attacks is vitally important. Here, we’ll talk about the different types of SSRF attacks and go over some mitigation techniques

What is server-side request forgery (SSRF)?

SSRF allows an attacker to force the server-side application into making arbitrary web requests to an unintended domain. This can result in the server making connections to internal-only services or arbitrary external systems.

A successful SSRF attack can result in unauthorized actions or access to data within the organization, either in the vulnerable application itself or on other back-end systems that the application can communicate with. In some situations, the SSRF vulnerability may even allow an attacker to perform arbitrary command execution. This can result in:

  • Information exposure
  • Internal reconnaissance
  • Denial-of-Service (DoS) attack
  • Remote code execution (RCE)

In general, SSRF attacks are made possible by a lack of user input validation in the web application. Without strict validation, the attacker can alter parameters that control what gets executed server-side, e.g. potentially malicious commands or establishing HTTP connections to arbitrary systems. Vulnerabilities will arise when the web application is unable to identify and validate requests from trusted applications or when the web application can send requests to any external IP address or domain

A closer look

Consider a scenario where the target web application provides functionality for importing, publishing, or reading data using a URL query parameter. The user can control the source of the data accessed by changing the value of the query parameter, which modifies the web request made by the server.

Once the manipulated request is received by the server, it will attempt to read the data by making a request to the user-supplied URL. If the web application lacks sufficient validation of user supplied data (in this case the source URL), then an attacker could potentially supply a URL that returns information from services that aren’t directly exposed publicly.

In some cases, the application server is able to interact with other back-end systems that are not directly reachable by users. Such systems often have private IP addresses and are designed not to be accessed publicly. Internal back-end systems may contain sensitive functionality that can be accessed without authentication by anyone who is able to interact with the systems.

A common example of this is cloud server metadata. Cloud services like Azure and AWS provide a representational state transfer (REST) service for querying metadata about the service itself. For example, AWS provides Instance Metadata Service (IMDS), which is used for querying information about an Amazon EC2 instance. This service is not publicly accessible and can only be accessed from the EC2 instance itself, by making a local HTTP request on http://169.254.169.254/.

The reason for this is that applications can sometimes hold important configuration files and authentication keys in these metadata directories. Endpoints that expose sensitive metadata like this are prime targets for attackers who wish to exploit SSRF vulnerabilities in applications with weak input validation.

Other examples of target data include Database HTTP interfaces such as NoSQL and MongoDB, as well as Internal REST interfaces and standard files structures.

Types of SSRF

Based on how a server responds to the request, SSRF can be divided into two types.

Basic SSRF: This when data from the malicious, forced back-end request is reflected in the application front-end. A hacker would use Basic SSRF when they want to exfiltrate data from the server directly or want to access unauthorized features.

Blind SSRF: As the name describes, with this type of SSRF attack, the application is forced to make a back-end HTTP request to a malicious domain. In this type of SSRF, the attacker doesn’t get data back from the server directly. The response from the back-end request triggers an action on the target without getting reflected in the application front-end. Hackers use this type of SSRF when they want to make some changes using the victim server.

Typical attack approach

Next, we’ll look at a sample attack that an attacker may use to test for SSRF vulnerabilities that are exposed when trying to acquire metadata from an Azure instance.

<AttackConfig>
      <Id>SSRF_13</Id>
      <Description><![CDATA[Azure Metadata]]></Description>
      <CustomParameterList>
        <CustomParameter>
          <Name>AttackString</Name>
          <Value>
              <![CDATA[http://169.254.169.254/metadata/instance/compute?api-version=2019-08-15]]>
          </Value>
        </CustomParameter>
        <CustomParameter>
          <name>vulnregex</name>
          <value>"(azEnvironment|osType|resourceId|vmSize)":\"</value>
        </CustomParameter>
      </CustomParameterList>
    </AttackConfig>

Upon identifying an injection point in, for example, a post parameter sent in the body of the request, the attacker may attempt to inject the value wrapped in CDATA — i.e. http://169.254.169.254/metadata/instance/compute?api-version=2019-08-15 in our example.

That might look something like this:

POST /users/search HTTP/1.0 Content-Type: application/x-www-form-urlencoded

searchApi=http://169.254.169.254/metadata/instance/compute?api-version=2019-08-15

The injected value contains the domain that retrieves metadata about the Azure instance that the web application is served on. Assuming that the web application is vulnerable to SSRF, no input validation will be performed to reject this malicious domain, and the web application will arbitrarily make a HTTP request that should result in Azure metadata being reflected in the web response.

If the application returns a valid response, the attacker could then search the web response for the value identified by “vulnregex” above. Matches should occur for information corresponding to the Azure instance — i.e. environment name, operating system, available resources or its storage size. This is a strong indication that the forged request was successful and therefore the application is vulnerable to SSRF attacks.

Validation

The above attack can be validated by attempting to visualize the information yourself. You can do this by navigating to the location on the application where a query parameter is being passed in the URL and injecting the value:

http://169.254.169.254/metadata/instance/compute?api-version=2019-08-15

When the forged request is submitted, you should look to see if any unexpected, sensitive information is returned in the response. In this case, since we are injecting an instance metadata domain, relevant information like operating system and storage size should be returned. If it is, this provides confirmation that the application is vulnerable to SSRF. An attacker could leverage this further to access and possibly even alter information in the metadata directory for that instance.

Sample vulnerable code

public String documentPreview(HttpServletRequest httpRequest,
                              Model model)
{
  String queryStringParams = httpRequest.getQueryString();
  String queryString = 
		StringUtils.substringAfter(queryStringParams, "url=");
​
  if (StringUtils.isBlank(queryString)) {
    log.error("Missed 'url' query param");
    return "preview";
  }
​
  try {
    DownloadFileResponse downloadFileResponse = 											storageService.load(queryString);
    
    model.addAttribute("image", new String(Base64.getEncoder().encode(
		    IOUtils.toByteArray(downloadFileResponse.getContent()))
    ));
  }
  catch (Exception e) {
    // Exception handling here.
}
​
  return "preview";
}

HttpGet httpGet = new HttpGet(url);

This example code has been created to upload images to an application and render them. However, it is vulnerable to SSRF attacks that will allow the attacker to make arbitrary requests to internal systems, such as metadata information.

The documentPreview() method is used for rendering an uploaded image file. This works by extracting a pre-signed image location URL passed via the “url=” parameter and assigns it to the variable named queryString. This variable is then passed into a storageService method which loads the image from where it is stored.

The load()method will invoke the HttpGet() function in order to retrieve the image. However, without proper input validation on the request parameter “url=”, the httpGet()method will perform arbitrary get requests on anything malicious that is input via that parameter.

Sample fixed code and remediation

The standard approach for preventing SSRF attacks can include denylist- and allowlist-based input validation for the URL.

Denylisting can include blocking hostnames like 127.0.0.1 or 169.254.169.254 so that the attacker cannot actively access internal information by injecting these parameters. This is useful when the application is required to send requests to external IP addresses or domains.

When the application is only required to send requests to trusted/identified applications, allowlist validation is available. This means the web application will only accept certain values as valid parameters. For example, you could implement embedded credentials in a URL before the hostname using the @ character so that the application can only access directories after the provided hostname.

To remediate our above example, the approach would be to implement some allowlist validation, as we only need to load images from a trusted single file storage service.

You could use regex to see if the parameter matches the trusted file storage hostname:

//Regex validation for a data having a simple format
if(Pattern.matches("http:\/\/trustedimages.com.*", queryString)){
    //Continue the processing because the input data is valid
	HttpGet httpGet = new HttpGet(url);
}else{
    //Stop the processing and reject the request
}

After the above code is implemented, only parameters beginning with http://trustedimages.com/ will be able to be sent to the httpGet()method and will prevent attackers from accessing hostnames outside of that domain.

Fighting a new contender

Don’t let the No. 10 spot fool you — SSRF is a serious threat that more than deserves its recognition in this year’s OWASP Top 10 list. In fact, 2021 is SSRF’s first year on the OWASP list, and security pros should expect to encounter this threat more and more in the coming years. But if you’re effectively testing your applications and remediating issues quickly and correctly, you’ll be prepared to spot and resolve SSRF vulnerabilities before an attacker exploits them.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Announcing Argo for Spectrum

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/argo-spectrum/

Announcing Argo for Spectrum

Announcing Argo for Spectrum

Today we’re excited to announce the general availability of Argo for Spectrum, a way to turbo-charge any TCP based application. With Argo for Spectrum, you can reduce latency, packet loss and improve connectivity for any TCP application, including common protocols like Minecraft, Remote Desktop Protocol and SFTP.

The Internet — more than just a browser

When people think of the Internet, many of us think about using a browser to view websites. Of course, it’s so much more! We often use other ways to connect to each other and to the resources we need for work. For example, you may interact with servers for work using SSH File Transfer Protocol (SFTP), git or Remote Desktop software. At home, you might play a video game on the Internet with friends.

To help people that protect these services against DDoS attacks, Spectrum launched in 2018 and extends Cloudflare’s DDoS protection to any TCP or UDP based protocol. Customers use it for a wide variety of use cases, including to protect video streaming (RTMP), gaming and internal IT systems. Spectrum also supports common VoIP protocols such as SIP and RTP, which have recently seen an increase in DDoS ransomware attacks. A lot of these applications are also highly sensitive to performance issues. No one likes waiting for a file to upload or dealing with a lagging video game.

Latency and throughput are the two metrics people generally discuss when talking about network performance. Latency refers to the amount of time a piece of data (a packet) takes to traverse between two systems. Throughput refers to the amount of bits you can actually send per second. This blog will discuss how these two interplay and how we improve them with Argo for Spectrum.

Argo to the rescue

There are a number of factors that cause poor performance between two points on the Internet, including network congestion, the distance between the two points, and packet loss. This is a problem many of our customers have, even on web applications. To help, we launched Argo Smart Routing in 2017, a way to reduce latency (or time to first byte, to be precise) for any HTTP request that goes to an origin.

That’s great for folks who run websites, but what if you’re working on an application that doesn’t speak HTTP? Up until now people had limited options for improving performance for these applications. That changes today with the general availability of Argo for Spectrum. Argo for Spectrum offers the same benefits as Argo Smart Routing for any TCP-based protocol.

Argo for Spectrum takes the same smarts from our network traffic and applies it to Spectrum. At time of writing, Cloudflare sits in front of approximately 20% of the Alexa top 10 million websites. That means that we see, in near real-time, which networks are congested, which are slow and which are dropping packets. We use that data and take action by provisioning faster routes, which sends packets through the Internet faster than normal routing. Argo for Spectrum works the exact same way, using the same intelligence and routing plane but extending it to any TCP based application.

Performance

But what does this mean for real application performance? To find out, we ran a set of benchmarks on Catchpoint. Catchpoint is a service that allows you to set up performance monitoring from all over the world. Tests are repeated at intervals and aggregate results are reported. We wanted to use a third party such as Catchpoint to get objective results (as opposed to running themselves).

For our test case, we used a file server in the Netherlands as our origin. We provisioned various tests on Catchpoint to measure file transfer performance from various places in the world: Rabat, Tokyo, Los Angeles and Lima.

Announcing Argo for Spectrum
Throughput of a 10MB file. Higher is better.

Depending on location, transfers saw increases of up to 108% (for locations such as Tokyo) and 85% on average. Why is it so much faster? The answer is bandwidth delay product. In layman’s terms, bandwidth delay product means that the higher the latency, the lower the throughput. This is because with transmission protocols such as TCP, we need to wait for the other party to acknowledge that they received data before we can send more.

As an analogy, let’s assume we’re operating a water cleaning facility. We send unprocessed water through a pipe to a cleaning facility, but we’re not sure how much capacity the facility has! To test, we send an amount of water through the pipe. Once the water has arrived, the facility will call us up and say, “we can easily handle this amount of water at a time, please send more.” If the pipe is short, the feedback loop is quick: the water will arrive, and we’ll immediately be able to send more without having to wait. If we have a very, very long pipe, we have to stop sending water for a while before we get confirmation that the water has arrived and there’s enough capacity.

The same happens with TCP: we send an amount of data to the wire and wait to get confirmation that it arrived. If the latency is high it reduces the throughput because we’re constantly waiting for confirmation. If latency is low we can throttle throughput at a high rate. With Spectrum and Argo, we help in two ways: the first is that Spectrum terminates the TCP connection close to the user, meaning that latency for that link is low. The second is that Argo reduces the latency between our edge and the origin. In concert, they create a set of low-latency connections, resulting in a low overall bandwidth delay product between users in origin. The result is a much higher throughput than you would otherwise get.

Argo for Spectrum supports any TCP based protocol. This includes commonly used protocols like SFTP, git (over SSH), RDP and SMTP, but also media streaming and gaming protocols such as RTMP and Minecraft. Setting up Argo for Spectrum is easy. When creating a Spectrum application, just hit the “Argo Smart Routing” toggle. Any traffic will automatically be smart routed.

Announcing Argo for Spectrum

Argo for Spectrum covers much more than just these applications: we support any TCP-based protocol. If you’re interested, reach out to your account team today to see what we can do for you.

A preview of Amazon’s AL2022 distribution

Post Syndicated from original https://lwn.net/Articles/876696/rss

Amazon has announced
a preview release of its upcoming AL2022 distribution. The company plans
to support AL2022 for five years after its release.

AL2022 uses the Fedora project as its upstream to provide customers
with a wide variety of the latest software, such as updated
language runtimes, as part of quarterly releases. In addition,
AL2022 has SELinux enabled and enforced by default.

Young people can name a piece of space history with Astro Pi Mission Zero

Post Syndicated from Claire Given original https://www.raspberrypi.org/blog/free-beginner-coding-activity-astro-pi-mission-zero-name-space-history/

Your young people don’t need to wait to become astronauts to be part of a space mission! In Mission Zero, the free beginners’ coding activity of the European Astro Pi Challenge, young people can create a simple computer program to send to the International Space Station (ISS) today.

The International Space Station.
The International Space Station, where your young people’s Mission Zero code could run soon! © ESA–L. Parmitano, CC BY-SA 3.0 IGO

This year, young people taking part in Astro Pi Mission Zero have the historic chance to help name the special Raspberry Pi computers we are sending up to the ISS for the Astro Pi Challenge. Their voices will decide the names of these unique pieces of space exploration hardware.

Astronaut Samantha Cristoforetti in the ISS's cupola.
Samantha Cristoforetti is one of the ESA astronauts who will be on the ISS when young people’s Mission Zero code runs. © ESA

Your young people can become part of a space mission today!

The European Astro Pi Challenge is a collaboration by us and ESA Education. Astro Pi Mission Zero is free, open to all young people up to age 19 from eligible countries*, and it’s designed for beginner coders.

Logo of Mission Zero, part of the European Astro Pi Challenge.

You can support participants easily, whether at home, in the classroom, or in a youth club. Simply sign up as a mentor and let your young people follow the step-by-step instructions we provide (in 19 European languages!) for writing their Mission Zero code online. Young people can complete Mission Zero in around an hour, and they don’t need any previous coding experience.

A mother and daughter do a coding activity together at a laptop at home.

Mission Zero is the perfect coding activity for parents and their children at home, for STEM or Scouts club leaders and attendees, and for teachers and students who are new to computer programming. You don’t need any special tech for Mission Zero participants. Any computer with a web browser and internet connection works for Mission Zero, because everything is done online.

We need young people to help name the Raspberry Pis we’re sending to space

Mission Zero participants follow our step-by-step instructions to create a simple program that takes a humidity reading on board the ISS and displays it for the astronauts — together with the participants’ own unique messages. And as part of their messages, they can vote for the name of the new hardware for the Astro Pi Challenge, hardware with Raspberry Pi computers at its heart.

Astro Pi MK II hardware.
The shiny new Raspberry Pi-powered hardware for the Astro Pi Challenge, which will replace the Raspberry Pi-powered Astro Pi units that have run Astro Pi participants’ code on board the ISS every year since 2015.

The new Astro Pi hardware, which will travel up in a rocket to the ISS on 21 December, is so new that these special augmented computers don’t even have names yet. Participants in Astro Pi Mission Zero get to vote for a name inspired by our list of ten renowned European scientists. Their vote will be part of the message they send to space.

SpaceX’s Falcon 9 rocket carrying the Crew Dragon spits fire as it lifts off from Kennedy Space Center in Florida.
A SpaceX rocket will deliver the special Raspberry Pi computers to the ISS. © SpaceX

What do your young people want to say in space?

Your young people’s messages to the ISS astronauts can say anything they like (apart from swear words, of course). Maybe they want to send some encouraging words to the astronauts or tell them a joke. They can even design a cool pixel art image to show on the Astro Pi hardware’s display:

Pixel art from Astro Pi Mission Zero participants.
Some of the pixel art from last year’s Astro Pi Mission Zero participants.

Whatever else they code for their Mission Zero entry, they’re supporting the astronauts with their important work on board the ISS. Since Mission Zero participants tell the Astro Pi hardware to read and display the humidity level inside the ISS, they provide helpful information for the astronauts as they go about their tasks.

Their own place in space history

After a participant’s Mission Zero code has run and their message has been shown in the ISS, we’ll send you a special certificate for them so you can commemorate their space mission.

The certificate will feature their name, the exact date and time their code ran, and a world map to mark the place on Earth above which the ISS was while their message was visible up there in space.

10 key things about Astro Pi Mission Zero

  1. It’s young people’s unique chance to be part of a real space mission
  2. Participation is free
  3. Participants send the ISS astronauts their own unique message
  4. This year only, participants can help name the two special Raspberry Pi computers that are travelling up to the ISS
  5. Mission Zero is open to young people up to age 19 who live in eligible countries (more about eligibility here)
  6. It’s a beginners’ coding activity with step-by-step instructions, available in 19 languages
  7. Completing the activity takes about one hour — at home, in the classroom, or in a Scouts or coding club session
  8. The activity can be done online in a web browser on any computer
  9. Participants will receive a special certificate to help celebrate their space mission
  10. Mission Zero is open until 18 March 2022

If you don’t want to let any young people in your life miss out on this amazing opportunity, sign up as their Mission Zero mentor today.


* The European Astro Pi Challenge is run as a collaboration by us at the Raspberry Pi Foundation and ESA Education. That’s why participants need to be from an ESA Member State, or from Slovenia, Canada, Latvia, Lithuania, or Malta, which have agreements with ESA.

If you live elsewhere, it’s possible to partner with Mission Zero mentors and young people in an eligible country. You can work together to support the young people to form international Mission Zero teams that write programs together.

If you live elsewhere and cannot partner with people in an eligible country, Mission Zero is still an awesome and inspiring project for your young people to try out coding. While these young people’s code unfortunately won’t run on the ISS, they will receive a certificate to mark their efforts.

The post Young people can name a piece of space history with Astro Pi Mission Zero appeared first on Raspberry Pi.

Offset lag metric for Amazon MSK as an event source for Lambda

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/offset-lag-metric-for-amazon-msk-as-an-event-source-for-lambda/

This post written by Adam Wagner, Principal Serverless Solutions Architect.

Last year, AWS announced support for Amazon Managed Streaming for Apache Kafka (MSK) and self-managed Apache Kafka clusters as event sources for AWS Lambda. Today, AWS adds a new OffsetLag metric to Lambda functions with MSK or self-managed Apache Kafka event sources.

Offset in Apache Kafka is an integer that marks the current position of a consumer. OffsetLag is the difference in offset between the last record written to the Kafka topic and the last record processed by Lambda. Kafka expresses this in the number of records, not a measure of time. This metric provides visibility into whether your Lambda function is keeping up with the records added to the topic it is processing.

This blog walks through using the OffsetLag metric along with other Lambda and MSK metrics to understand your streaming application and optimize your Lambda function.

Overview

In this example application, a producer writes messages to a topic on the MSK cluster that is an event source for a Lambda function. Each message contains a number and the Lambda function finds the factors of that number. It outputs the input number and results to an Amazon DynamoDB table.

Finding all the factors of a number is fast if the number is small but takes longer for larger numbers. This difference means the size of the number written to the MSK topic influences the Lambda function duration.

Example application architecture

Example application architecture

  1. A Kafka client writes messages to a topic in the MSK cluster.
  2. The Lambda event source polls the MSK topic on your behalf for new messages and triggers your Lambda function with batches of messages.
  3. The Lambda function factors the number in each message and then writes the results to DynamoDB.

In this application, several factors can contribute to offset lag. The first is the volume and size of messages. If more messages are coming in, the Lambda may take longer to process them. Other factors are the number of partitions in the topic, and the number of concurrent Lambda functions processing messages. A full explanation of how Lambda concurrency scales with the MSK event source is in the documentation.

If the average duration of your Lambda function increases, this also tends to increase the offset lag. This lag could be latency in a downstream service or due to the complexity of the incoming messages. Lastly, if your Lambda function errors, the MSK event source retries the identical records set until they succeed. This retry functionality also increases offset lag.

Measuring OffsetLag

To understand how the new OffsetLag metric works, you first need a working MSK topic as an event source for a Lambda function. Follow this blog post to set up an MSK instance.

To find the OffsetLag metric, go to the CloudWatch console, select All Metrics from the left-hand menu. Then select Lambda, followed by By Function Name to see a list of metrics by Lambda function. Scroll or use the search bar to find the metrics for this function and select OffsetLag.

OffsetLag metric example

OffsetLag metric example

To make it easier to look at multiple metrics at once, create a CloudWatch dashboard starting with the OffsetLag metric. Select Actions -> Add to Dashboard. Select the Create new button, provide the dashboard a name. Choose Create, keeping the rest of the options at the defaults.

Adding OffsetLag to dashboard

Adding OffsetLag to dashboard

After choosing Add to dashboard, the new dashboard appears. Choose the Add widget button to add the Lambda duration metric from the same function. Add another widget that combines both Lambda errors and invocations for the function. Finally, add a widget for the BytesInPerSec metric for the MSK topic. Find this metric under AWS/Kafka -> Broker ID, Cluster Name, Topic. Finally, click Save dashboard.

After a few minutes, you see a steady stream of invocations, as you would expect when consuming from a busy topic.

Data incoming to dashboard

Data incoming to dashboard

This example is a CloudWatch dashboard showing the Lambda OffsetLag, Duration, Errors, and Invocations, along with the BytesInPerSec for the MSK topic.

In this example, the OffSetLag metric is averaging about eight, indicating that the Lambda function is eight records behind the latest record in the topic. While this is acceptable, there is room for improvement.

The first thing to look for is Lambda function errors, which can drive up offset lag. The metrics show that there are no errors so the next step is to evaluate and optimize the code.

The Lambda handler function loops through the records and calls the process_msg function on each record:

def lambda_handler(event, context):
    for batch in event['records'].keys():
        for record in event['records'][batch]:
            try:
                process_msg(record)
            except:
                print("error processing record:", record)
    return()

The process_msg function handles base64 decoding, calls a factor function to factor the number, and writes the record to a DynamoDB table:

def process_msg(record):
    #messages are base64 encoded, so we decode it here
    msg_value = base64.b64decode(record['value']).decode()
    msg_dict = json.loads(msg_value)
    #using the number as the hash key in the dynamodb table
    msg_id = f"{msg_dict['number']}"
    if msg_dict['number'] <= MAX_NUMBER:
        factors = factor_number(msg_dict['number'])
        print(f"number: {msg_dict['number']} has factors: {factors}")
        item = {'msg_id': msg_id, 'msg':msg_value, 'factors':factors}
        resp = ddb_table.put_item(Item=item)
    else:
        print(f"ERROR: {msg_dict['number']} is >= limit of {MAX_NUMBER}")

The heavy computation takes place in the factor function:

def factor(number):
    factors = [1,number]
    for x in range(2, (int(1 + number / 2))):
        if (number % x) == 0:
            factors.append(x)
    return factors

The code loops through all numbers up to the factored number divided by two. The code is optimized by only looping up to the square root of the number.

def factor(number):
    factors = [1,number]
    for x in range(2, 1 + int(number**0.5)):
        if (number % x) == 0:
            factors.append(x)
            factors.append(number // x)
    return factors

There are further optimizations and libraries for factoring numbers but this provides a noticeable performance improvement in this example.

Data after optimization

Data after optimization

After deploying the code, refresh the metrics after a while to see the improvements:

The average Lambda duration has dropped to single-digit milliseconds and the OffsetLag is now averaging two.

If you see a noticeable change in the OffsetLag metric, there are several things to investigate. The input side of the system, increased messages per second, or a significant increase in the size of the message are a few options.

Conclusion

This post walks through implementing the OffsetLag metric to understand latency between the latest messages in the MSK topic and the records a Lambda function is processing. It also reviews other metrics that help understand the underlying cause of increases to the offset lag. For more information on this topic, refer to the documentation and other MSK Lambda metrics.

For more serverless learning resources, visit Serverless Land.

New – Amazon EC2 R6i Memory-Optimized Instances Powered by the Latest Generation Intel Xeon Scalable Processors

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-ec2-r6i-memory-optimized-instances-powered-by-the-latest-generation-intel-xeon-scalable-processors/

In August, we introduced the general-purpose Amazon EC2 M6i instances powered by the latest generation Intel Xeon Scalable processors (code-named Ice Lake) with an all-core turbo frequency of 3.5 GHz. Compute-optimized EC2 C6i instances were also made available last month.

Today, I am happy to share that we are expanding our sixth-generation x86-based offerings to include memory-optimized Amazon EC2 R6i instances.

Here’s a quick recap of the advantages of the new R6i instances compared to R5 instances:

  • A larger instance size (r6i.32xlarge) with 128 vCPUs and 1,024 GiB of memory that makes it easier and more cost-efficient to consolidate workloads and scale up applications
  • Up to 15 percent improvement in compute price/performance
  • Up to 20 percent higher memory bandwidth
  • Up to 40 Gbps for Amazon Elastic Block Store (EBS) and 50 Gbps for networking which is 2x more than R5 instances
  • Always-on memory encryption.

R6i instances are SAP Certified and are an ideal fit for memory-intensive workloads such as SQL and NoSQL databases, distributed web scale in-memory caches like Memcached and Redis, in-memory databases, and real-time big data analytics like Apache Hadoop and Apache Spark clusters.

Compared to M6i and C6i instances, the only difference is in the amount of memory that is included per vCPU. R6i instances are available in ten sizes:

Name vCPUs Memory
(GiB)
Network Bandwidth
(Gbps)
EBS Throughput
(Gbps)
r6i.large 2 16 Up to 12.5 Up to 10
r6i.xlarge 4 32 Up to 12.5 Up to 10
r6i.2xlarge 8 64 Up to 12.5 Up to 10
r6i.4xlarge 16 128 Up to 12.5 Up to 10
r6i.8xlarge 32 256 12.5 10
r6i.12xlarge 48 384 18.75 15
r6i.16xlarge 64 512 25 20
r6i.24xlarge 96 768 37.5 30
r6i.32xlarge 128 1024 50 40
r6i.metal 128 1024 50 40

Like M6i and C6i instances, these new R6i instances are built on the AWS Nitro System, which is a collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware, delivering high performance, high availability, and highly secure cloud instances.

As with all sixth generation EC2 instances, you may need to upgrade your Elastic Network Adapter (ENA) for optimal networking performance. For more information, see this article about migrating an EC2 instance to a sixth-generation instance in the AWS Knowledge Center.

R6i instances support Elastic Fabric Adapter (EFA) on r6i.32xlarge and r6i.metal instances for workloads that benefit from lower network latency, such as HPC and video processing.

Availability and Pricing
EC2 R6i instances are available today in four AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland). As usual with EC2, you pay for what you use. For more information, see the EC2 pricing page.

Danilo

Expanding cross-Region event routing with Amazon EventBridge

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/expanding-cross-region-event-routing-with-amazon-eventbridge/

This post is written by Stephen Liedig, Sr Serverless Specialist SA.

In April 2021, AWS announced a new feature for Amazon EventBridge that allows you to route events from any commercial AWS Region to US East (N. Virginia), US West (Oregon), and Europe (Ireland). From today, you can now route events between any AWS Regions, except AWS GovCloud (US) and China.

EventBridge enables developers to create event-driven applications by routing events between AWS services, integrated software as a service (SaaS) applications, and your own applications. This helps you produce loosely coupled, distributed, and maintainable architectures. With these new capabilities, you can now route events across Regions and accounts using the same model used to route events to existing targets.

Cross-Region event routing with Amazon EventBridge makes it easier for customers to develop multi-Region workloads to:

  • Centralize your AWS events into one Region for auditing and monitoring purposes, such as aggregating security events for compliance reasons in a single account.
  • Replicate events from source to destinations Regions to help synchronize data in cross-Region data stores.
  • Invoke asynchronous workflows in a different Region from a source event. For example, you can load balance from a target Region by routing events to another Region.

A previous post shows how cross-Region routing works. This blog post expands on these concepts and discusses a common use case for cross-Region event delivery – event auditing. This example explores how you can manage resources using AWS CloudFormation and EventBridge resource policies.

Multi-Region event auditing example walkthrough

Compliance is an important part of building event-driven applications and reacting to any potential policy or security violations. Customers use EventBridge to route security events from applications and globally distributed infrastructure into a single account for analysis. In many cases, they share specific AWS CloudTrail events with security teams. Customers also audit events from their custom-built applications to monitor sensitive data usage.

In this scenario, a company has their base of operations located in Asia Pacific (Singapore) with applications distributed across US East (N. Virginia) and Europe (Frankfurt). The applications in US East (N. Virginia) and Europe (Frankfurt) are using EventBridge for their respective applications and services. The security team in Asia Pacific (Singapore) wants to analyze events from the applications and CloudTrail events for specific API calls to monitor infrastructure security.

Reference architecture

To create the rules to receive these events:

  1. Create a new set of rules directly on all the event buses across the global infrastructure. Alternatively, delegate the responsibility of managing security rules to distributed teams that manage the event bus resources.
  2. Provide the security team with the ability to manage rules centrally, and control the lifecycle of rules on the global infrastructure.

Allowing the security team to manage the resources centrally provides more scalability. It is more consistent with the design principle that event consumers own and manage the rules that define how they process events.

Deploying the example application

The following code snippets are shortened for brevity. The full source code of the solution is in the GitHub repository. The solution uses AWS Serverless Application Model (AWS SAM) for deployment. Clone the repo and navigate to the solution directory:

git clone https://github.com/aws-samples/amazon-eventbridge-resource-policy-samples
cd ./patterns/ cross-region-cross-account-pattern/

To allow the security team to start receiving accounts from any of the cross-Region accounts:

1. Create a security event bus in the Asia Pacific (Singapore) Region with a rule that processes events from the respective event sources.

ap-southest-1 architecture

For simplicity, this example uses an Amazon CloudWatch Logs target to visualize the events arriving from cross-Region accounts:

 SecurityEventBus:
  Type: AWS::Events::EventBus
  Properties:
   Name: !Ref SecurityEventBusName

 # This rule processes events coming in from cross-Region accounts
 SecurityAnalysisRule:
  Type: AWS::Events::Rule
  Properties:
   Name: SecurityAnalysisRule
   Description: Analyze events from cross-Region event buses
   EventBusName: !GetAtt SecurityEventBus.Arn
   EventPattern:
    source:
     - anything-but: com.company.security
   State: ENABLED
   RoleArn: !GetAtt WriteToCwlRole.Arn
   Targets:
    - Id: SendEventToSecurityAnalysisRule
     Arn: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${SecurityAnalysisRuleTarget}"

In this example, you set the event pattern to process any event from a source that is not from the security team’s own domain. This allows you to process events from any account in any Region. You can filter this further as needed.

2. Set an event bus policy on each default and custom event bus that the security team must receive events from.

Event bus policy

This policy allows the security team to create rules to route events to its own security event bus in the Asia Pacific (Singapore) Region. The following policy defines a custom event bus in Account 2 in US East (N. Virginia) and an AWS::Events::EventBusPolicy that sets the Principal as the security team account.

This allows the security team to manage rules on the CustomEventBus:

 CustomEventBus: 
  Type: AWS::Events::EventBus
  Properties: 
    Name: !Ref EventBusName

 SecurityServiceRuleCreationStatement:
  Type: AWS::Events::EventBusPolicy
  Properties:
   EventBusName: !Ref CustomEventBus # If you omit this, the default event bus is used.
   StatementId: "AllowCrossRegionRulesForSecurityTeam"
   Statement:
    Effect: "Allow"
    Principal:
     AWS: !Sub "arn:aws:iam::${SecurityAccountNo}:root"
    Action:
     - "events:PutRule"
     - "events:DeleteRule"
     - "events:DescribeRule"
     - "events:DisableRule"
     - "events:EnableRule"
     - "events:PutTargets"
     - "events:RemoveTargets"
    Resource:
     - !Sub 'arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/${CustomEventBus.Name}/*'
    Condition:
     StringEqualsIfExists:
      "events:creatorAccount": "${aws:PrincipalAccount}"

3. With the policies set on the cross-Region accounts, now create the rules. Because you cannot create CloudFormation resources across Regions, you must define the rules in separate templates. This also gives the ability to expand to other Regions.

Once the template is deployed to the cross-Region accounts, use EventBridge resource policies to propagate rule definitions across accounts in the same Region. The security account must have permission to create CloudFormation resources in the cross-Region accounts to deploy the rule templates.

Resource policies

There are two parts to the rule templates. The first specifies a role that allows EventBridge to assume a role to send events to the target event bus in the security account:

 # This IAM role allows EventBridge to assume the permissions necessary to send events
 # from the source event buses to the destination event bus.
 SourceToDestinationEventBusRole:
  Type: "AWS::IAM::Role"
  Properties:
   AssumeRolePolicyDocument:
    Version: 2012-10-17
    Statement:
     - Effect: Allow
      Principal:
       Service:
        - events.amazonaws.com
      Action:
       - "sts:AssumeRole"
   Path: /
   Policies:
    - PolicyName: PutEventsOnDestinationEventBus
     PolicyDocument:
      Version: 2012-10-17
      Statement:
       - Effect: Allow
        Action: "events:PutEvents"
        Resource:
         - !Ref SecurityEventBusArn

The second is the definition of the rule resource. This requires the Amazon Resource Name (ARN) of the event bus where you want to put the rule, the ARN of the target event bus in the security account, and a reference to the SourceToDestinationEventBusRole role:

 SecurityAuditRule2:
  Type: AWS::Events::Rule
  Properties:
   Name: SecurityAuditRuleAccount2
   Description: Audit rule for the Security team in Singapore
   EventBusName: !Ref EventBusArnAccount2 # ARN of the custom event bus in Account 2
   EventPattern:
    source:
     - com.company.marketing
   State: ENABLED
   Targets:
    - Id: SendEventToSecurityEventBusArn
     Arn: !Ref SecurityEventBusArn
     RoleArn: !GetAtt SourceToDestinationEventBusRole.Arn

You can use the AWS SAM CLI to deploy this:

sam deploy -t us-east-1-rules.yaml \
  --stack-name us-east-1-rules \
  --region us-east-1 \
  --profile default \
  --capabilities=CAPABILITY_IAM \
  --parameter-overrides SecurityEventBusArn="arn:aws:events:ap-southeast-1:111111111111:event-bus/SecurityEventBus" EventBusArnAccount1="arn:aws:events:us-east-1:111111111111:event-bus/default" EventBusArnAccount2="arn:aws:events:us-east-1:222222222222:event-bus/custom-eventbus-account-2"

Testing the example application

With the rules deployed across the Regions, you can test by sending events to the event bus in Account 2:

  1. Navigate to the applications/account_2 directory. Here you find an events.json file, which you use as input for the put-events API call.
  2. Run the following command using the AWS CLI. This sends messages to the event bus in us-east-1 which are routed to the security event bus in ap-southeast-1:
    aws events put-events \
     --region us-east-1 \
     --profile [NAMED PROFILE FOR ACCOUNT 2] \
     --entries file://events.json
    

    If you have run this successfully, you see:
    Entries:
    - EventId: a423b35e-3df0-e5dc-b854-db9c42144fa2
    - EventId: 5f22aea8-51ea-371f-7a5f-8300f1c93973
    - EventId: 7279fa46-11a6-7495-d7bb-436e391cfcab
    - EventId: b1e1ecc1-03f7-e3ef-9aa4-5ac3c8625cc7
    - EventId: b68cea94-28e2-bfb9-7b1f-9b2c5089f430
    - EventId: fc48a303-a1b2-bda8-8488-32daa5f809d8
    FailedEntryCount: 0

  3. Navigate to the Amazon CloudWatch console to see a collection of log entries with the events you published. The log group is /aws/events/SecurityAnalysisRule.CloudWatch Logs

Congratulations, you have successfully sent your first events across accounts and Regions!

Conclusion

With cross-Region event routing in EventBridge, you can now route events to and from any AWS Region. This post explains how to manage and configure cross-Region event routing using CloudFormation and EventBridge resource policies to simplify rule propagation across your global event bus infrastructure. Finally, I walk through an example you can deploy to your AWS account.

For more serverless learning resources, visit Serverless Land.

AWS Security Profiles: J.D. Bean, Sr. Security Solutions Architect

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/aws-security-profiles-j-d-bean-sr-security-solutions-architect/

JD Bean AWS Security Profile
In the week leading up to AWS re:Invent 2021, we’ll share conversations we’ve had with people at AWS who will be presenting, and get a sneak peek at their work.


How long have you been at AWS, and what do you do in your current role?

I’m coming up on my three-year anniversary at AWS. Which, as I say it out loud, is hard to believe. It feels as if the time has passed in the blink of an eye. I’m a Solutions Architect with a specialty in security. I work primarily with AWS Strategic Accounts, a set of companies at the forefront of innovation. I partner with my customers to help them design, build, and deploy secure and compliant cloud workloads.

How did you get started in security?

Security began as a hobby for me, and I found it came quite naturally. Perhaps it’s just the way my brain is wired, but I often found security was a topic that consistently drew me in. I leaned into security professionally, and I really enjoy it. AWS makes security its top priority, which is really exciting as a security professional. I’m the kind of person who loves to understand how all the pieces of a system fit together, and AWS Security has been an incredible opportunity, letting me carry my depth of expertise to all sorts of interesting new technical areas such as IoT, HPC, and AI/ML.

How do you explain your job to non-tech friends?

I often say that I work as an AWS Solutions Architect, which means I work with AWS customers to help design their cloud environments and projects, and that I specifically focus on security. If they’re interested in hearing more, I tell them AWS offers a wide array of services customers can configure and combine in all sorts of different ways to fit their needs. If they’re anything like me, I use the analogy of my own experience at hardware stores. In a way, part of what I do is to act like that helpful person at the hardware store who understands what all the tools and equipment do, how to use them correctly, and how they interact with one another. I partner with AWS customers to learn about their project requirements and help them work backwards from those requirements to determine the best approach for achieving their goals.

What are you currently working on that you’re excited about?

I’m working with my customers on a bunch of exciting projects for establishing security, governance, and compliance at scale. I’ve also been returning to my roots and spending more time focusing on open-source software, which is a big passion area for me both personally and professionally.

You’re presenting at AWS re:Invent this year—can you give readers a sneak peek at what you’re covering?

I’m presenting two sessions this year. The first session is a builder session called Grant least privilege temporary access securely at scale (WPS304). We’ll use AWS Secrets Manager, AWS Identity and Access Management (IAM), and the isolated compute functionality provided by AWS Nitro Enclaves to allow system administrators to request and retrieve narrowly scoped and limited-time access.

My second session is the Using AWS Nitro Enclaves to process highly sensitive data workshop (SEC304). AWS Nitro Enclaves allow customers to create an isolated, hardened, and highly constrained environment to host security-critical applications. A lot of work has gone in to building this workshop over the past few months, and I’m excited to share it at re:Invent.

The workshop gives attendees an opportunity to get hands-on, practical experience with AWS Nitro Enclaves. Attendees will get experience launching enclave applications, using the Nitro Enclaves secure local channel for communication. Attendees will also work with Nitro Enclaves’ included cryptographic attestation features and integration with AWS Key Management Services. After putting all these elements together, attendees will be able to see how you can be sure that only your authorized code in your Nitro Enclave is able to access sensitive material.

For those who won’t be able to join the re:Invent workshop session in person, the AWS Nitro Enclaves Workshop is available online and can be completed in your own account at any time.

What are you hoping the audience will take away from the session(s)?

I hope attendees will come away from the session with a sense of how approachable and flexible AWS Nitro Enclaves are, and start to formulate ideas for how they can use Nitro Enclaves in their own workloads.

From your perspective, what’s the biggest thing happening in confidential computing right now?

Over the last year I’ve seen a big increase in interest from customers around confidential computing. This is how we’ve been approaching the design of the AWS Nitro System for many years now. The Nitro System, the underlying platform for all modern Amazon EC2 instances, already provides confidential computing protections by default.

More recently, AWS Nitro Enclaves has offered a new capability for customers to divide their own workloads into more-trusted and less-trusted components. The isolation of workload components in AWS Nitro Enclaves is powered by the specialized hardware and associated firmware of the Nitro System.

What’s your favorite Leadership Principle at Amazon and why?

My favorite Amazon Leadership principle is Learn and Be Curious. I think I’m at my best when I’m learning, growing, and pushing outward at the edges. AWS is such an incredible place to work for people who love to learn. AWS is constantly innovating and inventing for our customers, and learning is central to the culture here.

What’s the best career advice you’ve ever received?

One piece of advice I’ve held close from an early age is just how important it is to be comfortable saying “I don’t know”—ideally followed by “but I’d like to find out.” This has served me well in life, both professionally and personally.

Another is “lead with trust.” Being willing to be vulnerable and assume the best of others goes a long way. At Amazon, one of our leadership principles is Earn Trust. I’ve found how important it is to set an example of offering trust to others. Most people tend to rise to a challenge. If you enter new interactions with a default expectation of trusting others, more often than not, your trust ends up being well-placed.

If you had to pick any other job, what would you want to do?

It’s funny you ask that. I still think of my current role as the “other job” I daydream about. I began my professional life in the legal field. Admittedly, my work was primarily focused around open-source software, so it wasn’t entirely unrelated to what I do now, but I really do feel like being a Solutions Architect is a second phase in my career. I’m enjoying this new chapter too much to give doing anything else much thought.

If you were to really press me, I’d say that my wife, who’s a psychologist, tells me I missed my calling as a therapist. I take that as a real compliment.

Author

J. D. Bean

J.D. is a senior security specialist Solutions Architect for AWS Strategic Accounts based out of New York City. His interests include security, privacy, and compliance. He is passionate about his work enabling AWS customers’ successful cloud journeys. J.D. holds a Bachelor of Arts from The George Washington University and a Juris Doctor from New York University School of Law.

Author

Maddie Bacon

Maddie (she/her) is a technical writer for AWS Security with a passion for creating meaningful content. She previously worked as a security reporter and editor at TechTarget and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and all things Harry Potter.

Provide data reliability in Amazon Redshift at scale using Great Expectations library

Post Syndicated from Faizan Ahmed original https://aws.amazon.com/blogs/big-data/provide-data-reliability-in-amazon-redshift-at-scale-using-great-expectations-library/

Ensuring data reliability is one of the key objectives of maintaining data integrity and is crucial for building data trust across an organization. Data reliability means that the data is complete and accurate. It’s the catalyst for delivering trusted data analytics and insights. Incomplete or inaccurate data leads business leaders and data analysts to make poor decisions, which can lead to negative downstream impacts and subsequently may result in teams spending valuable time and money correcting the data later on. Therefore, it’s always a best practice to run data reliability checks before loading the data into any targets like Amazon Redshift, Amazon DynamoDB, or Amazon Timestream databases.

This post discusses a solution for running data reliability checks before loading the data into a target table in Amazon Redshift using the open-source library Great Expectations. You can automate the process for data checks via the extensive built-in Great Expectations glossary of rules using PySpark, and it’s flexible for adding or creating new customized rules for your use case.

Amazon Redshift is a cloud data warehouse solution and delivers up to three times better price-performance than other cloud data warehouses. With Amazon Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift lets you save the results of your queries back to your Amazon Simple Storage Service (Amazon S3) data lake using open formats like Apache Parquet, so that you can perform additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.

Great Expectations (GE) is an open-source library and is available in GitHub for public use. It helps data teams eliminate pipeline debt through data testing, documentation, and profiling. Great Expectations helps build trust, confidence, and integrity of data across data engineering and data science teams in your organization. GE offers a variety of expectations developers can configure. The tool defines expectations as statements describing verifiable properties of a dataset. Not only does it offer a glossary of more than 50 built-in expectations, it also allows data engineers and scientists to write custom expectation functions.

Use case overview

Before performing analytics or building machine learning (ML) models, cleaning data can take up a lot of time in the project cycle. Without automated and systematic data quality checks, we may spend most of our time cleaning data and hand-coding one-off quality checks. As most data engineers and scientists know, this process can be both tedious and error-prone.

Having an automated quality check system is critical to project efficiency and data integrity. Such systems help us understand data quality expectations and the business rules behind them, know what to expect in our data analysis, and make communicating the data’s intricacies much easier. For example, in a raw dataset of customer profiles of a business, if there’s a column for date of birth in format YYYY-mm-dd, values like 1000-09-01 would be correctly parsed as a date type. However, logically this value would be incorrect in 2021, because the age of the person would be 1021 years, which is impossible.

Another use case could be to use GE for streaming analytics, where you can use AWS Database Migration Service (AWS DMS) to migrate a relational database management system. AWS DMS can export change data capture (CDC) files in Parquet format to Amazon S3, where these files can then be cleansed by an AWS Glue job using GE and written to either a destination bucket for Athena consumption or the rows can be streamed in AVRO format to Amazon Kinesis or Kafka.

Additionally, automated data quality checks can be versioned and also bring benefit in the form of optimal data monitoring and reduced human intervention. Data lineage in an automated data quality system can also indicate at which stage in the data pipeline the errors were introduced, which can help inform improvements in upstream systems.

Solution architecture

This post comes with a ready-to-use blueprint that automatically provisions the necessary infrastructure and spins up a SageMaker notebook that walks you step by step through the solution. Additionally, it enforces the best practices in data DevOps and infrastructure as code. The following diagram illustrates the solution architecture.

The architecture contains the following components:

  1. Data lake – When we run the AWS CloudFormation stack, an open-source sample dataset in CSV format is copied to an S3 bucket in your account. As an output of the solution, the data destination is an S3 bucket. This destination consists of two separate prefixes, each of which contains files in Parquet format, to distinguish between accepted and rejected data.
  2. DynamoDB – The CloudFormation stack persists data quality expectations in a DynamoDB table. Four predefined column expectations are populated by the stack in a table called redshift-ge-dq-dynamo-blog-rules. Apart from the pre-populated rules, you can add any rule from the Great Expectations glossary according to the data model showcased later in the post.
  3. Data quality processing – The solution utilizes a SageMaker notebook instance powered by Amazon EMR to process the sample dataset using PySpark (v3.1.1) and Great Expectations (v0.13.4). The notebook is automatically populated with the S3 bucket location and Amazon Redshift cluster identifier via the SageMaker lifecycle config provisioned by AWS CloudFormation.
  4. Amazon Redshift – We create internal and external tables in Amazon Redshift for the accepted and rejected datasets produced from processing the sample dataset. The external dq_rejected.monster_com_rejected table, for rejected data, uses Amazon Redshift Spectrum and creates an external database in the AWS Glue Data Catalog to reference the table. The dq_accepted.monster_com table is created as a regular Amazon Redshift table by using the COPY command.

Sample dataset

As part of this post, we have performed tests on the Monster.com job applicants sample dataset to demonstrate the data reliability checks using the Great Expectations library and loading data into an Amazon Redshift table.

The dataset contains nearly 22,000 different sample records with the following columns:

  • country
  • country_code
  • date_added
  • has_expired
  • job_board
  • job_description
  • job_title
  • job_type
  • location
  • organization
  • page_url
  • salary
  • sector
  • uniq_id

For this post, we have selected four columns with inconsistent or dirty data, namely organization, job_type, uniq_id, and location, whose inconsistencies are flagged according to the rules we define from the GE glossary as described later in the post.

Prerequisites

For this solution, you should have the following prerequisites:

  • An AWS account if you don’t have one already. For instructions, see Sign Up for AWS.
  • For this post, you can launch the CloudFormation stack in the following Regions:
    • us-east-1
    • us-east-2
    • us-west-1
    • us-west-2
  • An AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
  • The user should have create, write, and read access for the following AWS services:
  • Familiarity with Great Expectations and PySpark.

Set up the environment

Choose Launch Stack to start creating the required AWS resources for the notebook walkthrough:

For more information about Amazon Redshift cluster node types, see Overview of Amazon Redshift clusters. For the type of workflow described in this post, we recommend using the RA3 Instance Type family.

Run the notebooks

When the CloudFormation stack is complete, complete the following steps to run the notebooks:

  1. On the SageMaker console, choose Notebook instances in the navigation pane.

This opens the notebook instances in your Region. You should see a notebook titled redshift-ge-dq-EMR-blog-notebook.

  1. Choose Open Jupyter next to this notebook to open the Jupyter notebook interface.

You should see the Jupyter notebook file titled ge-redshift.ipynb.

  1. Choose the file to open the notebook and follow the steps to run the solution.

Run configurations to create a PySpark context

When the notebook is open, make sure the kernel is set to Sparkmagic (PySpark). Run the following block to set up Spark configs for a Spark context.

Create a Great Expectations context

In Great Expectations, your data context manages your project configuration. We create a data context for our solution by passing our S3 bucket location. The S3 bucket’s name, created by the stack, should already be populated within the cell block. Run the following block to create a context:

from great_expectations.data_context.types.base import DataContextConfig,DatasourceConfig,S3StoreBackendDefaults
from great_expectations.data_context import BaseDataContext

bucket_prefix = "ge-redshift-data-quality-blog"
bucket_name = "ge-redshift-data-quality-blog-region-account_id"
region_name = '-'.join(bucket_name.replace(bucket_prefix,'').split('-')[1:4])
dataset_path=f"s3://{bucket_name}/monster_com-job_sample.csv"
project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_spark_datasource": {
            "data_asset_type": {
                "class_name": "SparkDFDataset",//Setting dataset type to Spark
                "module_name": "great_expectations.dataset",
            },
            "spark_config": dict(spark.sparkContext.getConf().getAll()) //Passing Spark Session configs,
            "class_name": "SparkDFDatasource",
            "module_name": "great_expectations.datasource"
        }
    },
    store_backend_defaults=S3StoreBackendDefaults(default_bucket_name=bucket_name)//
)
context = BaseDataContext(project_config=project_config)

For more details on creating a GE context, see Getting started with Great Expectations.

Get GE validation rules from DynamoDB

Our CloudFormation stack created a DynamoDB table with prepopulated rows of expectations. The data model in DynamoDB describes the properties related to each dataset and its columns and the number of expectations you want to configure for each column. The following code describes an example of the data model for the column organization:

{
 "id": "job_reqs-organization", 
 "dataset_name": "job_reqs", 
 "rules": [ //list of expectations to apply to this column
  {
   "kwargs": {
    "result_format": "SUMMARY|COMPLETE|BASIC|BOOLEAN_ONLY" //The level of detail of the result
   },
   "name": "expect_column_values_to_not_be_null",//name of GE expectation   "reject_msg": "REJECT:null_values_found_in_organization"
  }
 ],
 "column_name": "organization"
}

The code contains the following parameters:

  • id – Unique ID of the document
  • dataset_name – Name of the dataset, for example monster_com
  • rules – List of GE expectations to apply:
    • kwargs – Parameters to pass to an individual expectation
    • name – Name of the expectation from the GE glossary
    • reject_msg – String to flag for any row that doesn’t pass this expectation
  • column_name – Name of dataset column to run the expectations on

Each column can have one or more expectations associated that it needs to pass. You can also add expectations for more columns or to existing columns by following the data model shown earlier. With this technique, you can automate verification of any number of data quality rules for your datasets without performing any code change. Apart from its flexibility, what makes GE powerful is the ability to create custom expectations if the GE glossary doesn’t cover your use case. For more details on creating custom expectations, see How to create custom Expectations.

Now run the cell block to fetch the GE rules from the DynamoDB client:

  1. Read the monster.com sample dataset and pass through validation rules.

After we have the expectations fetched from DynamoDB, we can read the raw CSV dataset. This dataset should already be copied to your S3 bucket location by the CloudFormation stack. You should see the following output after reading the CSV as a Spark DataFrame.

To evaluate whether a row passes each column’s expectations, we need to pass the necessary columns to a Spark user-defined function. This UDF evaluates each row in the DataFrame and appends the results of each expectation to a comments column.

Rows that pass all column expectations have a null value in the comments column.

A row that fails at least one column expectation is flagged with the string format REJECT:reject_msg_from_dynamo. For example, if a row has a null value in the organization column, then according to the rules defined in DynamoDB, the comments column is populated by the UDF as REJECT:null_values_found_in_organization.

The technique with which the UDF function recognizes a potentially erroneous column is done by evaluating the result dictionary generated by the Great Expectations library. The generation and structure of this dictionary is dependent upon the keyword argument of result_format. In short, if the count of unexpected column values of any column is greater than zero, we flag that as a rejected row.

  1. Split the resulting dataset into accepted and rejected DataFrames.

Now that we have all the rejected rows flagged in the source DataFrame within the comments column, we can use this property to split the original dataset into accepted and rejected DataFrames. In the previous step, we mentioned that we append an action message in the comments column for each failed expectation in a row. With this fact, we can select rejected rows that start with the string REJECT (alternatively, you can also filter by non-null values in the comments column to get the accepted rows). When we have the set of rejected rows, we can get the accepted rows as a separate DataFrame by using the following PySpark except function.

Write the DataFrames to Amazon S3.

Now that we have the original DataFrame divided, we can write them both to Amazon S3 in Parquet format. We need to write the accepted DataFrame without the comments column because it’s only added to flag rejected rows. Run the cell blocks to write the Parquet files under appropriate prefixes as shown in the following screenshot.

Copy the accepted dataset to an Amazon Redshift table

Now that we have written the accepted dataset, we can use the Amazon Redshift COPY command to load this dataset into an Amazon Redshift table. The notebook outlines the steps required to create a table for the accepted dataset in Amazon Redshift using the Amazon Redshift Data API. After the table is created successfully, we can run the COPY command.

Another noteworthy point to mention is that one of the advantages that we witness due to the data quality approach described in this post is that the Amazon Redshift COPY command doesn’t fail due to schema or datatype errors for the columns, which have clear expectations defined that match the schema. Similarly, you can define expectations for every column in the table that satisfies the schema constraints and can be considered a dq_accepted.monster_com row.

Create an external table in Amazon Redshift for rejected data

We need to have the rejected rows available to us in Amazon Redshift for comparative analysis. These comparative analyses can help inform upstream systems regarding the quality of data being collected and how they can be corrected to improve the overall quality of data. However, it isn’t wise to store the rejected data on the Amazon Redshift cluster, particularly for large tables, because it occupies extra disk space and increase cost. Instead, we use Redshift Spectrum to register an external table in an external schema in Amazon Redshift. The external schema lives in an external database in the AWS Glue Data Catalog and is referenced by Amazon Redshift. The following screenshot outlines the steps to create an external table.

Verify and compare the datasets in Amazon Redshift.

12,160 records got processed successfully out of a total of 22,000 from the input dataset, and were loaded to the monster_com table under the dq_accepted schema. These records successfully passed all the validation rules configured in DynamoDB.

A total 9,840 records got rejected due to breaking of one or more rules configured in DynamoDB and loaded to the monster_com_rejected table in the dq_rejected schema. In this section, we describe the behavior of each expectation on the dataset.

  • Expect column values to not be null in organization – This rule is configured to reject a row if the organization is null. The following query returns the sample of rows, from the dq_rejected.monster_com_rejected table, that are null in the organization column, with their reject message.
  • Expect column values to match the regex list in job_type – This rule expects the column entries to be strings that can be matched to either any of or all of a list of regular expressions. In our use case, we have only allowed values that match a pattern within [".*Full.*Time", ".*Part.*Time", ".*Contract.*"].
  • The following query shows rows that are rejected due to an invalid job type.

Most of the records were rejected with multiple reasons, and all those mismatches are captured under the comments column.

  • Expect column values to not match regex for uniq_id – Similar to the previous rule, this rule aims to reject any row whose value matches a certain pattern. In our case, that pattern is having an empty space (\s++) in the primary column uniq_id. This means we consider a value to be invalid if it has empty spaces in the string. The following query returned an invalid format for uniq_id.
  • Expect column entries to be strings with a length between a minimum value and a maximum value (inclusive) – A length check rule is defined in the DynamoDB table for the location column. This rule rejects values or rows if the length of the value violates the specified constraints. The following
  • query returns the records that are rejected due to a rule violation in the location column.

You can continue to analyze the other columns’ predefined rules from DynamoDB or pick any rule from the GE glossary and add it to an existing column. Rerun the notebook to see the result of your data quality rules in Amazon Redshift. As mentioned earlier, you can also try creating custom expectations for other columns.

Benefits and limitations

The efficiency and efficacy of this approach is delineated from the fact that GE enables automation and configurability to an extensive degree when compared with other approaches. A very brute force alternative to this could be writing stored procedures in Amazon Redshift that can perform data quality checks on staging tables before data is loaded into main tables. However, this approach might not be scalable because you can’t persist repeatable rules for different columns, as persisted here in DynamoDB, in stored procedures (or call DynamoDB APIs), and would have to write and store a rule for each column of every table. Furthermore, to accept or reject a row based on a single rule requires complex SQL statements that may result in longer durations for data quality checks or even more compute power, which can also incur extra costs. With GE, a data quality rule is generic, repeatable, and scalable across different datasets.

Another benefit of this approach, related to using GE, is that it supports multiple Python-based backends, including Spark, Pandas, and Dask. This provides flexibility across an organization where teams might have skills in different frameworks. If a data scientist prefers using Pandas to write their ML pipeline feature quality test, then a data engineer using PySpark can use the same code base to extend those tests due to the consistency of GE across backends.

Furthermore, GE is written natively in Python, which means it’s a good option for engineers and scientists who are more used to running their extract, transform, and load (ETL) workloads in PySpark in comparison to frameworks like Deequ, which is natively written in Scala over Apache Spark and fits better for Scala use cases (the Python interface, PyDeequ, is also available). Another benefit of using GE is the ability to run multi-column unit tests on data, whereas Deequ doesn’t support that (as of this writing).

However, the approach described in this post might not be the most performant in some cases for full table load batch reads for very large tables. This is due to the serde (serialization/deserialization) cost of using UDFs. Because the GE functions are embedded in PySpark UDFs, the performance of these functions is slower than native Spark functions. Therefore, this approach gives the best performance when integrated with incremental data processing workflows, for example using AWS DMS to write CDC files from a source database to Amazon S3.

Clean up

Some of the resources deployed in this post, including those deployed using the provided CloudFormation template, incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

Go to the CloudFormation console and click the ‘delete stack’ to remove all resources.

The resources in the CloudFormation template are not production ready. If you would like to use this solution in production, enable logging for all S3 buckets and ensure the solution adheres to your organization’s encryption policies through EMR Security Best Practices.

Conclusion

In this post, we demonstrated how you can automate data reliability checks using the Great Expectations library before loading data into an Amazon Redshift table. We also showed how you can use Redshift Spectrum to create external tables. If dirty data were to make its way into the accepted table, all downstream consumers such as business intelligence reporting, advanced analytics, and ML pipelines can get affected and produce inaccurate reports and results. The trends of such data can generate wrong leads for business leaders while making business decisions. Furthermore, flagging dirty data as rejected before loading into Amazon Redshift also helps reduce the time and effort a data engineer might have to spend in order to investigate and correct the data.

We are interested to hear how you would like to apply this solution for your use case. Please share your thoughts and questions in the comments section.


About the Authors

Faizan Ahmed is a Data Architect at AWS Professional Services. He loves to build data lakes and self-service analytics platforms for his customers. He also enjoys learning new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. In his free time, Faizan enjoys traveling, sports, and reading.

Bharath Kumar Boggarapu is a Data Architect at AWS Professional Services with expertise in big data technologies. He is passionate about helping customers build performant and robust data-driven solutions and realize their data and analytics potential. His areas of interests are open-source frameworks, automation, and data architecting. In his free time, he loves to spend time with family, play tennis, and travel.

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/876655/rss

Security updates have been issued by Debian (firebird3.0, libmodbus, and salt), Fedora (js-jquery-ui and wordpress), Mageia (arpwatch, chromium-browser-stable, php, rust, and wireshark), openSUSE (barrier, firefox, hylafax+, opera, postgresql12, postgresql13, postgresql14, and tomcat), SUSE (ardana-ansible, ardana-monasca, crowbar-openstack, influxdb, kibana, openstack-cinder, openstack-ec2-api, openstack-heat-gbp, openstack-heat-templates, openstack-horizon-plugin-gbp-ui, openstack-keystone, openstack-neutron-gbp, openstack-nova, python-eventlet, rubygem-redcarpet, rubygem-puma, ardana-ansible, ardana-monasca, documentation-suse-openstack-cloud, openstack-ec2-api, openstack-heat-templates, python-Django, python-monasca-common, rubygem-redcarpet, rubygem-puma, firefox, kernel, postgresql, postgresql13, postgresql14, postgresql10, postgresql12, postgresql13, postgresql14, postgresql96, and samba), and Ubuntu (libreoffice).

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close