Tag Archives: announcements

Prepare for consolidated controls view and consolidated control findings in AWS Security Hub

Post Syndicated from Priyanka Prakash original https://aws.amazon.com/blogs/security/prepare-for-consolidated-controls-view-and-consolidated-control-findings-in-aws-security-hub/

Currently, AWS Security Hub identifies controls and generates control findings in the context of security standards. Security Hub is aiming to release two new features in the first quarter of 2023 that will decouple controls from standards and streamline how you view and receive control findings.

The new features to be released are consolidated controls view and consolidated control findings. Consolidated controls view will provide you with a comprehensive view within the Security Hub console of your controls across security standards. This feature will also introduce a single unique identifier for each control across security standards.

Consolidated control findings will streamline your control findings. When this feature is turned on, Security Hub will produce a single finding for a security check even when a check is shared across multiple standards. This will reduce finding noise and help you focus on misconfigured resources in your AWS environment.

In this blog post, I’ll summarize the upcoming features, the benefit they bring to your organization, and how you can take advantage of them upon release.

Feature 1: Consolidated controls view

Currently, controls are identified, viewed, and managed in the context of individual security standards. In the Security Hub console, you first have to navigate to a specific standard to see a list of controls for that standard. Within the AWS Foundational Security Best Practices (FSBP) standard, Security Hub identifies controls by the impacted AWS service and a unique number (for example, IAM.1). For other standards, Security Hub includes the standard as part of the control identifier (for example, CIS 1.1 or PCI.AutoScaling.1).

After the release of consolidated controls view, you will be able to see a consolidated list of your controls from a new Controls page in the Security Hub console. Security Hub will also assign controls a consistent security control ID across standards. Following the current naming convention of the AWS FSBP standard, control IDs will include the relevant service and a unique number.

For example, the control AWS Config should be enabled is currently identified as Config.1 in the AWS FSBP standard, CIS 2.5 in the Center for Internet Security (CIS) AWS Foundations Benchmark v1.2.0, CIS 3.5 in the CIS AWS Foundations Benchmark v1.4.0, and PCI.Config.1 in the Payment Card Industry Data Security Standard (PCI DSS). After this release, this control will have a single identifier called Config.1 across standards. The single Controls page and consistent identifier will help you rapidly discover misconfigurations with minimal context-switching.

You’ll be able to enable a control for one or more enabled standards that include the control. You’ll also be able to disable a control for one or more enabled standards. As before, you can enable the standards that apply to your business case.

Changes to control finding fields and values after the release of consolidated controls view

After the release of consolidated controls view, note the following changes to control finding fields and values in the AWS Security Finding Format (ASFF).

ASFF field What changes after consolidated controls view release Example value before consolidated controls view release Example value after consolidated controls view release
Compliance.SecurityControlId A single control ID will apply across standards. ProductFields.ControlId will still provide the standards-based control ID. Not applicable (new field) EC2.2
Compliance.AssociatedStandards Will show the standards that a control is enabled for. Not applicable (new field) [{“StandardsId”: “aws-foundational-security-best-practices/v/1.0.0”}]
ProductFields.RecommendationUrl This field will no longer reference a standard. https://docs.aws.amazon.com/console/securityhub/PCI.EC2.2/remediation https://docs.aws.amazon.com/console/securityhub/EC2.2/remediation
Remediation.Recommendation.Text This field will no longer reference a standard. “For directions on how to fix this issue, please consult the AWS Security Hub PCI DSS documentation.” “For instructions on how to fix this issue, see the AWS Security Hub documentation for EC2.2.”
Remediation.Recommendation.Url This field will no longer reference a standard. https://docs.aws.amazon.com/console/securityhub/PCI.EC2.2/remediation https://docs.aws.amazon.com/console/securityhub/EC2.2/remediation

Feature 2: Consolidated control findings

Currently, multiple standards contain separate controls for the same security check. Security Hub generates a separate finding per standard for each related control that is evaluated by the same security check.

After release of the consolidated control findings feature, you’ll be able to unify control findings across standards and reduce finding noise. This, in turn, will help you more quickly investigate and remediate failed findings. When you turn on consolidated control findings, Security Hub will generate a single finding or finding update for each security check of a control, even if the check is shared across multiple standards.

For example, after you turn on the feature, you will receive a single finding for a security check of Config.1 even if you’ve enabled this control for the AWS FSBP standard, CIS AWS Foundations Benchmark v1.2.0, CIS AWS Foundations Benchmark v1.4.0, and PCI DSS. If you don’t turn on consolidated control findings, you will receive four separate findings for a security check of Config.1 if you’ve enabled this control for the AWS FSBP standard, CIS AWS Foundations Benchmark v1.2.0, CIS AWS Foundations Benchmark v1.4.0, and PCI DSS.

Changes to control finding fields and values after turning on consolidated control findings

If you turn on consolidated control findings, note the following changes to control finding fields and values in the ASFF. These changes are in addition to the changes previously described for consolidated controls view.

ASFF field What changes after consolidated controls view release Example value before consolidated controls view release Example value after consolidated controls view release
GeneratorId This field will no longer reference a standard. aws-foundational-security-best-practices/v/1.0.0/Config.1 security-control/Config.1
Title This field will no longer reference a standard. PCI.Config.1 AWS Config should be enabled {
Id This field will no longer reference a standard. arn:aws:securityhub:eu-central-1:123456789012:subscription/pci-dss/v/3.2.1/PCI.IAM.5/finding/ab6d6a26-a156-48f0-9403-115983e5a956 arn:aws:securityhub:eu-central-1:123456789012:security-control/iam.9/finding/ab6d6a26-a156-48f0-9403-115983e5a956
ProductFields.ControlId This field will be removed in favor of a single, standard-agnostic control ID. PCI.EC2.2 Removed. See Compliance.SecurityControlId instead.
ProductFields.RuleId This field will be removed in favor of a single, standard-agnostic control ID. 1.3 Removed. See Compliance.SecurityControlId instead.
Description This field will no longer reference a standard. This PCI DSS control checks whether AWS Config is enabled in the current account and region. This AWS control checks whether AWS Config is enabled in the current account and region.
Severity Security Hub will no longer use the Product field to describe the severity of a finding. “Severity”: {
“Product”: 90,
“Label”: “CRITICAL”,
“Normalized”: 90,
“Original”: “CRITICAL”
},
“Severity”: {
“Label”: “CRITICAL”,
“Normalized”: 90,
“Original”: “CRITICAL”
},
Types This field will no longer reference a standard. [“Software and Configuration Checks/Industry and Regulatory Standards/PCI-DSS”] [“Software and Configuration Checks/Industry and Regulatory Standards”]
Compliance.RelatedRequirements This field will show related requirements across associated standards. [ “PCI DSS 10.5.2”,
“PCI DSS 11.5”]
[ “PCI DSS v3.2.1/10.5.2”,
“PCI DSS v3.2.1/11.5”,
“CIS AWS Foundations Benchmark v1.2.0/2.5”]
CreatedAt Format will remain the same, but value will reset when you turn on consolidated control findings. 2022-05-05T08:18:13.138Z 2022-09-25T08:18:13.138Z
FirstObservedAt Format will remain the same, but value will reset when you turn on consolidated control findings. 2022-05-07T08:18:13.138Z 2022-09-28T08:18:13.138Z
ProductFields.RecommendationUrl This field will be replaced by Remediation.Recommendation.Url. https://docs.aws.amazon.com/console/securityhub/EC2.2/remediation Removed. See Remediation.Recommendation.Url instead.
ProductFields.StandardsArn This field will be replaced by Compliance.AssociatedStandards. arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0 Removed. See Compliance.AssociatedStandards instead.
ProductFields.StandardsControlArn This field will be removed because Security Hub will generate one finding for a security check across standards. arn:aws:securityhub:us-east-1:123456789012:control/aws-foundational-security-best-practices/v/1.0.0/Config.1 Removed.
ProductFields.StandardsGuideArn This field will be replaced by Compliance.AssociatedStandards. arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.2.0 Removed. See Compliance.AssociatedStandards instead.
ProductFields.StandardsGuideSubscriptionArn This field will be removed because Security Hub will generate one finding for a security check across standards. arn:aws:securityhub:us-east-2:123456789012:subscription/cis-aws-foundations-benchmark/v/1.2.0 Removed.
ProductFields.StandardsSubscriptionArn This field will be removed because Security Hub will generate one finding for a security check across standards. arn:aws:securityhub:us-east-1:123456789012:subscription/aws-foundational-security-best-practices/v/1.0.0 Removed.
ProductFields.aws/securityhub/FindingId This field will no longer reference a standard. arn:aws:securityhub:us-east-1::product/aws/securityhub/arn:aws:securityhub:us-east-1:123456789012:subscription/aws-foundational-security-best-practices/v/1.0.0/Config.1/finding/751c2173-7372-4e12-8656-a5210dfb1d67 arn:aws:securityhub:us-east-1::product/aws/securityhub/arn:aws:securityhub:us-east-1:123456789012:security-control/Config.1/finding/751c2173-7372-4e12-8656-a5210dfb1d67

New values for customer-provided finding fields after turning on consolidated control findings

When you turn on consolidated control findings, Security Hub will archive the existing findings and generate new findings. To view archived findings, you can visit the Findings page of the Security Hub console with the Record state filter set to ARCHIVED, or use the GetFindings API action. Updates you’ve made to the original finding fields in the Security Hub console or by using the BatchUpdateFindings API action will not be preserved in the new findings (if needed, you can recover this data by referring to the archived findings).

Note the following changes to customer-provided control finding fields when you turn on consolidated control findings.

Customer-provided ASFF field Description of change after turning on consolidated control findings
Confidence Will reset to empty state.
Criticality Will reset to empty state.
Note Will reset to empty state.
RelatedFindings Will reset to empty state.
Severity The default severity of the finding (matches the severity of the control).
Types Will reset to standard-agnostic value.
UserDefinedFields Will reset to empty state.
VerificationState Will reset to empty state.
Workflow New failed findings will have a default value of NEW. New passed findings will have a default value of RESOLVED.

How to turn consolidated control findings on and off

Follow these instructions to turn consolidated control findings on and off.

New accounts

If you enable Security Hub for an AWS account for the first time on or after the time when consolidated control findings is released, by default consolidated control findings will be turned on for your account. You can turn it off at any time. However, we recommend keeping it turned on to minimize finding noise.

If you use the Security Hub integration with AWS Organizations, consolidated control findings will be turned on for new member accounts if the administrator account has turned on the feature. If the administrator account has turned it off, it will be turned off for new subordinate AWS accounts (member accounts) as well.

Existing accounts

If your Security Hub account already existed before consolidated control findings is released, your account will have consolidated control findings turned off by default. You can turn it on at any time. We recommend turning it on to minimize finding noise. If you use AWS Organizations, consolidated control findings will be turned on or off for existing member accounts based on the settings of the administrator account.

To turn consolidated control findings on and off (Security Hub console)

  1. In the navigation pane, choose Settings.
  2. Choose the General tab.
  3. For Controls, turn on Consolidated control findings. Turn it off to receive multiple findings for each standard.
  4. Choose Save.

To turn consolidated control findings on and off (Security Hub API)

  • Run the UpdateSecurityHubConfiguration API action. Use the new ControlFindingGenerator attribute to change whether an account uses consolidated control findings:
    • To turn on consolidated control findings, set ControlFindingGenerator equal to SECURITY_CONTROL.
    • To turn it off, set ControlFindingGenerator equal to STANDARD_CONTROL.

To turn consolidated control findings on and off (AWS CLI)

  • In the AWS CLI, run the update-security-hub-configuration command. Use the new control-finding-generator attribute to change whether an account uses consolidated control findings:
    • To turn on consolidated control findings, set control-finding-generator equal to SECURITY_CONTROL.
    • To turn it off, set control-finding-generator equal to STANDARD_CONTROL.

API permissions for consolidated control findings

You’ll need AWS Identity and Access Management (IAM) permissions for the following new API operations in order for consolidated control findings to work as expected:

  • BatchGetSecurityControls – Returns account and Region-specific data about a batch of controls.
  • ListSecurityControlDefinitions – Returns information about controls that apply to a specified standard.
  • ListStandardsControlAssociations – Identifies whether a control is currently associated with or dissociated from each enabled standard.
  • BatchGetStandardsControlAssociations – For a batch of controls, identifies whether each control is currently associated with or dissociated from a specified standard.
  • BatchUpdateStandardsControlAssociations – Used to associate a control with enabled standards that include the control, or to dissociate a control from enabled standards. This is a batch substitute for the UpdateStandardsControl API action if an administrator doesn’t want to allow member accounts to associate or dissociate controls.
  • BatchGetControlEvaluations (private API) – Retrieves the enablement and compliance status of a control, the findings count for a control, and the overall security score for controls.

How to prepare for control finding field and value changes

If your workflows don’t rely on the specific format of any control finding fields, no action is required to prepare for the feature releases. We recommend that you immediately turn on consolidated control findings.

Consider waiting to turn on consolidated control findings if you currently rely on the Automated Security Response on AWS solution for predefined response and remediation actions. That solution does not yet support consolidated control findings. If you turn consolidated control findings on now, actions you deployed using the Automated Security Response solution will no longer work.

If you rely on the specific format of any control finding fields (for example, for custom automation), carefully review the upcoming finding field and value changes to ensure that your workflows will continue to function as intended. Note that the changes noted in the first table in this post might impact you if you rely on the specified control finding fields and values.

The changes noted in the second table and third table in this post will only impact you if you turn on consolidated control findings. For example, if you rely on ProductFields.ControlId, GeneratorId, or Title, you’ll be impacted if you turn on consolidated control findings. As another example, if you’ve created an Amazon CloudWatch Events rule that initiates an action for a specific control ID (such as invoking an AWS Lambda function if the control ID equals CIS 2.7), you’ll need to update the rule to use CloudTrail.2, the new Compliance.SecurityControlId field for that control.

If you’ve created custom insights by using the control finding fields or values that will change (see previous tables), we recommend updating those insights to use the new fields or values.

Conclusion

This post covered the control finding fields and values that will change in Security Hub after release of the consolidated controls view and consolidated control findings features. We recommend that you carefully review the changes and update your workflows to start using the new fields and values as soon as the features become available.

For more information about the upcoming changes, see the Security Hub user guide, which includes value changes for GeneratorId , control title changes, and sample control findings before and after the upcoming feature releases.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Priyanka Prakash

Priyanka is a technical writer for AWS Security Hub. She enjoys helping customers understand how to effectively monitor their environment and address security issues. Prior to joining AWS, Priyanka worked for a cloud monitoring startup. In her personal time, Priyanka enjoys cooking and hiking.

New – Bring ML Models Built Anywhere into Amazon SageMaker Canvas and Generate Predictions

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/new-bring-ml-models-built-anywhere-into-amazon-sagemaker-canvas-and-generate-predictions/

Amazon SageMaker Canvas provides business analysts with a visual interface to solve business problems using machine learning (ML) without writing a single line of code. Since we introduced SageMaker Canvas in 2021, many users have asked us for an enhanced, seamless collaboration experience that enables data scientists to share trained models with their business analysts with a few simple clicks.

Today, I’m excited to announce that you can now bring ML models built anywhere into SageMaker Canvas and generate predictions.

New – Bring Your Own Model into SageMaker Canvas
As a data scientist or ML practitioner, you can now seamlessly share models built anywhere, within or outside Amazon SageMaker, with your business teams. This removes the heavy lifting for your engineering teams to build a separate tool or user interface to share ML models and collaborate between the different parts of your organization. As a business analyst, you can now leverage ML models shared by your data scientists within minutes to generate predictions.

Let me show you how this works in practice!

In this example, I share an ML model that has been trained to identify customers that are potentially at risk of churning with my marketing analyst. First, I register the model in the SageMaker model registry. SageMaker model registry lets you catalog models and manage model versions. I create a model group called 2022-customer-churn-model-group and then select Create model version to register my model.

Amazon SageMaker Model Registry

To register your model, provide the location of the inference image in Amazon ECR, as well as the location of your model.tar.gz file in Amazon S3. You can also add model endpoint recommendations and additional model information. Once you’ve registered your model, select the model version and select Share.

Amazon SageMaker Studio - Share models from model registry with SageMaker Canvas users

You can now choose the SageMaker Canvas user profile(s) within the same SageMaker domain you want to share your model with. Then, provide additional model details, such as information about training and validation datasets, the ML problem type, and model output information. You can also add a note for the SageMaker Canvas users you share the model with.

Amazon SageMaker Studio - Share a model from Model Registry with SageMaker Canvas users

Similarly, you can now also share models trained in SageMaker Autopilot and models available in SageMaker JumpStart with SageMaker Canvas users.

The business analysts will receive an in-app notification in SageMaker Canvas that a model has been shared with them, along with any notes you added.

Amazon SageMaker Canvas - Received model from SageMaker Studio

My marketing analyst can now open, analyze, and start using the model to generate ML predictions in SageMaker Canvas.

Amazon SageMaker Canvas - Imported model from SageMaker Studio

Select Batch prediction to generate ML predictions for an entire dataset or Single prediction to create predictions for a single input. You can download the results in a .csv file.

Amazon SageMaker Canvas - Generate Predictions

New – Improved Model Sharing and Collaboration from SageMaker Canvas with SageMaker Studio Users
We also improved the sharing and collaboration capabilities from SageMaker Canvas with data science and ML teams. As a business analyst, you can now select which SageMaker Studio user profile(s) you want to share your standard-build models with.

Your data scientists or ML practitioners will receive a similar in-app notification in SageMaker Studio once a model has been shared with them, along with any notes from you. In addition to just reviewing the model, SageMaker Studio users can now also, if needed, update the data transformations in SageMaker Data Wrangler, retrain the model in SageMaker Autopilot, and share back the updated model. SageMaker Studio users can also recommend an alternate model from the list of models in SageMaker Autopilot.

Once SageMaker Studio users share back a model, you receive another notification in SageMaker Canvas that an updated model has been shared back with you. This collaboration between business analysts and data scientists will help democratize ML across organizations by bringing transparency to automated decisions, building trust, and accelerating ML deployments.

Now Available
The enhanced, seamless collaboration capabilities for Amazon SageMaker Canvas, including the ability to bring your ML models built anywhere, are available today in all AWS Regions where SageMaker Canvas is available with no changes to the existing SageMaker Canvas pricing.

Start collaborating and bring your ML model to Amazon SageMaker Canvas today!

— Antje

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

Starting in April of 2023 we will be making two changes to Amazon Simple Storage Service (Amazon S3) to put our latest best practices for bucket security into effect automatically. The changes will begin to go into effect in April and will be rolled out to all AWS Regions within weeks.

Once the changes are in effect for a target Region, all newly created buckets in the Region will by default have S3 Block Public Access enabled and access control lists (ACLs) disabled. Both of these options are already console defaults and have long been recommended as best practices. The options will become the default for buckets that are created using the S3 API, S3 CLI, the AWS SDKs, or AWS CloudFormation templates.

As a bit of history, S3 buckets and objects have always been private by default. We added Block Public Access in 2018 and the ability to disable ACLs in 2021 in order to give you more control, and have long been recommending the use of AWS Identity and Access Management (IAM) policies as a modern and more flexible alternative.

In light of this change, we recommend a deliberate and thoughtful approach to the creation of new buckets that rely on public buckets or ACLs, and believe that most applications do not need either one. If your application turns out be one that does, then you will need to make the changes that I outline below (be sure to review your code, scripts, AWS CloudFormation templates, and any other automation).

What’s Changing
Let’s take a closer look at the changes that we are making:

S3 Block Public Access – All four of the bucket-level settings described in this post will be enabled for newly created buckets:

A subsequent attempt to set a bucket policy or an access point policy that grants public access will be rejected with a 403 Access Denied error. If you need public access for a new bucket you can create it as usual and then delete the public access block by calling DeletePublicAccessBlock (you will need s3:PutBucketPublicAccessBlock permission in order to call this function; read Block Public Access to learn more about the functions and the permissions).

ACLs Disabled – The Bucket owner enforced setting will be enabled for newly created buckets, making bucket ACLs and object ACLs ineffective, and ensuring that the bucket owner is the object owner no matter who uploads the object. If you want to enable ACLs for a bucket, you can set the ObjectOwnership parameter to ObjectWriter in your CreateBucket request or you can call DeleteBucketOwnershipControls after you create the bucket. You will need s3:PutBucketOwnershipControls permission in order to use the parameter or to call the function; read Controlling Ownership of Objects and Creating a Bucket to learn more.

Stay Tuned
We will publish an initial What’s New post when we start to deploy this change and another one when the deployment has reached all AWS Regions. You can also run your own tests to detect the change in behavior.

Jeff;

AWS Week in Review – December 12, 2022

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-week-in-review-december-12-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

The world is asynchronous, is what Werner Vogels, Amazon CTO, reminded us during his keynote last week at AWS re:Invent. At the beginning of the keynote, he showed us how weird a synchronous world would be and how everything in nature is asynchronous. One example of an event-driven application he showcased during his keynote is Serverlesspresso, a project my team has been working on for the last year. And last week, we announced Serverlesspresso extensions, a new program that lets you contribute to Serverlesspresso and learn how event-driven applications can be extended.

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon SageMaker Studio now supports fine-grained data access control with AWS LakeFormation when accessing data through Amazon EMR. Now, when you connect to EMR clusters to SageMaker Studio notebooks, you can choose what runtime IAM role you want to connect with, and the notebooks will only access data and resources permitted by the attached runtime role.

Amazon Lex has now added support for Arabic, Cantonese, Norwegian, Swedish, Polish, and Finnish. This opens new possibilities to create chat bots and conversational experiences in more languages.

Amazon RDS Proxy now supports creating proxies in Amazon Aurora Global Database primary and secondary Regions. Now, building multi-Region applications with Amazon Aurora is simpler. RDS proxy sits between your application and the database pool and shares established database connections.

Amazon FSx for NetApp ONTAP launched many new features. First, it added the support for Nitro-based encryption of data in transit. It also extended NVMe read cache support to Single-AZ file systems. And it added four new features to ease the use of the service: easily assign a snapshot policy to your volumes, easily create data protection volumes, configure volumes so their tags are automatically copied to the backups, and finally, add or remove VPC route tables for your existing Multi-AZ file systems.

I would also like to mention two launches that happened before re:Invent but were not covered on the News Blog:

Amazon EventBridge Scheduler is a new capability from Amazon EventBridge that allows you to create, run, and manage scheduled tasks at scale. Using this new capability, you can schedule one-time or recurrent tasks across 270 AWS services.

AWS IoT RoboRunner is now generally available. Last year at re:Invent Channy wrote a blog post introducing the preview for this service. IoT RoboRunner is a robotic service that makes it easier to build and deploy applications for fleets of robots working seamlessly together.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

I would like to recommend this really interesting Amazon Science article about federated learning. This is a framework that allows edge devices to work together to train a global model while keeping customers’ data on-device.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish, and every other week there is a new episode. Today the final episode for season three launched, and in it, we discussed many of the re:Invent launches. You can listen to all the episodes directly from your favorite podcast app or at AWS Podcasts en español.

AWS open-source news and updates–This is a newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Resiliency Hub Activation Day is a half-day technical virtual session to deep dive into the features and functionality of Resiliency Hub. You can register for free here.

AWS re:Invent recaps in your area. During the re:Invent week, we had lots of new announcements, and in the next weeks you can find in your area a recap of all these launches. All the events will be posted on this site, so check it regularly to find an event nearby.

AWS re:Invent keynotes, leadership sessions, and breakout sessions are available on demand. I recommend that you check the playlists and find the talks about your favorite topics in one collection.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

Authority to operate (ATO) on AWS Program now available for customers in Spain

Post Syndicated from Greg Herrmann original https://aws.amazon.com/blogs/security/authority-to-operate-on-aws-program-now-available-for-customers-in-spain/

Meeting stringent security and compliance requirements in regulated or public sector environments can be challenging and time consuming, even for organizations with strong technical competencies. To help customers navigate the different requirements and processes, we launched the ATO on AWS Program in June 2019 for US customers. The program involves a community of expert AWS partners to help support and accelerate customers’ ability to meet their security and compliance obligations.

We’re excited to announce that we have now expanded the ATO on AWS Program to Spain. As part of the launch in Spain, we recruited and vetted five partners with a demonstrated competency in helping customers meet Spanish and European Union (EU) regulatory compliance and security requirements, such as the General Data Protection Regulation (GDPR), Esquema Nacional de Seguridad (ENS), and European Banking Authority guidelines.

How Does the ATO on AWS Program support customers?

The primary offering of the ATO on AWS Program is access to a community of vetted, expert partners that specialize in customers’ authorization needs, whether it be architecting, configuring, deploying, or integrating tools and controls. The team also provides direct engagement activities to introduce you to publicly available and no-cost resources, tools, and offerings so you can work to meet your security obligations on AWS. These activities include one-on-one meetings, answering questions, technical workshops (in specific cases), and more.

Who are the partners?

Partners in the ATO on AWS Program go through a rigorous evaluation conducted by a team of AWS Security and Compliance experts. Before acceptance into the program, the partners complete a checklist of criteria and provide detailed evidence that they meet those criteria.

Our initial launch in Spain includes the following five partners that have successfully met the criteria to join the program. Each partner has also achieved the Esquema Nacional de Seguridad certification.

  • ATOS – a global leader in digital transformation, cybersecurity, and cloud and high performance computing. ATOS was ranked #1 in Managed Security Services (MSS) revenue by Gartner in 2021.
  • Indra Sistemas – a global technology and consulting company that provides proprietary solutions for the transport and defense markets. It also offers digital transformation consultancy and information technologies in Spain and Latin America through its affiliate Minsait.
  • NTT Data EMEAL ­– an operational company created from an alliance between everis and NTT DATA EMEAL to support clients in Europe and Latin America. NTT Data EMEAL supports customers through strategic consulting and advisory services, new technologies, applications, infrastructure, IT modernization, and business process outsourcing (BPO).
  • Telefónica Tech – a leading company in digital transformation. Telefónica Tech combines cybersecurity and cloud technologies to help simplify technological environments and build appropriate solutions for customers.
  • T-Systems – a leading service provider for the public sector in Spain. As an AWS Premier Tier Services Partner and Managed Service Provider, T-Systems maintains the Security and Migration Competencies, supporting customers with migration and secure operation of applications.

For a complete list of ATO on AWS Program partners, see the ATO on AWS Partners page.

Engage the ATO on AWS Program

Customers seeking support can engage the ATO on AWS Program and our partners in multiple ways. The best way to reach us is to complete a short, online ATO on AWS Questionnaire so we can learn more about your timeline and goals. If you prefer to engage AWS partners directly, see the complete list of our partners and their contact information at ATO on AWS Partners.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Greg Herrmann

Greg Herrmann

Greg has worked in the security and compliance field for over 18 years, supporting both classified and unclassified workloads for U.S. federal and DoD customers. He has been with AWS for more than 6 years as a Senior Security Partner Strategist for the Security and Compliance Partner Team, working with AWS partners and customers to accelerate security and compliance processes.

Borja Larrumbide

Borja Larrumbide

Borja is a Security Assurance Manager for AWS in Spain and Portugal. Previously, he worked at companies such as Microsoft and BBVA in different roles and sectors. Borja is a seasoned security assurance practitioner with years of experience engaging key stakeholders at national and international levels. His areas of interest include security, privacy, risk management, and compliance.

AWS achieves GNS Portugal certification for classified information

Post Syndicated from Rodrigo Fiuza original https://aws.amazon.com/blogs/security/aws-achieves-gns-portugal-certification-for-classified-information/

GNS Logo

We continue to expand the scope of our assurance programs at Amazon Web Services (AWS), and we are pleased to announce that our Regions and AWS Edge locations in Europe are now certified by the Portuguese GNS/NSO (National Security Office) at the National Restricted level. This certification demonstrates our ongoing commitment to adhere to the heightened expectations for cloud service providers to process, transmit, and store classified data.

The GNS certification is based on NIST SP800-53 R4 and CSA CCM v4 frameworks, with the goal of protecting the processing and transmission of classified information.

AWS was evaluated by Adyta Lda, an independent third-party auditor, and by GNS Portugal. The Certificate of Compliance illustrating the compliance status of AWS is available on the GNS Certifications page and through AWS Artifact. AWS Artifact is a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

As of this writing, 26 services offered in Europe are in scope of this certification. For up-to-date information, including when additional services are added, see the AWS Services in Scope by Compliance Program and select GNS.

AWS strives to continuously bring services into the scope of its compliance programs to help you meet your architectural and regulatory needs. If you have questions or feedback about GNS Portugal compliance, reach out to your AWS account team.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Rodrigo Fiuza

Rodrigo is a Security Audit Manager at AWS, based in São Paulo. He leads audits, attestations, certifications, and assessments across Latin America, Caribbean and Europe. Rodrigo has previously worked in risk management, security assurance, and technology audits for the past 12 years.

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse.

Data ingestion is the process of getting data to Amazon Redshift. You can leverage one of the many zero-ETL integration methods to make data available in Amazon Redshift directly. However, if your data is in your Amazon S3 bucket, then you can simply load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift using the COPY command. A COPY command is the most efficient way to load a table from S3 because it uses the Amazon Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel.

Amazon Redshift launched auto-copy support to simplify data loading from Amazon S3 into Amazon Redshift. You can now setup continuous file ingestion rules to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. This also enables end users to have the latest data available in Amazon Redshift shortly after the source data is available.

This post shows you how to build automatic file ingestion pipelines in Amazon Redshift when source files are located on Amazon S3 by using a simple SQL command. In addition, we show you how to enable auto-copy using auto-copy jobs, how to monitor jobs, considerations, and best practices.

Overview of the auto-copy feature in Amazon Redshift

The auto-copy feature in Amazon Redshift leverages the S3 event integration to automatically load data into Amazon Redshift and simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating auto-copy jobs. A auto-copy job is a database object that stores, automates, and reuses the COPY statement for newly created files that land in the S3 folder.

The following diagram illustrates this process.

S3 event integration and auto-copy jobs have the following benefits:

  • Users can now load data from Amazon S3 automatically without having to build a pipeline or using an external framework
  • auto-copy jobs offer automatic and incremental data ingestion from an Amazon S3 location without the need to implement a custom solution
  • This functionality comes at no additional cost
  • Existing COPY statements can be converted into auto-copy jobs by appending the JOB CREATE <job_name> parameter
  • It keeps track of loaded files and minimizes data duplication.
  • It can be quickly set up using a simple SQL statement using your choice of JDBC/ODBC clients.
  • It has automatic error handling of bad quality data files.
  • It has a mechanism to load-once for each file. This means that there is no need to generate explicit manifest files.

Prerequisites

To get started with auto-copy, you need the following prerequisites:

  • An AWS account
  • An encrypted Amazon Redshift provisioned cluster or Amazon redshift serverless workgroup
  • An Amazon S3 bucket
  • Add following to the Amazon S3 bucket policy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Auto-Copy-Policy-01",
                "Effect": "Allow",
                "Principal": {
                    "Service":"redshift.amazonaws.com"
                        
                    
                },
                "Action": [
                    "s3:GetBucketNotification",
                    "s3:PutBucketNotification",
                    "s3:GetBucketLocation"
                ],
                "Resource": "arn:aws:s3:::<<your-s3-bucket-name>>",
                "Condition": {
                    "StringEquals": {
                        "aws:SourceArn": "arn:aws:redshift:<region-name>:<aws-account-id>:integration:*",
                        "aws:SourceAccount": "<aws-account-id>"
                    }
                }
            }
        ]
    }

Set up Amazon S3 event Integration

An Amazon S3 event integration facilitates seamless and automated data ingestion from S3 buckets into an Amazon Redshift data warehouse, streamlining the process of transferring and storing data for analytical purposes

  1. Sign in to the AWS Management Console and Navigate to Amazon Redshift home page. Under Integrations section choose S3 event integrations

  2. Choose Create S3 event integration

  3. Enter Integration name and Description, choose Next

  4. Choose Browse S3 buckets, a dialog box pops up, select the Amazon S3 bucket and choose Continue

  5. Amazon S3 bucket is selected. Choose Next
  6. Choose Browse Redshift data warehouse

  7. Choose the Amazon Redshift data warehouse and choose Continue

  8. Then Amazon Redshift resource policy needs access to S3 event integration. In case of Resource policy error, check Fix it for me and choose Next

  9. Add Tags as required and choose Next

  10. Review changes and choose Create S3 event integration
  11. An S3 event integration is created. Wait until the status of S3 event integration is Active

Set up auto-copy jobs

In this section, we demonstrate how to automate data loading of files from Amazon S3 into Amazon Redshift. With the existing COPY syntax, we add the JOB CREATE parameter to perform a one-time setup for automatic file ingestion. See the following code:

COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name> [AUTO ON | OFF];

Auto ingestion is enabled by default on auto-copy jobs. Files already present at the S3 location will not be visible to the auto-copy job. Only files added after JOB creation are tracked by Amazon Redshift.

Automate ingestion from a single data source

With a auto-copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

In this example, we have multiple files that are being loaded on a daily basis containing the sales transactions across all the stores in the US. For this we can create a store_sales folder in the bucket.

The following code creates the store_sales table:

DROP TABLE IF EXISTS public.store_sales;
CREATE TABLE IF NOT EXISTS public.store_sales
(
  ss_sold_date_sk int4,            
  ss_sold_time_sk int4,     
  ss_item_sk int4 not null,      
  ss_customer_sk int4,           
  ss_cdemo_sk int4,              
  ss_hdemo_sk int4,         
  ss_addr_sk int4,               
  ss_store_sk int4,           
  ss_promo_sk int4,           
  ss_ticket_number int8 not null,        
  ss_quantity int4,           
  ss_wholesale_cost numeric(7,2),          
  ss_list_price numeric(7,2),              
  ss_sales_price numeric(7,2),
  ss_ext_discount_amt numeric(7,2),             
  ss_ext_sales_price numeric(7,2),              
  ss_ext_wholesale_cost numeric(7,2),           
  ss_ext_list_price numeric(7,2),               
  ss_ext_tax numeric(7,2),                 
  ss_coupon_amt numeric(7,2), 
  ss_net_paid numeric(7,2),   
  ss_net_paid_inc_tax numeric(7,2),             
  ss_net_profit numeric(7,2),
  primary key (ss_item_sk, ss_ticket_number)
 ) 
DISTKEY (ss_item_sk) 
SORTKEY (ss_sold_date_sk);

Next, we create the auto-copy job to automatically load the gzip-compressed files into the store_sales table:

COPY store_sales
FROM 's3://aws-redshift-s3-auto-copy-demo/store_sales'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
gzip delimiter '|' EMPTYASNULL
region 'us-east-1'
JOB CREATE job_store_sales AUTO ON;

Each day’s sales transactions are loaded to their own folder in Amazon S3.

Now upload the files for transaction sold on 2002-12-31. Each folder contains multiple gzip-compressed files.

Since the auto-copy job is already created, it automatically loads the gzip-compressed files located in the S3 object path specified in the COPY command to the store_sales table.

Let’s run a query to get the daily total of sales transactions across all the stores in the US:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

The output shown comes from the transactions sold on 2002-12-31.

The following day, incremental sales transactions data are loaded to a new folder in the same S3 object path.

As new files arrive to the same S3 object path, the auto-copy job automatically loads the unprocessed files to the store_sales table in an incremental fashion.

All new sales transactions for 2003-01-01 are automatically ingested, which can be verified by running the following query:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

Automate ingestion from multiple data sources

We can also load an Amazon Redshift table from multiple data sources. When using a pub/sub pattern where multiple S3 buckets populate data to an Amazon Redshift table, you have to maintain multiple data pipelines for each source/target combination. With new parameters in the COPY command, this can be automated to handle data loads efficiently.

In the following example, the Customer_1 folder has Green Cab Company sales data, and the Customer_2 folder has Red Cab Company sales data. We can use the COPY command with the JOB parameter to automate this ingestion process.

The following screenshot shows sample data stored in files. Each folder has similar data but for different customers.

The target for these files in this example is the Amazon Redshift table cab_sales_data.

Define the target table cab_sales_data:

DROP TABLE IF EXISTS cab_sales_data;
CREATE TABLE IF NOT EXISTS cab_sales_data
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);

You can define two auto-copy jobs as shown in the following code to handle and monitor the ingestion of sales data belonging to different customers, in our case Customer_1 and Customer_2. These jobs monitor the Customer_1 and Customer_2 folders and load new files that are added here.

COPY public.cab_sales_data
FROM 's3://aws-redshift-s3-auto-copy-demo/Customer_1'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_green_cab AUTO ON;

COPY public.cab_sales_data
FROM 's3:// aws-redshift-s3-auto-copy-demo/Customer_2'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_red_cab AUTO ON;

After setting up the two jobs, we can upload the relevant files into their respective folders. This will make sure that the data is loaded efficiently as soon as the files arrive. Each customer is assigned its own vendorid, as shown in the following output:

SELECT vendorid,
       sum(passenger_count) as total_passengers 
  FROM cab_sales_data
GROUP BY vendorid;

result 1

Manually run a auto-copy job

There might be scenarios wherein the auto-copy job needs to be paused, meaning it needs to stop looking for new files, for example, to fix a corrupted data pipeline at the data source.

In that case, either use the auto-copy job ALTER command to set AUTO to OFF or create a new auto-copy job with AUTO OFF. Once this is set, auto copy will no longer look for new files.

If necessary, users can manually invoke auto-copy job which will do the work and ingest if new files are found.

auto-copy job RUN <auto-copy job Name>

You can disable “AUTO ON” in the existing auto-copy job using the following command:

auto-copy job ALTER <auto-copy job Name> AUTO OFF

The following table compares the syntax and data duplication between a regular copy statement and the new auto-copy job

. Copy Auto-copy job
Syntax COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name>;
Data Duplication If it is run multiple times against the same S3 folder, it will load the data again, resulting in data duplication. It will not load the same file twice, preventing data duplication.

Error handling and monitoring for auto-copy jobs

auto-copy jobs continuously monitor the S3 folder specified during job creation and perform ingestion whenever new files are created. New files created under the S3 folder are loaded exactly once to avoid data duplication.

By default, if there are data or format issues with the specific files, the auto-copy job will fail to ingest the files with a load error and log details to the system tables. The auto-copy job will remain AUTO ON with new data files and will continue to ignore previously failed files.

Amazon Redshift provides the following system tables for users to monitor or troubleshoot auto-copy jobs as needed:

  • List auto-copy jobs – Use SYS_COPY_JOB to list the auto-copy jobs stored in the database:
SELECT * 
  FROM sys_copy_job;

  • Get a summary of a auto-copy job – Use the SYS_LOAD_HISTORY view to get the aggregate metrics of a auto-copy job operation by specifying the copy_job_id. It shows the aggregate metrics of the files that have been processed by a auto-copy job.
SELECT *
  FROM sys_load_history
 WHERE copy_job_id = 274978;

  • Get details of a auto-copy job – Use STL_LOAD_COMMITS to get the status and details of each file that was processed by a auto-copy job:
SELECT *
  FROM stl_load_commits
 WHERE copy_job_id = 274978
ORDER BY curtime ASC;

  • Get exception details of a auto-copy job – Use STL_LOAD_ERRORS to get the details of files that failed to ingest from a auto-copy job:
SELECT   query,
    slice,
    starttime , 
    filename,
    line_number,
    colname,
    type,
    err_code,
    err_reason,
    copy_job_id,
    raw_line,
    raw_field_value
  FROM stl_load_errors
 WHERE copy_job_id = 274978;

Auto-copy job best practices

In an auto-copy job, when a new file is detected and ingested (automatically or manually), Amazon Redshift stores the file name and doesn’t run this specific job when a new file is created with the same file name.

The following are the recommended best practices when working with files using the auto-copy job:

  • Use unique file names for each file in a auto-copy job (for example, 2022-10-15-batch-1.csv). However, you can use the same file name as long as it’s from different auto-copy jobs:
    • job_customerA_saless3://redshift-blogs/sales/customerA/2022-10-15-sales.csv
    • job_customerB_saless3://redshift-blogs/sales/customerB/2022-10-15-sales.csv
  • Do not update file contents. Do not overwrite existing files. Changes in existing files will not be reflected to the target table. The auto-copy job doesn’t pick up updated or overwritten files, so make sure they’re renamed as new file names for the auto-copy job to pick up.
  • Run regular COPY statements (not a job) if you need to ingest a file that was already processed by your auto-copy job. (COPY statement without a JOB CREATE syntax doesn’t track loaded files.) For example, this is helpful in scenarios where you don’t have control of the file name and the initial file received failed. The following figure shows a typical workflow in this case.

  • Delete and recreate your auto-copy job if you want to reset file tracking history and start over. You can drop auto-copy job using following command.
    auto-copy job DROP <auto-copy job Name>

auto-copy job considerations

Here are the main things to consider when using auto-copy:

For additional details on other considerations for auto-copy, refer to the AWS documentation.

Customer feedback

GE Aerospace is a global provider of jet engines, components, and systems for commercial and military aircraft. The company has been designing, developing, and manufacturing jet engines since World War I.

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end users. We spend more time adding value through data and less time on integrations.”

Alcuin Weidus Sr Principal Data Architect at GE Aerospace

Conclusion

This post demonstrated how to automate data ingestion from Amazon S3 to Amazon Redshift using the auto-copy feature. This new functionality helps make Amazon Redshift data ingestion easier than ever, and will allow SQL users to get access to the most recent data using a simple SQL command.

Users can begin ingesting data to Redshift from Amazon S3 with simple SQL commands and gain access to the most up-to-date data without the need for third-party tools or custom implementation.


About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize the business value of data by helping them build solutions using the AWS platform and tools.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 Hangjian Yuan is a Software Development Engineer at Amazon Redshift. He’s passionate about analytical databases and focuses on delivering cutting-edge streaming experiences for customers.

Renewal of AWS CyberGRX assessment to enhance customers’ third-party due diligence process

Post Syndicated from Naranjan Goklani original https://aws.amazon.com/blogs/security/renewal-of-aws-cybergrx-assessment-to-enhance-customers-third-party-due-diligence-process/

CyberGRX

Amazon Web Services (AWS) is pleased to announce renewal of the AWS CyberGRX cyber risk assessment report. This third-party validated report helps customers perform effective cloud supplier due diligence on AWS and enhances their third-party risk management process.

With the increase in adoption of cloud products and services across multiple sectors and industries, AWS has become a critical component of customers’ third-party environments. Regulated customers are held to high standards by regulators and auditors when it comes to exercising effective due diligence on third parties.

Many customers use third-party cyber risk management (TPCRM) services such as CyberGRX to better manage risks from their evolving third-party environments and to drive operational efficiencies. To help with such efforts, AWS has completed the CyberGRX assessment of its security posture. CyberGRX security analysts perform the assessment and validate the results annually.

The CyberGRX assessment applies a dynamic approach to third-party risk assessment. This approach integrates advanced analytics, threat intelligence, and sophisticated risk models with vendors’ responses to provide an in-depth view of how a vendor’s security controls help protect against potential threats.

Vendor profiles are continuously updated as the risk level of cloud service providers changes, or as AWS updates its security posture and controls. This approach eliminates outdated static spreadsheets for third-party risk assessments, in which the risk matrices are not updated in near real time.

In addition, AWS customers can use the CyberGRX Framework Mapper to map AWS assessment controls and responses to well-known industry standards and frameworks, such as National Institute of Standards and Technology (NIST) 800-53, NIST Cybersecurity Framework, International Organization for Standardization (ISO) 27001, Payment Card Industry Data Security Standard (PCI DSS), and U.S. Health Insurance Portability and Assessment Act (HIPAA). This mapping can reduce customers’ third-party supplier due-diligence burden.

Customers can access the AWS CyberGRX report at no additional cost. Customers can request access to the report by completing an access request form, available on the AWS CyberGRX page.

As always, we value your feedback and questions. Reach out to the AWS Compliance team through the Contact Us page. If you have feedback about this post, submit comments in the Comments section below. To learn more about our other compliance and security programs, see AWS Compliance Programs.

Want more AWS Security news? Follow us on Twitter.

Naranjan Goklani

Naranjan Goklani

Naranjan is a Security Audit Manager at AWS, based in Toronto (Canada). He leads audits, attestations, certifications, and assessments across North America and Europe. Naranjan has more than 13 years of experience in risk management, security assurance, and performing technology audits. Naranjan previously worked in one of the Big 4 accounting firms and supported clients from the financial services, technology, retail, ecommerce, and utilities industries.

Visualize and create your serverless workloads with AWS Application Composer

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualize-and-create-your-serverless-workloads-with-aws-application-composer/

This post is written by Luca Mezzalira, Principal Specialist Solutions Architect.

Today, AWS is launching a preview of AWS Application Composer, a visual designer that you can use to build your serverless applications from multiple AWS services.

In distributed systems, empowering teams is a cultural shift needed for enabling developers to help translate business capabilities into code.

This doesn’t mean every team works in isolation. Different teams or even new-joiners must understand what they are building to contribute to a project. The best way to understand architecture quickly is by using diagrams. Unfortunately, architectural diagrams are often outdated. Often, when releasing a workload in production, there are already discrepancies from the initial design and infrastructure.

Developers new to building serverless applications can face a learning curve when composing applications from multiple AWS services. They must understand how to configure each service, and then learn and write infrastructure as code (IaC) to deploy their application.

Example scenario

Emma is a cloud architect working for a video on-demand platform where every user can access the content after subscribing to the service. In the next few months, the marketing team wants to start a campaign to increase the user base using discount codes for new users only.

She collaborates with a team of developers who are new to building serverless applications. They must design a discount code service that can scale to thousands of transactions per second. There are many requirements to implement this service:

  • Gathering the gift code from a user.
  • Verifying the discount code is available.
  • Applying the discount code to the invoice at the end of the month.

Based on these requirements and default SLAs available for all the platform services, Emma designs a high-level architecture with the key elements needed for building this microservice.

Discount code service high-level architecture

Discount code service high-level architecture

Her idea is to receive a request from clients with a discount code in the payload, and validate the availability of the discount code in a database. The service then asynchronously processes different discount codes in batches to reduce traffic to downstream dependencies and reduce the cost of the overall infrastructure.

This approach ensures that the service can scale in the future beyond the initial traffic volume. It simplifies the management and implementation of the discount code service and other parts of the system with a loosely coupled architecture.

After discussing the architecture with her developers, she opens Application Composer in the AWS Management Console and starts building the implementation using serverless services.

Application Composer initial screen

Application Composer initial screen

To start, she selects New blank project and selects a local file system folder to save the project files.

Application Composer create blank project

Application Composer create blank project

Granting Application Composer access to your local project files allows near real time bidirectional syncing of changes between the console interface and locally stored project files. When you update a property with the Application Composer interface, it’s reflected in the files stored locally. When you change a local file in your IDE, it automatically reflects in the Application Composer canvas.

After creating the project, Emma drags the AWS resources she needs from the left sidebar for expressing the initial design agreed with the team.

Using Application Composer, you can drag serverless resources on the canvas and connect them together. In the background, Application Composer generates the infrastructure as code AWS CloudFormation template for you.

Application Composer canvas

Application Composer canvas

For example, this is the default configuration generated when you drag a Lambda function onto the canvas. The following code is present in the template view:

  Function:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub ${AWS::StackName}-Function
      Description: !Sub
        - Stack ${AWS::StackName} Function ${ResourceName}
        - ResourceName: Function
      CodeUri: src/Function
      Handler: index.handler
      Runtime: nodejs14.x
      MemorySize: 3008
      Timeout: 30
      Tracing: Active

Application Composer incorporates some helpful default property values, which are sometimes overlooked by developers new to serverless workloads. These include activating tracing using AWS X-Ray or increasing a function timeout, for instance.

You can change these parameters either in the CloudFormation template inside Application Composer or by visually selecting a resource. In the previous example, you can update the Lambda function parameters by opening the resource properties panel.

Application Composer resource panel

Application Composer resource panel

When you synchronize an Application Composer project with the local system, you can change the CloudFormation template from a code editor. This reflects the change in the Application Composer interface automatically.

When you connect two elements in the canvas, Application Composer sets default IAM policies, environment variables for Lambda functions, and event subscriptions where applicable.

For instance, if you have a Lambda function that interacts with an Amazon DynamoDB table and Amazon SQS queue, Application Composer generates the following configuration for the Lambda function.

Function:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub ${AWS::StackName}-Function
      Description: !Sub
        - Stack ${AWS::StackName} Function ${ResourceName}
        - ResourceName: Function
      CodeUri: src/Function
      Handler: index.handler
      […]
      Environment:
        Variables:
          QUEUE_NAME: !GetAtt Queue.QueueName
          QUEUE_ARN: !GetAtt Queue.Arn
          QUEUE_URL: !Ref Queue
          TABLE_NAME: !Ref Table
          TABLE_ARN: !GetAtt Table.Arn
      Policies:
        - SQSSendMessagePolicy:
            QueueName: !GetAtt Queue.QueueName
        - DynamoDBCrudPolicy:
            TableName: !Ref Table

This helps new builders when designing their first serverless applications and provides an initial configuration, which more advanced builders can amend. This allows you to include good operational practices when designing a serverless application.

Emma’s team continues to add together the different services needed to express the discount code architecture. This is the final result in Application Composer:

Discount code architecture in Application Composer

Discount code architecture in Application Composer

  1. The application includes an Amazon API Gateway endpoint that exposes the API needed for submitting a discount code to the system.
  2. The POST API triggers a Lambda function that first validates that the discount code is still available.
  3. This is stored using a DynamoDB table
  4. After successfully validating the discount code, the function adds a message to an SQS queue and returns a successful response to the client.
  5. Another Lambda function retrieves the message from the SQS queue and sends an invoice.

Using this approach optimizes the Lambda function invocation for speed as the remaining operations are handled asynchronously. This also simplifies the complexity and cost of the architecture because you can aggregate multiple discount codes per user SQS batching, rather than scaling the service when requests arrive from the users.

The team agrees to use this as the initial design of their service. In the future, they plan to integrate with their authentication mechanism. They add Lambda Powertools for observability, and additional libraries developed internally to make the project compliant with company standards.

Application Composer has created all the files needed to start the project in Emma’s local file system including the CloudFormation template .yaml file and the Lambda functions’ handlers.

Application Composer generated files

Application Composer generated files

Emma can now upload the outline of this service to a version control system and share the artifacts with other developers who can start coding the business logic.

Additional features

Application Composer includes a resource list tab within the left-side panel that allows you to quickly browse available resources.

Application Composer browse available resources

Application Composer browse available resources

You can also group resources semantically for simplifying the visualization inside the canvas. This helps when you have a large application in the canvas and you want to select an element quickly without dragging the canvas around to find the resource. This feature doesn’t impact the infrastructure generated.

Application Composer grouping

Application Composer grouping

Application Composer adds some metadata to the CloudFormation template to allow the canvas to group resources together when the project is loaded again.

Metadata:
  AWS::Composer::Groups:
    Group:
      Label: Group
      Members:
        - CodesQueue
        - CodesTable

You can use Application Composer beyond building new serverless workloads. You can load existing CloudFormation templates by selecting Load existing project in the Create project dialog.

Application Composer load existing project

Application Composer load existing project

You can use this to define your blueprints with organizational best practices and then visualize them within Application Composer. This helps teams collaborate when starting new serverless services. You can add resources from an existing base template to build serverless microservices or event-driven architectures.

Integration with AWS SAM

AWS Serverless Application Model (AWS SAM) recently announced the general availability of AWS SAM Accelerate to accelerate the feedback loop and testing of your code and cloud infrastructure by synchronizing only project changes. You can use Application Composer together with AWS SAM Accelerate to more simply visually build and then test your serverless applications in the cloud.

To learn more about AWS SAM Accelerate, watch this live demo.

Where Application Composer fits into the development process

Emma used Application Composer to help her team for this project but has ideas on further ways to use it.

  • Rapid prototyping.
  • Reviewing and collaboratively evolving existing serverless projects.
  • Generating diagrams for documentation or Wikis.
  • On-boarding new team members to a project
  • Reducing the first steps to deploy something in an AWS Cloud account.

Application Composer availability

Application Composer is currently available as a public preview in the following Regions: Frankfurt (eu-central-1), Ireland (eu-west-1), Ohio (us-east-2), Oregon (us-west-2), North Virginia (us-east-1) and Tokyo (ap-northeast-1).

Application Composer is available at no additional cost and can be accessed via the AWS Management Console.

Conclusion

Application Composer is a visual designer to help developers and architects express and build their application architecture. They can iterate on their ideas with colleagues and create documentation for others working on the application for the first time. You can use Application Composer during multiple stages of your software development lifecycle, reducing the friction in getting your project started and into production.

Currently, Application Composer supports a limited number of services that we plan to add to in the future. Let us know which services you would like to see included.

As a public preview, we are looking for suggestions and ideas to evolve the tool. We are looking for ways to help you and your teams to speed up the adoption of serverless workloads inside your organization. Add a comment to this post or tweet with the tag #AWSAppComposerWishlist.

For more serverless learning resources, visit Serverless Land.

New – Process PDFs, Word Documents, and Images with Amazon Comprehend for IDP

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/now-process-pdfs-word-documents-and-images-with-amazon-comprehend-for-idp/

Today we are announcing a new Amazon Comprehend feature for intelligent document processing (IDP). This feature allows you to classify and extract entities from PDF documents, Microsoft Word files, and images directly from Amazon Comprehend without you needing to extract the text first.

Many customers need to process documents that have a semi-structured format, like images of receipts that were scanned or tax statements in PDF format. Until today, those customers first needed to preprocess those documents using optical character recognition (OCR) tools to extract the text. Then they could use Amazon Comprehend to classify and extract entities from those preprocessed files.

Now with Amazon Comprehend for IDP, customers can process their semi-structured documents, such as PDFs, docx, PNG, JPG, or TIFF images, as well as plain-text documents, with a single API call. This new feature combines OCR and Amazon Comprehend’s existing natural language processing (NLP) capabilities to classify and extract entities from the documents. The custom document classification API allows you to organize documents into categories or classes, and the custom-named entity recognition API allows you to extract entities from documents like product codes or business-specific entities. For example, an insurance company can now process scanned customers’ claims with fewer API calls. Using the Amazon Comprehend entity recognition API, they can extract the customer number from the claims and use the custom classifier API to sort the claim into the different insurance categories—home, car, or personal.

Starting today, Amazon Comprehend for IDP APIs are available for real-time inferencing of files, as well as for asynchronous batch processing on large document sets. This feature simplifies the document processing pipeline and reduces development effort.

Getting Started
You can use Amazon Comprehend for IDP from the AWS Management Console, AWS SDKs, or AWS Command Line Interface (CLI).

In this demo, you will see how to asynchronously process a semi-structured file with a custom classifier. For extracting entities, the steps are different, and you can learn how to do it by checking the documentation.

In order to process a file with a classifier, you will first need to train a custom classifier. You can follow the steps in the Amazon Comprehend Developer Guide. You need to train this classifier with plain text data.

After you train your custom classifier, you can classify documents using either asynchronous or synchronous operations. For using the synchronous operation to analyze a single document, you need to create an endpoint to run real-time analysis using a custom model. You can find more information about real-time analysis in the documentation. For this demo, you are going to use the asynchronous operation, placing the documents to classify in an Amazon Simple Storage Service (Amazon S3) bucket and running an analysis batch job.

To get started classifying documents in batch from the console, on the Amazon Comprehend page, go to Analysis jobs and then Create job.

Create new job

Then you can configure the new analysis job. First, input a name and pick Custom classification and the custom classifier you created earlier.

Then you can configure the input data. First, select the S3 location for that data. In that location, you can place your PDFs, images, and Word Documents. Because you are processing semi-structured documents, you need to choose One document per file. If you want to override Amazon Comprehend settings for extracting and parsing the document, you can configure the Advanced document input options.

Input data for analysis job

After configuring the input data, you can select where the output of this analysis should be stored. Also, you need to give access permissions for this analysis job to read and write on the specified Amazon S3 locations, and then you are ready to create the job.

Configuring the classification job

The job takes a few minutes to run, depending on the size of the input. When the job is ready, you can check the output results. You can find the results in the Amazon S3 location you specified when you created the job.

In the results folder, you will find a .out file for each of the semi-structured files Amazon Comprehend classified. The .out file is a JSON, in which each line represents a page of the document. In the amazon-textract-output directory, you will find a folder for each classified file, and inside that folder, there is one file per page from the original file. Those page files contain the classification results. To learn more about the outputs of the classifications, check the documentation page.

Job output

Available Now
You can get started classifying and extracting entities from semi-structured files like PDFs, images, and Word Documents asynchronously and synchronously today from Amazon Comprehend in all the Regions where Amazon Comprehend is available. Learn more about this new launch in the Amazon Comprehend Developer Guide.

Marcia

New — Create Point-to-Point Integrations Between Event Producers and Consumers with Amazon EventBridge Pipes

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-create-point-to-point-integrations-between-event-producers-and-consumers-with-amazon-eventbridge-pipes/

It is increasingly common to use multiple cloud services as building blocks to assemble a modern event-driven application. Using purpose-built services to accomplish a particular task ensures developers get the best capabilities for their use case. However, communication between services can be difficult if they use different technologies to communicate, meaning that you need to learn the nuances of each service and how to integrate them with each other. We usually need to create integration code (or “glue” code) to connect and bridge communication between services. Writing glue code slows our velocity, increases the risk of bugs, and means we spend our time writing undifferentiated code rather than building better experiences for our customers.

Introducing Amazon EventBridge Pipes
Today, I’m excited to announce Amazon EventBridge Pipes, a new feature of Amazon EventBridge that makes it easier for you to build event-driven applications by providing a simple, consistent, and cost-effective way to create point-to-point integrations between event producers and consumers, removing the need to write undifferentiated glue code.

The simplest pipe consists of a source and a target. An optional filtering step allows only specific source events to flow into the Pipe and an optional enrichment step using AWS Lambda, AWS Step Functions, Amazon EventBridge API Destinations, or Amazon API Gateway enriches or transforms events before they reach the target. With Amazon EventBridge Pipes, you can integrate supported AWS and self-managed services as event producers and event consumers into your application in a simple, reliable, consistent and cost-effective way.

Amazon EventBridge Pipes bring the most popular features of Amazon EventBridge Event Bus, such as event filtering, integration with more than 14 AWS services, and automatic delivery retries.

How Amazon EventBridge Pipes Works
Amazon EventBridge Pipes provides you a seamless means of integrating supported AWS and self-managed services, favouring configuration over code. To start integrating services with EventBridge Pipes, you need to take the following steps:

  1. Choose a source that is producing your events. Supported sources include: Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon SQS, Amazon Managed Streaming for Apache Kafka, and Amazon MQ (both ActiveMQ and RabbitMQ).
  2. (Optional) Specify an event filter to only process events that match your filter (you’re not charged for events that are filtered out).
  3. (Optional) Transform and enrich your events using built-in free transformations, or AWS Lambda, AWS Step Functions, Amazon API Gateway, or EventBridge API Destinations to perform more advanced transformations and enrichments.
  4. Choose a target destination from more than 14 AWS services, including Amazon Step Functions, Kinesis Data Streams, AWS Lambda, and third-party APIs using EventBridge API destinations.

Amazon EventBridge Pipes provides simplicity to accelerate development velocity by reducing the time needed to learn the services and write integration code, to get reliable and consistent integration.

EventBridge Pipes also comes with additional features that can help in building event-driven applications. For example, with event filtering, Pipes helps event-driven applications become more cost-effective by only processing the events of interest.

Get Started with Amazon EventBridge Pipes
Let’s see how to get started with Amazon EventBridge Pipes. In this post, I will show how to integrate an Amazon SQS queue with AWS Step Functions using Amazon EventBridge Pipes.

The following screenshot is my existing Amazon SQS queue and AWS Step Functions state machine. In my case, I need to run the state machine for every event in the queue. To do so, I need to connect my SQS queue and Step Functions state machine with EventBridge Pipes.

Existing Amazon SQS queue and AWS Step Functions state machine

First, I open the Amazon EventBridge console. In the navigation section, I select Pipes. Then I select Create pipe.

On this page, I can start configuring a pipe and set the AWS Identity and Access Management (IAM) permission, and I can navigate to the Pipe settings tab.

Navigate to Pipe Settings

In the Permissions section, I can define a new IAM role for this pipe or use an existing role. To improve developer experience, the EventBridge Pipes console will figure out the IAM role for me, so I don’t need to manually configure required permissions and let EventBridge Pipes configures least-privilege permissions for IAM role. Since this is my first time creating a pipe, I select Create a new role for this specific resource.

Setting IAM Permission for pipe

Then, I go back to the Build pipe section. On this page, I can see the available event sources supported by EventBridge Pipes.

List of available services as the event source

I select SQS and select my existing SQS queue. If I need to do batch processing, I can select Additional settings to start defining Batch size and Batch window. Then, I select Next.

Select SQS Queue as event source

On the next page, things get even more interesting because I can define Event filtering from the event source that I just selected. This step is optional, but the event filtering feature makes it easy for me to process events that only need to be processed by my event-driven application. In addition, this event filtering feature also helps me to be more cost-effective, as this pipe won’t process unnecessary events. For example, if I use Step Functions as the target, the event filtering will only execute events that match the filter.

Event filtering in Amazon EventBridge Pipes

I can use sample events from AWS events or define custom events. For example, I want to process events for returned purchased items with a value of 100 or more. The following is the sample event in JSON format:

{
   "event-type":"RETURN_PURCHASE",
   "value":100
}

Then, in the event pattern section, I can define the pattern by referring to the Content filtering in Amazon EventBridge event patterns documentation. I define the event pattern as follows:

{
   "event-type": ["RETURN_PURCHASE"],
   "value": [{
      "numeric": [">=", 100]
   }]
}

I can also test by selecting test pattern to make sure this event pattern will match the custom event I’m going to use. Once I’m confident that this is the event pattern that I want, I select Next.

Defining and testing an event pattern for filtering

In the next optional step, I can use an Enrichment that will augment, transform, or expand the event before sending the event to the target destination. This enrichment is useful when I need to enrich the event using an existing AWS Lambda function, or external SaaS API using the Destination API. Additionally, I can shape the event using the Enrichment Input Transformer.

The final step is to define a target for processing the events delivered by this pipe.

Defining target destination service

Here, I can select various AWS services supported by EventBridge Pipes.

I select my existing AWS Step Functions state machine, named pipes-statemachine.

In addition, I can also use Target Input Transformer by referring to the Transforming Amazon EventBridge target input documentation. For my case, I need to define a high priority for events going into this target. To do that, I define a sample custom event in Sample events/Event Payload and add the priority: HIGH in the Transformer section. Then in the Output section, I can see the final event to be passed to the target destination service. Then, I select Create pipe.

In less than a minute, my pipe was successfully created.

Pipe successfully created

To test this pipe, I need to put an event into the Amaon SQS queue.

Sending a message into Amazon SQS Queue

To check if my event is successfully processed by Step Functions, I can look into my state machine in Step Functions. On this page, I see my event is successfully processed.

I can also go to Amazon CloudWatch Logs to get more detailed logs.

Things to Know
Event Sources
– At launch, Amazon EventBridge Pipes supports the following services as event sources: Amazon DynamoDB, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK) alongside self-managed Apache Kafka, Amazon SQS (standard and FIFO), and Amazon MQ (both for ActiveMQ and RabbitMQ).

Event Targets – Amazon EventBridge Pipes supports 15 Amazon EventBridge targets, including AWS Lambda, Amazon API Gateway, Amazon SNS, Amazon SQS, and AWS Step Functions. To deliver events to any HTTPS endpoint, developers can use API destinations as the target.

Event Ordering – EventBridge Pipes maintains the ordering of events received from an event sources that support ordering when sending those events to a destination service.

Programmatic Access – You can also interact with Amazon EventBridge Pipes and create a pipe using AWS Command Line Interface (CLI), AWS CloudFormation, and AWS Cloud Development Kit (AWS CDK).

Independent Usage – EventBridge Pipes can be used separately from Amazon EventBridge bus and Amazon EventBridge Scheduler. This flexibility helps developers to define source events from supported AWS and self-managed services as event sources without Amazon EventBridge Event Bus.

Availability – Amazon EventBridge Pipes is now generally available in all AWS commercial Regions, with the exception of Asia Pacific (Hyderabad) and Europe (Zurich).

Visit the Amazon EventBridge Pipes page to learn more about this feature and understand the pricing. You can also visit the documentation page to learn more about how to get started.

Happy building!

— Donnie

Step Functions Distributed Map – A Serverless Solution for Large-Scale Parallel Data Processing

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/step-functions-distributed-map-a-serverless-solution-for-large-scale-parallel-data-processing/

I am excited to announce the availability of a distributed map for AWS Step Functions. This flow extends support for orchestrating large-scale parallel workloads such as the on-demand processing of semi-structured data.

Step Function’s map state executes the same processing steps for multiple entries in a dataset. The existing map state is limited to 40 parallel iterations at a time. This limit makes it challenging to scale data processing workloads to process thousands of items (or even more) in parallel. In order to achieve higher parallel processing prior to today, you had to implement complex workarounds to the existing map state component.

The new distributed map state allows you to write Step Functions to coordinate large-scale parallel workloads within your serverless applications. You can now iterate over millions of objects such as logs, images, or .csv files stored in Amazon Simple Storage Service (Amazon S3). The new distributed map state can launch up to ten thousand parallel workflows to process data.

You can process data by composing any service API supported by Step Functions, but typically, you will invoke Lambda functions to process the data with code written in your favorite programming language.

Step Functions distributed map supports a maximum concurrency of up to 10,000 executions in parallel, which is well above the concurrency supported by many other AWS services. You can use the maximum concurrency feature of the distributed map to ensure that you do not exceed the concurrency of a downstream service. There are two factors to consider when working with other services. First, the maximum concurrency supported by the service for your account. Second, the burst and ramping rates, which determine how quickly you can achieve the maximum concurrency.

Let’s use Lambda as an example. Your functions’ concurrency is the number of instances that serve requests at a given time. The default maximum concurrency quota for Lambda is 1,000 per AWS Region. You can ask for an increase at any time. For an initial burst of traffic, your functions’ cumulative concurrency in a Region can reach an initial level of between 500 and 3000, which varies per Region. The burst concurrency quota applies to all your functions in the Region.

When using a distributed map, be sure to verify the quota on downstream services. Limit the distributed map maximum concurrency during your development, and plan for service quota increases accordingly.

To compare the new distributed map with the original map state flow, I created this table.

Original map state flow New distributed map flow
Sub workflows
  • Runs a sub-workflow for each item in an array. The array must be passed from the previous state.
  • Each iteration of the sub-workflow is called a map iteration, and its events are added to the state machine’s execution history.
  • Runs a sub-workflow for each item in an array or Amazon S3 dataset.
  • Each sub-workflow is run as a totally separate child execution, with its own event history.
Parallel branches Map iterations run in parallel, with an effective maximum concurrency of around 40 at a time. Can pass millions of items to multiple child executions, with concurrency of up to 10,000 executions at a time.
Input source Accepts only a JSON array as input. Accepts input as Amazon S3 object list, JSON arrays or files, csv files, or Amazon S3 inventory.
Payload 256 KB Each iteration receives a reference to a file (Amazon S3) or a single record from a file (state input). Actual file processing capability is limited by Lambda storage and memory.
Execution history 25,000 events Each iteration of the map state is a child execution, with up to 25,000 events each (express mode has no limit on execution history).

Sub-workflows within a distributed map work with both Standard workflows and the low-latency, short-duration Express Workflows.

This new capability is optimized to work with S3. I can configure the bucket and prefix where my data are stored directly from the distributed map configuration. The distributed map stops reading after 100 million items and supports JSON or csv files of up to 10GB.

When processing large files, think about downstream service capabilities. Let’s take Lambda again as an example. Each input—a file on S3, for example—must fit within the Lambda function execution environment in terms of temporary storage and memory. To make it easier to handle large files, Lambda Powertools for Python introduced a new streaming feature to fetch, transform, and process S3 objects with minimal memory footprint. This allows your Lambda functions to handle files larger than the size of their execution environment. To learn more about this new capability, check the Lambda Powertools documentation.

Let’s See It in Action
For this demo, I will create a workflow that processes one thousand dog images stored on S3. The images are already stored on S3.

➜  ~ aws s3 ls awsnewsblog-distributed-map/images/
2022-11-08 15:03:36      27034 n02085620_10074.jpg
2022-11-08 15:03:36      34458 n02085620_10131.jpg
2022-11-08 15:03:36      12883 n02085620_10621.jpg
2022-11-08 15:03:36      34910 n02085620_1073.jpg
...

➜  ~ aws s3 ls awsnewsblog-distributed-map/images/ | wc -l
    1000

The workflow and the S3 bucket must be in the same Region.

To get started, I navigate to the Step Functions page of the AWS Management Console and select Create state machine. On the next page, I choose to design my workflow using the visual editor. The distributed map works with Standard workflows, and I keep the default selection as-is. I select Next to enter the visual editor.

Distributed Map - create a workflowIn the visual editor, I search and select the Map component on the left-side pane, and I drag it to the workflow area. On the right side, I configure the component. I choose Distributed as Processing mode and Amazon S3 as Item Source.

Distributed maps are natively integrated with S3. I enter the name of the bucket (awsnewsblog-distributed-map) and the prefix (images) where my images are stored.

On the Runtime Settings section, I choose Express for Child workflow type. I also may decide to restrict the Concurrency limit. It helps to ensure we operate within the concurrency quotas of the downstream services (Lambda in this demo) for a particular account or Region.

By default, the output of my sub-workflows will be aggregated as state output, up to 256KB. To process larger outputs, I may choose to Export map state results to Amazon S3.

Distributed Map - add a Lambda invocation

Finally, I define what to do for each file. In this demo, I want to invoke a Lambda function for each file in the S3 bucket. The function exists already. I search for and select the Lambda invocation action on the left-side pane. I drag it to the distributed map component. Then, I use the right-side configuration panel to select the actual Lambda function to invoke: AWSNewsBlogDistributedMap in this example.

Distributed Map - add a Lambda invocation

When I am done, I select Next. I select Next again on the Review generated code page (not shown here).

On the Specify state machine settings page, I enter a Name for my state machine and the IAM Permissions to run. Then, I select Create state machine.

Create State Machine - Final ScreenNow I am ready to start the execution. On the State machine page, I select the new workflow and select Start execution. I can optionally enter a JSON document to pass to the workflow. In this demo, the workflow does not handle the input data. I leave it as-is, and I select Start execution.

Start workflow execution Start workflow execution - pass input data

During the execution of the workflow, I can monitor the progress. I observe the number of iterations, and the number of items successfully processed or in error.

I can drill down on one specific execution to see the details.

Distributed Map - monitor execution details

With just a few clicks, I created a large-scale and heavily parallel workflow able to handle a very large quantity of data.

Which AWS Service Should I Use
As often happens on AWS, you might observe an overlap between this new capability and existing services such as AWS Glue, Amazon EMR, or Amazon S3 Batch Operations. Let’s try to differentiate the use cases.

In my mental model, data scientists and data engineers use AWS Glue and EMR to process large amounts of data. On the other hand, application developers will use Step Functions to add serverless data processing into their applications. Step Functions is able to scale from zero quickly, which makes it a good fit for interactive workloads where customers may be waiting for the results. Finally, system administrators and IT operation teams are likely to use Amazon S3 Batch Operations for single-step IT automation operations such as copying, tagging, or changing permissions on billions of S3 objects.

Pricing and Availability
AWS Step Functions’ distributed map is generally available in the following ten AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), and Europe (Frankfurt, Ireland, Stockholm).

The pricing model for the existing inline map state does not change. For the new distributed map state, we charge one state transition per iteration. Pricing varies between Regions, and it starts at $0.025 per 1,000 state transitions. When you process your data using express workflows, you are also charged based on the number of requests for your workflow and its duration. Again, prices vary between Regions, but they start at $1.00 per 1 million requests and $0.06 per GB-hour (prorated to 100ms).

For the same amount of iterations, you will observe a cost reduction when using the combination of the distributed map and standard workflows compared to the existing inline map. When you use express workflows, expect the costs to stay the same for more value with the distributed map.

I am really excited to discover what you will build using this new capability and how it will unlock innovation. Go start to build highly parallel serverless data processing workflows today!

— seb

AWS Marketplace Vendor Insights – Simplify Third-Party Software Risk Assessments

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-marketplace-vendor-insights-simplify-third-party-software-risk-assessments/

AWS Marketplace Vendor Insights is a new capability of AWS Marketplace. It simplifies third-party software risk assessments when procuring solutions from the AWS Marketplace.

It helps you to ensure that the third-party software continuously meets your industry standards by compiling security and compliance information, such as data privacy and residency, application security, and access control, in one consolidated dashboard.

As a security engineer, you may now complete third-party software risk assessment in a few days instead of months. You can now:

  • Quickly discover products in AWS Marketplace that meet your security and certification standards by searching for and accessing Vendor Insights profiles.
  • Access and download current and validated information, with evidence gathered from the vendors’ security tools and audit reports. Reports are available for download on AWS Artifact third-party reports (now available in preview).
  • Monitor your software’s security posture post-procurement and receive notifications for security and compliance events.

As a software vendor, you can now reduce the operational burden of responding to buyer requests for risk assessment information. It gives your customers a self-service access experience. You can now:

  • Build your product’s security profile by uploading your ISO 27001 or SOC2 Type 2 report and completing a software risk assessment with AWS Audit Manager.
  • Store and share your compliance reports such as ISO 27001 and SOC2 Type 2, using AWS Artifact third-party reports (preview).
  • View and approve your buyer requests for viewing security controls and compliance artifacts stored in Vendor Insights.

Let’s See It in Action
I want to procure a solution on the AWS Marketplace. But before purchasing the product, as a security engineer, I want to review its compliance. I navigate to the AWS Marketplace page of the AWS Management Console. I use the faceted search on the left side to select vendors that are ISO 27001 compliant.

AWS MArketplace vendor insights - faceted searchI select a product. On the Product Overview page, I select View assessment data on the top right side (not shown on the screenshot). Then, I can see the overview page, which shows the Security certification received and the Expiration date.

AWS MArketplace vendor insights - certification receivedI select the Security and compliance tab and see that I need to request access to see the detailed security and compliance information. I select the Request access button on the top right side to ask the vendor for access to their compliance documents.

AWS MArketplace vendor insights - request access part 1

On the next page, I fill in the Your information form with my details, and I select Request access.

AWS MArketplace vendor insights - request access part 2The Next Steps section details what will happen next. The seller will contact me to sign a nondisclosure agreement (NDA). The seller will notify AWS Marketplace when the NDA is signed. Then, I will be granted access to Vendor Insights data.

The process can take a few days. For this demo, I switch to a fictional product—Everest—for which I have access to the compliance data. Here is the Security and compliance tab when my request for access is accepted.

The Summary section shows how many controls are available. It reports how many have been validated with evidence and how many have been self-reported by the seller. It also shows how many noncompliant controls are reported.

I can scroll down the page to see the details for multiple categories: Audit, compliance and security policy, Data security, Access management, Application security, Risk management and incident response, Business resiliency and continuity, End user device security, Infrastructure security, Human resources, and Security and configuration policy. The screenshot does not show all of them.

AWS Marketplace vendor insights - security and complianceI select the detail for Access control and see the list under Control name. For each of them, I can see the compliance for SOC2 Type 2, ISO 27001, and the Vendor self-assessment.

AWS Marketplace vendor insights - access controlI select the noncompliant one to get the details and the explanation the vendor provided.

AWS Marketplace vendor insights - non compliant details

If needed, I might also use AWS Artifact third-party reports (preview) to download the compliance reports.

For Software Vendors
As a software vendor, you can create a security profile for your SaaS products on AWS Marketplace and share this profile with your prospective and existing buyers. It helps you to reduce the manual work for engineering and security teams to respond to your customer questionnaires.

To create a security profile, you will need to complete a self-assessment using AWS Audit Manager on your marketplace management AWS account, share the current SOC2 Type II and ISO27001 compliance artifacts, if available, and turn on automated assessment using Audit Manager and AWS Config on your production AWS accounts.

Our team has created an AWS CloudFormation template to automate the onboarding steps. You can find the technical resources, such as the setup guide and the onboarding templates, on our GitHub repository. Once the profile is created, Vendor Insights will keep your security profile up to date by using automated evidence from Audit Manager and AWS Config. The updates to your profile are sent as notifications. Your security and compliance team can review the updates before they are shared with buyers.

With Vendor Insights, you manage access to your product’s security profile by approving the buyer’s subscription requests. When a buyer requests access, Vendor Insights shares their contact information over email to your compliance or deal-desk operations team. They can complete the NDA with the buyer and notify AWS Marketplace to grant the buyer access to your security profile. You can also request AWS Marketplace to revoke the buyer’s subscription on a later day if you don’t want to share your product’s security and compliance posture information with the buyer anymore.

The entire process is documented in the AWS Marketplace Vendor Insights seller guide.

Pricing and Availability
Vendor Insights is now available in all AWS Regions where AWS Marketplace is available.

The pricing model is very simple; there is no charge involved for using AWS Marketplace Vendor Insights.

For buyers, you can access and download assets during your procurement phase. You lose access to the Vendor Insights profile if you have not purchased the product after 60 days. When you purchase the product, you keep access to the product’s security profile for continuous monitoring of its compliance status.

For sellers, AWS Marketplace doesn’t charge to activate and use Vendor Insights. You will incur fees for using Audit Manager and AWS Config.

Go and start your risk assessments on the AWS Marketplace today.

— seb

Introducing the Cloud Shuffle Storage Plugin for Apache Spark

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In AWS Glue, you can use Apache Spark, an open-source, distributed processing system for your data integration tasks and big data workloads.

Apache Spark utilizes in-memory caching and optimized query execution for fast analytic queries against your datasets, which are split into multiple Spark partitions on different nodes so that you can process a large amount of data in parallel. In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. During a shuffle, data is written to local disk and transferred across the network. The shuffle operation is often constrained by the available local disk capacity, or data skew, which can cause straggling executors. Spark often throws a No space left on device or MetadataFetchFailedException error when there isn’t enough disk space left on the executor and there is no recovery. Such Spark jobs can’t typically succeed without adding additional compute and attached storage, wherein compute is often idle, and results in additional cost.

In 2021, we launched Amazon S3 shuffle for AWS Glue 2.0 with Spark 2.4. This feature disaggregated Spark compute and shuffle storage by utilizing Amazon Simple Storage Service (Amazon S3) to store Spark shuffle files. Using Amazon S3 for Spark shuffle storage enabled you to run data-intensive workloads more reliably. After the launch, we continued investing in this area, and collected customer feedback.

Today, we’re pleased to release Cloud Shuffle Storage Plugin for Apache Spark. It supports the latest Apache Spark 3.x distribution so you can take advantage of the plugin in AWS Glue or any other Spark environments. It’s now also natively available to use in AWS Glue Spark jobs on AWS Glue 3.0 and the latest AWS Glue version 4.0 without requiring any extra setup or bringing external libraries. Like the Amazon S3 shuffle for AWS Glue 2.0, the Cloud Shuffle Storage Plugin helps you solve constrained disk space errors during shuffle in serverless Spark environments.

We’re also excited to announce the release of software binaries for the Cloud Shuffle Storage Plugin for Apache Spark under the Apache 2.0 license. You can download the binaries and run them on any Spark environment. The new plugin is open-cloud, comes with out-of-the box support for Amazon S3, and can be easily configured to use other forms of cloud storage such as Google Cloud Storage and Microsoft Azure Blob Storage.

Understanding a shuffle operation in Apache Spark

In Apache Spark, there are two types of transformations:

  • Narrow transformation – This includes map, filter, union, and mapPartition, where each input partition contributes to only one output partition.
  • Wide transformation – This includes join, groupBykey, reduceByKey, and repartition, where each input partition contributes to many output partitions. Spark SQL queries including JOIN, ORDER BY, GROUP BY require wide transformations.

A wide transformation triggers a shuffle, which occurs whenever data is reorganized into new partitions with each key assigned to one of them. During a shuffle phase, all Spark map tasks write shuffle data to a local disk that is then transferred across the network and fetched by Spark reduce tasks. The volume of data shuffled is visible in the Spark UI. When shuffle writes take up more space than the local available disk capacity, it causes a No space left on device error.

To illustrate one of the typical scenarios, let’s use the query q80.sql from the standard TPC-DS 3 TB dataset as an example. This query attempts to calculate the total sales, returns, and eventual profit realized during a specific time frame. It involves multiple wide transformations (shuffles) caused by left outer join and group by.

Let’s run the following query on AWS Glue 3.0 job with 10 G1.X workers where a total of 640GB of local disk space is available:

with ssr as
 (select  s_store_id as store_id,
          sum(ss_ext_sales_price) as sales,
          sum(coalesce(sr_return_amt, 0)) as returns,
          sum(ss_net_profit - coalesce(sr_net_loss, 0)) as profit
  from store_sales left outer join store_returns on
         (ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number),
     date_dim, store, item, promotion
 where ss_sold_date_sk = d_date_sk
       and d_date between cast('2000-08-23' as date)
                  and (cast('2000-08-23' as date) + interval '30' day)
       and ss_store_sk = s_store_sk
       and ss_item_sk = i_item_sk
       and i_current_price > 50
       and ss_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by s_store_id),
 csr as
 (select  cp_catalog_page_id as catalog_page_id,
          sum(cs_ext_sales_price) as sales,
          sum(coalesce(cr_return_amount, 0)) as returns,
          sum(cs_net_profit - coalesce(cr_net_loss, 0)) as profit
  from catalog_sales left outer join catalog_returns on
         (cs_item_sk = cr_item_sk and cs_order_number = cr_order_number),
     date_dim, catalog_page, item, promotion
 where cs_sold_date_sk = d_date_sk
       and d_date between cast('2000-08-23' as date)
                  and (cast('2000-08-23' as date) + interval '30' day)
        and cs_catalog_page_sk = cp_catalog_page_sk
       and cs_item_sk = i_item_sk
       and i_current_price > 50
       and cs_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by cp_catalog_page_id),
 wsr as
 (select  web_site_id,
          sum(ws_ext_sales_price) as sales,
          sum(coalesce(wr_return_amt, 0)) as returns,
          sum(ws_net_profit - coalesce(wr_net_loss, 0)) as profit
  from web_sales left outer join web_returns on
         (ws_item_sk = wr_item_sk and ws_order_number = wr_order_number),
     date_dim, web_site, item, promotion
 where ws_sold_date_sk = d_date_sk
       and d_date between cast('2000-08-23' as date)
                  and (cast('2000-08-23' as date) + interval '30' day)
        and ws_web_site_sk = web_site_sk
       and ws_item_sk = i_item_sk
       and i_current_price > 50
       and ws_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by web_site_id)
 select channel, id, sum(sales) as sales, sum(returns) as returns, sum(profit) as profit
 from (select
        'store channel' as channel, concat('store', store_id) as id, sales, returns, profit
      from ssr
      union all
      select
        'catalog channel' as channel, concat('catalog_page', catalog_page_id) as id,
        sales, returns, profit
      from csr
      union all
      select
        'web channel' as channel, concat('web_site', web_site_id) as id, sales, returns, profit
      from  wsr) x
 group by rollup (channel, id)
 order by channel, id

The following screenshot shows the Executor tab in the Spark UI.
Spark UI Executor Tab

The following screenshot shows the status of Spark jobs included in the AWS Glue job run.
Spark UI Jobs
In the failed Spark job (job ID=7), we can see the failed Spark stage in the Spark UI.
Spark UI Failed stage
There was 167.8GiB shuffle write during the stage, and 14 tasks failed due to the error java.io.IOException: No space left on device because the host 172.34.97.212 ran out of local disk.
Spark UI Tasks

Cloud Shuffle Storage for Apache Spark

Cloud Shuffle Storage for Apache Spark allows you to store Spark shuffle files on Amazon S3 or other cloud storage services. This gives complete elasticity to Spark jobs, thereby allowing you to run your most data intensive workloads reliably. The following figure illustrates how Spark map tasks write the shuffle files to the Cloud Shuffle Storage. Reducer tasks consider the shuffle blocks as remote blocks and read them from the same shuffle storage.

This architecture enables your serverless Spark jobs to use Amazon S3 without the overhead of running, operating, and maintaining additional storage or compute nodes.
Chopper diagram
The following Glue job parameters enable and tune Spark to use S3 buckets for storing shuffle data. You can also enable at-rest encryption when writing shuffle data to Amazon S3 by using security configuration settings.

Key Value Explanation
--write-shuffle-files-to-s3 TRUE This is the main flag, which tells Spark to use S3 buckets for writing and reading shuffle data.
--conf spark.shuffle.storage.path=s3://<shuffle-bucket> This is optional, and specifies the S3 bucket where the plugin writes the shuffle files. By default, we use –TempDir/shuffle-data.

The shuffle files are written to the location and create files such as following:

s3://<shuffle-storage-path>/<Spark application ID>/[0-9]/<Shuffle ID>/shuffle_<Shuffle ID>_<Mapper ID>_0.data

With the Cloud Shuffle Storage plugin enabled and using the same AWS Glue job setup, the TPC-DS query now succeeded without any job or stage failures.
Spark UI Jobs with Chopper plugin

Software binaries for the Cloud Shuffle Storage Plugin

You can now also download and use the plugin in your own Spark environments and with other cloud storage services. The plugin binaries are available for use under the Apache 2.0 license.

Bundle the plugin with your Spark applications

You can bundle the plugin with your Spark applications by adding it as a dependency in your Maven pom.xml as you develop your Spark applications, as shown in the follwoing code. For more details on the plugin and Spark versions, refer to Plugin versions.

<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/</url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>chopper-plugin</artifactId>
    <version>3.1-amzn-LATEST</version>
</dependency>

You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows:

#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/chopper-plugin/3.1-amzn-LATEST/chopper-plugin-3.1-amzn-LATEST.jar -P /usr/lib/spark/jars

Submit the Spark application by including the JAR files on the classpath and specifying the two Spark configs for the plugin:

spark-submit --deploy-mode cluster \
--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin \
--conf spark.shuffle.storage.path=s3://<s3 bucket>/<shuffle-dir> \
 --class <your class> <your application jar> 

The following Spark parameters enable and configure Spark to use an external storage URI such as Amazon S3 for storing shuffle files; the URI protocol determines which storage system to use.

Key Value Explanation
spark.shuffle.storage.path s3://<shuffle-storage-path> It specifies an URI where the shuffle files are stored, which much be a valid Hadoop FileSystem and be configured as needed
spark.shuffle.sort.io.plugin.class com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin The entry class in the plugin

Other cloud storage integration

This plugin comes with out-of-the box support for Amazon S3 and can also be configured to use other forms of cloud storage such as Google Cloud Storage and Microsoft Azure Blob Storage. To enable other Hadoop FileSystem compatible cloud storage services, you can simply add a storage URI for the corresponding service scheme, such as gs:// for Google Cloud Storage instead of s3:// for Amazon S3, add the FileSystem JAR files for the service, and set the appropriate authentication configurations.

For more information about how to integrate the plugin with Google Cloud Storage and Microsoft Azure Blob Storage, refer to Using AWS Glue Cloud Shuffle Plugin for Apache Spark with Other Cloud Storage Services.

Best practices and considerations

Note the following considerations:

  • This feature replaces local shuffle storage with Amazon S3. You can use it to address common failures with price/performance benefits for your serverless analytics jobs and pipelines. We recommend enabling this feature when you want to ensure reliable runs of your data-intensive workloads that create a large amount of shuffle data or when you’re getting No space left on device error. You can also use this plugin if your job encounters fetch failures org.apache.spark.shuffle.MetadataFetchFailedException or if your data is skewed.
  • We recommend setting S3 bucket lifecycle policies on the shuffle bucket (spark.shuffle.storage.s3.path) in order to clean up old shuffle data automatically.
  • The shuffle data on Amazon S3 is encrypted by default. You can also encrypt the data with your own AWS Key Management Service (AWS KMS) keys.

Conclusion

This post introduced the new Cloud Shuffle Storage Plugin for Apache Spark and described its benefits to independently scale storage in your Spark jobs without adding additional workers. With this plugin, you can expect jobs processing terabytes of data to run much more reliably.

The plugin is available in AWS Glue 3.0 and 4.0 Spark jobs in all AWS Glue supported Regions. We’re also releasing the plugin’s software binaries under the Apache 2.0 license. You can use the plugin in AWS Glue or other Spark environments. We look forward to hearing your feedback.


About the Authors

Noritaka Sekiyama s a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts that help customers build data lakes on the cloud.

Rajendra Gujja is a Senior Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and everything and anything about the data.

Chuhan Liu is a Software Development Engineer on the AWS Glue team.

Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with data integration and connectivity to a variety of sources, efficiently manage data lakes on Amazon S3, and optimizes Apache Spark for fault-tolerance with ETL workloads.

New for Amazon SageMaker – Perform Shadow Tests to Compare Inference Performance Between ML Model Variants

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/new-for-amazon-sagemaker-perform-shadow-tests-to-compare-inference-performance-between-ml-model-variants/

As you move your machine learning (ML) workloads into production, you need to continuously monitor your deployed models and iterate when you observe a deviation in your model performance. When you build a new model, you typically start validating the model offline using historical inference request data. But this data sometimes fails to account for current, real-world conditions. For example, new products might become trending that your product recommendation model hasn’t seen yet. Or, you experience a sudden spike in the volume of inference requests in production that you never exposed your model to before.

Today, I’m excited to announce Amazon SageMaker support for shadow testing!

Deploying a model in shadow mode lets you conduct a more holistic test by routing a copy of the live inference requests for a production model to the new (shadow) model. Yet, only the responses from the production model are returned to the calling application. Shadow testing helps you build further confidence in your model and catch potential configuration errors and performance issues before they impact end users. Once you complete a shadow test, you can use the deployment guardrails for SageMaker inference endpoints to safely update your model in production.

Get Started with Amazon SageMaker Shadow Testing
You can create shadow tests using the new SageMaker Inference Console and APIs. Shadow testing gives you a fully managed experience for setup, monitoring, viewing, and acting on the results of shadow tests. If you have existing workflows built around SageMaker endpoints, you can also deploy a model in shadow mode using the existing SageMaker Inference APIs.

On the SageMaker console, select Inference and Shadow tests to create, monitor, and deploy shadow tests.

Amazon SageMaker Shadow Tests

To create a shadow test, select an existing (or create a new) SageMaker endpoint and production variant you want to test against.

Amazon SageMaker - Create Shadow Test

Next, configure the proportion of traffic to send to the shadow variant, the comparison metrics you want to evaluate, and the duration of the test. You can also enable data capture for your production and shadow variant.

Amazon SagMaker - Create Shadow Test

That’s it. SageMaker now automatically deploys the new variant in shadow mode and routes a copy of the inference requests to it in real time, all within the same endpoint. The following diagram illustrates this workflow.

Amazon SageMaker - Shadow Testing

Note that only the responses of the production variant are returned to the calling application. You can choose to either discard or log the responses of the shadow variant for offline comparison.

You can also use shadow testing to validate changes you made to any component in your production variant, including the serving container or ML instance. This can be useful when you’re upgrading to a new framework version of your serving container, applying patches, or if you want to make sure that there is no impact to latency or error rate due to this change. Similarly, if you consider moving to another ML instance type, for example, Amazon EC2 C7g instances based on AWS Graviton processors, or EC2 G5 instances powered by NVIDIA A10G Tensor Core GPUs, you can use shadow testing to evaluate the performance on production traffic prior to rollout.

You can monitor the progress of the shadow test and performance metrics such as latency and error rate through a live dashboard. On the SageMaker console, select Inference and Shadow tests, then select the shadow test you want to monitor.

Amazon SageMaker - Monitor Shadow Test

Amazon SageMaker - Monitor Shadow Test

If you decide to promote the shadow model to production, select Deploy shadow variant and define the infrastructure configuration to deploy the shadow variant.

Amazon SageMaker - Deploy Shadow Variant

Amazon SageMaker - Deploy Shadow Variant

You can also use the SageMaker deployment guardrails if you want to add linear or canary traffic shifting modes and auto rollbacks to your update.

Availability and Pricing
SageMaker support for shadow testing is available today in all AWS Regions where SageMaker hosting is available except for the AWS GovCloud (US) Regions and AWS China Regions.

There is no additional charge for SageMaker shadow testing other than usage charges for the ML instances and ML storage provisioned to host the shadow variant. The pricing for ML instances and ML storage dimensions is the same as the real-time inference option. There is no additional charge for data processed in and out of shadow deployments. The SageMaker pricing page has all the details.

To learn more, visit Amazon SageMaker shadow testing.

Start validating your new ML models with SageMaker shadow tests today!

— Antje

Next Generation SageMaker Notebooks – Now with Built-in Data Preparation, Real-Time Collaboration, and Notebook Automation

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/next-generation-sagemaker-notebooks-now-with-built-in-data-preparation-real-time-collaboration-and-notebook-automation/

In 2019, we introduced Amazon SageMaker Studio, the first fully integrated development environment (IDE) for data science and machine learning (ML). SageMaker Studio gives you access to fully managed Jupyter Notebooks that integrate with purpose-built tools to perform all ML steps, from preparing data to training and debugging models, tracking experiments, deploying and monitoring models, and managing pipelines.

Today, I’m excited to announce the next generation of Amazon SageMaker Notebooks to increase efficiency across the ML development workflow. You can now improve data quality in minutes with the built-in data preparation capability, edit the same notebooks with your teams in real time, and automatically convert notebook code to production-ready jobs.

Let me show you what’s new!

New Notebook Capability for Simplified Data Preparation
The new built-in data preparation capability is powered by Amazon SageMaker Data Wrangler and is available in SageMaker Studio notebooks.  SageMaker Studio notebooks automatically generate key visualizations on top of Pandas data frames to help you understand data distribution and identify data quality issues, like missing values, invalid data, and outliers. You can also select the target column for ML models and generate ML-specific insights such as imbalanced class or high correlation columns. You then receive recommendations for data transformations to resolve the issues. You can apply the data transformations right in the UI, and SageMaker Studio notebooks automatically generate the corresponding transformation code in the notebook cells that you can use to replay your data preparation pipeline.

Using the Built-in Data Preparation Capability
To get started, pip install and import sagemaker_datawrangler along with the pandas Python package. Then, download the dataset you want to analyze to the notebook working directory, and read the dataset with pandas.

import pandas as pd 
import sagemaker_datawrangler 

!aws s3 cp s3://<YOUR_S3_BUCKET>/data.csv . 

df = pd.read_csv("data.csv")

Now, when you display the data frame, it automatically shows key data visualizations at the top of each column, surfaces data insights, detects data quality issues, and suggests solutions to improve data quality. When you select a column as the target column for ML predictions, you get target-specific insights and warnings, such as mixed data types in target (for regression use cases) or too few instances per class (for classification use cases).

In this example, I’m using the Women’s E-Commerce Clothing Reviews dataset that contains customer reviews and ratings for women’s clothing. This dataset was obtained from Kaggle and has been modified by Amazon to add synthetic data quality issues.

Amazon SageMaker Studio notebooks with built-in data preparation

You can review the suggested data transformations to improve the data quality and apply them right in the UI. For a list of all supported data transformations, have a look at the documentation. Once you apply a data transformation, SageMaker Studio notebooks automatically generate the code to reproduce those data preparation steps in another notebook cell.

For my example, I select Rating as my target column. Target column insights tells me in a high-priority warning that this column has too few instances per class and with a medium-priority warning that classes are too imbalanced. Let’s follow the suggestions and drop rare target values and drop missing values. I will also follow the suggestions for some of the feature columns and drop missing values in the Review Text column and drop the Division Name column.

Once I apply the transformations, the notebook generates this code for me:

# Pandas code generated by sagemaker_datawrangler
output_df = df.copy(deep=True)


# Code to Drop rare target values for column: Rating to resolve warning: Too few instances per class 
rare_target_labels_to_drop = ['-100', '100']
output_df = output_df[~output_df['Rating'].isin(rare_target_labels_to_drop)]


# Code to Drop missing for column: Rating to resolve warning: Missing values 
output_df = output_df[output_df['Rating'].notnull()]


# Code to Drop missing for column: Review Text to resolve warning: Missing values 
output_df = output_df[output_df['Review Text'].notnull()]


# Code to Drop column for column: Division Name to resolve warning: Missing values 
output_df=output_df.drop(columns=['Division Name'])

I can now review and modify the code if needed or start integrating the data transformations as part of my ML development workflow.

Introducing Shared Spaces for Team-Based Sharing and Real-Time Collaboration
SageMaker Studio now offers shared spaces that give data science and ML teams a workspace where they can read, edit, and run notebooks together in real time to streamline collaboration and communication during the development process. Shared spaces provide a shared Amazon EFS directory that you can utilize to share files within a shared space. All taggable SageMaker resources that you create in a shared space are automatically tagged to help you organize and have a filtered view of your ML resources, such as training jobs, experiments, and models, that are relevant to the business problem you work on in the space. This also helps you monitor costs and plan budgets using tools such as AWS Budgets and AWS Cost Explorer.

And that’s not all. You can now also create multiple SageMaker domains within the same AWS account to scope access and isolate resources to different teams or business units in your organization. Now, let me show you how to create a shared space for users within a SageMaker domain.

Using Shared Spaces
You can use the SageMaker console or the AWS CLI to create shared spaces for a SageMaker domain. To get started in the SageMaker console, go to Domains, select or create a new domain, and select Space management on the Domain details page. Then, select Create and give the shared space a name.

Amazon SageMaker Spaces - Create Space

Users in this SageMaker domain can now launch and join the shared space through their SageMaker domain user profiles.

Amazon SageMaker Spaces - Launch Spaces

In a shared space, select the new Collaborators icon in the left navigation menu. You can now see who else is currently active in this space. The following screenshot shows user tom on the left, editing a notebook file. On the right, user antje sees the edits in real time, together with an annotation of the user name that currently edits that notebook cell.

Amazon SageMaker Spaces

New Notebook Capability to Automatically Convert Notebook Code to Production-Ready Jobs
You can now select a notebook and automate it as a job that can run in a production environment without the need to manage the underlying infrastructure. When you create a SageMaker Notebook Job, SageMaker Studio takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule you define, and deprovisions the infrastructure upon job completion. This notebook capability is now also available in SageMaker Studio Lab, our free ML development environment that provides the compute, storage, and security to learn and experiment with ML.

Using the Notebook Capability to Automate Notebooks
To get started, open a notebook file in SageMaker Studio. Then, right-click your notebook file and select Create Notebook Job or select the Create Notebook Job icon, as highlighted in the following screenshot.

Amazon SageMaker Studio - Automate your notebooks

Define a name for the Notebook Job, review the input file location, specify the compute type to use, and whether to run the job immediately or on a schedule. Then, select Create.

Amazon SageMaker Studio - Create Notebook Job

The Notebook Job has been created, and you can review all Notebook Job Definitions in the UI.

Amazon SageMaker Studio - Notebook Job Definitions

Now Available
The new Amazon SageMaker Studio notebook capabilities are now available in all AWS Regions where Amazon SageMaker Studio is available except for the AWS China Regions.

At launch, the built-in data preparation capability powered by SageMaker Data Wrangler is supported for SageMaker Studio notebooks and the following notebook kernel images:

  • Python 3 (Data Science) with Python 3.7
  • Python 3 (Data Science 2.0) with Python 3.8
  • Python 3 (Data Science 3.0) with Python 3.10
  • Spark Analytics 1.0 and 2.0

For more information, visit Amazon SageMaker Notebooks.

Start building your ML projects with the next generation of Amazon SageMaker Notebooks today!

— Antje

New – Share ML Models and Notebooks More Easily Within Your Organization with Amazon SageMaker JumpStart

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/new-share-ml-models-and-notebooks-more-easily-within-your-organization-with-amazon-sagemaker-jumpstart/

Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. SageMaker JumpStart gives you access to built-in algorithms with pre-trained models from popular model hubs, pre-trained foundation models to help you perform tasks such as article summarization and image generation, and end-to-end solutions to solve common use cases.

Today, I’m happy to announce that you can now share ML artifacts, such as models and notebooks, more easily with other users that share your AWS account using SageMaker JumpStart.

Using SageMaker JumpStart to Share ML Artifacts
Machine learning is a team sport. You might want to share your models and notebooks with other data scientists in your team to collaborate and increase productivity. Or, you might want to share your models with operations teams to put your models into production. Let me show you how to share ML artifacts using SageMaker JumpStart.

In SageMaker Studio, select Models in the left navigation menu. Then, select Shared models and Shared by my organization. You can now discover and search ML artifacts that other users shared within your AWS account. Note that you can add and share ML artifacts developed with SageMaker as well as those developed outside of SageMaker.

To share a model or notebook, select Add. For models, provide basic information, such as title, description, data type, ML task, framework, and any additional metadata. This information helps other users to find the right models for their use cases. You can also enable training and deployment for your model. This allows users to fine-tune your shared model and deploy the model in just a few clicks through SageMaker JumpStart.

Amazon SageMaker Jumpstart - Add model to private ML hub

To enable model training, you can select an existing SageMaker training job that will autopopulate all relevant information. This information includes the container framework, training script location, model artifact location, instance type, default training and validation datasets, and target column. You can also provide custom model training information by selecting a prebuilt SageMaker Deep Learning Container or selecting a custom Docker container in Amazon ECR. You can also specify default hyperparameters and metrics for model training.

To enable model deployment, you also need to define the container image to use, the inference script and model artifact location, and the default instance type. Have a look at the SageMaker Developer Guide to learn more about model training and model deployment options.

Sharing a notebook works similarly. You need to provide basic information about your notebook and the Amazon S3 location of the notebook file.

Amazon SageMaker JumpStart - Add a notebook to private ML hub

Users that share your AWS account can now browse and select shared models to fine-tune, deploy endpoints, or run notebooks directly in SageMaker JumpStart.

In SageMaker Studio, select Quick start solutions in the left navigation menu, then select Solutions, models, example notebooks to access all shared ML artifacts, together with pre-trained models from popular model hubs and end-to-end solutions.

Amazon SageMaker JumpStart

Now Available
The new ML artifact-sharing capability within Amazon SageMaker JumpStart is available today in all AWS Regions where Amazon SageMaker JumpStart is available. To learn more, visit Amazon SageMaker JumpStart and the SageMaker JumpStart documentation.

Start sharing your models and notebooks with Amazon SageMaker JumpStart today!

— Antje

AWS Machine Learning University New Educator Enablement Program to Build Diverse Talent for ML/AI Jobs

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-machine-learning-university-new-educator-enablement-program-to-build-diverse-talent-for-ml-ai-jobs/

AWS Machine Learning University is now providing a free educator enablement program. This program provides faculty at community colleges, minority-serving institutions (MSIs), and historically Black colleges and universities (HBCUs) with the skills and resources to teach data analytics, artificial intelligence (AI), and machine learning (ML) concepts to build a diverse pipeline for in-demand jobs of today and tomorrow.

According to the National Science Foundation, Black and Hispanic or Latino students earn bachelor’s degrees in Computer Science—the dominant pathway to AI/ML—at a much lower rate than their white peers, earning less than 11 percent of computer science degrees awarded. However, research shows that having diverse perspectives among skilled practitioners and across the AI/ML lifecycle contributes to the development of AI/ML systems that are safe, trustworthy, and have less bias. 

In 2018, we announced the Machine Learning University (MLU) to share with all developers the same courses that we used to train engineers at Amazon and AWS. This platform offers self-service, self-paced, AI/ML digital courses.

Machine Learning University home page

And today, we add this new program to our AI/ML training offering. Although anyone could access the MLU self-paced learning, it places the burden on the learner to source prerequisite work and solutions. This educator enablement program takes the concepts and lessons developed by MLU and makes them more accessible to educators. It offers a year-round educator enablement program with lesson planning, course playbooks, and access to free compute resources.

Program Details
Educators are onboarded in small-group cohorts into bootcamps where they will learn the material and deep dive into how to teach it via instructor-led lectures and hands-on projects. Educators who complete the bootcamp can take part in different year-round development opportunities, such as a dedicated Slack channel to share teaching best practices, education topic series and virtual study sessions moderated by MLU instructors, and regional events for continued professional development. Also, they will receive continuing education credits and AWS-provided stipends.

Faculty and students get access to instructional material through Amazon SageMaker Studio Lab. SageMaker Studio Lab was announced last year and is AWS’s free (no credit card required) ML development environment. It provides computing and storage for anybody that wants to learn and experiment with ML. Institutions can unlock additional resources to support their ML programs by registering for AWS Academy. AWS Academy unlocks all the AWS services for a complete AI/ML program.

Community colleges and universities can integrate this educator enablement program into their computer science, information technology, and business curricula to create an AI/ML course, certificate, or degree. We have worked with educators and education boards such as Houston Community College to create content that is vetted for credit-worthy and degree-earning curricula.

In August 2022, we launched our first educator bootcamp in partnership with The Coding School. The bootcamp was delivered over two weeks, offering lectures, case studies, and hands-on projects. 25 educators completed the Educator Machine Learning Bootcamp, representing 22 US community colleges and universities.

Learn More and Join The Program
During 2023, AWS Machine Learning University will run six educator-enablement cohorts starting in January. The program will give priority consideration to educators at community colleges, MSIs, and HBCUs, in alignment with this program mission to increase access to AI/ML technology to historically underserved and underrepresented students.

If you are a computer science educator or part of a board of educators interested in fostering more depth in your computer science coursework, you should sign up for the educator enablement program.

Marcia

New for Amazon Redshift – Simplify Data Ingestion and Make Your Data Warehouse More Secure and Reliable

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-simplify-data-ingestion-and-make-your-data-warehouse-more-secure-and-reliable/

When we talk with customers, we hear that they want to be able to harness insights from data in order to make timely, impactful, and actionable business decisions. A common pattern with data-driven organizations is that they have many different data sources they need to ingest into their analytics systems. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse. As a consequence of this complex setup, it can take data engineers weeks or even months to build data ingestion pipelines. These data pipelines are costly, and the delays can lead to missed business opportunities. Additionally, data warehouses are increasingly becoming mission critical systems that require high availability, reliability, and security.

Amazon Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. This year at re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.

In this blog, I introduce some of these new features that fit into two main categories:

  • Simplify data ingestion
    • Amazon Redshift now supports auto-copy from Amazon S3 (available in preview). With this new capability, Amazon Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. In this way, you don’t need to manually or repeatedly run copy procedures. Amazon Redshift automates file ingestion and takes care of data-loading steps under the hood.
    • With Amazon Aurora zero-ETL integration with Amazon Redshift, you can use Amazon Redshift for near real-time analytics and machine learning on petabytes of transactional data stored on Amazon Aurora MySQL databases (available in limited preview). With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after transactional data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. You can replicate data from multiple Amazon Aurora databases into the same Amazon Redshift instance to run analytics across multiple applications. With near real-time access to transactional data, you can leverage Amazon Redshift’s analytics and capabilities, such as built-in machine learning (ML), materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data.
    • With the general availability of Amazon Redshift Streaming Ingestion, you can now natively ingest hundreds of megabytes of data per second from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view and query it in seconds. Learn more in this post.
  • Make your data warehouse more secure and reliable
    • You can now improve the availability of your data warehouse by choosing multiple Availability Zone (AZ) deployments. Multi-AZ deployments for your Amazon Redshift clusters are available in preview and reduce recovery times to seconds through automatic recovery. In this way, you can build solutions that are more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework.
    • With dynamic data masking (available in preview), you can protect sensitive information stored in your data warehouse and ensure that only the relevant data is accessible by users based on their roles. You can limit how much identifiable data is visible to users using multiple levels of policies so different users and groups can have different levels of data access without having to create multiple copies of data. Dynamic data masking complements other granular access control capabilities in Amazon Redshift including row-level and column-level security and role-based access controls. In this way, Dynamic Data Masking helps you meet requirements for GDPR, CCPA, and other privacy regulations.
    • Amazon Redshift now supports central access controls for data sharing with AWS Lake Formation (available in public preview). You can now use Lake Formation to simplify governance of data shared from Amazon Redshift and centrally manage granular access across all data-sharing consumers.

There have been other interesting news for Amazon Redshift at re:Invent you might have already heard about:

  • The general availability of Amazon Redshift integration for Apache Spark makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, opening up the data warehouse for a broader set of AWS analytics and machine learning solutions.
  • AWS Backup now supports Amazon Redshift. AWS Backup allows you to define a central backup policy to manage data protection of your applications and can also protect your Amazon Redshift clusters. In this way, you have a consistent experience when managing data protection across all supported services.

Availability and Pricing
Multi-AZ deployments, central access control for data sharing with AWS Lake Formation, auto-copy from Amazon S3, and dynamic data masking are available in preview in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

There is no additional cost for using auto-copy from Amazon S3 and near real-time analytics on transactional data. There is no extra charge for dynamic data masking and central access control for data sharing. For more information, see Amazon Redshift pricing.

These new capabilities take you one step further in analyzing all your data across data sources with simple data ingestion capabilities, while improving the security and reliability of your data warehouse.

Danilo

New — Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-introducing-support-for-real-time-and-batch-inference-in-amazon-sagemaker-data-wrangler/

To build machine learning models, machine learning engineers need to develop a data transformation pipeline to prepare the data. The process of designing this pipeline is time-consuming and requires a cross-team collaboration between machine learning engineers, data engineers, and data scientists to implement the data preparation pipeline into a production environment.

The main objective of Amazon SageMaker Data Wrangler is to make it easy to do data preparation and data processing workloads. With SageMaker Data Wrangler, customers can simplify the process of data preparation and all of the necessary steps of data preparation workflow on a single visual interface. SageMaker Data Wrangler reduces the time to rapidly prototype and deploy data processing workloads to production, so customers can easily integrate with MLOps production environments.

However, the transformations applied to the customer data for model training need to be applied to new data during real-time inference. Without support for SageMaker Data Wrangler in a real-time inference endpoint, customers need to write code to replicate the transformations from their flow in a preprocessing script.

Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler
I’m pleased to share that you can now deploy data preparation flows from SageMaker Data Wrangler for real-time and batch inference. This feature allows you to reuse the data transformation flow which you created in SageMaker Data Wrangler as a step in Amazon SageMaker inference pipelines.

SageMaker Data Wrangler support for real-time and batch inference speeds up your production deployment because there is no need to repeat the implementation of the data transformation flow. You can now integrate SageMaker Data Wrangler with SageMaker inference. The same data transformation flows created with the easy-to-use, point-and-click interface of SageMaker Data Wrangler, containing operations such as Principal Component Analysis and one-hot encoding, will be used to process your data during inference. This means that you don’t have to rebuild the data pipeline for a real-time and batch inference application, and you can get to production faster.

Get Started with Real-Time and Batch Inference
Let’s see how to use the deployment supports of SageMaker Data Wrangler. In this scenario, I have a flow inside SageMaker Data Wrangler. What I need to do is to integrate this flow into real-time and batch inference using the SageMaker inference pipeline.

First, I will apply some transformations to the dataset to prepare it for training.

I add one-hot encoding on the categorical columns to create new features.

Then, I drop any remaining string columns that cannot be used during training.

My resulting flow now has these two transform steps in it.

After I’m satisfied with the steps I have added, I can expand the Export to menu, and I have the option to export to SageMaker Inference Pipeline (via Jupyter Notebook).

I select Export to SageMaker Inference Pipeline, and SageMaker Data Wrangler will prepare a fully customized Jupyter notebook to integrate the SageMaker Data Wrangler flow with inference. This generated Jupyter notebook performs a few important actions. First, define data processing and model training steps in a SageMaker pipeline. The next step is to run the pipeline to process my data with Data Wrangler and use the processed data to train a model that will be used to generate real-time predictions. Then, deploy my Data Wrangler flow and trained model to a real-time endpoint as an inference pipeline. Last, invoke my endpoint to make a prediction.

This feature uses Amazon SageMaker Autopilot, which makes it easy for me to build ML models. I just need to provide the transformed dataset which is the output of the SageMaker Data Wrangler step and select the target column to predict. The rest will be handled by Amazon SageMaker Autopilot to explore various solutions to find the best model.

Using AutoML as a training step from SageMaker Autopilot is enabled by default in the notebook with the use_automl_step variable. When using the AutoML step, I need to define the value of target_attribute_name, which is the column of my data I want to predict during inference. Alternatively, I can set use_automl_step to False if I want to use the XGBoost algorithm to train a model instead.

On the other hand, if I would like to instead use a model I trained outside of this notebook, then I can skip directly to the Create SageMaker Inference Pipeline section of the notebook. Here, I would need to set the value of the byo_model variable to True. I also need to provide the value of algo_model_uri, which is the Amazon Simple Storage Service (Amazon S3) URI where my model is located. When training a model with the notebook, these values will be auto-populated.

In addition, this feature also saves a tarball inside the data_wrangler_inference_flows folder on my SageMaker Studio instance. This file is a modified version of the SageMaker Data Wrangler flow, containing the data transformation steps to be applied at the time of inference. It will be uploaded to S3 from the notebook so that it can be used to create a SageMaker Data Wrangler preprocessing step in the inference pipeline.

The next step is that this notebook will create two SageMaker model objects. The first object model is the SageMaker Data Wrangler model object with the variable data_wrangler_model, and the second is the model object for the algorithm, with the variable algo_model. Object data_wrangler_model will be used to provide input in the form of data that has been processed into algo_model for prediction.

The final step inside this notebook is to create a SageMaker inference pipeline model, and deploy it to an endpoint.

Once the deployment is complete, I will get an inference endpoint that I can use for prediction. With this feature, the inference pipeline uses the SageMaker Data Wrangler flow to transform the data from your inference request into a format that the trained model can use.

In the next section, I can run individual notebook cells in Make a Sample Inference Request. This is helpful if I need to do a quick check to see if the endpoint is working by invoking the endpoint with a single data point from my unprocessed data. Data Wrangler automatically places this data point into the notebook, so I don’t have to provide one manually.

Things to Know
Enhanced Apache Spark configuration — In this release of SageMaker Data Wrangler, you can now easily configure how Apache Spark partitions the output of your SageMaker Data Wrangler jobs when saving data to Amazon S3. When adding a destination node, you can set the number of partitions, corresponding to the number of files that will be written to Amazon S3, and you can specify column names to partition by, to write records with different values of those columns to different subdirectories in Amazon S3. Moreover, you can also define the configuration in the provided notebook.

You can also define memory configurations for SageMaker Data Wrangler processing jobs as part of the Create job workflow. You will find similar configuration as part of your notebook.

Availability — SageMaker Data Wrangler supports for real-time and batch inference as well as enhanced Apache Spark configuration for data processing workloads are generally available in all AWS Regions that Data Wrangler currently supports.

To get started with Amazon SageMaker Data Wrangler supports for real-time and batch inference deployment, visit AWS documentation.

Happy building
— Donnie