Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

2024-12-19 Somdeb Bhattacharjee

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/big-data/implement-a-custom-subscription-workflow-for-unmanaged-amazon-s3-assets-published-with-amazon-datazone/

Organizational data is often fragmented across multiple lines of business, leading to inconsistent and sometimes duplicate datasets. This fragmentation can delay decision-making and erode trust in available data. Amazon DataZone, a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. Although Amazon DataZone automates subscription fulfillment for structured data assets—such as data stored in Amazon Simple Storage Service (Amazon S3), cataloged with the AWS Glue Data Catalog, or stored in Amazon Redshift—many organizations also rely heavily on unstructured data. For these customers, extending the streamlined data discovery and subscription workflows in Amazon DataZone to unstructured data, such as files stored in Amazon S3, is critical.

For example, Genentech, a leading biotechnology company, has vast sets of unstructured gene sequencing data organized across multiple S3 buckets and prefixes. They need to enable direct access to these data assets for downstream applications efficiently, while maintaining governance and access controls.

In this post, we demonstrate how to implement a custom subscription workflow using Amazon DataZone, Amazon EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in Amazon S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Solution overview

For our use case, the data producer has unstructured data stored in S3 buckets, organized with S3 prefixes. We want to publish this data to Amazon DataZone as discoverable S3 data. On the consumer side, users need to search for these assets, request subscriptions, and access the data within an Amazon SageMaker notebook, using their own custom AWS Identity and Access Management (IAM) roles.

The proposed solution involves creating a custom subscription workflow that uses the event-driven architecture of Amazon DataZone. Amazon DataZone keeps you informed of key activities (events) within your data portal, such as subscription requests, updates, comments, and system events. These events are delivered through the EventBridge default event bus.

An EventBridge rule captures subscription events and invokes a custom Lambda function. This Lambda function contains the logic to manage access policies for the subscribed unmanaged asset, automating the subscription process for unstructured S3 assets. This approach streamlines data access while ensuring proper governance.

To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus.

The solution architecture is shown in the following screenshot.

Custom subscription workflow architecture diagram

To implement the solution, we complete the following steps:

As a data producer, publish an unstructured S3 based data asset as S3ObjectCollectionType to Amazon DataZone.
For the consumer, create a custom AWS service environment in the consumer Amazon DataZone project and add a subscription target for the IAM role attached to a SageMaker notebook instance. Now, as a consumer, request access to the unstructured asset published in the previous step.
When the request is approved, capture the subscription created event using an EventBridge rule.
Invoke a Lambda function as the target for the EventBridge rule and pass the event payload to it:
The Lambda function does 2 things:
1. Fetches the asset details, including the Amazon Resource Name (ARN) of the S3 published asset and the IAM role ARN from the subscription target.
2. Uses the information to update the S3 bucket policy granting List/Get access to the IAM role.

Prerequisites

To follow along with the post, you should have an AWS account. If you don’t have one, you can sign up for one.

For this post, we assume you know how to create an Amazon DataZone domain and Amazon DataZone projects. For more information, see Create domains and Working with projects and environments in Amazon DataZone.

Also, for simplicity, we use the same IAM role for the Amazon DataZone admin (creating domains) as well the producer and consumer personas.

Publish unstructured S3 data to Amazon DataZone

We have uploaded some sample unstructured data into an S3 bucket. This is the data that will be published to Amazon DataZone. You can use any unstructured data, such as an image or text file.

On the Properties tab of the S3 folder, note the ARN of the S3 bucket prefix.

Complete the following steps to publish the data:

Create an Amazon DataZone domain in the account and navigate to the domain portal using the link for Data portal URL.

DataZone domain creation

Create a new Amazon DataZone project (for this post, we name it unstructured-data-producer-project) for publishing the unstructured S3 data asset.
On the Data tab of the project, choose Create data asset.

Data asset creation

Enter a name for the asset.
For Asset type, choose S3 object collection.
For S3 location ARN, enter the ARN of the S3 prefix.

After you create the asset, you can add glossaries or metadata forms, but it’s not necessary for this post. You can publish the data asset so it’s now discoverable within the Amazon DataZone portal.

Set up the SageMaker notebook and SageMaker instance IAM role

Create an IAM role which will be attached to the SageMaker notebook instance. For the trust policy, allow SageMaker to assume this role and leave the Permissions tab blank. We refer to this role as the instance-role throughout the post.

SageMaker instance role

Next, create a SageMaker notebook instance from the SageMaker console. Attach the instance-role to the notebook instance.

SageMaker instance

Set up the consumer Amazon DataZone project, custom AWS service environment, and subscription target

Complete the following steps:

Log in to the Amazon DataZone portal and create a consumer project (for this post, we call it custom-blueprint-consumer-project), which will used by the consumer persona to subscribe to the unstructured data asset.

Custom blueprint project name

We use the recently launched custom blueprints for AWS services for creating the environment in this consumer project. The custom blueprint allows you to bring your own environment IAM role to integrate your existing AWS resources with Amazon DataZone. For this post, we create a custom environment to directly integrate SageMaker notebook access from the Amazon DataZone portal.

Before you create the custom environment, create the environment IAM role that will be used in the custom blueprint. The role should have a trust policy as shown in the following screenshot. For the permissions, attach the AWS managed policy AmazonSageMakerFullAccess. We refer to this role as the environment-role throughout the post.

Custom Environment role

To create the custom environment, first enable the Custom AWS Service blueprint on the Amazon DataZone console.

Enable custom blueprint

Open the blueprint to create a new environment as shown in the following screenshot.
For Owning project, use the consumer project that you created earlier and for Permissions, use the environment-role.

Custom environment project and role

After you create the environment, open it to create a customized URL for the SageMaker notebook access.

SageMaker custom URL

Create a new custom AWS link and enter the URL from the SageMaker notebook.

You can find it by navigating to the SageMaker console and choosing Notebooks in the navigation pane.

Choose Customize to add the custom link.

Add the custom link

Next, create a subscription target in the custom environment to pass the instance role that needs access to the unstructured data.

A subscription target is an Amazon DataZone engineering concept that allows Amazon DataZone to fulfill subscription requests for managed assets by granting access based on the information defined in the target like domain-id, environment-id, or authorized-principals.

Currently, creation of subscription targets is only allowed using the AWS Command Line Interface (AWS CLI). You can use the command create-subscription-target to create the subscription target.

The following is an example JSON payload for the subscription target creation. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json). Replace the domain ID and the environment ID with the corresponding values for your domain and environment.

{
"domainIdentifier": "<<your-domain-id>>",
"environmentIdentifier": "<<your-environment-id>>",
"name": "custom-s3-target-consumerenv",
"type": "GlueSubscriptionTargetType",
"manageAccessRole": "<<provide the environment-role here>>",
"applicableAssetTypes": ["S3ObjectCollectionAssetType"],
"provider": "Custom Provider",
"authorizedPrincipals": [ "<<provide the instance-role here>>"],
"subscriptionTargetConfig": [{
"formName": "GlueSubscriptionTargetConfigForm",
"content": "{\"databaseName\":\"customdb1\"}"
}]
}

You can get the domain ID from the user name button in the upper right Amazon DataZone data portal; it’s in the format dzd_<<some-random-characters>>.

For the environment ID, you can find it on the Settings tab of the environment within your consumer project.

Open an AWS CloudShell environment and upload the JSON payload file using the Actions option in the CloudShell terminal.
You can now create a new subscription target using the following AWS CLI command:

aws datazone create-subscription-target --cli-input-json file://blog-sub-target.json

Create subscription target

To verify the subscription target was created successfully, run the list-subscription-target command from the AWS CloudShell environment:

aws datazone list-subscription-targets —domain-identifier <<domain-id>> —environment-identifier <<environment-id>>

Create a function to respond to subscription events

Now that you have the consumer environment and subscription target set up, the next step is to implement a custom workflow for handling subscription requests.

The simplest mechanism to handle subscription events is a Lambda function. The exact implementation may vary based on environment; for this post, we walk through the steps to create a simple function to handle subscription creation and cancellation.

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Select Author from scratch.
For Function name, enter a name (for example, create-s3policy-for-subscription-target).
For Runtime¸ choose Python 3.12.
Choose Create function.

Author Lambda function

This should open the Code tab for the function and allow editing of the Python code for the function. Let’s look at some of the key components of a function to handle the subscription for unmanaged S3 assets.

Handle only relevant events

When the function gets invoked, we check to make sure it’s one of the events that’s relevant for managing access. Otherwise, the function can simply return a message without taking further action.

def lambda_handler(event, context):
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

These subscription events should include both the domain ID and a request ID (among other attributes). You can use these to look up the details of the subscription request in Amazon DataZone:

sub_request = dz.get_subscription_request_details(
domainIdentifier = domain_id,
identifier= sub_request_id
)
asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
form_data = json.loads(asset_listing['forms'])
asset_id = asset_listing['entityId']
asset_version = asset_listing['entityRevision']
asset_type = asset_listing['entityType']

Part of the subscription request should include the ARN for the S3 bucket in question, so you can retrieve that:

# We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

You can also use the Amazon DataZone API calls to get the environment associated with the project making the subscription request for this S3 asset. After retrieving the environment ID, you can check which IAM principals have been authorized to access unmanaged S3 assets using the subscription target:

        list_sub_target = dz.list_subscription_targets(
            domainIdentifier=domain_id,
            environmentIdentifier=environment_id,
            maxResults=50,
            sortBy='CREATED_AT',
            sortOrder='DESCENDING'
            )
        
        print('asset type:', list_sub_target['items'][0]['applicableAssetTypes'])
        
        if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
            role_arn = list_sub_target['items'][0]['authorizedPrincipals']
            print('role arn',role_arn)

If this is a new subscription, add the relevant IAM principal to the S3 bucket policy by appending a statement that allows the desired S3 actions on this bucket for the new principal:

        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })

Conversely, if this is a subscription being revoked or cancelled, remove the previously added statement from the bucket policy to make sure the IAM principal no longer has access:

        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block

The completed function should be able to handle adding or removing principals like IAM roles or users to a bucket policy. Be sure to handle cases where there is no existing bucket policy or where a cancellation means removing the only statement in the policy, meaning the entire bucket policy is no longer needed.

The following is an example of a completed function:

import json
import boto3
import os


dz = boto3.client('datazone')
s3 = boto3.client('s3')

# The list of actions to be permitted on the bucket in the newly granted policy
S3_ACTION_STRING = 's3:*'

def build_policy_statements(event_type, statement_block, principal, sub_request_id, bucket_arn):
        # Generate a Sid that should be unique
        sid_string = ''.join(c for c in f'DZ{principal}{sub_request_id}' if c.isalnum())
        # Add a new policy statement that gives the prinicpal access to whole bucket.
        # If it turns out something other than bucket ARN is allowed in asset, we can
        # get more granular than that
        # Sid that should be unique in case we need to handle unsubscribe
        print('statement block :',statement_block)
        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
            else:
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '/*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block
           

        return statement_block

def lambda_handler(event, context):
    """Lambda function reacting to DataZone subscribe events

    Parameters
    ----------
    event: dict, required
        Event Bridge Events Format

    context: object, required
        Lambda Context runtime methods and attributes

    Returns
    ------
        Simple reponse indicating success or failure reason
    """
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

    
    # get the domain_id and other information
    domain_id = event_detail['metadata']['domain']
    project_id = event_detail['metadata']['owningProjectId']
    sub_request_id = event_detail['data']['subscriptionRequestId']
    listing_id = event_detail['data']['subscribedListing']['id']
    listing_version = event_detail['data']['subscribedListing']['version']
    
    print('domain-id',domain_id)
    print('project-id:',project_id)
    
    sub_request = dz.get_subscription_request_details(
        domainIdentifier = domain_id,
        identifier= sub_request_id
    )
   
    # Retrieve info about the asset from the request
    asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
    form_data = json.loads(asset_listing['forms'])
    asset_id = asset_listing['entityId']
    asset_version = asset_listing['entityRevision']
    asset_type = asset_listing['entityType']

    # We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

        # Get the current bucket policy, or else make a blank one if there currently
        # is no policy
        try:
            bucket_policy = json.loads(s3.get_bucket_policy(Bucket=bucket_name)['Policy'])
        except s3.exceptions.from_code('NoSuchBucketPolicy'):
            bucket_policy = {'Statement': []}
        except:
            response = '{"Response" : "Could not get bucket policy"}'
            return response
        
        # Gets new policy with the subscribing principal either added or removed based on
        # event type
        new_policy_statements = build_policy_statements(event_type, bucket_policy['Statement'], principal, 
                                               sub_request_id, bucket_arn)

            
        # Write back the new policy. This can fail if the new policy is too big
        # or if for some reason the function role doesn't have rights to do this
        # If we removed the only policy statement, then just delete the policy
        try: 
            if not new_policy_statements:
                s3.delete_bucket_policy(Bucket = bucket_name)
            else:
                bucket_policy['Statement'] = new_policy_statements
                policy_string = json.dumps(bucket_policy)
                print('policy string :',policy_string)
                s3.put_bucket_policy(
                    Bucket=bucket_name,
                    Policy = policy_string
                )
        except Exception as e: 
            response = f'{{"Response" : "Error updating bucket policy: {e.args}"}}'
            return response
        
        # If we got here everything went as planned
        response = f'{{"Response" : "Updated policy for " + {bucket_name}}}'
    else:
        response = '{"Response" : "Not an S3 asset"}'


    return response

def get_principal(domain_id,project_id):
    # Call list environments to get the environment id
    listenv_request = dz.list_environments(
        domainIdentifier = domain_id,
        projectIdentifier= project_id
    )
    
   # In our example environment, there is only one of these
    environment_id = listenv_request['items'][0]['id']

   # Get the role we want to give access to from the subscription target info
    list_sub_target = dz.list_subscription_targets(
        domainIdentifier=domain_id,
        environmentIdentifier=environment_id,
        maxResults=50,
        sortBy='CREATED_AT',
        sortOrder='DESCENDING'
        )

    if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
       role_arn = list_sub_target['items'][0]['authorizedPrincipals']
   else:
        role_arn = []

    return role_arn

Because this Lambda function is intended to manage bucket policies, the role assigned to it will need a policy that allows the following actions on any buckets it is intended to manage:

s3:GetBucketPolicy
s3:PutBucketPolicy
s3:DeleteBucketPolicy

Now you have a function that is capable of editing bucket policies to add or remove the principals configured for your subscription targets, but you need something to invoke this function any time a subscription is created, cancelled, or revoked. In the next section, we cover how to use EventBridge to integrate this new function with Amazon DataZone.

Respond to subscription events in EventBridge

For events that take place within Amazon DataZone, it publishes information about each event in EventBridge. You can watch for any of these events, and invoke actions based on matching predefined rules. In this case, we’re interested in asset subscriptions being created, cancelled, or revoked, because those will determine when we grant or revoke access to the data in Amazon S3.

On the EventBridge console, choose Rules in the navigation pane.

The default event bus should automatically be present; we use it for creating the Amazon DataZone subscription rule.

Choose Create rule.
In the Rule detail section, enter the following:
1. For Name, enter a name (for example, DataZoneSubscriptions).
2. For Description, enter a description that explains the purpose of the rule.
3. For Event bus, choose default.
4. Turn on Enable the rule on the selected event bus.
5. For Rule type, select Rule with an event pattern.
Choose Next.

EventBridge rule

In the Event source section, select AWS Events or EventBridge partner events as the source of the events.

Define Event source

In the Creation method section, select Custom Pattern (JSON editor) to enable exact specification of the events needed for this solution.

Choose custom pattern

In the Event pattern section, enter the following code:

{
"detail-type": ["Subscription Created", "Subscription Cancelled", "Subscription Revoked"],
"source": ["aws.datazone"]
}

Define custom pattern JSON

Choose Next.

Now that we’ve defined the events to watch for, we can make sure those Amazon DataZone events get sent to the Lambda function we defined in the previous section.

On the Select target(s) page, enter the following for Target 1:
1. For Target types, select AWS service.
2. For Select a target, choose Lambda function
3. For Function, choose create-s3policy-for-subscription-target.
Choose Skip to Review and create.

Define event target

On the Review and create page, choose Create rule.

Subscribe to the unstructured data asset

Now that you have the custom subscription workflow in place, you can test the workflow by subscribing to the unstructured data asset.

In the Amazon DataZone portal, search for the unstructured data asset you published by browsing the catalog.

Search unstructured asset

Subscribe to the unstructured data asset using the consumer project, which starts the Amazon DataZone approval workflow.

Subscribe to unstructured asset

You should get a notification for the subscription request; follow the link and approve it.

When the subscription is approved, it will invoke the custom EventBridge Lambda workflow, which will create the S3 bucket policies for the instance role to access the S3 object. You can verify that by navigating to the S3 bucket and reviewing the permissions.

Access the subscribed asset from the Amazon DataZone portal

Now that the consumer project has been given access to the unstructured asset, you can access it from the Amazon DataZone portal.

In the Amazon DataZone portal, open the consumer project and navigate to the Environments
Choose the SageMaker-Notebook

Choose SageMaker notebook on the consumer project

In the confirmation pop-up, choose Open custom.

Choose Custom

This will redirect you to the SageMaker notebook assuming the environment role. You can see the SageMaker notebook instance.

Choose Open JupyterLab.

Open JupyterLab Notebook

Choose conda_python3 to launch a new notebook.

Launch Notebook

Add code to run get_object on the unstructured S3 data that you subscribed earlier and run the cells.

Now, because the S3 bucket policy has been updated to allow the instance role access to the S3 objects, you should see the get_object call return a HTTPStatusCode of 200.

Multi-account implementation

In the instructions so far, we’ve deployed everything in a single AWS account, but in larger organizations, resources can be distributed throughout AWS accounts, often managed by AWS Organizations. The same pattern can be applied in a multi-account environment, with some minor additions. Instead of directly acting on a bucket, the Lambda function in the domain account can assume a role in other accounts that contain S3 buckets to be managed. In each account with an S3 bucket containing assets, create a role that allows editing the bucket policy and has a trust policy referencing the Lambda role in the domain account as a principal.

Clean up

If you’ve finished experimenting and don’t want to incur any further cost for the resources deployed, you can clean up the components as follows:

Delete the Amazon DataZone domain.
Delete the Lambda function.
Delete the SageMaker instance.
Delete the S3 bucket that hosted the unstructured asset.
Delete the IAM roles.

Conclusion

By implementing this custom workflow, organizations can extend the simplified subscription and access workflows provided by Amazon DataZone to their unstructured data stored in Amazon S3. This approach provides greater control over unstructured data assets, facilitating discovery and access across the enterprise.

We encourage you to try out the solution for your own use case, and share your feedback in the comments.

About the Authors

Somdeb Bhattacharjee is a Senior Solutions Architect specializing on data and analytics. He is part of the global Healthcare and Life sciences industry at AWS, helping his customers modernize their data platform solutions to achieve their business outcomes.

Sam Yates is a Senior Solutions Architect in the Healthcare and Life Sciences business unit at AWS. He has spent most of the past two decades helping life sciences companies apply technology in pursuit of their missions to help patients. Sam holds BS and MS degrees in Computer Science.

Fish shell announces 4.0 release

2024-12-19 daroc

Post Syndicated from daroc original https://lwn.net/Articles/1002820/

fish is a shell with a custom language and several affordances not available out of the box in other shells, such as directory-sensitive command completion. Although the project does not normally make beta releases, the

newly announced 4.0 release
will have one in order to ensure that no problems were introduced

after a major effort to switch the code base from C++ to Rust.

fish is a smart and user-friendly command line shell with clever features that just work, without needing an advanced degree in bash scriptology. Today we are announcing an open beta, inviting all users to try out the upcoming 4.0 release.

fish 4.0 is a big upgrade. It’s got lots of new features to make using the command line easier and more enjoyable, such as more natural key binding and expanded history search. And under the hood, we’ve rebuilt the foundation in Rust to embrace modern computing.

The role of email security in reducing user risk amid rising threats

2024-12-19 Ayush Kumar

Post Syndicated from Ayush Kumar original https://blog.cloudflare.com/the-role-of-email-security-in-reducing-user-risk-amid-rising-threats/

Phishing remains one of the most dangerous and persistent cyber threats for individuals and organizations. Modern attacks use a growing arsenal of deceptive techniques that bypass traditional secure email gateways (SEGs) and email authentication measures, targeting organizations, employees, and vendors. From business email compromise (BEC) to QR phishing and account takeovers, these threats are designed to exploit weaknesses across multiple communication channels, including email, Slack, Teams, SMS, and cloud drives.

Phishing remains the most popular attack vector for bad actors looking to gain unauthorized access or extract fraudulent payment, and it is estimated that 90% of all attacks start with a phishing email. However, as companies have shifted to using a multitude of apps to support communication and collaboration, attackers too have evolved their approach. Attackers now engage employees across a combination of channels in an attempt to build trust and pivot targeted users to less-secure apps and devices. Cloudflare is uniquely positioned to address this trend thanks to our integrated Zero Trust services, extensive visibility from protecting approximately 20% of all websites, and signals derived from processing billions of email messages a year.

Cloudflare recognizes that combating phishing requires an integrated approach and a more complete view of user-based risk. That’s why we’ve designed our email security solution to protect organizations before, during, and after message delivery, while also extending protection beyond email into the broader security ecosystem. Phishing is no longer just an email problem — it’s a multi-channel, cross-application threat.

Assessing holistic user risk

When it comes to protecting against user-based threats, Cloudflare employs a platform approach to security. Instead of forcing customers to rely on an array of fragmented tools that create unnecessary complexity and blind spots, we treat email security as part of an overall strategy for assessing and responding to user-related risk. Our email security solution works in tandem with our network solutions so that SOC teams can quickly assert what actions their users are performing outside of email. Given our extensive network visibility, our platform is not limited by API integrations, and can provide SOC teams with the best visibility and protection. This helps SOC teams not only combat phishing, but begin to identify and take action against a wider range of insider threats.

Within a single, unified dashboard, SOC teams can quickly review detailed information regarding the following questions, which we discuss in more detail below:

Who in the organization is being targeted?
Who are the attackers impersonating?
What risky behaviors are my users performing?

Who in the organization is being targeted?

Within the Cloudflare dashboard, SOC teams can view which users are the most targeted. This can help them determine which accounts should be hardened (e.g. MFA enforced), and identify risky users that should be monitored more closely for significant deviations in behavior. One way organizations can use this information is to require high-risk users to connect from a managed device. For instance, if they use Crowdstrike, we can require that these users be on a managed device and force a posture check before letting them access sensitive applications.

SOC teams can also dive into what types of attacks are hitting their users and at what frequency.

Customers can use these insights to adjust various platform policies, effectively blocking malicious content and securing sensitive resources. Above, we can see that attackers are frequently leveraging links to try to compromise users. Based on the link analysis we are seeing in email, SOC teams can use our gateway to block similar attacks, so that when attackers try to use other communication methods (LinkedIn, Teams, Slack, etc.) users will not be able to interact with those links.

To learn more about stopping these types of multichannel phishing attacks, please see our blog post, A wild week in phishing, and what it means for you.

Who are the attackers impersonating?

SOC teams can also get visibility into impersonation attempts within their email environment. Customers can see which users are being impersonated the most, and can use this information to build policies within our email security solution and broader set of Zero Trust services.

A list of frequently impersonated users can be added to the impersonation registry, which changes the sensitivity of our models to apply more scrutiny on messages coming from those users.

Given our unique position as a domain name registrar, customers can also report lookalike domains to Cloudflare for action to be taken against them. This helps prevent attackers from being able to impersonate our customers and negatively impact their reputation.

Finally, customers can also use our free DMARC management to track who is sending emails on their behalf. This information can be used to update SPF records and get customers to p=quarantine or p=reject so that their brand is more resistant to being spoofed.

What risky behaviors are my users performing?

Cloudflare provides visibility into user actions in several ways.

Within the email security solution, we can track internal messages and alert if we see any malicious or suspicious behaviors. This can be enhanced with our managed service offering, Phishguard, which can alert admins when they see any type of behavior that indicates fraud (like Business Email Compromise), account takeover, or insider threats.

SOC teams can also take advantage of our CASB solution to view the different actions that users have performed. Actions are labeled with different risk levels to let teams know which findings are critical and require remediation.

Customers are also able to view data loss prevention (DLP) violations that users have incurred to see if there is any unauthorized egress of data. We provide the ability to automatically block this egress based on different policies within our platform, making sure there is no exfiltration of sensitive data.

We also enable organizations to put internal applications behind our Access solution. This prevents any users with improper permissions or a high risk level from accessing critical applications. Our dashboard then provides metrics on these logins to see how many failures we observed, so that SOC teams can investigate the user further.

These signals feed into our Unified Risk Score, which can be exported if needed to take automated actions within other platforms.

Increasing SOC productivity

With all of our functionality unified within a single interface and fed by one data lake, we see an increase in SOC productivity because teams no longer have to spend time building rules or flipping between disparate interfaces and workflows.

AI-driven email security

Unlike legacy secure email gateways, our email security solution is driven by predictive AI models which eliminate the need for creating and updating rules. These models are also more effective than reactive measures because they are fed by a massive volume of diverse data from across Cloudflare’s network. This means models are trained on emerging threats earlier and can identify new tactics with a higher accuracy than legacy systems.

Automated isolation

To further reduce the risk posed by users visiting potentially malicious websites, customers can isolate browser sessions using our natively integrated, clientless remote browser that runs on our global network. Within an isolated browsing session, SOC teams can prohibit various behaviors such as copy/paste, upload/download, keyboard inputs, and more. This decreases the risk of users accessing a website and performing an action which could compromise the organization.

Our browser isolation solution also decreases the time SOC teams need to maintain policies. Rather than adding domains and applications one by one, teams can choose to isolate based on content categories. These categories are based on our threat intelligence, and are constantly updated. This means that as new websites emerge, SOC teams do not have to spend the time to chase down and update the proper policy — rather, it is done automatically.

Automated blocking

While some websites might require running in an isolated browser to mitigate the risk of users encountering malicious content, others may need to be fully blocked altogether. Customers can use the same process listed above to block any website that could be risky for users based on tags. However, we allow admins to also provide feedback to users to increase awareness. This can be done via a custom block page that allows SOC teams to communicate with users about their risky behaviors, so that they take actions to curb this behavior in the future and alert their SOC teams to attacks that might be occurring.

What’s on the horizon for 2025

In 2024, our email security team focused on refining the user interface and improving the incident investigation experience. Looking ahead to 2025, we plan to introduce additional capabilities that deepen the integration of our email security solution with our SASE platform, delivering enhanced insight and protection against user-based threats.

Configurable browser isolation for email

Our Email Link Isolation feature currently applies to links we consider suspicious. However, we intend to allow customers to add customized configurations to meet their internal policies. This enhancement will provide more granular control over which websites users can access from an email message without using an isolated browser.

Outbound DLP for email

We will be releasing an add-in for Microsoft Outlook that will allow customers to use our DLP engine for inspecting outbound email messages. This client-side application enables customers to configure downstream policies that trigger action when a DLP policy is violated, all while minimizing disruption to existing email infrastructure.

Expanded user risk scoring

Cloudflare will be increasing the signals that feed into our user risk scores. This will enable SOC teams to create more policies within Cloudflare or to take automated actions externally based on the level of risk observed.

These are just a few examples of significant releases that will be coming in 2025. Please stay tuned to the Cloudflare blog where we will be announcing these releases as they happen.

Try Cloudflare Email Security today

We provide all organizations (whether a Cloudflare customer or not) with free access to our Retro Scan tool, allowing them to use our predictive AI models to scan existing inbox messages. Retro Scan will detect and highlight any threats found, enabling organizations to remediate them directly in their email accounts. With these insights, organizations can implement further controls, either using Cloudflare Email Security or their preferred solution, to prevent similar threats from reaching their inboxes in the future.

The Books We Read in High School (Part 1)

2024-12-19 The Atlantic

Post Syndicated from The Atlantic original https://www.youtube.com/watch?v=vW_TP33zX6E

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

2024-12-19 Надежда Цекулова

Post Syndicated from Надежда Цекулова original https://www.toest.bg/otgovornost-na-obshtestvoto-eva-yambor-za-detsata-sus-sindrom-na-daun-v-avstria/

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

Ева е омъжена за българин и животът на семейството се развива в България допреди 18 години, когато се ражда синът им Йоан. След като научават, че детето е със синдром на Даун, родителите решават да се преместят в Австрия, за да се възползват от възможностите, които австрийската система гарантира при грижите, развитието и образованието на деца с особености в развитието. Днес Йоан работи в кафене – социално предприятие, възникнало като частна инициатива, но подкрепено от държавата, след като доказва ефективността си.

Разговор за това, което е най-важно за децата и хората с особености – да са сред нас, в поредицата „Разговори за образованието на специалните деца“ с подкрепата на „Лидл България“.

Представяме Ви в родителската Ви роля и искам да помоля да започнем с родителската Ви история с Вашето дете със синдром на Даун.

Родих в България преди почти 18 години. Бях в привилегирована ситуация в сравнение с повечето раждащи жени, защото държах на раждането ми да присъстват моят личен гинеколог и моята акушерка. Бях изгубила две деца до този момент, бях в голяма криза и имах нужда от подкрепа.

Когато Йоан се роди, го посрещна една акушерка от болницата. Тя беше първата, която го взе в ръцете си, и почти изкрещя: „Не, не сте ли направили амниоцентеза?“ С тези думи посрещна Йоан на този свят. Веднага видя, че той е със синдром на Даун. Но моят гинеколог и акушерката отхвърлиха това съмнение, сякаш неадекватната реакция на тази акушерка ги стресна и ги накара да не виждат очевидното. Така в началото заживях с мисълта, че синът ми е здрав.

Аз също бях толкова влюбена в моя син, че не виждах нищо необичайно.

Не Ви ли глождеше въпросът на акушерката „Не сте ли направили амниоцентеза?“?

Не, за мен да не правя това изследване след две загуби беше осъзнат избор. От една страна, не исках да поемам никакъв риск. От друга, вярвам, че всяка жена има правото да избере дали да има деца. Но личната ми етика не позволява, след като съзнателно съм създала дете, да избирам дали искам точно това дете.

И така повече от шест месеца отглеждах Йоан като най-големия подарък, имах невероятния късмет да не чуя от никого „горката“, да ми е спестена тревогата, че съм родила дете със синдром на Даун. И той се развиваше изключително добре първите месеци. Имахме късмет и че нямаше никакви здравословни проблеми, които са често срещани при деца със синдрома.

Кога разбрахте за състоянието на Йоан?

Когато беше на 10 или 11 месеца. Вече бях успяла да си изградя представа за него като за най-прекрасното дете, може би затова новината не ме смаза така, както се случва с други родители. Дадох си сметка каква разлика е това, малко по-късно. Когато дойдохме в Австрия, ми предложиха да се включа в едно проучване сред родители на деца със синдрома. В анкетата видях въпроса „Хората поздравиха ли Ви за раждането на Вашето дете?“. Едва тогава разбрах, че често семействата, на които се ражда дете със синдром на Даун, не получават поздравления от близките си. Никога няма да забравя какво изпитах, докато четях този въпросник, и дълбокото усещане колко съм благодарна, че това ми е било спестено от съдбата.

Спомних си как, когато родих, изпратих съобщение на всичките си контакти: „Щастието ми си има име – Йоан.“ Зададох си въпроса щях ли да го напиша, ако знаех, че той е със синдром на Даун. Днес отговорът ми е „да“, но какъв щеше да бъде в онзи ден? В първия момент ти не знаеш какво те очаква, защото тези хора все още не са достатъчно видими в обществото и ние нямаме знание за тях. Има отделни случаи, които ни показват само колко много работа ни предстои още – в САЩ има една много популярна манекенка, в Испания – известен актьор, в София две момичета работят в голям хотел… Това е страхотно, но не трябва да е изключение. Трябва да е нормално да ги виждаме навсякъде, за да няма страх в обществото и хората да не се боят да поздравят едно семейство, в което се е родило дете със синдром на Даун.

Бих искала да Ви върна към решението да напуснете България, след като сте разбрала, че Йоан е дете със синдром на Даун. Какво Ви привлече в австрийската система?

Говорим за времето преди близо 18 години. Тогава в България тепърва се зараждаше някаква практика. Имаше единици специалисти, които работеха на парче. Когато научих и започнах да търся в интернет, веднага намерих информация къде трябва да отида във Виена, каква финансова подкрепа ще получа в Австрия, какви изследвания трябва да се направят, какви медицински прегледи се препоръчват и колко често. Освен това аз съм австрийка, синът ми имаше право да получи тези грижи и за нас изборът да се преместим беше естествен.

Кой организира тази подкрепа – някаква неправителствена организация, частна инициатива или държавата?

Първото място, на което семейство като нашето бива изпратено, беше един Институт за синдром на Даун, част от държавна болница. Получавах всички услуги за Йоан със здравната си карта [документ, който удостоверява здравноосигурителния статус – б.а.]. Оттам ме насочиха и към едно сдружение на родители, но към онзи момент все още не бях готова да участвам в тази общност – те се занимаваха предимно с проблеми в сферата на образованието и реализацията на по-големите деца. Предлагаха се и услуги – например заниманията с логопед, – които си плащах сама, но след това част от разходите ми се възстановяваха, а друга част се приспадаха от данъците ми.

Имаше, разбира се, и частни инициативи, чиято роля е да надграждат съществуващото. Например една дама, с която към момента вече добре се познаваме, тогава тъкмо беше създала център за обучение на родители и учители. Тя е специален педагог по професия и когато ѝ се ражда дете с Даун, се амбицира и започва да събира и адаптира наличните знания за работа с такива деца в практически програми – от това родителите да знаят как да развиват потенциала на детето чрез всекидневни игри, избор на играчки, занимания, до това как на тези деца да се преподава математика например. Обучението в този център вече се плаща от джоба, но то е съвсем друго ниво на грижа.

Добре е, че споменахте математиката, за да направим прехода към образованието. В България на теория децата със синдром на Даун трябва да учат в масови училища, където да получават допълнителна специализирана подкрепа. На практика невинаги резултатът е добър. Какъв е Вашият опит с австрийската система?

И при нас в образованието нещата са сходни. В детската градина е добре, детето може да ходи в масова група, където да е само с деца в норма. Може да ходи и в специализирана, която е по-специално организирана, но пак на включващ принцип – тоест има деца с различни увреждания, но има и здрави деца. След това обаче става по-сложно.

Първо, според мен би трябвало родителят да има възможност да изпрати детето си на училище по-късно, защото развитието е по-бавно. Но в Австрия всички тръгват на училище на шест години, а отлагане е възможно само с една година. И друго – родителите имат право да пратят детето в масово училище, но ако специализираните учители нямат достатъчно часове, няма да има кой да се занимава с него и то няма да може да се развива със собственото си темпо. Защото това, което видяхме от опит, е, че интензивният контакт и индивидуалните занимания са ключови за развитието. Децата с Даун са различни, както и ние сме различни – някои учат лесно, други много, много трудно. И за да имат истинско образование, то трябва да е съобразено с индивидуалното им темпо и нужди. Затова решихме да пратим нашия син в частно училище.

Има един концептуален въпрос, свързан с дилемата къде е по-добре за децата с особености в развитието – в масовото учебно заведение, където са сред други деца, но пък има по-малко възможности с тях да се работи индивидуално, или в някакви специализирани центрове, където имат повече индивидуални занимания, но нямат досег с естествената среда на своите връстници. Това обсъжда ли се в Австрия?

Много, много хубав въпрос и много труден отговор. Да, в Австрия също съществува тази дилема. По принцип би трябвало вече да ги няма тези специализирани центрове, в които децата с някакви особености са изолирани и ние не ги виждаме – трябва всички да могат да учат заедно. Обаче реалността е, че ако си специално дете, ходиш в масово училище и там нямаш специален учител до теб, това пречи на развитието ти.

Според мен са нужни повече усилия и инвестиции двете неща да вървят ръка за ръка – децата да са в масовото училище, да контактуват с деца без увреждания, но там, в масовата среда, да има достатъчно квалифициран персонал, който да ги подкрепя и да има условия. Например децата с особености в развитието често са по-чувствителни и не могат да издържат през целия ден на толкова много хора, шум, емоции. В масовите училища трябва да има стая, в която тези деца да провеждат част от индивидуалните си специализирани занимания. Но да ви кажа, това не сме го реализирали в Австрия, тоест няма един отговор.

Мисля, че като общество винаги трябва да показваме, че сме отворени и има място за всички нас. Откровено казано, смятам, че най-голяма полза от приобщаващото образование имат хората без увреждания. Присъствието на различни деца в класовете учи останалите на толерантност, на емпатия, създава им социално-емоционални умения, които са толкова ценни в съвременния свят. Австрийската държава, подобно на българската, има още много работа, за да направи това приобщаващо обучение наистина ефективно за всички.

Работа на държавата ли е това?

Сто процента е задача на държавата. Друг е въпросът, че когато тя не я изпълнява, хората запретват ръкави и сами създават инициативи. Например в Австрия най-големият проблем с образованието на нашите деца възниква, като навършат 15 години. До тази възраст образованието е задължително и достъпът на ученици с особености в развитието е гарантиран, но след това не е и оставането на едно дете със специални потребности в клас зависи от добрата воля на директора на училището. Миналата година организирахме протести заради това, стана голям скандал, защото именно нашите деца имат най-голяма нужда да са по-дълго време в училище. Аз съм убедена, че това ще се промени, работа на политиците е да го променят. Но междувременно, за щастие, в Австрия има много инициативи, част от една такава е и моят син.

Образователни инициативи ли?

И образователни, да. С част от неговите преподаватели създадохме едно кафене. Там Йоан и други деца като него едновременно усвояват професия и ходят на училище. Опитахме да създадем модел, където могат да получат професионална квалификация и същевременно да продължат да учат. Създадохме го с много спонсорски пари, с много дарения. След това обаче минахме през специална оценка и сега вече има държавно финансиране за този модел – един държавен фонд плаща на преподавателите на Йоан, че те работят с него. Това е типичен пример за практиката в Австрия – една частна инициатива започва някаква дейност и когато докаже ефективността си, може да получи държавна подкрепа.

Тоест това е честа практика в Австрия?

Да, да. Хубаво е, че държавата разбира, че да се подкрепят такива ангажирани хора е голям плюс. И аз съм голяма оптимистка, че точно такива инициативи ще дадат стимул и на държавата да промени политиките си. В момента в Австрия имаме много работилници. Това са специализирани места, където работят хора с увреждания, но там те са скрити, не са между нас. А в това кафене синът ми всеки ден е сред хора и някои от посетителите специално ходят там. Има една дама, която пътува половин час с трамвай три пъти седмично. Като я попитахме защо го прави, тя отговори: „Толкова е уютно при вас, колко обичам тези деца, как мило ме посрещат те!“

Това е страхотен баланс и обществото е готово за нещо подобно. Австрия все още не го е направила като държава, не го е превърнала в политика, но това е посоката, в която трябва да се работи – да виждаме хората със синдром на Даун, хората с увреждания сред нас. Разбирам, че е трудно и няма идеални модели, но отговорността за непрекъснатото развитие на средата не бива да се прехвърля на друг – тя е на обществото. Това е много важно.

Интервюто е част от поредица разговори за достъпа до образование на децата от уязвими групи. Проектът се осъществява благодарение на най-голямата социално отговорна инициатива на „Лидл България“ – „Ти и Lidl за нашето утре“, в партньорство с Фондация „Работилница за граждански инициативи“, Българския дарителски форум и Асоциацията на европейските журналисти.

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

[$] LWN.net Weekly Edition for December 19, 2024

2024-12-19 corbet

Post Syndicated from corbet original https://lwn.net/Articles/1001869/

The LWN.net Weekly Edition for December 19, 2024 is available.

A Hyena in the White House

2024-12-19 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=4H_kEr5tMM8

Comic for 2024.12.19 – Plans

2024-12-19 Explosm.net

Post Syndicated from Explosm.net original https://explosm.net/comics/plans

New Cyanide and Happiness Comic

Inventor of the Rubik’s cube on the cube and how it relates to problem solving

2024-12-18 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=f87Vff-Ik9A

AWS named Leader in the 2024 ISG Provider Lens report for Sovereign Cloud Infrastructure Services (EU)

2024-12-18 Marta Taggart

Post Syndicated from Marta Taggart original https://aws.amazon.com/blogs/security/aws-named-leader-in-the-2024-isg-provider-lens-report-for-sovereign-cloud-infrastructure-services-eu/

For the second year in a row, Amazon Web Services (AWS) is named as a Leader in the Information Services Group (ISG) Provider Lens Quadrant report for Sovereign Cloud Infrastructure Services (EU), published on December 18, 2024. ISG is a leading global technology research, analyst, and advisory firm that serves as a trusted business partner to more than 900 clients. This ISG report evaluates 19 providers of sovereign cloud infrastructure services in the multi public cloud environment and examines how they address the key challenges that enterprise clients face in the European Union (EU). ISG defines Leaders as providers who represent innovative strength and competitive stability.

ISG rated AWS ahead of other leading cloud providers on both the competitive strength and portfolio attractiveness axes, with the highest score on portfolio attractiveness. Competitive strength was assessed on multiple factors, including degree of awareness, core competencies, and go-to-market strategy. Portfolio attractiveness was assessed on multiple factors, including scope of portfolio, portfolio quality, strategy and vision, and local characteristics.

According to the ISG Provider Lens report, “AWS develops various innovative solutions to meet different sovereignty needs, guided by inputs from regulators, cybersecurity experts, partners and customers. These solutions address factors such as location, workload sensitivity and industry standards.”

Read the report to:

Gather insight on the factors that ISG believes will influence the sovereign cloud landscape in the EU.
Discover why AWS was named as a Leader with the highest score on portfolio attractiveness by ISG.
Learn what makes the AWS Cloud sovereign-by-design and how we continue to offer more control and more choice without compromising on the full power of AWS.

The recognition of AWS as a Leader in this report for the second year in a row is a testament to our efforts to help European customers and partners meet their digital sovereignty and resilience requirements. AWS continues to deliver on the AWS Digital Sovereignty Pledge, our commitment to offering AWS customers the most advanced set of sovereignty controls and features available in the cloud. Earlier this year, we announced plans to invest €7.8 billion in the AWS European Sovereign Cloud by 2040, building on our long-term commitment to Europe and ongoing support of the region’s sovereignty needs. The AWS European Sovereign Cloud, which will be a new, independent cloud for Europe, is set to launch by the end of 2025.

Download the full 2024 ISG Provider Lens Quadrant report for Sovereign Cloud Infrastructure Services (EU).

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

The Hasivo S600WP-5XGT-1SX-SE Might Be The Best 6-port 10GbE Switch with PoE

2024-12-18 Rohit Kumar

Post Syndicated from Rohit Kumar original https://www.servethehome.com/hasivo-s600wp-5xgt-1sx-se-review-6-port-10gbe-switch-with-poe-realtek/

The Hasivo S600WP-5XGT-1SX-SE is perhaps the cheapest 6-port 10Gbase-T and SFP+ web managed switch with PoE capabilities

The post The Hasivo S600WP-5XGT-1SX-SE Might Be The Best 6-port 10GbE Switch with PoE appeared first on ServeTheHome.

Christmas Came Early at xAI Colossus NVIDIA GB200 Shown

2024-12-18 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/christmas-came-early-at-xai-colossus-nvidia-gb200-shown-dell-arm/

The xAI team looks like they are in the process of deploying Dell NVIDIA GB200 NVL systems as was shown in a recent photo

The post Christmas Came Early at xAI Colossus NVIDIA GB200 Shown appeared first on ServeTheHome.

Comic for 2024.12.18 – Reckoning 2

2024-12-18 Explosm.net

Post Syndicated from Explosm.net original https://explosm.net/comics/reckoning-2

New Cyanide and Happiness Comic

New Advances in the Understanding of Prime Numbers

2024-12-18 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/12/new-advances-in-the-understanding-of-prime-numbers.html

Really interesting research into the structure of prime numbers. Not immediately related to the cryptanalysis of prime-number-based public-key algorithms, but every little bit matters.

Enhancing Search Relevancy with Cohere Rerank 3.5 and Amazon OpenSearch Service

2024-12-18 Breanne Warner

Post Syndicated from Breanne Warner original https://aws.amazon.com/blogs/big-data/enhancing-search-relevancy-with-cohere-rerank-3-5-and-amazon-opensearch-service/

This post is co-written with Elliott Choi from Cohere.

The ability to quickly access relevant information is a key differentiator in today’s competitive landscape. As user expectations for search accuracy continue to rise, traditional keyword-based search methods often fall short in delivering truly relevant results. In the rapidly evolving landscape of AI-powered search, organizations are looking to integrate large language models (LLMs) and embedding models with Amazon OpenSearch Service. In this blog post, we’ll dive into the various scenarios for how Cohere Rerank 3.5 improves search results for best matching 25 (BM25), a keyword-based algorithm that performs lexical search, in addition to semantic search. We will also cover how businesses can significantly improve user experience, increase engagement, and ultimately drive better search outcomes by implementing a reranking pipeline.

Amazon OpenSearch Service

Amazon OpenSearch Service is a fully managed service that simplifies the deployment, operation, and scaling of OpenSearch in the AWS Cloud to provide powerful search and analytics capabilities. OpenSearch Service offers robust search capabilities, including URI searches for simple queries and request body searches using a domain-specific language for complex queries. It supports advanced features such as result highlighting, flexible pagination, and k-nearest neighbor (k-NN) search for vector and semantic search use cases. The service also provides multiple query languages, including SQL and Piped Processing Language (PPL), along with customizable relevance tuning and machine learning (ML) integration for improved result ranking. These features make OpenSearch Service a versatile solution for implementing sophisticated search functionality, including the search mechanisms used to power generative AI applications.

Overview of traditional lexical search and semantic search using bi-encoders and cross-encoders

Two important techniques for using end-user search queries are lexical search and semantic search. OpenSearch Service natively supports BM25. This method, while effective for keyword searches, lacks the ability to recognize the intent or context behind a query. Lexical search relies on exact keyword matching between the query and documents. For a natural language query searching for “super hero toys,” it retrieves documents containing those exact terms. While this method is fast and works well for queries targeted at specific terms, it fails to capture context and synonyms, potentially missing relevant results that use different words such as “action figures of superheroes.” Bi-encoders are a specific type of embedding model designed to independently encode two pieces of text. Documents are first turned into an embedding or encoded offline and queries are encoded online at search time. In this approach, the query and document encodings are generated with the same embedding algorithm. The query’s encoding is then compared to pre-computed document embeddings. The similarity between query and documents is measured by their relative distances, despite being encoded separately. This allows the system to recognize synonyms and related concepts, such as “action figures” is related to “toys” and “comic book characters” to “super heroes.”

By contrast, processing the same query—”super hero toys”—with cross-encoders involves first retrieving a set of candidate documents using methods such as lexical search or bi-encoders. Each query-document pair is then jointly evaluated by the cross-encoder, which inputs the combined text to deeply model interactions between the query and document. This approach allows the cross-encoder to understand context, disambiguate meanings, and capture nuances by analyzing every word in relation to each other. It also assigns precise relevance scores to each pair, re-ranking the documents so that those most closely matching the user’s intent—specifically about toys depicting superheroes—are prioritized. Therefore, this significantly enhances search relevancy compared to methods that encode queries and documents independently.

It’s important to note that the effectiveness of semantic search, such as two-stage retrieval search pipelines, depend heavily on the quality of the initial retrieval stage. The primary goal of a robust first-stage retrieval is to efficiently recall a subset of potentially relevant documents from a large collection, setting the foundation for more sophisticated ranking in later stages. The quality of the first-stage results directly impacts the performance of subsequent ranking stages. The goal is to maximize recall and capture as many relevant documents as possible because the later ranking stage has no way to recover excluded documents. A poor initial retrieval can limit the effectiveness of even the most sophisticated re-ranking algorithms.

Overview of Cohere Rerank 3.5

Cohere is an AWS third-party model provider partner that provides advanced language AI models, including embeddings, language models, and reranking models. See Cohere Rerank 3.5 now generally available on Amazon Bedrock to learn more about accessing Cohere’s state-of- the-art models using Amazon Bedrock. The Cohere Rerank 3.5 model focuses on enhancing search relevance by reordering initial search results based on deeper semantic understanding of the user query. Rerank 3.5 uses a cross-encoder architecture where the input of the model always consists of a data pair (for example, a query and a document) that is processed jointly by the encoder. The model outputs an ordered list of results, each with an assigned relevance score, as shown in the following GIF.

Cohere Rerank 3.5 with OpenSearch Service search

Many organizations rely on OpenSearch Service for their lexical search needs, benefiting from its robust and scalable infrastructure. When organizations want to enhance their search capabilities to match the sophistication of semantic search, they are challenged with overhauling their existing systems. Often it is a difficult engineering task for teams or may not be feasible. Now through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale. For financial services firms, this means more accurate matching of complex queries with relevant financial products and information. For e-commerce businesses, they can improve product discovery and recommendations, potentially boosting conversion rates. The ease of integration through a single API call with Amazon OpenSearch enables quick implementation, offering a competitive edge in user experience without significant disruption or resource allocation.

In benchmarks conducted by Cohere, the normalized Discounted Cumulative Gain (nDCG), Cohere Rerank 3.5 improved accuracy when compared to Cohere’s previous Rerank 3 model as well as BM25 and hybrid search across a financial, e-commerce and project management data sets. The nDCG is a metric that’s used to evaluate the quality of a ranking system by assessing how well ranked items align with their actual relevance and prioritizes relevant results at the top. In this study, @10 indicates that the metric was calculated considering only the top 10 items in the ranked list. The nDCG metric is helpful because metrics such as precision, recall, and the F-score measure predictive performance without taking into account the position of ranked results. Whereas the nDCG normalizes scores and discounts relevant results that are returned lower on the list of results. The following figures below shows these performance improvements of Cohere Rerank 3.5 for financial domain as well as e-commerce evaluation consisting of external datasets.

Also, Cohere Rerank 3.5, when integrated with OpenSearch, can significantly enhance existing project management workflows by improving the relevance and accuracy of search results across engineering tickets, issue tracking systems, and open-source repository issues. This enables teams to quickly surface the most pertinent information from their extensive knowledge bases and boosting productivity. The following figure demonstrates the performance improvements of Cohere Rerank 3.5 for project management evaluation.

Combining reranking with BM25 for enterprise search is supported by studies from other organizations. For instance Anthropic, an artificial intelligence startup founded in 2021 that focuses on developing safe and reliable AI systems, conducted a study that found using reranked contextual embedding and contextual BM25 reduced the top-20-chunk retrieval failure rate by 67%, from 5.7% to 1.9%. The combination of BM25’s strength in exact matching with the semantic understanding of reranking models addresses the limitations of each approach when used alone and delivers a more effective search experience for users.

As organizations strive to improve their search capabilities, many find that traditional keyword-based methods such BM25 have limitations in understanding context and user intent. This leads customers to explore hybrid search approaches that combine the strengths of keyword-based algorithms with the semantic understanding of modern AI models. OpenSearch Service 2.11 and later supports the creation of hybrid search pipelines using normalization processors directly within the OpenSearch Service domain. By transitioning to a hybrid search system, organizations can use the precision of BM25 while benefiting from the contextual awareness and relevance ranking capabilities of semantic search.

Cohere Rerank 3.5 acts as a final refinement layer, analyzing the semantic and contextual aspects of both the query and the initial search results. These models excel at understanding nuanced relationships between queries and potential results, considering factors like customer reviews, product images, or detailed descriptions to further refine the top results. This progression from keyword search to semantic understanding, and then applying advanced reranking, allows for a dramatic improvement in search relevance.

How to integrate Cohere Rerank 3.5 with OpenSearch Service

There are several options available to integrate and use Cohere Rerank 3.5 with OpenSearch Service. Teams can use OpenSearch Service ML connectors which facilitate access to models hosted on third-party ML platforms. Every connector is specified by a connector blueprint. The blueprint defines all the parameters that you need to provide when creating a connector.

In addition to the Bedrock Rerank API, teams can use the Amazon SageMaker connector blueprint for Cohere Rerank hosted on Amazon Sagemaker for flexible deployment and fine-tuning of Cohere models. This connector option works with other AWS services for comprehensive ML workflows and allows teams to use the tools built into Amazon SageMaker for model performance monitoring and management. There is also a Cohere native connector option available that provides direct integration with Cohere’s API, offering immediate access to the latest models and is suitable for users with fine-tuned models on Cohere.

See this general reranking pipeline guide for OpenSearch Service 2.12 and later or this tutorial to configure a search pipeline that uses Cohere Rerank 3.5 to improve a first-stage retrieval system that can run on the native OpenSearch Service vector engine.

Conclusion

Integrating Cohere Rerank 3.5 with OpenSearch Service is a powerful way to enhance your search functionality and deliver a more meaningful and relevant search experience for your users. We covered the added benefits a rerank model could bring to various businesses and how a reranker can enhance search. By tapping into the semantic understanding of Cohere’s models, you can surface the most pertinent results, improve user satisfaction, and drive better business outcomes.

About the Authors

Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption for 1P and 3P models. Breanne is also on the Women@Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign (UIUC).

Karan Singh is a generative AI Specialist for 3P models at AWS where he works with top-tier 3P foundational model providers to define and execute join GTM motions that help customers train, deploy, and scale models to enable transformative business applications and use cases across industry verticals. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a Masters in Science in Electrical Engineering from Northwestern University, and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

Hugo Tse is a Solutions Architect at Amazon Web Services supporting independent software vendors. He strives to help customers use technology to solve challenges and create business opportunities, especially in the domains of generative AI and storage. Hugo holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.

Elliott Choi is a Staff Product Manager at Cohere working on the Search and Retrieval Team. Elliott holds a Bachelor of Engineering and a Bachelor of Arts from the University of Western Ontario.

Introducing the new Amazon Kinesis source connector for Apache Flink

2024-12-18 Lorenzo Nicora

Post Syndicated from Lorenzo Nicora original https://aws.amazon.com/blogs/big-data/introducing-the-new-amazon-kinesis-source-connector-for-apache-flink/

On November 11, 2024, the Apache Flink community released a new version of AWS services connectors, an AWS open source contribution. This new release, version 5.0.0, introduces a new source connector to read data from Amazon Kinesis Data Streams. In this post, we explain how the new features of this connector can improve performance and reliability of your Apache Flink application.

Apache Flink has both a source and sink connector, to read from and write to Kinesis Data Streams. In this post, we focus on the new source connector, because version 5.0.0 does not introduce new functionality for the sink.

Apache Flink is a framework and distributed stream processing engine designed to perform computation at in-memory speed and at any scale. Amazon Managed Service for Apache Flink offers a fully managed, serverless experience to run your Flink applications, implemented in Java, Python or SQL, and using all the APIs available in Flink: SQL, Table, DataStream, and ProcessFunction API.

Apache Flink connectors

Flink supports reading and writing data to external systems, through connectors, which are components that allow your application to interact with stream-storage message brokers, databases, or object stores. Kinesis Data Streams is a popular source and destination for streaming applications. Flink provides both source and sink connectors for Kinesis Data Streams.

The following diagram illustrates a sample architecture.

Role of connectors in a Flink applications

Before proceeding further, it’s important to clarify three terms often used interchangeably in data streaming and in the Apache Flink documentation:

Kinesis Data Streams refers to the Amazon service
Kinesis source and Kinesis consumer refer to the Apache Flink components, in particular the source connectors, that allows reading data from Kinesis Data Streams
In this post, we use the term stream referring to a single Kinesis data stream

Introducing the new Flink Kinesis source connector

The launch of the version 5.0.0 of AWS connectors introduces a new connector for reading events from Kinesis Data Streams. The new connector is called Kinesis Streams Source and supersedes the Kinesis Consumer as the source connector for Kinesis Data Streams.

The new connector introduces several new features and adheres to the new Flink Source interface, and is compatible with Flink 2.x, the first major version release by the Flink community. Flink 2.x introduces a number of breaking changes, including removing the SourceFunction interface used by legacy connectors. The legacy Kinesis Consumer will no longer work with Flink 2.x.

Setting up the connector is slightly different than with the legacy Kinesis connector. Let’s start with the DataStream API.

How to use the new connector with the DataStream API

To add the new connector to your application, you need to update the connector dependency. For the DataStream API, the dependency has changed its name to flink-connector-aws-kinesis-streams.

At the time of writing, the latest connector version is 5.0.0 and it supports the most recent stable Flink versions, 1.19 and 1.20. The connector is also compatible with Flink 2.0, but no connector has been officially released for Flink 2.x yet. Assuming you are using Flink 1.20, the new dependency is the following:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-aws-kinesis streams</artifactId>
    <version>5.0.0-1.20</version>
</dependency>

The connector uses the new Flink Source interface. This interface implements the new FLIP-27 standard, and replaces the legacy SourceFunction interface that has been deprecated. SourceFunction will be completely removed in Flink 2.x.

In your application, you can now use a fluent and expressive builder interface to instantiate and configure the source. The minimal setup only requires the stream Amazon Resource Name (ARN) and the deserialization schema:

KinesisStreamsSource<String> kdsSource = KinesisStreamsSource.<String>builder()
    .setStreamArn("arn:aws:kinesis:us-east-1:123456789012:stream/test-stream")
    .setDeserializationSchema(new SimpleStringSchema())
    .build();

The new source class is called KinesisStreamSource. Not to be confused with the legacy source, FlinkKinesisConsumer.

You can then add the source to the execution environment using the new fromSource() method. This method requires explicitly specifying the watermark strategy, along with a name for the source:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// ...
DataStream<String> kinesisRecordsWithEventTimeWatermarks = env.fromSource(
    kdsSource,
    WatermarkStrategy.<String>forMonotonousTimestamps()
        .withIdleness(Duration.ofSeconds(1)),
    "Kinesis source");

These few lines of code introduce some of the main changes in the interface of the connector, which we discuss in the following sections.

Stream ARN

You can now define the Kinesis data stream ARN, as opposed to the stream name. This makes it simpler to consume from streams cross-Region and cross-account.

When running in Amazon Managed Service for Apache Flink, you only need to add to the application AWS Identity and Access Management (IAM) role permissions to access the stream. The ARN allows pointing to a stream located in a different AWS Region or account, without assuming roles or passing any external credentials.

Explicit watermark

One of the most important characteristics of the new Source interface is that you have to explicitly define a watermark strategy when you attach the source to the execution environment. If your application only implements processing-time semantics, you can specify WatermarkStrategy.noWatermarks().

This is an improvement in terms of code readability. Looking at the source, you know immediately which type of watermark you have, or if you don’t have any. Previously, many connectors were providing some type of default watermarks that the user could override. However, the default watermark of each connector was slightly different and confusing for the user.

With the new connector, you can achieve the same behavior as the legacy FlinkKinesisConsumer default watermarks, using WatermarkStrategy.forMonotonousTimestamps(), as shown in the previous example. This strategy generates watermarks based on the approximateArrivalTimestamp returned by Kinesis Data Streams. This timestamp corresponds to the time when the record was published to Kinesis Data Streams.

Idleness and watermark alignment

With the watermark strategy, you can additionally define an idleness, which allows the watermark to progress even when some shards of the stream are idle and receiving no records. Refer to Dealing With Idle Sources for more details about idleness and watermark generators.

A feature introduced by the new Source interface, and fully supported by the new Kinesis source, is watermark alignment. Watermark alignment works in the opposite direction of idleness. It slows down consuming from a shard that is progressing faster than others. This is particularly useful when replaying data from a stream, to reduce the volume of data buffered in the application state. Refer to Watermark alignment for more details.

Set up the connector with the Table API and SQL

Assuming you are using Flink 1.20, the dependency containing both Kinesis source and sink for the Table API and SQL is the following (both Flink 1.19 and 1.20 are supported, adjust the version accordingly):

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kinesis</artifactId>
    <version>5.0.0-1.20</version>
</dependency>

This dependency contains both the new source and the legacy source. Refer to Versioning in case you are planning to use both in the same application.

When defining the source in SQL or the Table API, you use the connector name kinesis, as it was with the legacy source. However, many parameters have changed with the new source:

CREATE TABLE KinesisTable (
    `user_id` BIGINT,
    `item_id` BIGINT,
    `category_id` BIGINT,
    `behavior` STRING,
    `ts` TIMESTAMP(3)
)
PARTITIONED BY (user_id, item_id)
WITH (
    'connector' = 'kinesis',
    'stream.arn' = 'arn:aws:kinesis:us-east-1:012345678901:stream/my-stream-name',
    'aws.region' = 'us-east-1',
    'source.init.position' = 'LATEST',
    'format' = 'csv'
);

A couple of notable connector options changed from the legacy source are:

stream.arn specifies the stream ARN, as opposed to the stream name used in the legacy source.
init.initpos defines the starting position. This option works similarly to the legacy source, but the option name is different. It was previously scan.stream.initpos.

For the full list of connector options refer to Connector Options.

New features and improvements

In this section, we discuss the most important features introduced by the new connector. These features are available in the DataStream API, and also the Table API and SQL.

Ordering guarantees

The most important improvement introduced by the new connector is about ordering guarantees.

With Kinesis Data Streams, the order of the message is retained per partitionId. This is achieved by putting all records with the same partitionId in the same shard. However, when the stream scales, splitting or merging shards, records with the same partitionId end up in a new shard. Kinesis keeps track of the parent-child lineage when resharding happens.

Stream resharding

One known limitation of the legacy Kinesis source is that it was unable to follow the parent-child shard lineage. As a consequence, ordering could not be guaranteed when resharding happens. The problem was particularly relevant when the application replayed old messages from a stream that had been resharded because ordering would be lost. This also made watermark generation and event-time processing non-deterministic.

With the new connector, ordering is retained also when resharding happens. This is achieved following the parent-child shard lineage, and consuming all records from a parent shard before proceeding with the child shard.

How the connector follows shard lineage

A better default shard assigner

Each Kinesis data stream is comprised of many shards. Also, the Flink source operator runs in multiple parallel subtasks. The shard assigner is the component that decides how to assign the shards of the stream across the source subtasks. The shard assigner’s job is non-trivial, because shard split or merge operations (resharding) might happen when the stream scales up or down.

The new connector comes with a new default assigner, UniformShardAssigner. This assigner maintains uniform distribution of the stream partitionId across parallel subtasks, also when resharding happens. This is achieved by looking at the range of partition keys (HashKeyRange) of each shard.

This shard assigner was already available in the previous connector version, but for backward compatibility, it was not the default and you had to set it up explicitly. This is no longer the case with the new source. The old default shard assigner, the legacy FlinkKinesisConsumer, was evenly distributing shards (not partitionId) across subtasks. In this case, the actual data distribution might become uneven in the case of resharding, because of the combination of open and closed shards in the stream. Refer to Shard Assignment Strategy for more details.

Reduced JAR size

The size of the JAR file has been reduced by 99%, from about 60 MB down to 200 KB. This substantially reduces the size of the fat-JAR of your application using the connector. A smaller JAR can speed up many operations that require redeploying the application.

AWS SDK for Java 2.x

The new connector is based on the newer AWS SDK for Java 2.x, which adds several features and improves support for non-blocking I/O. This makes the connector future-proof because the AWS SDK v1 will reach end-of-support by end of 2025.

AWS SDK built-in retry strategy

The new connector relies on the AWS SDK built-in retry strategy, as opposed to a custom strategy implemented by the legacy connector. Relying on the AWS SDK improves the classification of some errors as retriable or non-retriable.

Removed dependency on the Kinesis Client Library and Kinesis Producer Library

The new connector package no longer includes the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), contributing to the substantial reduction of the JAR size that we have mentioned.

An implication of this change is that the new connector no longer supports de-aggregation out of the box. Unless you are publishing records to the stream using the KPL and you enabled aggregation, this will not make any difference for you. If your producers use KPL aggregation, you might consider implementing a custom DeserializationSchema to de-aggregate the records in the source.

Migrating from the legacy connector

Flink sources typically save the position in the checkpoint and savepoints, called snapshots in Amazon Managed Service for Apache Flink. When you stop and restart the application, or when you update the application to deploy a change, the default behavior is saving the source position in the snapshot just before stopping the application, and restoring the position when the application restarts. This allows Flink to provide exactly-once guarantees on the source.

However, due to the major changes introduced by the new KinesisSource, the saved state is no longer compatible with the legacy FlinkKinesisConsumer. This means that when you upgrade the source of an existing application, you can’t directly restore the source position from the snapshot.

For this reason, migrating your application to the new source requires some attention. The exact migration process depends on your use case. There are two general scenarios:

Your application uses the DataStream API and you are following Flink best practices defining a UID on each operator
Your application uses the Table API or SQL, or your application used the DataStream API and you are not defining a UID on each operator

Let’s cover each of these scenarios.

Your application uses the DataStream API and you are defining a UID on each operator

In this case, you might consider selectively resetting the state of the source operator, retaining any other application state. The general approach is as follows:

Update your application dependencies and code, replacing the FlinkKinesisConsumer with the new KinesisSource.
Change the UID of the source operator (use a different string). Leave all other operators’ UIDs This will selectively reset the state of the source while retaining the state of all other operators.
Configure the source starting position using AT_TIMESTAMP and set the timestamp to just before the moment you will deploy the change. See Configuring Starting Position to learn how to set the starting position. We recommend passing the timestamp as a runtime property to make this more flexible. The configured source starting position is used only when the application can’t restore the state from a savepoint (or snapshot). In this case, we are deliberately forcing this, changing the UID of the source operator.
Update the Amazon Managed Service for Apache Flink application, selecting the new JAR containing the modified application. Restart from the latest snapshot (default behavior) and select allowNonRestoredState = true. Without this flag, Flink would prevent restarting the application, not being able to restore the state of the old source that was saved in the snapshot. See Savepointing for more details about allowNonRestoredState.

This approach will cause the reprocessing of some records from the source, and internal state exactly-once consistency can be broken. Carefully evaluate the impact of reprocessing on your application, and the impact of duplicates on the downstream systems.

Your application uses the Table API or SQL, or your application used the DataStream API and you are not defining a UID on each operator

In this case, you can’t selectively reset the state of the source operator.

Why does this happen? When using the Table API or SQL, or the DataStream API without defining the operator’s UID explicitly, Flink automatically generates identifiers for all operators based on the structure of the job graph of your application. These identifiers are used to identify the state of each operator when saved in the snapshots, and to restore it to the correct operator when you restart the application.

Changes to the application might cause changes in the underlying data flow. This changes the auto-generated identifier. If you are using the DataStream API and you are specifying the UID, Flink uses your identifiers instead of the auto-generated identifies, and is able to map back the state to the operator, even when you make changes to the application. This is an intrinsic limitation of Flink, explained in Set UUIDs For All Operators. Enabling allowNonRestoredState does not solve this problem, because Flink is not able to map the state saved in the snapshot with the actual operators, after the changes.

In our migration scenario, the only option is resetting the state of your application. You can achieve this in Amazon Managed Service for Apache Flink by selecting Skip restore from snapshot (SKIP_RESTORE_FROM_SNAPSHOT) when you deploy the change that replaces the source connector.

After the application using the new source is up and running, you can switch back to the default behavior of when restarting the application, using the latest snapshots (RESTORE_FROM_LATEST_SNAPSHOT). This way, no data loss happens when the application is restarted.

Choosing the right connector package and version

The dependency version you need to pick is normally <connector-version>-<flink-version>. For example, the latest Kinesis connector version is 5.0.0. If you are using a Flink runtime version 1.20.x, your dependency for the DataStream API is 5.0.0-1.20.

For the most up-to-date connector versions, see Use Apache Flink connectors with Managed Service for Apache Flink.

Connector artifact

In previous versions of the connector (4.x and before), there were separate packages for the source and sink. This additional level of complexity has been removed with version 5.x.

For your Java application, or Python applications where you package JAR dependencies using Maven, as shown in the Amazon Managed Service for Apache Flink examples GitHub repository, the following dependency contains the new version of both source and sink connectors:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-aws-kinesis-streams</artifactId>
    <version>5.0.0-1.20</version>
</dependency>

Make sure you’re using the latest available version. At the time of writing, this is 5.0.0. You can verify the available artifact versions in Maven Central. Also, use the correct version depending on your Flink runtime version. The previous example is for Flink 1.20.0.

Connector artifacts for Python application

If you use Python, we recommend packaging JAR dependencies using Maven, as shown in the Amazon Managed Service for Apache Flink examples GitHub repository. However, if you’re passing directly a single JAR to your Amazon Managed Service for Apache Flink application, you need to use the artifact that includes all transitive dependencies. In the case of the new Kinesis source and sink, this is called flink-sql-connector-aws-kinesis-streams. This artifact includes only the new source. Refer to Amazon Kinesis Data Streams SQL Connector for the right package, in case you want to use both the new and the legacy source.

Conclusion

The new Flink Kinesis source connector introduces many new features that improve stability and performance, and prepares your application for Flink 2.x. Support for watermark idleness and alignment is a particularly important feature if your application uses event-time semantics. The ability to retain record ordering improves data consistency, in particular when stream resharding happens, and when you replay old data from a stream that has been reshared.

You should carefully plan the change if you’re migrating your application from the legacy Kinesis source connector, and make sure you follow Flink’s best practices like specifying a UID on all DataStream operators.

You can find a working example of Java DataStream API application using the new connector, in the Amazon Managed Service for Apache Flink samples GitHub repository.

To learn more about the new Flink Kinesis source connector, refer to Amazon Kinesis Data Streams Connector and Amazon Kinesis Data Streams SQL Connector.

About the Author

Lorenzo Nicora works as a Senior Streaming Solutions Architect at AWS, helping customers across EMEA. He has been building cloud-centered, data-intensive systems for over 25 years, working across industries both through consultancies and product companies. He has used open source technologies extensively and contributed to several projects, including Apache Flink.

[$] Emacs code completion can cause compromise

2024-12-18 daroc

Post Syndicated from daroc original https://lwn.net/Articles/1002046/

Emacs has had a
few bugs related to accidentally
permitting the execution of untrusted code. Unfortunately, it seems as though
another bug of that sort has appeared — and may be harder to patch,
because the problem comes from the way Emacs handles expansion of Lisp macros in
code being analyzed. The
vulnerability is only practically exploitable in a non-default configuration, so
not every Emacs user has something to worry about. The Emacs
developers are reportedly working on a fix, but have not yet shared details
about it. In the meantime, every Emacs version since at least
26.1 (released in May 2018) through the current development version is vulnerable.

Security updates for Wednesday

2024-12-18 jzb

Post Syndicated from jzb original https://lwn.net/Articles/1002703/

Security updates have been issued by AlmaLinux (libsndfile, php:7.4, python3.11, python3.12, and python36:3.6), Debian (dpdk), Mageia (curl and socat), Oracle (firefox and tuned), Red Hat (bluez, containernetworking-plugins, edk2, edk2:20220126gitbb1bba3d77, edk2:20240524, expat, gstreamer1-plugins-base, gstreamer1-plugins-base and gstreamer1-plugins-good, gstreamer1-plugins-good, kernel, libsndfile, libsndfile:1.0.31, mpg123, mpg123:1.32.9, pam, python3.11-urllib3, skopeo, tuned, unbound, and unbound:1.16.2), SUSE (cloudflared, curl, docker, firefox, gstreamer-plugins-good, kernel, libmozjs-115-0, libmozjs-128-0, libmozjs-78-0, libsoup, ovmf, python-urllib3_1, subversion, thunderbird, and traefik), and Ubuntu (editorconfig-core, libspring-java, linux, linux-aws, linux-aws-6.8, linux-gcp, linux-gcp-6.8, linux-gke,
linux-gkeop, linux-ibm, linux-nvidia, linux-nvidia-6.8,
linux-nvidia-lowlatency, linux-oem-6.8, linux-oracle, linux-oracle-6.8,
linux-raspi, linux, linux-gcp, linux-gcp-5.15, linux-gke, linux-gkeop, linux-ibm,
linux-ibm-5.15, linux-kvm, linux-lowlatency, linux-lowlatency-hwe-5.15,
linux-nvidia, linux-oracle, linux-oracle-5.15, linux-raspi, linux, linux-gcp, linux-gcp-5.4, linux-hwe-5.4, linux-kvm, linux-raspi, linux, linux-lowlatency, linux-oracle, linux-aws, linux-aws-5.15, linux-aws, linux-aws-5.4, linux-bluefield, linux-oracle, linux-oracle-5.4, and linux-oem-6.11).

What’s New in Rapid7 Products & Services: Q4 2024 in Review

2024-12-18 Margaret Wei

Post Syndicated from Margaret Wei original https://blog.rapid7.com/2024/12/18/whats-new-in-rapid7-products-services-q4-2024-in-review/

What’s New in Rapid7 Products & Services: Q4 2024 in Review

This quarter at Rapid7 we continued to make investments across our Command Platform to provide security professionals with a holistic, actionable view of their entire attack surface – from Exposure Management to Detection and Response. Below, we’ve highlighted key releases and updates from the quarter across our products and services, including the new Platform Home Navigation experience, extensibility enhancements to Exposure Command and Surface Command, expanded MXDR support, and 2024 threat landscape trends from Rapid7 Labs.

Accelerate security efficiency and results with Rapid7’s Command Platform

In October, we released our revamped, modernized Command Platform home navigation experience for all users, providing a more cohesive, efficient flow for our users and increased visibility between Rapid7 products and capabilities. Now, viewing security program metrics across your suite of Rapid7 products is easier than ever before—so you can spend less time navigating between products and more time making decisions with easily accessible data.

We’ll be building on this new experience in the coming year to bring iterative updates to the look, feel, and function of the Command Platform—stay tuned for more!

What’s New in Rapid7 Products & Services: Q4 2024 in Review — *New Command Platform Home Navigation*

Along with the navigation updates, we’ve made improvements to our user management experience. Now, teams are empowered to better safeguard data and systems with more tailored, role- and responsibility-based user access controls. This enables easier collaboration across your organization while ensuring the appropriate access level for each person.

Achieve complete attack surface visibility and proactively eliminate exposures from endpoint to cloud

Rapid7 co-launches Resource Control Policies with AWS, Adding Support in Exposure Command and InsightCloudSec

Leading up to Re:Invent, AWS announced a powerful new feature to help organizations enforce least privilege access at scale: Resource Control Policies (RCPs). RCPs are an org-level access control policy that can be used to centrally implement and enforce preventative controls across all AWS resources in your environment.

To support this launch, we expanded our existing cloud identity and entitlement management capabilities to include dedicated, out-of-the-box checks for consistent and secure application of RCPs. Today, both Exposure Command and InsightCloudSec include these checks, enabling organizations to apply RCPs consistently and securely. Learn more here.

Shifting Left to Stay Secure with Exposure Command

Developers are at the forefront of modern cloud environments, making “shift-left” strategies essential for effective security. By addressing risks during development rather than after deployment, teams can eliminate vulnerabilities before they become costly issues.

To support our customers in executing stronger shift-left strategies, Exposure Command now offers more robust Infrastructure-as-Code (IaC) scanning and deeper CI/CD integration with Terraform and CloudFormation support across hundreds of resource types. For development teams, integrations like GitLab, GitHub Actions, AWS CloudFormation, and Azure DevOps bring security checks directly into their workflows, helping to secure code without disrupting productivity.

Streamline Vulnerability Management Across Your Entire Application Inventory with Vulnerability Groupings

Triaging scan results can be one of the most arduous and time-consuming parts of vulnerability management, but it’s also one of the most critical. Teams need to quickly synthesize results to validate exposures, prioritize response, and determine next steps for safeguarding their attack surface.

With the recent addition of Application Vulnerability Grouping, InsightAppSec customers can now visualize attacks and assess single applications or their entire application inventory at once, allowing teams to:

Visualize exposures with pre-triaged vulnerabilities by app and attack type
Identify and focus on threats in key functional areas to simplify vulnerability remediation
Manage application-layer risks at scale by updating the status or severity and adding comments to entire groups of vulnerabilities at once

Explore Exposure Management Use Cases via Guided Product Tours

We’re excited to introduce a new way for you to engage hands-on with core use cases across the Command Platform with our new guided product tours. These tours provide a first-hand, in-depth look at new products and features.

Today, you’ll see tours showcasing how Surface Command can help you map your entire attack surface and identify coverage gaps across your security ecosystem. You’ll also learn how you can prioritize remediation efforts and mobilize teams across your organization with Remediation Hub. Check out the available tours here, and we’ll continue to add more covering use cases across the Command Platform in the future.

Gain Insights from Products Across Your Environment Faster with Self-Service Surface Command Connector

Surface Command customers can now install connectors at their own convenience via the Rapid7 Extensions Library, making it faster and easier to gain visibility into cyber asset insights across your security and IT management tools. Customers can choose from over 100 out-of-the-box connectors to ingest and enrich asset data within Surface Command, consolidating insights from across your entire security ecosystem into one place.

Pinpoint critical signals and act confidently against threats with cloud-ready detection and response

A Growing Ecosystem of Cloud Event Sources in InsightIDR and MDR

At Rapid7, we understand that organizations are tasked with collecting and correlating vast amounts of data across their unique ecosystems. To tackle this, teams need faster, more dynamic mechanisms to ingest cloud data directly into their SIEM tool. We addressed this earlier this year with cloud event sources, providing a native cloud collection framework that can receive log data from cloud platforms directly – without requiring installation of collector software in their cloud and on-premise environments.

This quarter, we further expanded our list of cloud event sources by adding support for Microsoft products, including: Defender for Endpoint, Defender for Cloud, Defender for Identity, Defender for Cloud Apps, Defender O365, Defender for Vulnerability Management, and Entra ID.

MXDR: Expanded Support for Microsoft & AWS

In our Q3 “What’s New” blog, we announced the launch of Rapid7 MDR for the Extended Ecosystem (MXDR), which expands our MDR service to triage, investigate, and respond to alerts from third-party tools within customer organizations. Now, we’re excited to announce that we have updated our MXDR to support an expanded subset of detections across AWS GuardDuty and Microsoft security tools, bringing more protection to customer environments across a broader group of security tools.

Furthering our commitment to keep organizations safe and ahead of adversaries in today’s complex threat landscape, this update includes:

Deepened existing support for Microsoft security tools like Defender for Endpoint, Defender for Cloud, and AWS GuardDuty
Expanded support (via aforementioned cloud event sources) to critical alerts across Defender for Identity, Microsoft O365, Defender for Vulnerability Management, and Microsoft Entra

Expanded Coverage for Next-Gen Antivirus: MacOS and Linux

We’ve extended operating system coverage for Next-Gen AV (NGAV) support beyond Windows OS to now include protection capabilities for MacOS and Linux. Now, customers utilizing NGAV don’t have to utilize multiple point systems across the operating systems within their detection surface to stop breaches as early as possible in the kill chain.

The latest research and intelligence from Rapid7 Labs

2024 Threat Landscape Statistics

This year, Rapid7’s global Managed Services team and Rapid7 Labs researchers responded to hundreds of major incidents, significant vulnerabilities, and ransomware threats—delivering emergent threat guidance, research reports, and other vulnerability and threat content for customers. See the roundup of key statistics and trends from our Rapid7 Labs team in our recent blog post, here.

Emergent Threat Response: Real-time Guidance for Critical Threats

Rapid7’s Emergent Threat Response (ETR) program from Rapid7 Labs delivers fast, expert analysis and first-rate security content for the highest-priority security threats to help both Rapid7 customers and the greater security community understand their exposure and act quickly to defend their networks against rising threats.

In Q4, Rapid7’s ETR team provided expert analysis, InsightIDR and InsightVM content, and mitigation guidance for multiple critical, actively exploited vulnerabilities and widespread attacks, including:

Follow along here to receive the latest emergent threat guidance from our team.

Stay tuned for more!

As always, we’re continuing to work on exciting product enhancements and releases throughout the year. Keep an eye on our blog and release notes as we continue to highlight the latest in product and service investments at Rapid7.

Murder and Saxophones: Adolphe Sax

2024-12-18 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=BmJpwT_jcrU

Solution overview

Prerequisites

Publish unstructured S3 data to Amazon DataZone

Set up the SageMaker notebook and SageMaker instance IAM role

Set up the consumer Amazon DataZone project, custom AWS service environment, and subscription target

Create a function to respond to subscription events

Handle only relevant events

Respond to subscription events in EventBridge

Subscribe to the unstructured data asset

Access the subscribed asset from the Amazon DataZone portal

Multi-account implementation

Clean up

Conclusion

About the Authors

Assessing holistic user risk

Who in the organization is being targeted?

Who are the attackers impersonating?

What risky behaviors are my users performing?

Increasing SOC productivity

AI-driven email security

Automated isolation

Automated blocking

What’s on the horizon for 2025

Configurable browser isolation for email

Outbound DLP for email

Expanded user risk scoring

Try Cloudflare Email Security today

Amazon OpenSearch Service

Overview of traditional lexical search and semantic search using bi-encoders and cross-encoders

Overview of Cohere Rerank 3.5

Cohere Rerank 3.5 with OpenSearch Service search

How to integrate Cohere Rerank 3.5 with OpenSearch Service

Conclusion

About the Authors

Apache Flink connectors

Introducing the new Flink Kinesis source connector

How to use the new connector with the DataStream API

Stream ARN

Explicit watermark

Idleness and watermark alignment

Set up the connector with the Table API and SQL

New features and improvements

Ordering guarantees

A better default shard assigner

Reduced JAR size

AWS SDK for Java 2.x

AWS SDK built-in retry strategy

Removed dependency on the Kinesis Client Library and Kinesis Producer Library

Migrating from the legacy connector

Your application uses the DataStream API and you are defining a UID on each operator

Your application uses the Table API or SQL, or your application used the DataStream API and you are not defining a UID on each operator

Choosing the right connector package and version

Connector artifact

Connector artifacts for Python application

Conclusion

About the Author

Accelerate security efficiency and results with Rapid7’s Command Platform

Achieve complete attack surface visibility and proactively eliminate exposures from endpoint to cloud

Rapid7 co-launches Resource Control Policies with AWS, Adding Support in Exposure Command and InsightCloudSec

Shifting Left to Stay Secure with Exposure Command

Streamline Vulnerability Management Across Your Entire Application Inventory with Vulnerability Groupings

Explore Exposure Management Use Cases via Guided Product Tours

Gain Insights from Products Across Your Environment Faster with Self-Service Surface Command Connector

Pinpoint critical signals and act confidently against threats with cloud-ready detection and response

A Growing Ecosystem of Cloud Event Sources in InsightIDR and MDR

MXDR: Expanded Support for Microsoft & AWS

Expanded Coverage for Next-Gen Antivirus: MacOS and Linux

The latest research and intelligence from Rapid7 Labs

2024 Threat Landscape Statistics

Emergent Threat Response: Real-time Guidance for Critical Threats

Stay tuned for more!

The collective thoughts of the interwebz