Tag Archives: Technical How-to

AWS End User Messaging SMS & Voice v2 API: A Migration Guide from v1

Post Syndicated from Brett Ezell original https://aws.amazon.com/blogs/messaging-and-targeting/aws-end-user-messaging-sms-and-voice-v2-api-a-migration-guide-from-v1/

This blog covers the steps on how to upgrade to the latest APIs offered by AWS for SMS messaging.

IMPORTANT: Understanding this Migration in Context of Amazon Pinpoint’s End of Support

Amazon Pinpoint announced its end of support (EOS) on October 2026. However, if you are an existing customer using Amazon Pinpoint for SMS, your current SMS operations are not impacted by the EOS, and you are not required to migrate to the v2 API at this time.

This blog post details an optional upgrade path from the Amazon Pinpoint v1 SendMessages API to the AWS End User Messaging SMS and Voice v2 SendTextMessage API. Transitioning allows you to leverage enhanced capabilities and new features available exclusively in the v2 API.

Regarding your existing resources: Your SMS phone numbers and sender ID resources are already stored in AWS End User Messaging. There is no need to register new numbers (originators) or configure new Sender IDs if you choose to migrate your API usage as described here.

AWS End User Messaging provides developers with a scalable and cost-effective messaging infrastructure without compromising the safety, security, or results of their communications. Developers can integrate messaging over channels such as SMS, MMS, voice text-to-speech, WhatsApp and mobile push to support use cases such as one-time passwords (OTP) at sign-ups, account updates, appointment reminders, delivery notifications, promotions and more. AWS End User Messaging was renamed in 2024 as the new name for Amazon Pinpoint’s communication capabilities. For the SMS protocol, the service includes two sets of APIs, including SMS and Voice, version 2 API (v2 API), which provides enhanced capabilities and flexibility for customer communications and the original Amazon Pinpoint v1 API.

The End User Messaging SMS service provides core building blocks and is the core service in charge of SMS delivery for other AWS services, including Amazon Simple Notification Service (SNS), Amazon Cognito, and Amazon Connect. Transitioning to the v2 API allows customers using the older Amazon Pinpoint v1 API to access enhanced capabilities available now—like improved control via Configuration Sets, Phone Pools, Protect Configurations, and Registrations—as well as ensuring access to future features developed exclusively in the V2 APIs.

In this post, we will focus on why transitioning is beneficial and how developers can transition from Amazon Pinpoint’s v1 SendMessages API to the AWS End User Messaging SMS and Voice, version 2 (SendTextMessage) API. This move offers more functionality for creating custom messaging solutions and integrating seamlessly with third-party applications. Consolidating management and messaging into AWS End User Messaging provides customers with greater control and access to the features and capabilities described below.

What Are the Changes and Benefits of Migrating?

Migrating from the Pinpoint v1 SendMessages API to the AWS End User Messaging v2 SendTextMessage API offers several key advantages in plain language:

  • Simpler Code: The new SendTextMessage API has a flatter, more straightforward structure compared to the nested configuration required by the old SendMessages API. This makes your code easier to write, read, and maintain.
  • More Control Over Delivery: AWS End User Messaging introduces new tools that give you fine-grained control:
    • Configuration Sets: Apply specific rules for logging via Event Destinations, routing (using specific number pools), and opt-out management per message.
    • Phone Pools: Group your sending phone numbers or sender IDs to manage sender reputation, improve reliability with failover, and handle different use cases more effectively.
    • Protect Configurations: Set account-level safety rules, like blocking messages to certain countries, or filtering messages suspected to be Artificially Inflated Traffic (AIT), enhancing security and compliance.
    • Message Feedback: Enable detailed feedback mechanisms and track message conversion data like one time passcode conversions.
  • Better Monitoring & Troubleshooting: With Configuration Sets and Event Destinations, you retain detailed delivery logs, Delivery Receipts (DLRs), and event tracking, similar to Pinpoint’s event streaming. This allows comprehensive monitoring via Amazon EventBridge, Amazon CloudWatch, Amazon Data Firehose, or Amazon SNS.
  • Expanded Capabilities: The AWS End User Messaging v2 API exclusively supports Media Messaging Service (MMS) and includes integrated tools for managing control plane activities, such as sender registrations required in many countries.
  • Future-Proofing: AWS is actively developing new features exclusively for the AWS End User Messaging v2 API. Migrating ensures you can leverage the latest advancements in messaging technology.

Understanding Key AWS End User Messaging Components

Migration involves understanding and utilizing several core AWS End User Messaging components:

Phone Pools

  • What they are: A Phone Pool is a collection of your origination identities (phone numbers, Sender IDs) that share common settings. When you send a message using a pool, AWS End User Messaging can automatically select an appropriate identity from that pool.
  • Benefits:
    • Improved Reliability: Pools provide automatic failover; if one number in the pool fails, another can be used.
    • Reputation Management: You can create separate pools for different use-cases, or types of messages (e.g., transactional vs. promotional) to isolate sender reputations.
    • Simplified Configuration: Apply settings like opt-out lists or two-way SMS configurations to the entire pool.
  • When to Use: Pools are highly recommended for managing multiple origination identities, ensuring high availability, or separating traffic. However, they are optional. If you only use a single phone number or Sender ID for a specific use case, you can continue to specify that single identity directly in your SendTextMessage API calls.

Protect Configurations

  • What they are: Protect Configurations allow you to define account-wide or configuration-set-level rules that safeguard against unwanted traffic, including Artificially Inflated Traffic (AIT) and messages to specific destinations. They operate using rules applied per country, which can be set to different modes: Block (prevent sending to specified countries), Monitor (analyze traffic for AIT risk without blocking, providing visibility), or Filter (automatically block messages suspected of being AIT based on risk analysis, in addition to blocking specified countries).
  • Benefits:
    • Enhanced Security: Prevent accidental or malicious sending to unintended destinations by explicitly blocking specific countries.
    • Proactive AIT Defense (Filter): Automatically block suspected fraudulent SMS pumping traffic based on risk analysis before messages are sent, reducing exposure to financial costs and potential reputation damage associated with AIT.
    • Risk Visibility (Monitor): Gain insights into potential AIT patterns targeting your account and understand the likely impact of Filter rules without disrupting legitimate message delivery. Recommendations are provided via event destinations.
    • Compliance: Helps enforce geographic sending restrictions required by regulations or internal policies.
    • Cost Control: Avoid unexpected charges from sending to blocked/filtered destinations.
    • Granular Application: Apply different rules (Block/Monitor/Filter) per country and use multiple Protect Configurations tailored to different workflows (e.g., stricter filtering for public-facing forms vs. internal notifications).
  • Note: Protect Configurations, including the Monitor and Filter modes for AIT, are features unique to the AWS End User Messaging v2 API and are not available in the Pinpoint v1 API. Check out Defending Against SMS Pumping: New AWS Features to Help Combat Artificially Inflated Traffic to learn more.

Configuration Sets

  • What they are: Configuration Sets are collections of rules applied to messages when you send them. You specify which Configuration Set to use in your SendTextMessage API call.
  • Benefits:
    • Essential for Monitoring: Configuration Sets are required to receive message event data (including Delivery Receipts – DLRs). They replace the automatic event streaming provided by Pinpoint v1. You must associate an Event Destination (EventBridge, CloudWatch Logs, Amazon Data Firehose, or SNS) with your Configuration Set to capture delivery status, failures, and other events.
    • Granular Control: Apply specific settings per message, such as routing messages through specific Phone Pools, using different default Sender IDs, or associating different opt-out lists.
    • Message Feedback: Enable detailed feedback mechanisms.
  • Key Takeaway: Migrating users must create and use Configuration Sets with associated Event Destinations if they need to monitor message delivery status, replicating the functionality previously provided by Pinpoint event streams.

Registrations

  • What they are: In many countries and regions, regulations require businesses to register their Sender IDs or phone numbers (like US 10DLC numbers) before sending A2P (Application-to-Person) messages. Some regions (like India or China) may also require pre-registering message templates.
  • Benefits:
    • Compliance: Ensures adherence to local telecommunication laws and carrier requirements.
    • Improved Deliverability: Registered traffic is often treated with higher priority and is less likely to be filtered as spam by carriers.
  • When Needed: Registration requirements — or lack thereof — vary significantly by country (e.g., US 10DLC, UK Sender IDs, India Sender IDs) and the type of number/Sender ID used. While some origination identities in certain regions may not require pre-registration, the landscape is complex and evolving. Always check the AWS documentation for the specific, current requirements of the countries you send messages to. AWS End User Messaging provides tools within the console and API to help manage any necessary registrations.

Additional Key Differences & Benefits

  • MMS Support: The AWS End User Messaging v2 API provides clear, first-class support for sending Multimedia Messaging Service (MMS) messages via the SendMediaMessage API call, offering richer media possibilities than typically available via the older Pinpoint API.
  • API Simplicity: The SendTextMessage API call is significantly less nested and complex than the SendMessages API structure, making integration and maintenance easier.
    • Example Comparison:
    • # Pinpoint v1 SendMessages (Nested Structure)
      response = pinpoint_client.send_messages(
          ApplicationId='YOUR_PINPOINT_PROJECT_ID',
          MessageRequest={
              'Addresses': {
                  '+12065550100': { # Destination Address
                      'ChannelType': 'SMS'
                  }
              },
              'MessageConfiguration': {
                  'SMSMessage': {
                      'Body': 'Your message content',
                      'MessageType': 'TRANSACTIONAL',
                      'OriginationNumber': '+12065550199' # Origination Number
                  }
              }
          }
      )
      
      # AWS End User Messaging v2 SendTextMessage (Flatter Structure)
      response = end_user_messaging_client.send_text_message(
          DestinationPhoneNumber='+12065550100',
          OriginationIdentity='+12065550199', # Can be Number, SenderID, Pool ARN/ID
          MessageBody='Your message content',
          MessageType='TRANSACTIONAL'
          # Optional: ConfigurationSetName='MyConfigSet'
      )

      This side-by-side comparison clearly shows the reduced nesting and more direct parameter usage in the AWS End User Messaging SendTextMessage API call.

  • Broader Regional Availability: The AWS End User Messaging v2 API offers significantly broader regional availability compared to the older Pinpoint v1 API (currently supported in over 30 regions versus 13). AWS also plans to expand this support to an additional 4 regions soon (reaching 34 total), further improving global reach and potentially reducing latency.

How to Use AWS End User Messaging Features Functionally (API Examples)

Here’s how you use the SendTextMessage API (using AWS SDK for Python Boto3 as an example) to leverage these features:

import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize the AWS End User Messaging v2 client
# Note the client name change from 'pinpoint'
end_user_messaging_client = boto3.client('pinpoint-sms-voice-v2')

# --- Basic SMS Send ---
try:
    response = end_user_messaging_client.send_text_message(
        DestinationPhoneNumber='+12065550100',
        OriginationIdentity='+12065550199', # Your AWS End User Messaging registered Phone Number, Sender ID, or Pool ARN/ID
        MessageBody='Hello from AWS End User Messaging!',
        MessageType='TRANSACTIONAL' # Or 'PROMOTIONAL'
    )
    logger.info(f"Basic Send - Message ID: {response.get('MessageId')}")
except Exception as e:
    logger.error(f"Basic Send Failed: {e}")

# --- Send with a Configuration Set (for DLRs/Events) ---
try:
    response = end_user_messaging_client.send_text_message(
        DestinationPhoneNumber='+12065550101',
        OriginationIdentity='+12065550199',
        MessageBody='This message uses a configuration set.',
        MessageType='TRANSACTIONAL',
        ConfigurationSetName='MyConfigSet' # Specify your Configuration Set
    )
    logger.info(f"Config Set Send - Message ID: {response.get('MessageId')}")
except Exception as e:
    logger.error(f"Config Set Send Failed: {e}")

# --- Send using a Phone Pool ---
try:
    response = end_user_messaging_client.send_text_message(
        DestinationPhoneNumber='+12065550102',
        OriginationIdentity='arn:aws:sms-voice:us-east-1:111122223333:pool/MyPhonePool', # Use Pool ARN or ID
        MessageBody='This message is sent via a phone pool.',
        MessageType='TRANSACTIONAL',
        ConfigurationSetName='MyConfigSet' # Still need a Config Set for events
    )
    logger.info(f"Pool Send - Message ID: {response.get('MessageId')}")
except Exception as e:
    logger.error(f"Pool Send Failed: {e}")

# --- Send with a Protect Configuration (via Config Set or directly) ---
# Option 1: Protect Config applied via Configuration Set
# (Configure association in the AWS End User Messaging console or via AssociateProtectConfiguration API)
# No change needed in SendTextMessage call beyond specifying the ConfigurationSetName

# Option 2: Protect Config applied directly per message
try:
    response = end_user_messaging_client.send_text_message(
        DestinationPhoneNumber='+12065550103',
        OriginationIdentity='+12065550199',
        MessageBody='This message has direct protect config applied.',
        MessageType='TRANSACTIONAL',
        ProtectConfigurationId='MyProtectConfig' # Specify Protect Configuration ID or ARN
        # ConfigurationSetName='MyConfigSet' # Can also be used if needed for other rules like events
    )
    logger.info(f"Protect Config Send - Message ID: {response.get('MessageId')}")
except Exception as e:
    logger.error(f"Protect Config Send Failed: {e}")

# --- Standard Exception Handling ---
try:
    response = end_user_messaging_client.send_text_message(
        DestinationPhoneNumber='+12065550104',
        OriginationIdentity='+12065550199',
        MessageBody='Testing exception handling.',
        MessageType='TRANSACTIONAL',
        ConfigurationSetName='MyConfigSet'
    )
    logger.info(f"Exception Test Send - Message ID: {response.get('MessageId')}")
except end_user_messaging_client.exceptions.ThrottlingException:
    logger.error("Rate limit exceeded. Implementing backoff.")
except end_user_messaging_client.exceptions.ValidationException as e:
    logger.error(f"Validation error: {str(e)}")
except Exception as e:
     logger.error(f"Generic error sending message: {e}")

How to Migrate SDKs

Migrating your application code involves updating your AWS SDK initialization:

  1. Identify SDK Usage: Locate where your code uses the AWS SDK to interact with the Pinpoint v1 API, specifically for sending SMS (likely using client.send_messages(...)).
  2. Update Client Initialization: Change the service client you initialize.
    • Python (Boto3): Change boto3.client('pinpoint') to boto3.client('pinpoint-sms-voice-v2').
    • Other SDKs: Consult the specific AWS SDK documentation for the equivalent change. The service identifier will typically change from pinpoint or similar to pinpoint-sms-voice-v2 or equivalent.
  3. Update API Calls: Replace calls to the Pinpoint v1 SendMessages operation with calls to the AWS End User Messaging v2 SendTextMessage operation, adjusting the parameter structure as shown in the examples above.
  4. Dependencies: Ensure your application deployment includes the necessary updated SDK libraries.

Data Migration Considerations

Endpoint Contextual Data: Pinpoint allows associating rich attributes and user data with endpoints, often used in campaigns. If your application logic for sending direct SMS via the Pinpoint SendMessages API previously relied on accessing these Pinpoint endpoint attributes at the time of sending (e.g., for personalization), note that the AWS End User Messaging SendTextMessage API operates independently of the Pinpoint endpoint data model. Your application will now need to fetch any required contextual data (user attributes, preferences, etc.) itself before calling SendTextMessage if that information is needed for message construction or logic.

Opt-Out Lists (Optional):

IMPORTANT: End User Messaging has automatically been adding opt-outs to a “default optout list” for all SMS you have sent out unless you have been self managing opt-outs. Check to see if you have numbers in the “default” list in the console or by using DescribeOptedOutNumbers and using “default” as the OptOutListName.

If you are managing opt-outs at the endpoint level, managing them within your application, or only have opt-outs in the “default” list, it’s recommended that you export those numbers into a new End User Messaging Opt-Out List.

  1. Extract from Pinpoint: Identify and extract the phone numbers opted out of SMS within your Pinpoint project. This typically involves exporting endpoint data (using the Pinpoint API, like GetEndPoints, or GetSegmentExportJobs) and filtering for endpoints associated with the SMS channel that have an OptOut status set (e.g., ALL). You may need custom processing to isolate the correct phone numbers.
  2. Create AWS End User Messaging List: Create a new opt-out list in AWS End User Messaging using the CreateOptOutList API operation (e.g., name it MigratedPinpointOptOuts).
  3. Import Numbers: Add the phone numbers extracted from Pinpoint into the newly created AWS End User Messaging opt-out list using the PutOptedOutNumber API operation for each number.
  4. Associate List: Associate your new AWS End User Messaging Opt-Out List with the relevant Phone Pool(s) or individual origination phone number(s)/Sender ID(s) using the appropriate AWS End User Messaging API calls (e.g., SetDefaultPoolOptOutList, SetPhoneNumberOptedOut). This ensures AWS End User Messaging automatically blocks sends to these numbers when using those specific origination identities. Note: Opt-out lists are associated with origination identities, not directly with Configuration Sets.

Here’s an accurate AWS SDK for Python (Boto3) example that demonstrates this process:

import boto3

# Initialize clients
pinpoint = boto3.client('pinpoint')
eum = boto3.client('pinpoint-sms-voice-v2')

# Step 1: Extract opted-out numbers from Pinpoint
def get_opted_out_numbers(project_id):
    opted_out_numbers = set()
    paginator = pinpoint.get_paginator('get_endpoints')
    
    for page in paginator.paginate(ApplicationId=project_id):
        for item in page['ItemResponse'].values():
            if item['ChannelType'] == 'SMS' and item.get('OptOut') == 'ALL':
                opted_out_numbers.add(item['Address'])
    
    return opted_out_numbers

# Example usage
project_id = 'YOUR_PINPOINT_PROJECT_ID'
opted_out_numbers = get_opted_out_numbers(project_id)

# Step 2: Create AWS End User Messaging Opt-Out List
opt_out_list_name = 'MigratedPinpointOptOuts'
eum.create_opt_out_list(OptOutListName=opt_out_list_name)

# Step 3: Import Numbers to the new Opt-Out List
for number in opted_out_numbers:
    eum.put_opted_out_number(
        OptOutListName=opt_out_list_name,
        OptedOutNumber=number
    )

# Step 4: Associate the Opt-Out List with a Phone Pool
phone_pool_id = 'YOUR_PHONE_POOL_ID'
eum.set_default_pool_opt_out_list(
    PhonePoolId=phone_pool_id,
    OptOutListName=opt_out_list_name
)

# If you're using individual numbers not in a pool:
# phone_number = 'YOUR_PHONE_NUMBER'
# eum.set_phone_number_opted_out(
#     PhoneNumber=phone_number,
#     OptOutListName=opt_out_list_name
# )

print(f"Migration complete. {len(opted_out_numbers)} numbers added to the opt-out list.")

Conclusion

Migrating from the legacy Amazon Pinpoint v1 SendMessages API to the modern AWS End User Messaging v2 SendTextMessage API aligns strategically with End User Messaging, the focus of AWS’s ongoing messaging development and advancements. As we’ve explored, this transition is driven by tangible benefits and enhanced capabilities designed for contemporary communication needs.

This guide detailed the key advantages, including a significantly simplified API structure that streamlines development and maintenance. We also covered the core components that provide enhanced control and functionality: Configuration Sets, which are crucial for enabling detailed monitoring and delivery reporting (DLRs); Phone Pools for flexible and reliable sender identity management; Protect Configurations for improved security and compliance; and Registrations for adhering to regional requirements and maximizing deliverability.

This post also outlined the practical steps involved in this migration, from updating your SDK client initialization and adapting your code to use the SendTextMessage API with its new features, to the critical process of migrating existing opt-out lists to ensure continuity and compliance.

By understanding these components, leveraging the new features, and executing the necessary technical and data migration steps, you can successfully transition your SMS operations. This move not only modernizes your infrastructure but also positions you to take full advantage of future advancements on the AWS End User Messaging platform, enabling you to build more robust, scalable, and sophisticated messaging solutions.

How Smartsheet boosts developer productivity with Amazon Bedrock and Roo Code

Post Syndicated from Rony Blum original https://aws.amazon.com/blogs/architecture/how-smartsheet-boosts-developer-productivity-with-amazon-bedrock-and-roo-code/

This post was co-written with JB Brown from Smartsheet.

Organizations are often seeking ways to enhance developer productivity while maintaining cost-efficiency. AI-powered coding assistants are effective solutions to accelerate development cycles, but implementing these solutions at scale while managing costs presents unique challenges. Roo Code, an AI-powered autonomous coding agent located directly in developers’ editors, represents the latest evolution in AI-assisted development. It helps developers code faster and smarter by offering capabilities ranging from code generation and refactoring to documentation writing and code base analysis through natural language interaction. This post explores how Smartsheet successfully deployed Roo Code with Amazon Bedrock and Anthropic’s Claude, achieving significant improvements in developer efficiency while optimizing costs through innovative caching strategies.

The Amazon Bedrock prompt caching capability, which became generally available in April 2025, represents a significant advancement in optimizing AI model interactions. With this feature, organizations can cache frequently used prompts across multiple API calls, reducing costs by up to 90% and lowing latency by up to 85%. The technology is particularly valuable for applications that repeatedly use similar context, such as coding assistants that need to maintain context about code files. Prompt caching supports multiple foundation models (FMs), including Anthropic’s Claude 3.5 Haiku and Anthropic’s Claude 3.7 Sonnet, as well as the Amazon Nova family of models. The cached context remains available for up to 5 minutes after each access, with each cache hit resetting this countdown. For organizations implementing AI-assisted development workflows, they can use this capability to balance performance with cost-efficiency.

Benefits of Amazon Bedrock prompt caching

Smartsheet is a leading cloud-based enterprise work management service, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. Their engineering team identified an opportunity to enhance developer productivity by implementing AI-assisted coding capabilities across their organization. The solution needed to scale efficiently while maintaining reasonable costs, leading them to choose Roo Code with Amazon Bedrock with Anthropic’s Claude as their FM provider.

“Our engineering team has achieved a 60% reduction in operational costs and 20% improvement in response latency when using Roo Code with Amazon Bedrock prompt caching. These improvements directly translate to enhanced developer productivity across our organization,” says JB Brown, VP of Engineering at Smartsheet.

This fact is underscored by the team’s ability to rapidly generate comprehensive code documentation and architecture diagrams in a mere four hours, a task that previously consumed 2 weeks of manual effort. This newfound efficiency extends to critical areas like AWS cost management, where a vital analysis tool was built in just 30 minutes, and is projected to save Smartsheet $450,000 annually. Furthermore, the speed and accessibility of information have been significantly amplified, as evidenced by the 6–8 hours of senior engineer time saved by quickly explaining complex Terraform/Terragrunt infrastructure. These examples highlight how Amazon Bedrock prompt caching is not just about incremental gains, but about unlocking substantial time and resource savings, empowering our developers to focus on innovation rather than time-consuming manual processes.

Solution overview

When implementing AI-assisted coding at scale, Smartsheet faced several key challenges around managing costs associated with large language model (LLM) API calls, reducing latency for developer interactions and handling repetitive elements in prompts efficiently. The development team recognized that a significant portion of the prompts sent to Amazon Bedrock were repeated elements: the system prompt and the continuously building message history, presenting an opportunity for optimization.

To address these challenges, Smartsheet implemented a comprehensive solution centered around Roo Code integration with Amazon Bedrock and prompt caching. Smartsheet deployed Roo Code, powered by Amazon Bedrock and Anthropic’s Claude 3.7, across the organization and integrated seamlessly into existing development workflows. This provided developers with immediate access to AI assistance for code writing, reviewing, and debugging tasks.

The team then contributed a sophisticated Amazon Bedrock prompt caching system to Roo Code that identifies and stores common prompt elements, significantly reducing redundant data transmission to Bedrock. This caching layer proved crucial in optimizing both cost and performance metrics. The caching decision flow incorporates two key checks: first, it verifies if the selected model supports prompt caching, then it makes sure the system prompt meets a minimum token threshold to make caching worthwhile.

The solution integrates Roo Code directly into developers’ integrated development environments (IDE), using the Amazon Bedrock prompt caching capabilities. The architecture separates content into static elements (like system instructions and code context) and dynamic elements (like user queries). When a developer makes a request, static content is automatically marked with cache checkpoints and stored for 5 minutes, with each access refreshing this window.

For subsequent queries, the system checks for matching cached content before processing. When matches are found, it bypasses recomputation of these sections, significantly reducing both response time and costs.

The following diagram shows the integration points between Roo Code and Amazon Bedrock.

Technical workflow diagram showing data flow between user query, cache decision logic, and Amazon Bedrock with response handling

The following screenshot highlights the model provider settings and connection parameters required for implementation.

Amazon Bedrock configuration interface showing AWS authentication, region settings, and Claude model with caching enabled

Optimizing performance with prompt caching

When analyzing code files, much of the context—such as file contents and environment details—remains consistent across multiple queries. By caching this content, subsequent requests can reuse it from the cache, significantly reducing cost and latency.

In our example, we first asked Roo Code to analyze a Python file. We then sent multiple follow-up queries about different aspects of the code. The first query explained the overall structure, followed by questions about specific classes, unused modules, and potential improvements.

The usage section of the responses showed significant improvements through caching. During the first query, the model processed the input and wrote the cached content. For subsequent queries, 90% of the input was served from the cache. Because cache reads are 90% cheaper than processing new input tokens, this resulted in an 83% cost reduction for follow-up queries.

Here’s how the caching worked across queries:

  • First query – The model processed the entire file content and environment details, writing them to cache
  • Subsequent queries – The model loaded the cached content, only processing the new question text

To illustrate this, let’s look at the usage statistics from a series of queries:

Query 1: "Explain my @/src/models/product.py code"
{
    "inputTokens": 1051,
    "outputTokens": 902,
    "totalTokens": 19374,
    "cacheReadInputTokenCount": 0,
    "cacheWriteInputTokenCount": 12421
}

Query 2: "Describe what is ProductUpdate class used for"
{
    "inputTokens": 1078,
    "outputTokens": 694,
    "totalTokens": 21664,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 0
}

Query 3: "Which module is not in use?"
{
    "inputTokens": 4,
    "outputTokens": 774,
    "totalTokens": 22753,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 2083
}

Query 4: "Any improvement to implement to my code making it more efficient?"
{
    "inputTokens": 1097,
    "outputTokens": 1270,
    "totalTokens": 24342,
    "cacheReadInputTokenCount": 21975,
    "cacheWriteInputTokenCount": 0
} 

We can see from the results that the first query didn’t return cacheReadInputTokenCount but wrote to the cache the whole input. For subsequent queries, most of the input was read from the cache, except for the specific question.

The results clearly demonstrate the benefits of prompt caching:

  • 70% of total input was served from cache across queries
  • 90% of input for follow-up queries came from cache
  • Overall input costs reduced by 60%
  • Follow-up query costs reduced by 83%

This approach is particularly effective for development workflows where developers frequently query the same code base with different questions. The cached content provides necessary context for each query while significantly reducing processing overhead.

The following screenshot displays Roo Code’s interface, showing detailed metrics including token usage, cache hit rates, and associated API costs. The interface presents the cost savings achieved through caching and provides comprehensive usage analytics.

IDE showing Roo Code analysis with metrics, API costs, and detailed Python code explanation

Results and future innovation

The following line graph demonstrates the cost reduction trend over multiple queries runs, having the first query uncached, and the subsequent input using significant query responses from the cache. The y-axis shows costs in dollars, and the x-axis shows the cost per query. The graph shows a clear downward trend, with costs decreasing by 60% from the baseline.

Dual-axis graph displaying per-query and cumulative costs, demonstrating cost savings through caching over 4 queries

The implementation delivered remarkable improvements across key metrics. The prompt caching system achieved a 60% reduction in operational costs while simultaneously improving response latency by 20%.

The choice of Amazon Bedrock prompt caching proved particularly effective for Smartsheet’s coding assistance workflows. The 5-minute cache Time to Live (TTL) aligns perfectly with the natural rhythm of developer interactions with Roo Code, where multiple related queries typically occur within short timeframes. During active coding sessions, developers engage in rapid back-and-forth exchanges with the AI assistant—reviewing code, asking follow-up questions, and requesting modifications—while maintaining context through the cache. This agentic workflow, where each interaction builds upon previous context, takes full advantage of the prompt caching mechanism—each query refreshes the cache timer while preserving valuable context about the code being discussed.

Best practices and lessons learned

Through their implementation journey, Smartsheet developed several key best practices for organizations looking to implement similar solutions. Early implementation of prompt caching proved crucial, allowing the team to analyze prompt patterns and design efficient caching strategies from the start. The team found that focusing on developer experience through response time optimization and seamless tool integration was essential for successful adoption.

Continuous monitoring and analysis of usage patterns became a cornerstone of their optimization strategy. By tracking key metrics like response times and costs, the team could identify opportunities for further optimization and regularly adjust their caching strategies to maintain optimal performance.

Looking ahead

Smartsheet continues to innovate with Amazon Bedrock and Roo Code, exploring new ways to enhance developer productivity. Their engineering team is investigating advanced caching strategies and evaluating new FMs as they become available on Amazon Bedrock. The success of this implementation has established a strong foundation for future AI-assisted development initiatives.

Conclusion

Smartsheet’s implementation of Roo Code with Amazon Bedrock and prompt caching demonstrates how organizations can successfully deploy AI-assisted coding solutions while maintaining cost-efficiency. Their approach provides a blueprint for others looking to enhance developer productivity through AI while optimizing operational costs. The combination of strategic implementation, innovative caching solutions, and continuous optimization has enabled Smartsheet to achieve significant improvements in performance and cost metrics.

To learn more about implementing AI solutions at scale, refer to the Amazon Bedrock documentation and explore the AWS Machine Learning Blog. The frameworks and strategies outlined in this post can help organizations of varying size implement efficient, cost-effective AI-assisted development workflows.


About the Authors

Mastering Amazon Q Developer Part 1: Crafting Effective Prompts

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/mastering-amazon-q-developer-part-1-crafting-effective-prompts/

As organizations increasingly adopt AI-powered tools to enhance developer productivity, your ability to effectively communicate with these assistants becomes a valuable skill. This guide explores how you can craft prompts that deliver accurate, useful results when working with Amazon Q Developer.

Your success with Amazon Q Developer depends directly on how well you communicate with it. Through my work as a Principal Specialist Solutions Architect on the Next Generation Developer Experience team at AWS, I’ve observed that developers experience varying degrees of success based primarily on their approach to prompt construction. The difference between a vague request and a well-structured prompt can be the difference between wasted time and a productivity breakthrough.

Recent McKinsey research reveals that developers can complete tasks up to twice as fast with generative AI when using proper prompting techniques [1]. Even more impressive, developers tackling complex tasks are 25-30% more likely to complete them within given time-frames when using these tools effectively. These productivity gains aren’t automatic—they depend on mastering the art and science of prompt engineering.

Based on patterns observed across numerous customer interactions, this guide provides practical techniques to help you maximize the value of your AI-assisted development experience. You’ll learn how to transform your interactions to consistently produce helpful, relevant assistance that can dramatically improve your development workflow.

Key Takeaways

  • Structure your prompts with clear context, specific requirements, and desired output format
  • Include relevant technical details about your environment and constraints
  • Avoid vague requests and provide specific examples when possible
  • Use the provided prompt template to ensure consistent results

Getting Started with Amazon Q Developer

Already using Amazon Q Developer? Great! This guide will help you get more value from your interactions. If you haven’t set up Amazon Q Developer yet, check out the getting started guide.

Understanding the Impact of Good Prompts

The rapid adoption of AI technologies makes prompt engineering skills essential for today’s developers. McKinsey’s latest global survey reveals that 65% of organizations regularly use generative AI, nearly double from their previous survey. When developers master prompt engineering, they’re 25-30% more likely to complete complex tasks within given timeframes.

What Makes an Effective Prompt?

  • Specific Request: State exactly what you need
  • Clear Background: Describe your project, requirements, and constraints
  • Additional Context: Provide code, configuration, or other additional context
  • Expected Output: Specify how you want the information presented

Here’s how this works in practice:

Poor prompt:

How do I deploy a container on AWS?

Effective prompt:


I need to deploy a containerized Node.js e-commerce application that handles 
50,000 daily users with peak loads during promotional events.
Requirements:
- High availability across multiple regions
- MongoDB for persistence
- Auto-scaling capabilities

Please provide:
1. AWS architecture diagram
2. List of required services with configurations
3. Security best practices
4. Operational monitoring recommendations

Common Patterns to Avoid

Short or Vague Requests:

  • Add Docs
  • Make this better
  • Check this
'Add docs' simple prompt with generic response.

Not much to go on here. Amazon Q Developer will likely provide generic documentation.

'Check this' simple prompt with generic response.

Another vague prompt with a generic response.

Overly Broad Questions:

  • How do I use AWS?
  • What's the best practice?
  • Help with Lambda
Image showing the Amazon Q Developer IDE Chat panel where the user entered the vague prompt: 'Help with Lambda'. Amazon Q Developer responds by asking clarifying questions.

The prompt is so vague that Amazon Q Developer responds by asking clarifying questions.

Image showing the Amazon Q Developer chat pane where the user entered the prompt: "Create a Lambda function that processes S3 events."

The more specific prompt allows Amazon Q Developer to provide a more precise response.

Remember: The quality of information you receive directly correlates with the quality of the information you provide.

Proven Techniques for Better Results

To help you apply these principles consistently, I’ve developed a template structure that incorporates all the key elements of an effective prompt. This framework can be adapted for various scenarios and serves as a starting point for your interactions with Amazon Q Developer. While Amazon Q Developer will fill in some parts of this context (see the next post in this series), you just need to make sure this information is available.

These are the principles demonstrated in the template:

  • Technical Context Requirements
    1. Specify your technology stack and versions
    2. Include environment details
    3. Mention compliance requirements
    4. Define scale expectations
  • Example Specifications
    1. Include relevant code snippets
    2. Paste error messages
    3. Reference configuration files
    4. Show current architecture
  • Output Format Guidelines
    1. Request specific documentation formats
    2. Ask for diagrams when needed
    3. Specify code language preferences
    4. Indicate level of detail needed
Image showing the Amazon Q Developer chat panel with the user submitted prompt: "Document the requirements for an application that will process images. Format as a technical requirements document using markdown markup. Output as a single markdown code-block." The response is much more detailed, and aligns with the user's request.

The specification of the output format ensure the response is what you expect.

Quick Reference Prompt Template

Use this template to structure your prompts:


[Business Context] 
- Project description: 
- Performance requirements: 
- Compliance needs: 
- Scale expectations: 

[Technical Details] 
- Current technology stack: 
- Versions/dependencies: 
- Technical constraints: 
- Environment details: 

[Specific Request] 
- Task description: 
- Expected outcome: 
- Special considerations: 

[Output Format] 
- Desired format: 
- Level of detail: 
-  Examples needed: 
- Additional requirements:

Best Practices for Daily Use

Successfully working with Amazon Q Developer requires consistent application of proven practices. These guidelines, developed through extensive customer interactions, will help you maximize the value of your AI-assisted development experience.

  • Start with clear business objectives
  • Include relevant technical constraints
  • Specify performance requirements
  • Request specific output formats
  • Provide examples when possible

Through extensive customer interactions, we’ve found that following these practices consistently produces better results and reduces the need for follow-up clarification.

Take Action Now

Additional Resources

What’s Next?

In the next part of this series, we’ll explore advanced context management in Amazon Q Developer and dive into the new prompt catalog features. You’ll learn how to:

  • Build and maintain context across multiple interactions
  • Use the prompt catalog effectively
  • Handle complex, multi-step development tasks
  • Optimize responses for your specific use cases

Stay tuned, and start applying these techniques today to transform how you build on AWS!

About the author:

Will Matos

Will Matos is a Principal Specialist Solutions Architect at AWS, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 25 years of tech experience, he collaborates with product teams to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs.

Navigate Bulk Sender Requirements with Amazon SES

Post Syndicated from Vinay Ujjini original https://aws.amazon.com/blogs/messaging-and-targeting/navigate-bulk-sender-requirements-with-amazon-ses/

Introduction

Email communication remains a critical component of business operations and customer engagement. As the digital landscape evolves, major mailbox providers continually update their policies to enhance security and user experience. This blog will explore the changes implemented by Microsoft for bulk senders trying to reach Outlook.com (supporting Hotmail.com, live.com consumer domain addresses). This follows the Google & Yahoo! bulk sender requirements changes in February of 2024. Microsoft is implementing the enforcement of sender requirements for bulk email senders, particularly those sending over 5,000 messages daily, starting May 5, 2025. These requirements focus on improving email authentication and trust. This will ensure Outlook and Hotmail recipients are receiving messages that are authenticated and from who they claim to be from. These measures will help reduce spoofing, phishing, and spam, and safeguarding individuals and businesses relying on email.

This blog will discuss what these changes mean for you, and how Amazon Simple Email Service (Amazon SES) can help you maintain compliance and optimize your email sending practices.

Background

In February 2024, Google and Yahoo implemented new requirements for bulk email senders, building upon industry efforts to combat spam and improve email deliverability. These changes aligned with Google’s 2024 bulk sender requirements initiative, signaling a unified approach among major mailbox providers to enhance the privacy and compliance in email.

What does this mean for customers and email senders?

What’s Changing?

Microsoft’s New Requirements

  1. DMARC enforcement with at least a p=none policy
  2. Sender domain authentication (SPF, DKIM)
  3. Functional unsubscribe links required in the email
  4. Requirement for From and Reply-to addresses to be deliverable

Why These Changes Matter?

These new requirements serve several crucial purposes:

  1. Enhances trust in your sending domain: Validates that the sender is who they are claiming to be. Enhances trust by delivering messages that are authenticated and aligned with the bulk sender requirements.
  2. Improved Deliverability: Ensuring legitimate emails reach the recipients who have subscribed to sender’s messages.
  3. User-Centric: Providing recipients with control over their inboxes.
  4. Industry Standardization: Aligning sender requirements across major email providers

Best Practices for Compliance

To adhere to these new requirements and optimize your email sending practices, consider the following best practices:

1. Implement Strong Authentication

  • Configure SPF: SPF (Sender Policy Framework) is an email authentication standard that’s designed to prevent email spoofing. Domain owners use SPF to tell email providers which servers are allowed to send email from their domains. Follow setup instructions to authenticate your email with SPF. Must pass SPF for sending domain.
    • Configure “custom MAIL FROM“, which is how senders can ensure that the SPF-authenticated domain is aligned with the From header domain’s DMARC policy.
  • Enable DKIM signing: DomainKeys Identified Mail (DKIM) is an email security standard. It is designed to ensure that an email that claims to have come from a specific domain, was indeed authorized by the owner of that domain. It uses public-key cryptography to sign an email with a private key. Recipient servers use a public key, published to a domain’s DNS to verify that parts of the email have not been modified during the transit. Follow these set up instructions to authenticate email with DKIM in SES. Must pass to validate email integrity and authenticity.
    • Verify your domain with Easy DKIM. If currently using email identities, you have to move to domain
    • If you utilize email identities only, you will default all authentication to amazonses.com. That will not align with your friendly from address which will not satisfy the bulk sender requirements. This means that when you send email to mailbox providers, your messages will be rejected because you do not have proper authentication on your emails. To satisfy the bulk sender requirements, you must use domain verified identities which ensure that you have ownership of or permission to use the sending domain. That will allow SES to sign the outgoing emails with a DKIM signature that aligns with the friendly from domain.
  • Set up DMARC with an appropriate policy: Domain-based Message Authentication, Reporting and Conformance (DMARC) is an email authentication protocol that uses SPF and DKIM to detect email spoofing and phishing. To comply with DMARC, messages must be authenticated through either SPF or DKIM. Ideally, when both are used with DMARC, you’ll be ensuring the highest level of protection possible for your email sending.
Name Type Value
_dmarc.example.com TXT “v=DMARC1;p=none;rua=mailto:[email protected]

In the preceding records:

    • example.com is your domain
    • Value of the TXT record contains the DMARC policy that applies to your domain.
    • In this example, the policy tells email providers to do the following:
      • At least p=none should be implemented.

2. Optimize Email Content

  • Clearly identify yourself as the sender: Use a recognizable “From” name and email address that accurately represents your brand or organization. For example, use “[email protected]” instead of a generic or misleading address.
  • Implement user friendly unsubscribe mechanisms: Include a visible, easy-to-use unsubscribe link in every email, typically in the header. Ensure the unsubscribe process is simple and honors requests promptly, ideally within 24-48 hours. Visit this guide on how Amazon SES helps you do that.
  • Subject line aligns with content: Avoid deceptive subject lines that don’t match the email content.
  • Clearly identify commercial content: If your email is promotional, make it obvious. Use clear language in the subject line and body that indicates the nature of the email, such as “Special Offer” or “Newsletter.”
  • Include a valid physical address: Add your company’s physical mailing address in the email footer. This is not only a legal requirement in many jurisdictions but builds trust with recipients.
  • Verify URLS in the emails: Verify that links in the emails you send work and are not misleading to the reader/subscriber. Be transparent with URLs/links in the email content.

3. Monitor and Maintain

  • Monitor bounces: A bounce typically indicates why a message was not delivered. The SMTP response in the bounce message will have details on why the message was bounced. For example: if it is missing authentication records (fix: include authentication records for the domain – quick fix) versus an IP or domain reputation bounce reason (this maybe a longer term fix).
    • Track both hard bounces (permanent delivery failures) and soft bounces (temporary issues). High bounce rates can indicate list quality problems or delivery issues. Visit this blog to set up notifications for bounces & complaints. Virtual Deliverability Manager (VDM) is an Amazon SES feature that helps you enhance email deliverability. It helps increasing inbox deliverability and email conversions, by providing insights into your sending and delivery data. VDM advices on how to fix the issues that are negatively affecting your delivery success rate and reputation.
  • Track complaint rates: Regularly monitor the number of spam complaints your emails receive with a goal of keeping the complaint rate under 0.2%. Not all mailbox providers have complaint feedback loop data, so use aggregate data from the mailbox providers that do, such as Hotmail and Yahoo. Email providers that don’t provide complaint feedback loops, such as Gmail may have alternative dashboards or tools available like Google Postmaster tools.
  • Perform regular authentication checks: Periodically verify that your SPF, DKIM, and DMARC records are correctly set up and functioning. Alternative to manual DNS checks, Amazon SES has a feature in Virtual Deliverability Manager that performs authentication checks for your sending identities.
  • Maintain list hygiene: Regularly clean your email list by removing inactive subscribers, correcting typos in email addresses, and honoring unsubscribe requests. This helps improve deliverability and engagement rates.

How Amazon SES Helps

Amazon SES provides a robust set of features to help you meet these new requirements and optimize your email sending practices:

Authentication Support

  • Easy DKIM configuration
  • SPF record management
  • DMARC implementation guidance

Comprehensive Monitoring

  • Virtual Deliverability Manager
  • Complaint tracking
  • Bounce rate monitoring
  • Event publishing to Amazon CloudWatch, SNS , Kinesis Firehose and Event Bridge
  • Detailed sending statistics

Compliance Tools

  • List management capabilities (included with SES)
  • Suppression list handling (included with SES)
  • Feedback loop processing (included with SES)
  • Authentication status tracking: This is done through Amazon SES feature Virtual Deliverability Manager (VDM).

Implementation Strategy

To successfully implement these changes, consider the following strategy:

  1. Assessment: Audit your current email practices, review authentication status, and evaluate compliance gaps.
  2. Technical Implementation: Configure authentication protocols, update DNS records, and implement required unsubscribe mechanisms.
  3. Monitoring and Optimization: Track deliverability metrics, monitor complaint rates, and adjust sending practices as needed.

Measuring Success

To ensure ongoing compliance and optimize your email practices, track these key metrics:

  1. Delivery rates
  2. Complaint rates
  3. Authentication pass rates
  4. Engagement metrics (open rates, click-through rates)

Conclusion

The new bulk sender requirements from Microsoft and Yahoo represent an important step towards a more secure and reliable email ecosystem. By leveraging Amazon SES’s powerful features and following industry best practices, you can maintain compliance, improve deliverability, and enhance the overall effectiveness of your email communications.

Amazon SES is committed to helping you navigate these changes and optimize your email sending practices. For the most up-to-date guidance and support, please consult SES’s documentation or contact Amazon SES support.

Additional Resources

The email landscape is constantly evolving. Stay informed and adaptable to ensure your email practices remain effective and compliant.

About the authors:

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-managed-storage-tables-through-apache-spark-on-aws-glue-and-amazon-emr-using-amazon-sagemaker-lakehouse/

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data  in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by configuring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key Value
spark.sql.catalog.{catalog_name} org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.{catalog_name}.type glue
spark.sql.catalog.{catalog_name}.glue.id {account_id}:{rms_catalog_name}/{database_name}
spark.sql.catalog.{catalog_name}.client.region {aws_region}
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Configuration parameters:

  • {catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
  • {rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
  • {database_name}: The RMS database name
  • {aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

  1. Sign in to SageMaker Unified Studio.
  2. Choose Select a project on the top menu and choose Create project.
  3. For Project name, enter demo.
  4. For Project profile, choose All capabilities.
  5. Choose Continue.

  1. Leave the default values and choose Continue.
  2. Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

  1. Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

  1. In the navigation pane, choose Data.
  2. Choose the + (plus) sign.
  3. Select Create Lakehouse catalog to create a new catalog and choose Next.

  1. For Lakehouse catalog name, enter rms-catalog-demo.
  2. Choose Add catalog.

  1. Wait for the catalog to be created.

  1. In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

  1. In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
  2. On the top right, choose Select data source.
  3. For CONNECTIONS, choose Redshift (Lakehouse).
  4. For DATABASES, choose dev@rms-catalog-demo.
  5. For SCHEMAS, choose public.
  6. Choose Choose.

  1. In the query cell, enter and execute the following query to create a new schema:
create schema "dev@rms-catalog-demo".salesdb

  1. In a new cell, enter and execute the following query to create a new table:
create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

  1. In a new cell, enter and execute the following query to populate the table with sample data:
insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

  1. In a new cell, enter and run the following query to verify the table contents:
select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

  1. Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

  1. Choose Create new compute resources, then choose Next.

  1. Select EMR Serverless.

  1. Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

  1. Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

  1. In the navigation pane, choose Data
  2. Under Lakehouse, select the down arrow next to rms-catalog-demo
  3. Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse offers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

  1. Choose Open in Jupyter Lab notebook
  2. On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option Pyspark code Spark SQL
Connection type Compute Connection type Compute
AWS Glue Pyspark project.spark.compatibility SQL project.spark.compatibility
Amazon EMR Serverless Pyspark emr-s.emr_serverless_application SQL emr-s.emr_serverless_application
  1. In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
  2. Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:
from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

  1. Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
  2. Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:
SHOW TABLES IN rmscatalog.salesdb

  1. In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:
DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

  1. In a new SQL cell, enter the following SELECT statement to view the content of the table
SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

  1. Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
    1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
    2. Choose Actions, then select Delete project.
    3. Type confirm and choose Delete project.
  1. Delete the created Lakehouse catalog:
    1. Navigate to the AWS Lake Formation page in the Catalogs section.
    2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
    3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.


Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()

Revolutionizing agricultural knowledge management using a multi-modal LLM: A reference architecture

Post Syndicated from Nitin Eusebius original https://aws.amazon.com/blogs/architecture/revolutionizing-agricultural-knowledge-management-using-a-multi-modal-llm-a-reference-architecture/

Handwritten documents are still an important form of data capture in agribusiness. Paper-based handwritten documents can be the result of business culture, lack of internet connectivity, lack of mobile devices or computers, or environmental conditions in the field or in an industrial setting. Because of the physical nature of the document, there might be a delay in transcription or even no transcription into a digital system for enterprise reporting, causing critical information to be unavailable. Using generative AI, handwritten notes can be scanned to record and analyze the document and establish automated workflows for product procurement, the supply chain, and entry into customer relationship management (CRM), enterprise resource planning (ERP), and farm management information systems (FMIS).

Multi-modal large language models (LLMs) are transforming the agriculture industry by integrating diverse data types such as text, images, video, and audio. This approach enhances AI’s understanding and decision-making in farming contexts. For example, a multi-modal LLM can analyze images to identify crop issues, then generate targeted recommendations for irrigation or pest control. Combining handwritten documents and satellite imagery with the power of LLMs can lead to better crop analytics and better yields.

In this blog post, we introduce a reference architecture that offers an intelligent document digitization solution that converts handwritten notes, scanned documents, and images into editable, searchable, and accessible formats. Powered by Anthropic’s Claude 3 on Amazon Bedrock, the solution uses the sophisticated vision capabilities of LLMs to process a wide range of visual formats, preserving the original formatting while extracting text, tables, and images. This enables businesses to digitize their knowledge bases, facilitate seamless collaboration, and integrate the digitized content into their existing digital workflows, enhancing productivity and unlocking the full potential of their information assets.

A comprehensive solution and reference architecture

This reference architecture helps agricultural companies to automatically capture, analyze, and process handwritten notes and images with data and reports that are generated by individuals working in farm fields. This is an example of how to create an end-to-end solution to ingest these documents in image format with Amazon Bedrock. The processed information can be consumed by downstream systems such as CRM, ERP, and FMIS to make better data driven decisions.

The solution uses Anthropic’s Claude 3 multi modal model hosted in Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Claude is Anthropic’s state-of-the-art LLM that offers important features for enterprises such as advanced reasoning, generating text from images, code generation, and multilingual processing. Claude 3 models have sophisticated vision capabilities and can process a wide range of visual formats, including photos, charts, graphs and technical diagrams. You can also use other models such as Llama 3.2 11B and 90B, which also support vision tasks.

The following diagram illustrates the reference solution.

Revolutionizing agricultural knowledge management using a multi-modal LLM: A reference architecture

The process includes the following steps:

  1. A field worker uploads handwritten notes in an image format using a static website on their mobile device. The static website is accessed through Amazon CloudFront and hosted in Amazon Simple Storage Service (Amazon S3).
  2. The worker is securely authenticated using Amazon Cognito.
  3. After the worker is authenticated, the uploaded handwritten notes are sent to Amazon Bedrock for processing using Amazon API Gateway.
  4. An AWS Lambda function stores and reads the image from Amazon S3. It sends the uploaded image and associated prompt information to Anthropic’s Claude 3 hosted in Amazon Bedrock.
  5. Anthropic’s Claude 3 processes the image. It recognizes the handwritten text and analyzes the converted text based on the given prompt.
  6. The converted digital text and analyzed information provided by Anthropic’s Claude 3 are stored in Amazon DynamoDB for further downstream processing.
  7. The field worker uses an app to access the converted digital text and newly processed information stored in Amazon DynamoDB through API Gateway.
  8. The processed information is published to Amazon Simple Notification Service (Amazon SNS) and is consumed by downstream systems.
  9. The field worker’s location details and processed image information are consumed by two different Amazon Simple Queue Service (Amazon SQS) queues to be stored in downstream systems.
  10. The downstream systems can include CRM, FMS, and FMIS.

Additionally, using this solution, geospatial information such as GPS and GIS information can be sent to the FMIS. This can help farmers in many ways including crop monitoring, soil health and nutrient management, pest control, water management, farm mapping, and much more.

Best practices and implementation guidelines

To implement a production-ready system, it’s important to consider the following best practices.

Responsible AI: Deployment of customer facing generative AI solutions raises concerns about responsible AI practices. To mitigate risks such as biased outputs, exposure of sensitive information, or misuse for malicious purposes, it’s crucial to implement robust safeguards and validation mechanisms. Amazon Bedrock Guardrails is a set of tools and services provided by AWS that you can use to implement safeguards and responsible AI practices when building applications with generative AI models.

Security: Follow secure coding practices throughout the development lifecycle to minimize vulnerabilities. Protect your web applications from common exploits by integrating with AWS WAF. The OWASP Top 10 for Large Language Model Applications is a set of guidelines that address the unique security risks associated with generative AI solutions. It covers vulnerabilities such as model inversion, membership inference, and adversarial attacks—all of which can compromise the confidentiality, integrity, and availability of LLMs.

Observability: Monitor all layers of a generative AI solution, including the application, prompt, LLM, knowledgebase, and response provided by the LLM. You can monitor health and performance using Amazon CloudWatch.

LLMOps: Implementing LLM operations (LLMOps) will help to scale your GenAI solutions. See FMOps/LLMOps: Operationalize generative AI and differences with MLOps for additional information.

Conclusion

In this post, we introduced a reference architecture for an intelligent document digitization solution in agriculture. This system uses Amazon Bedrock and the multi-modal capabilities of LLMs such as Anthropic’s Claude 3 to transform handwritten notes and multi-modal data into searchable, digital formats. We explored how this architecture bridges the gap between traditional field documentation and modern digital systems, enhancing data accessibility and decision-making in agribusiness.

The possibilities for customization and expansion are vast. For specific use cases, you can fine-tune the multi-modal model on your unique agricultural business data. You can also implement a combination of multi-modal processing and a specialized knowledge base using Amazon Bedrock Knowledge Bases, further enhancing the system’s accuracy and relevance.


About the Authors

Protect against advanced DNS threats with Amazon Route 53 Resolver DNS Firewall

Post Syndicated from Lawton Pittenger original https://aws.amazon.com/blogs/security/protect-against-advanced-dns-threats-with-amazon-route-53-resolver-dns-firewall/

Every day, millions of applications seamlessly connect users to the digital services they need through DNS queries. These queries act as an interface to the internet’s address book, translating familiar domain names like amazon.com into the IP addresses that computers use to appropriately route traffic. The DNS landscape presents unique security challenges and opportunities in Amazon Virtual Private Cloud (Amazon VPC) environments. First, DNS resolution acts as an early checkpoint that you can use to control network traffic before it even begins. Second, DNS queries in your VPC follow a distinct path through the Amazon Route 53 Resolver that operates independently from your standard internet gateway, bypassing other network security controls.

To address this, Amazon Route 53 Resolver DNS Firewall provides protection for DNS traffic, starting with traditional domain lists where you can explicitly allow or deny DNS resolution of specific domains. Also, included are AWS Managed Domain Lists, which automatically block known malicious domains identified through Amazon Threat Intelligence and our trusted security partners. While this approach works effectively to help prevent known threats, sophisticated bad actors are increasingly using techniques that traditional blocklists can’t catch.

Instead of relying solely on static lists, Amazon Route 53 Resolver DNS Firewall Advanced provides intelligent protection alongside these traditional controls. These advanced rules work like a skilled security analyst, watching for suspicious patterns in DNS queries in real time. By examining characteristics such as query length, entropy, and frequency, the service can spot potentially malicious activity even when encountering previously unknown domains. This approach enables detecting and blocking advanced threats like DNS tunneling and domain generation algorithms (DGAs)—techniques that bad actors use to establish hidden communication channels or connect malware to their control servers.

In this post, we take you on a practical journey exploring these DNS-based threats and tools to help prevent them. You’ll learn how to set up effective Route 53 Resolver DNS Firewall Advanced rules, and we provide a ready-to-deploy CloudFormation template with our recommended configurations. Finally, we demonstrate an example of real-world threat detection and show you how the service integrates with AWS Security Hub to improve visibility of alerts. By the time you finish reading this post, you’ll have a clear understanding of how to deploy Route 53 Resolver DNS Firewall rules to add an intelligent, proactive layer of security to your AWS environment.

Understanding the risks of DNS tunneling and DGAs

As mentioned earlier, the Route 53 Resolver provides a service-managed path to the internet that operates independently from your VPC’s internet gateway. While this architecture enables efficient DNS resolution, it can be exploited through techniques such as DNS tunneling. Let’s explore how these techniques work and why they present unique challenges.

DNS tunneling takes advantage of the DNS protocol’s basic function—asking questions about domain names and receiving answers from the authoritative nameserver for the domain. But instead of using DNS for its intended purpose of domain name resolution, tunneling encodes other types of data within DNS queries and responses. For example, rather than asking simply what is the IP address for example.com?, a tunneling exploit might embed data within a query like secretdata123.attacker.com, where secretdata123 contains encoded information. This can lead to DNS being used as a two-way communications command and control channel. Detecting and blocking DNS tunneling is a vital control for stopping data exfiltration and command and control (C2) communications.

DGAs represent a different challenge for DNS security. Rather than using a fixed, predictable domain name that can be quickly blocked, DGAs automatically create many possible domain names using mathematical formulas, which are then used as a destination for C2 traffic. For instance, a DGA might generate domains like xkt7py.com today and mn9qrs.com tomorrow. This makes it difficult to maintain effective blocklists, because the domains change frequently and appear random. Traditional threat intelligence feeds, which rely on identifying and blocking known malicious domains, struggle to keep pace with DGA-generated domains.

How DNS Firewall Advanced works

When examining a domain name, Route 53 Resolver DNS Firewall Advanced looks at multiple characteristics that help distinguish between legitimate and suspicious domains. For example, legitimate domain names typically use real words and follow predictable patterns that are designed to facilitate a human’s ability to recall and enter them accurately. In contrast, domains used for tunneling or generated by DGAs often contain random-looking strings of characters or unusual patterns.

Route 53 Resolver DNS Firewall Advanced builds its intelligence on extensive analysis of real-world domain usage patterns. It learns what legitimate domain names look like by studying the most resolved domains on the internet, combined with actual domain resolution patterns from across AWS. This real-world training data helps establish a baseline for normal domain name characteristics. DNS Firewall Advanced then contrasts these patterns against known techniques used in DNS tunneling and domain generation to identify suspicious activity.

The service analyzes various aspects of each domain name, including:

  • How the domain name is structured and broken into parts
  • The patterns of letters and numbers used
  • How closely the domain resembles natural language
  • The presence of common words versus random character combinations

The service analyzes queries in real time, processing each one in less than a millisecond, which maintains strong security controls without affecting your applications’ performance.

Route 53 Resolver DNS Firewall Advanced has customized protection levels that you can use to choose how aggressively you want to detect and respond to suspicious domains through confidence thresholds:

  • High confidence: This setting focuses on the most obvious threats, minimizing false positives. It’s ideal for production environments where blocking legitimate traffic could be disruptive.
  • Medium confidence: Provides balanced protection, suitable for most environments.
  • Low confidence: Offers the most detection but might require more tuning to avoid false positives. This setting is useful for high-security environments or for initial monitoring to understand traffic patterns.

You can combine these confidence levels with different actions (block or alert) to create a defense strategy that matches your security needs.

Manually create a DNS Firewall Advanced rule:

To start, we show you how to manually create a Route 53 Resolver DNS Firewall Advanced rule in the AWS Management Console. This rule will block DNS queries that it has detected to be DNS tunneling with high confidence.

To manually create a rule:

  1. In the Route 53 console, choose Rules in the navigation pane, and then choose Add rule.
    Figure 1: Rules in the Route 53 console

    Figure 1: Rules in the Route 53 console

  2. Enter a name for the rule and select DNS Firewall Advanced protections.
    Figure 2: Add a rule

    Figure 2: Add a rule

  3. Under DNS Firewall Advanced protection:
    1. Select DNS tunneling detection.
    2. For Confidence threshold, select High.
    3. Leave the Query type empty so that the rule applies to all query types.
    Figure 3: Select DNS protection options

    Figure 3: Select DNS protection options

  4. Under Action:
    1. Select Block.
    2. For the response, select OVERRIDE.
    3. For the Record value, enter dns-firewall-advanced-block.
    4. For the Record type, select CNAME.
    5. Choose Add rule.
    Figure 4: Configure actions for the rule

    Figure 4: Configure actions for the rule

We’ve created an AWS CloudFormation stack that deploys the following recommended Route 53 Resolver DNS Firewall rules in a DNS Firewall rule group. We recommend this configuration because it provides a balanced security approach—blocking high-confidence threats immediately while generating alerts for lower-confidence detections.

The inclusion of the AWS Managed Aggregate Threat List is particularly valuable because it combines domains from multiple threat categories (malware, ransomware, botnet, spyware, and DNS tunneling) into a blocklist. This consolidated list includes the domains from other AWS Managed Domain Lists, including those identified by GuardDuty threat intelligence systems, giving you broad protection against known malicious domains while the Route 53 DNS Firewall Advanced rules catch previously unseen threats.

For enterprise environments, you can scale this protection across your entire organization by using AWS Firewall Manager to automatically deploy and manage this rule group configuration consistently across the VPCs in your organization.

  • BLOCK – Aggregate Threat List (domains associated with multiple DNS threat categories including malware, ransomware, botnet, spyware, and DNS tunneling to help block multiple types of threats)
  • BLOCK – DNS Tunneling | Confidence: HIGH
  • BLOCK – DGAs | Confidence: HIGH
  • ALERT – DNS Tunneling | Confidence: LOW
  • ALERT – DGAs | Confidence: LOW

To deploy this rule group using a CloudFormation stack:

  1. Navigate to the CloudFormation console, choose Stacks from the navigation pane. Choose Create Stack in the upper right and select With new resources (standard).
    Figure 5: Create a stack

    Figure 5: Create a stack

  2. Download the CloudFormation template. Select Choose an existing template and then select Upload a template file and upload the CloudFormation stack. Choose Next.
    Figure 6: Use the CloudFormation template

    Figure 6: Use the CloudFormation template

  3. Enter a stack name and choose Next.
    Figure 7: Enter a stack name

    Figure 7: Enter a stack name

  4. Leave the default values for all options, select Next, and then choose Submit.
  5. Navigate to the Route 53 Resolver DNS Firewall by visiting the Amazon VPC console, scroll down to the DNS firewall section, and select the Rule groups tab.
  6. Select the newly created rule group.
  7. Select the Associated VPCs tab, choose Associate VPC, and then associate a VPC you want to protect and choose Associate.
    Figure 8: Associate a VPC

    Figure 8: Associate a VPC

Observability

Route 53 Resolver query logging provides detailed visibility into DNS queries made from resources associated with your VPCs, enabling you to monitor and analyze your DNS traffic for security and compliance purposes. By configuring query logging, you can capture essential information about each DNS request, including the domain name being queried, the record type, the response code, and the originating VPC and instance. Query logging is particularly valuable when used in conjunction with Route 53 Resolver DNS Firewall, because it helps you track blocked queries and fine-tune your security rules based on actual DNS traffic patterns in your environment. The following are examples of log entries generated when DNS Firewall detects and responds to suspicious activities, showing the detailed information available for security analysis and incident response.

Example log entry: DNS tunneling block

The following is an example of a DNS tunneling block.

{
    "version": "1.100000",
    "account_id": "11111111111",
    "region": "us-west-2",
    "vpc_id": "vpc-0fcc85bd45b791d5a",
    "query_timestamp": "2025-02-05T03:54:12Z",
    "query_name": "1WTE4CyL4Vf1LQDDAToimuqFBEtMXyYMsYP8zPgVyTagzSh5PvinuQcL6N8at4A.REZv3VqKU4x43DPcCKAzQk4UKoZjB3nDMukHAuKTtDckTqZ8SDDZ1iXRey6a5sD.mEDMdrzPocS9exqoBQ1xfSuKfvW.1.dnstunnel.com.",
    "query_type": "A",
    "query_class": "IN",
    "rcode": "NXDOMAIN",
    "answers": [
        {
            "Rdata": "dns-firewall-advanced-block.",
            "Type": "CNAME",
            "Class": "IN"
        }
    ],
    "srcaddr": "10.1.0.122",
    "srcport": "41859",
    "transport": "UDP",
    "srcids": {
        "instance": "i-0c738190f19db9a2c"
    },
    "firewall_rule_action": "BLOCK",
    "firewall_rule_group_id": "rslvr-frg-63efa138b43f428b",
    "firewall_protection": "DNS_TUNNELING"
}

Example log entry: DNS tunneling alert

The following is an example of a DNS tunneling alert.

{
    "version": "1.100000",
    "account_id": "11111111111",
    "region": "us-west-2",
    "vpc_id": "vpc-0fcc85bd45b791d5a",
    "query_timestamp": "2025-02-05T04:00:02Z",
    "query_name": "1WTEc8GwFH3qHY8XKjbhXuj43yGShMrhacqwJYSZkSqRQ95sagz64NUpnuj4R8R.S79aru2KRB8d9nCHEPdXWJxGT4aUjVMqtCRSq9EZXRCo8NH5cmLvmcho3hh1mbK.NqGY1X6M4qpMGX6dnTSHuCsZFbf.1.dnstunnel.com.",
    "query_type": "A",
    "query_class": "IN",
    "rcode": "NOERROR",
    "answers": [
        {
        "Rdata": "202.92.34.217",
        "Type": "A",
        "Class": "IN"
        }
    ],
    "srcaddr": "10.1.0.122",
    "srcport": "35116",
    "transport": "UDP",
    "srcids": {
        "instance": "i-0c738190f19db9a2c",
        "resolver_endpoint": "rslvr-out-e20639d3666748f58"
    },
    "firewall_rule_action": "ALERT",
    "firewall_rule_group_id": "rslvr-frg-63efa138b43f428b",
    "firewall_protection": "DNS_TUNNELING"
}

Integration with Security Hub

Security Hub provides you with a view of your security state in AWS and helps you to check your environment against security industry standards and best practices. Security Hub collects security data from across AWS accounts, AWS services, and supported third-party partner products, and helps you to analyze security trends and identify the highest priority security issues. It enables findings from both the Amazon: Route 53 Resolver DNS Firewall – AWS List and Amazon: Route 53 Resolver DNS Firewall Advanced list by default, so you’ll automatically receive these alerts without additional configuration. You only need to manually enable Amazon: Route 53 Resolver DNS Firewall – Custom List findings if you’re using custom domain lists in your rule groups. See Sending findings from Route 53 Resolver DNS Firewall to Security Hub for more information.

The following figure is an example of how Route 53 Resolver DNS Firewall Advanced findings appear in the Security Hub console, providing you with actionable security intelligence directly in your centralized dashboard.

Figure 9: DNS Firewall Advanced findings in Security Hub

Figure 9: DNS Firewall Advanced findings in Security Hub

Select a finding to view details such as Finding ID, Types, Workflow status, and so on.

Figure 10: Findings details

Figure 10: Findings details

Conclusion

Amazon Route 53 Resolver DNS Firewall Advanced represents a significant step forward in protecting organizations against sophisticated DNS-based threats. As mentioned, DNS queries sent to the Route 53 Resolver follow a unique path that bypasses traditional AWS security controls like security groups, NACLs, and even AWS Network Firewall—creating a security gap in many environments. Throughout this post, we’ve explored how DNS tunneling and DGA-based exploits take advantage of this blind spot, and how you can use Route 53 Resolver DNS Firewall Advanced to protect from these threats through real-time pattern analysis and anomaly detection. You learned how to configure the service in the AWS console and use the provided CloudFormation template with recommended rules that balance blocking high-confidence threats while alerting on potential issues. And you saw how query logging provides valuable visibility into your DNS traffic and how Security Hub integration centralizes your security findings. Implementing these capabilities helps you protect your infrastructure from sophisticated DNS-based exploits that traditional domain blocklists cannot catch, strengthening your cloud security posture while maintaining operational efficiency.

If you have feedback about this post, submit comments in the Comments section below.

Lawton Pittenger

Lawton Pittenger

Lawton is a Security Solutions Architect at AWS, based in New York City, focused on helping customers implement native AWS security services. Professionally, Lawton has worked in IT security roles, securing cloud environments. Outside of cloud security, his interests include skateboarding, snowboarding, and rock climbing.

Michael Leighty

Michael Leighty

Michael is a Senior Security Solutions Architect at AWS, based in Atlanta. He specializes in helping customers design and implement effective network security controls, drawing from extensive experience at leading network security vendors. At AWS, he works closely with service teams to drive continuous improvement in security services based on customer needs and feedback.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

Post Syndicated from Roy Ninio original https://aws.amazon.com/blogs/big-data/petabyte-scale-data-migration-made-simple-appsflyers-best-practice-journey-with-amazon-emr-serverless/

This post is co-written with Roy Ninio from Appsflyer.

Organizations worldwide aim to harness the power of data to drive smarter, more informed decision-making by embedding data at the core of their processes. Using data-driven insights enables you to respond more effectively to unexpected challenges, foster innovation, and deliver enhanced experiences to your customers. In fact, data has transformed how organizations drive decision-making, but historically, managing the infrastructure to support it posed significant challenges and required specific skill sets and dedicated personnel. The complexity of setting up, scaling, and maintaining large-scale data systems impacted agility and pace of innovation. This reliance on experts and intricate setups often diverted resources from innovation, slowed time-to-market, and hindered the ability to respond to changes in industry demands.

AppsFlyer is a leading analytics and attribution company designed to help businesses measure and optimize their marketing efforts across mobile, web, and connected devices. With a focus on privacy-first innovation, AppsFlyer empowers organizations to make data-driven decisions while respecting user privacy and compliance regulations. AppsFlyer provides tools for tracking user acquisition, engagement, and retention, delivering actionable insights to enhance ROI and streamline marketing strategies.

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.

Why AppsFlyer embraced a serverless approach for big data

AppsFlyer manages one of the largest-scale data infrastructures in the industry, processing 100 PB of data daily, handling millions of events per second, and running thousands of jobs across nearly 100 self-managed Hadoop clusters. The AppsFlyer architecture is comprised of many data engineering open source technologies, including but not limited to Apache Spark, Apache Kafka, Apache Iceberg, and Apache Airflow. Although this setup has powered operations for years, the growing complexity of scaling resources to meet fluctuating demands, coupled with the operational overhead of maintaining clusters, prompted AppsFlyer to rethink their big data processing strategy.

EMR Serverless is a modern, scalable solution that alleviates the need for manual cluster management while dynamically adjusting resources to match real-time workload requirements. With EMR Serverless, scaling up or down happens within seconds, minimizing idle time and interruptions like spot terminations.

This shift has freed engineering teams to focus on innovation, improved resilience and high availability, and future-proofed the architecture to support their ever-increasing demands. By only paying for compute and memory resources used during runtime, AppsFlyer also optimized costs and minimized charges for idle resources, marking a significant step forward in efficiency and scalability.

Solution overview

AppsFlyer’s previous architecture was built around self-managed Hadoop clusters running on Amazon Elastic Compute Cloud (Amazon EC2) and handled the scale and complexity of the data workflows. Although this setup supported operational needs, it required substantial manual effort to maintain, scale, and optimize.

AppsFlyer orchestrated over 100,000 daily workflows with Airflow, managing both streaming and batch operations. Streaming pipelines used Spark Streaming to ingest real-time data from Kafka, writing raw datasets to an Amazon Simple Storage Service (Amazon S3) data lake while simultaneously loading them into BigQuery and Google Cloud Storage to build logical data layers. Batch jobs then processed this raw data, transforming it into actionable datasets for internal teams, dashboards, and analytics workflows. Additionally, some processed outputs were ingested into external data sources, enabling seamless delivery of AppsFlyer insights to customers across the web.

For analytics and fast queries, real-time data streams were ingested into ClickHouse and Druid to power dashboards. Additionally, Iceberg tables were created from Delta Lake raw data and made accessible through Amazon Athena for further data exploration and analytics.

With the migration to EMR Serverless, AppsFlyer replaced its self-managed Hadoop clusters, bringing significant improvements to scalability, cost-efficiency, and operational simplicity.

Spark-based workflows, including streaming and batch jobs, were migrated to run on EMR Serverless and take advantage of the elasticity of EMR Serverless, dynamically scaling to meet workload demands.

This transition has significantly reduced operational overhead, alleviating the need for manual cluster management, so teams can focus more on data processing and less on infrastructure.

The following diagram illustrates the solution architecture.

This post reviews the main challenges and lessons learned by the team at AppsFlyer from this migration.

Challenges and lessons learned

Migrating a large-scale organization like AppsFlyer, with dozens of teams, from Hadoop to EMR Serverless was a significant challenge—especially because many R&D teams had limited or no prior experience managing infrastructure. To provide a smooth transition, AppsFlyer’s Data Infrastructure (DataInfra) team developed a comprehensive migration strategy that empowered the R&D teams to seamlessly migrate their pipelines.

In this section, we discuss how AppsFlyer approached the challenge and achieved success for the entire organization.

Centralized preparation by the DataInfra team

To provide a seamless transition to EMR Serverless, the DataInfra team took the lead in centralizing preparation efforts:

  • Clear ownership – Taking full responsibility for the migration, the team planned, guided, and supported R&D teams throughout the process.
  • Structured migration guide – A detailed, step-by-step guide was created to streamline the transition from Hadoop, breaking down the complexities and making it accessible to teams with limited infrastructure experience.

Building a strong support network

To make sure the R&D teams had the resources they needed, AppsFlyer established a robust support environment:

  • Data community – The primary resource for answering technical questions. It encouraged knowledge sharing across teams and was spearheaded by the DataInfra team.
  • Slack support channel – A dedicated channel where the DataInfra team actively responded to questions and guided teams through the migration process. This real-time support significantly reduced bottlenecks and helped teams resolve issues quickly.

Infrastructure templates with best practices

Recognizing the complexity of the team’s migration, the DataInfra team had standardized templates to help teams start quickly and efficiently:

  • Infrastructure as code (IaC) templates – They developed Terraform templates with best practices for building applications on EMR Serverless. These templates included code examples and real production workflows already migrated to EMR Serverless. Teams could quickly bootstrap their projects by using these ready-made templates.
  • Cross-account access solutions – Operating across multiple AWS accounts required managing secure access between EMR Serverless accounts (where jobs run) and data storage accounts (where datasets reside). To streamline this, a step-by-step module was developed for setting up cross-account access using Assume Role permissions. Additionally, a dedicated repository was created, so teams can define and automate role and policy creation, providing seamless and scalable access management.

Airflow integration

As AppsFlyer’s primary workflow scheduler, Airflow plays a critical role, making it essential to provide a seamless transition for its users.

AppsFlyer developed a dedicated Airflow operator for executing Spark jobs on EMR Serverless, carefully designed to replicate the functionality of the existing Hadoop-based Spark operator. In addition, a Python package was made available across all Airflow clusters with the relevant operators. This approach minimized code changes, allowing teams to transition seamlessly with minimal modifications.

Solving common permission challenges

To streamline permissions management, AppsFlyer developed targeted solutions for frequent use cases:

  • Comprehensive documentation – Provided detailed instructions for handling permissions for services like Athena, BigQuery, Vault, GIT, Kafka, and many more.
  • Standardized Spark defaults configuration for teams to apply to their applications – Included built-in solutions for collecting lineage from Spark jobs running on EMR Serverless, providing accountability and traceability.

Continuous engagement with R&D teams

To promote progress and maintain alignment across teams, AppsFlyer introduced the following measures:

  • Weekly meetings – Weekly status meetings to review the status of each team’s migration efforts. Teams shared updates, challenges, and commitments, fostering transparency and collaboration.
  • Assistance – Proactive assistance was provided for issues raised during meetings to minimize delays. This made sure that the teams were on track and had the support they needed to meet their commitments.

By implementing these strategies, AppsFlyer transformed the migration process from a daunting challenge into a structured and well-supported journey. Key outcomes included:

  • Empowered teams – R&D teams with minimal infrastructure experience were able to confidently migrate their pipelines.
  • Standardized practices – Infrastructure templates and predefined solutions provided consistency and best practices across the organization.
  • Reduced downtime – The custom Airflow operator and detailed documentation minimized disruptions to existing workflows.
  • Cross-account compatibility – With seamless cross-account access, teams could run jobs and access data efficiently.
  • Improved collaboration – The data community and Slack support channel fostered a sense of collaboration and shared responsibility across teams.

Migrating an entire organization’s data workflows to EMR Serverless is a complex task, but by investing in preparation, templates, and support, AppsFlyer successfully streamlined the process for all R&D teams in the company.

This approach can serve as a model for organizations undertaking similar migrations.

Spark application code management and deployment

For AppsFlyer data engineers, developing and deploying Spark applications is a core daily responsibility. The Data Platform team focuses on identifying and implementing the right set of tools and safeguards that would not only simplify the migration to EMR Serverless, but also streamline ongoing operations.

There are two different approaches available for running Spark code on EMR Serverless: custom container images and JARs or Python files. At the beginning of the exploration, custom images looked promising because it allows greater customization than JARs, which should allow the DataInfra team smoother migration for existing workloads. After deeper research, it was realized that custom images have great power, but come with a cost that in large scale would need to be evaluated. Custom images presented the following challenges:

  • Custom images are supported as of version 6.9.0, but some of AppsFlyer’s workloads used earlier versions.
  • EMR Serverless resources run from the moment EMR Serverless begins downloading the image until workers are stopped. This means a payment is done for aggregate vCPU, memory, and storage resources during the image download phase.
  • They required a different continuous integration and delivery (CI/CD) approach than compiling a JAR or Python file, leading to operational work that should be minimized as much as possible.

AppsFlyer decided to go all in with JARs and allow only in unique cases, where the customization required the use of custom images. Eventually, it was realized that using non-custom images was suitable for AppsFlyer use cases.

CI/CD perspective

From a CI/CD perspective, AppsFlyer’s DataInfra team decided to align with AppsFlyer’s GitOps vision, making sure that both infrastructure and application code are version-controlled, built, and deployed using Git operations.

The following diagram illustrates the GitOps approach AppsFlyer adopted.

JARs continuous integration

For CI, the process in charge of building the application artifacts, several options have been explored. The following key considerations drove the exploration process:

  • Use Amazon S3 as the native JAR source for EMR Serverless
  • Support different versions for the same job
  • Support staging and production environments
  • Allow hotfixes, patches, and rollbacks

Using AppsFlyer’s current external package repository led to challenges, because it required them to build a custom delivery into Amazon S3 or a complex runtime ability to fetch the code externally.

Using Amazon S3 directly also had several alternative approaches:

  • Buckets – Use single vs. separated buckets for staging and production
  • Versions – Use Amazon S3 native object versioning vs. uploading a new file
  • Hotfix – Override the same job’s JAR file vs. uploading a new one

Finally, the decision was to go with immutable builds for consistent deployment across the environments.

Each Spark job git repository pushes to the main branch, triggers a CI process to validate the semantic versioning (semver) assignment, compiles the JAR artifact, and uploads it to Amazon S3. Each artifact is uploaded to three different paths according to the version of the JAR, and also include a version tag for the S3 object:

  • <BucketName>/<SparkJobName>/<major>"."<minor>"."<patch>/app.jar
  • <BucketName>/<SparkJobName>/<major>"."<minor>"/app.jar
  • <BucketName>/<SparkJobName>/<major>/app.jar

AppsFlyer can now have deep granularity and assign each EMR Serverless job to a pinpointed version. Some jobs can run with the latest major version, and other stability and SLA sensitive jobs require a lock to a specific patch version.

EMR Serverless continuous deployment

Uploading the files to Amazon S3 was the final step in the CI process, which then leads to a different CD process.

CD is done by changing the infrastructure code, which is Terraform based, to point to the new JAR that was uploaded to Amazon S3. Then the staging or production application can start using the newly uploaded code and the process can be considered deployed.

Spark application rollbacks

If they need an application rollback, AppsFlyer points the EMR Serverless job IaC configuration from the current impaired JAR version to the previous stable JAR version in the relevant Amazon S3 path.

AppsFlyer believes that every automation impacting production, like CD, requires a breaking glass mechanism for an emergency situation. In such cases, AppsFlyer can manually override the needed S3 object (JAR file) while still using Amazon S3 versions in order to have better visibility and manual version control.

Single-job vs. multi-job applications

When using EMR Serverless, one important architectural decision is whether to create a separate application for each Spark job or use an automatic scaling application shared across multiple Spark jobs. The following table summarizes these considerations.

Aspect Single-Job Application Multi-Job Application
Logical Nature Dedicated application for each job. Shared application for multiple jobs.
Shared Configurations Limited shared configurations; each application is independently configured. Allows shared configurations through spark-defaults, including executors, memory settings, and JARs.
Isolation Maximum isolation; each job runs independently. Maintains job-level isolation through distinct IAM roles despite sharing the application.
Flexibility Flexible for unique configurations or resource requirements. Reduces overhead by reusing configurations and using automatic scaling.
Overhead Higher setup and management overhead due to multiple applications. Lower administrative overhead but requires careful resource contention management.
Use Cases Suitable for jobs with unique requirements or strict isolation needs. Ideal for related workloads that benefit from shared settings and dynamic scaling.

By balancing these considerations, AppsFlyer tailored its EMR Serverless usage to efficiently meet the demands of diverse Spark workloads across their teams.

Airflow operator: Simplifying the transition to EMR Serverless

Before the migration to EMR Serverless, AppsFlyer’s teams relied on a custom Airflow Spark operator created by the DataInfra team.

This operator, packaged as a Python library, was integrated into the Airflow environment and became a key component of the data workflows.

It provided essential capabilities, including:

  • Retries and alerts – Built-in retry logic and PagerDuty alert integration
  • AWS role-based access – Automatic fetching of AWS permissions based on role names
  • Custom defaults – Setting Spark configurations and package defaults tailored for each job
  • State management – Job state tracking

This operator streamlined running Spark jobs on Hadoop and was highly tailored to AppsFlyer’s requirements.

When moving to EMR Serverless, the team chose to build a custom Airflow operator to align with their existing Spark-based workflows. They already had dozens of Directed Acyclic Graphs (DAGs) in production, so with this approach, they could maintain their familiar interface, including custom handling for retries, alerting, and configurations—all without requiring broad changes across the board.

This abstraction provided a smoother migration by preserving the same development patterns and minimizing the migration efforts of adapting to the native operator semantics.

The DataInfra team developed a dedicated, custom, EMR Serverless operator to support the following goals:

  • Seamless migration – The operator was designed to closely mimic the interface of the existing Spark operator on Hadoop. This made sure that teams could migrate with minimal code changes.
  • Feature parity – They added the features missing from the native operator:
    • Built-in retry logic.
    • PagerDuty integration for alerts.
    • Automatic role-based permission fetching.
    • Default Spark configurations and package support for each job.
  • Simplified integration – It’s packaged as a Python library available in Airflow clusters. Teams could use the operator just like they did with the previous Spark operator.

The custom operator abstracts some of the underlying configurations required to submit jobs to EMR Serverless, aligning with AppsFlyer’s internal best practices and adding essential features.

The following is from an example DAG using the operator:

return SparkBatchJobEmrServerlessOperator(
    task_id=task_id,  # Unique task identifier in the DAG

    jar_file=jar_file,  # Path to the Spark job JAR file on S3
    main_class="<main class path>",

    spark_conf=spark_conf,

    app_id=default_args["<emr_serverless_application_id>"],  # EMR Serverless app ID
    execution_role=default_args["<job_execution_role_arn>"],  # IAM role for job execution

    polling_interval_sec=120,  # How often to poll for job status
    execution_timeout=timedelta(hours=1),  # Max allowed runtime

    retries=5,  # Retry attempts for failed jobs
    app_args=[],  # Arguments to pass to the Spark job

    depends_on_past=True,  # Ensure sequential task execution

    tags={'owner': '<team_tag>'},  # Metadata for ownership
    aws_assume_role="<my_aws_role>",  # Role for cross-account access

    alerting_policy=ALERT_POLICY_CRITICAL.with_slack_channel(sc),  # Alerting integration
    owner="<team_owner>",

    dag=dag  # DAG this task belongs to
)

Cross-account permissions on AWS: Simplifying EMRs workflows

AppsFlyer operates across multiple AWS accounts, creating a need for secure and efficient cross-account access. EMR Serverless jobs are executed in the production account, and the data they process resides in a separate data account. To enable seamless operation, Assume Role permissions are used to verify that EMR Serverless jobs running in the production account can access the data and services in the data account. The following diagram illustrates this architecture.

Below is a diagram demonstrating the cross-account permissions AppsFlyer adopted:

Role management strategy

To manage cross-account access efficiently, three distinct roles were created and maintained:

  • EMR role – Used for executing and managing EMR Serverless applications in the production account. Integrated directly into Airflow workers to make it available for the DAGs on the dedicated team Airflow cluster.
  • Execution role – Assigned to the Spark job running on EMR Serverless. Passed by the EMR role in the DAG code to provide seamless integration.
  • Data role – Resides in the data account and is assumed by the execution role to access data stored in Amazon S3 and other AWS services.

To enforce access boundaries, each role and policy is tagged with team-specific identifiers.
This makes sure that teams can only access their own data and roles, minimizing unauthorized access to other teams’ resources.

Simplifying Airflow migration

A streamlined process to make cross-account permissions transparent for teams migrating their workloads to EMR Serverless was developed:

  1. The EMR role is embedded into Airflow workers, making it available for DAGs in the dedicated Airflow cluster for each team:
{
   "Version":"2012-10-17",
   "Statement":[
      "..."{
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::account-id:role/execution-role",
         "Condition":{
            "StringEquals":{
               "iam:ResourceTag/Team":"team-tag"
            }
         }
      }
   ]
}
  1. The EMR role automatically passes the execution role to the job within the DAG code:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::data-account-id:role/data-role",
      "Condition": {
        "StringEquals": {
          "iam:ResourceTag/Team": "team-tag"
        }
      }
    }
  ]
}
  1. The execution role assumes the data role dynamically during job execution to access the required data and services in the data account:

Allows the Execution Role in the Production account to assume the Data Role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::production-account-id:role/execution-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
  1. Policies, trust relationships, and role definitions are managed in a dedicated GitLab repository. GitLab CI/CD pipelines automate the creation and integration of roles and policies, providing consistency and reducing manual overhead.

Benefits of AppsFlyer’s approach

This approach offered the following benefits:

  • Seamless access – Teams no longer need to handle cross-account permissions manually because these are automated through preconfigured roles and policies, providing seamless and secure access to resources across accounts.
  • Scalable and secure – Role-based and tag-based permissions provide security and scalability across multiple teams and accounts. By using roles and tags, it alleviates the need to create separate hardcoded policies for each team or account. Instead, they can define generalized policies that scale automatically as new resources, accounts, or teams are added.
  • Automated management – GitLab CI/CD streamlines the deployment and integration of policies and roles, reducing manual effort while enhancing consistency. It also minimizes human errors, improves change transparency, and simplifies version management.
  • Flexibility for teams – Teams have the flexibility to use their own or native EMR Serverless operators while maintaining secure access to data.

By implementing a robust, automated cross-account permissions system, AppsFlyer has enabled secure and efficient access to data and services across multiple AWS accounts. This makes sure that teams can focus on their workloads without worrying about infrastructure complexities, accelerating their migration to EMR Serverless.

Integrating lineage into EMR Serverless

AppsFlyer developed a robust solution for column-level lineage collection to provide comprehensive visibility into data transformations across pipelines. Lineage data is stored in Amazon S3 and subsequently ingested into DataHub, AppsFlyer’s lineage and metadata management environment.

Currently, AppsFlyer collects column-level lineage from a variety of sources, including Amazon Athena, BigQuery, Spark, and more.

This section focuses on how AppsFlyer collects Spark column-level lineage specifically within the EMR Serverless infrastructure.

Collecting Spark lineage with Spline

To capture lineage from Spark jobs, AppsFlyer uses Spline, an open source tool designed for automated tracking of data lineage and pipeline structures.

AppsFlyer modified Spline’s default behavior to output a customized Spline object that aligns with AppsFlyer’s specific requirements. AppsFlyer adapted the Spline integration into both legacy and modern environments. In the pre-migration phase, they injected the Spline agent into Spark jobs through their customized Airflow Spark operator. In the post-migration phase, they integrated Spline directly into EMR Serverless applications.

The lineage workflow consists of the following steps:

  1. As Spark jobs execute, Spline captures detailed metadata about the queries and transformations performed.
  2. The captured metadata is exported as Spline object files to a dedicated S3 bucket.
  3. These Spline objects are processed into column-level lineage objects customized to fit AppsFlyer’s data architecture and requirements.
  4. The processed lineage data is ingested into DataHub, providing a centralized and interactive view of data dependencies.

The following figure is an example of a lineage diagram from DataHub.

Challenges and how AppsFlyer addressed them

AppsFlyer encountered the following challenges:

  • Supporting different EMR Serverless applications – Each EMR Serverless application has its own Spark and Scala version requirements.
  • Diverse operator usage – Teams often use custom or native EMR Serverless operators, making uniform Spline integration challenging.
  • Confirming universal adoption – They need to make sure Spark jobs across multiple accounts use the Spline agent for lineage tracking.

AppsFlyer addressed these challenges with the following solutions:

  • Version-specific Spline agents – AppsFlyer created a dedicated Spline agent for each EMR Serverless application version to match its Spark and Scala versions. For example, EMR Serverless application version 7.0.1 and Spline.7.0.1.
  • Spark defaults integration – They integrated the Spline agent into EMR Serverless application Spark defaults to verify lineage collection for jobs executed on the application—no job-specific modifications needed.
  • Automation for compliance – This process consists of the following steps:
    • Detect a newly created EMR Serverless application across accounts.
    • Verify that Spline is properly defined in the application’s Spark defaults.
    • Send a PagerDuty alert to the dedicated team if misconfigurations are detected.

Example integration with Terraform

To automate Spline integration, AppsFlyer used Terraform and local-exec to define Spark defaults for EMR Serverless applications. With Amazon EMR, you can set unified Spark configuration properties through spark-defaults, which are then applied to Spark jobs.

This configuration makes sure the Spline agent is automatically applied to every Spark job without requiring modifications to the Airflow operator or the job itself.

This robust lineage integration provides the following benefits:

  • Full visibility – Automatic lineage tracking provides detailed insights into data transformations
  • Seamless scalability – Version-specific Spline agents provide compatibility with EMR Serverless applications
  • Proactive monitoring – Automated compliance checks verify that lineage tracking is consistently enabled across accounts
  • Enhanced governance – Ingesting lineage data into DataHub provides traceability, supports audits, and fosters a deeper understanding of data dependencies

By integrating Spline with EMR Serverless applications, AppsFlyer has provided comprehensive and automated lineage tracking, so teams can understand their data pipelines better while meeting compliance requirements. This scalable approach aligns with AppsFlyer’s commitment to maintaining transparency and reliability throughout their data landscape.

Monitoring and observability

When embarking on a large migration, and as a day-to-day best-practice process, monitoring and observability are key parts of being able to run workloads successfully for stability, debugging, and cost.

AppsFlyer’s DataInfra team set several KPIs for monitoring and observability in EMR Serverless:

  • Monitor infrastructure-level metrics and logs:
    • EMR Serverless resource usage, including cost
    • EMR Serverless API usage
  • Monitor Spark application-level metrics and logs:
    • stdout and stderr logs
    • Spark engine metrics
  • Centralized observability over the existing environments, Datadog

Metrics

Using EMR Serverless native metrics, AppsFlyer’s DataInfra team set up several dashboards to support tracking both the migration and the day-to-day usage of EMR Serverless across the company. The following are the main metrics that were monitored:

  • Service quota usage metrics:
    • vCPU usage tracking (ResourceCount with vCPU dimension)
    • API usage tracking (API actual usage vs. API limits)
  • Application status metrics:
    • RunningJobs, SuccessJobs, FailedJobs, PendingJobs, CancelledJobs
  • Resource limits tracking:
    • MaxCPUAllowed vs. CPUAllocated
    • MaxMemoryAllowed vs. MemoryAllocated
    • MaxStorageAllowed vs. StorageAllocated
  • Worker-level metrics:
    • WorkerCpuAllocated vs. WorkerCpuUsed
    • WorkerMemoryAllocated vs. WorkerMemoryUsed
    • WorkerEphemeralStorageAllocated vs. WorkerEphemeralStorageUsed
  • Capacity allocation tracking:
    • Metrics filtered by CapacityAllocationType (PreInitCapacity vs. OnDemandCapacity)
    • ResourceCount
  • Worker type distribution:
    • Metrics filtered by WorkerType (SPARK_DRIVER vs. SPARK_EXECUTORS)
  • Job success rates over time:
    • SuccessJobs vs. FailedJobs ratio
    • SubmitedJobs vs. PendingJobs

The following screenshot shows an example of the tracked metrics.

Logs

For logs management, AppsFlyer’s DataInfra team explored several options:

Streamlining EMR Serverless log shipping to Datadog

Because AppsFlyer decided to keep their logs in an external logging environment, the DataInfra team aimed to reduce the number of components involved in the shipping process and minimize maintenance overhead. Instead of managing a Lambda based log shipper, they developed a custom Spark plugin that seamlessly exports logs from EMR Serverless to Datadog.

Companies already storing logs in Amazon S3 or CloudWatch Logs can take advantage of EMR Serverless native support for those environments. However, for teams needing a direct, real-time integration with Datadog, this approach alleviates the need for extra infrastructure, providing a more efficient and maintainable logging solution.

The custom Spark plugin offers the following capabilities:

  • Automated log export – Streams logs from EMR Serverless to Datadog
  • Fewer extra components – Alleviates the need for Lambda based log shippers
  • Secure API key management – Uses Vault instead of hardcoding credentials
  • Customizable logging – Supports custom Log4j settings and log levels
  • Full integration with Spark – Works on both driver and executor nodes

How the plugin works

In this section, we walk through the components of how the plugin works and provide a pseudocode overview:

  • Driver pluginLoggerDriverPlugin runs on the Spark driver to configure logging. The plugin fetches EMR job metadata, calls Vault to retrieve the Datadog API key, and configures logging settings.
initialize() {
  if (user provided log4j.xml) {
     Use custom log configuration
  } else {
     Fetch EMR job metadata (application name, job ID, tags)
     Retrieve Datadog API key from Vault
     Apply default logging settings
  }
}
  • Executor plugin – LoggerExecutorPlugin provides consistent logging across executor nodes. It inherits the driver’s log configuration and makes sure the executors use consistent logging
initialize() {
   fetch logging config from Driver
   apply log settings (log4j, log levels)
}
  • Main plugin – LoggerSparkPlugin registers the driver and executor plugins in Spark. It serves as the entry point for Spark and applies custom logging settings dynamically.
function registerPlugin() {
  return (driverPlugin, executorPlugin);
}
loginToVault(role, vaultAddress) {
    create AWS signed request
    authenticate with Vault
    return vault token
}

getDatadogApiKey(vaultToken, secretPath) {
    fetch API key from Vault
    return key
}

Set up the plugin

To set up the plugin, complete the following steps:

  1. Add the following dependencies to your project:
<dependency>
  <groupId>com.AppsFlyer.datacom</groupId>
  <artifactId>emr-serverless-logger-plugin</artifactId>
  <version><!-- insert version here --></version>
</dependency>
  1. Configure the Spark plugin. The following code enables the custom Spark plugin and assigns the Vault role to access the Datadog API key:

--conf "spark.plugins=com.AppsFlyer.datacom.emr.plugin.LoggerSparkPlugin"

--conf "spark.datacom.emr.plugin.vaultAuthRole=your_vault_role"

  1. Use a custom or default Log4j configuration:

--conf "spark.datacom.emr.plugin.location=classpath:my_custom_log4j.xml"

  1. Set the environment variables for different log levels. This adjusts the logging for specific packages.

--conf "spark.emr-serverless.driverEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.executorEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.emr-serverless.driverEnv.LOG_LEVEL=DEBUG"

--conf "spark.executorEnv.LOG_LEVEL=DEBUG"

  1. Configure the Vault and Datadog API key and verify secure Datadog API key retrieval.

By adopting this plugin, AppsFlyer was able to significantly simplify log shipping, reducing the number of moving parts while maintaining real-time log visibility in Datadog. This approach provides reliability, security, and ease of maintenance, making it an ideal solution for teams using EMR Serverless with Datadog.

Summary

Through their migration to EMR Serverless, AppsFlyer achieved a significant transformation in team autonomy and operational efficiency. Individual teams now have greater freedom to choose and build their own resources without depending on a central infrastructure team, and can work more independently and innovatively. The minimization of spot interruptions, which were common in their previous self-managed Hadoop clusters, has substantially improved stability and agility in their operations. Thanks to this autonomy and reliability, combined with the automatic scaling capabilities of EMR Serverless, the AppsFlyer teams can focus more on data processing and innovation rather than infrastructure management. The result is a more efficient, flexible, and self-sufficient development environment where teams can better respond to their specific needs while maintaining high performance standards.

Ruli Weisbach, AppsFlyer EVP of R&D, says,

“EMR-Serverless is a game changer for AppsFlyer; we are able to save significantly our cost with remarkably lower management overhead and maximal elasticity.”

If the AppsFlyer approach sparked your interest and you are thinking about implementing a similar solution in your organization, refer to the following resources:

Migrating to EMR Serverless can transform your organization’s data processing capabilities, offering a fully managed, cloud-based experience that automatically scales resources and eases the operational complexity of traditional cluster management, while enabling advanced analytics and machine learning workloads with greater cost-efficiency.


About the authors

Roy Ninio is an AI Platform Lead with deep expertise in scalable data platform and cloud-native architectures. At AppsFlyer, Roy led the design of a high-performance Data Lake handling PB of daily events, driven the adoption of EMR Serverless for dynamic big data processing, and architected lineage and governance systems across platforms.

Avichay Marciano is a Sr. Analytics Solutions Architect at Amazon Web Services. He has over a decade of experience in building large-scale data platforms using Apache Spark, modern data lake architectures, and OpenSearch. He is passionate about data-intensive systems, analytics at scale, and it’s intersection with machine learning.

Eitav Arditti is AWS Senior Solutions Architect with 15 years in AdTech industry, specializing in Serverless, Containers, Platform engineering, and Edge technologies. Designs cost-efficient, large-scale AWS architectures that leverage the cloud-native and edge computing to deliver scalable, reliable solutions for business growth.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist, helping customers design scalable, open data lakehouse architectures and adopt modern analytics solutions across industries.

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-account-access-of-amazon-sagemaker-lakehouse-multi-catalog-tables-using-aws-glue-5-0-spark/

Many organizations build and operate enterprise-wide data mesh architectures using the AWS Glue Data Catalog and AWS Lake Formation for their Amazon Simple Storage Service (Amazon S3) based data lakes. Now, with Amazon SageMaker Lakehouse, these organizations can unify their data analytics and AI/ML workflows while maintaining secure cross-account access without data replication. By centralizing access to a single copy of data and using the secure fine-grained permissions of Lake Formation, enterprises can accelerate their analytics initiatives while reducing operational complexity across business units.

SageMaker Lakehouse organizes data using logical containers called catalogs, enabling teams to seamlessly query and analyze data across their entire ecosystem—from S3 data lakes to Amazon Redshift warehouses—using familiar Apache Iceberg compatible tools. Organizations can either mount their existing data warehouse to the lakehouse or create new catalogs using Amazon Redshift managed storage. Built-in zero-ETL connectors reduce data silos by integrating various data sources, enabling unified analytics across teams. This seamless integration particularly benefits existing AWS customers who already use the Data Catalog and Lake Formation, because they can immediately take advantage of SageMaker Lakehouse capabilities.

AWS Glue is a serverless service that makes data integration simpler, faster, and cheaper. We launched AWS Glue 5.0 with upgraded Apache Spark 3.5.4 and Python 3.11. AWS Glue 5.0 adds support for SageMaker Lakehouse to unify your data across S3 data lakes and Redshift data warehouses.

In our previous blog post, we demonstrated the process of creating tables in both the Amazon Redshift managed catalog and Amazon Redshift federated catalog within a single AWS account. In this post, we show you how to share a Redshift table and Amazon S3 based Iceberg table from the account that owns the data to another account that consumes the data. In the recipient account, we run a join query on the shared data lake and data warehouse tables using Spark in AWS Glue 5.0. We walk you through the complete cross-account setup and provide the Spark configuration in a Python notebook.

Solution overview

To demonstrate the functionality of SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark, let’s assume the retail company Example Retail Corp launches a campaign to understand their market and drive growth by country of operation. Their infrastructure consists of a Redshift data warehouse for structured data and an S3 data lake for structured and semi-structured data. The marketing team realizes that customer data is spread across those two systems and wants to use the support of their data engineering and analysts to analyze and provide insights. As a company, they prefer unified governance for managing data access while enabling a secure sharing mechanism for business and engineering teams.

Let’s see how they can achieve the goal using SageMaker Lakehouse. The solution is represented in the following diagram.

001-BDB 5089

The setup could be extended to enterprise data meshes where a data producer account will own the Redshift clusters, catalog the tables in a central governance account, and share with any number of consumer accounts from the central account. Multiple consumer accounts could analyze the shared Redshift tables using the SageMaker Lakehouse integrated analytics engines.

The solution also works for cross-Region table access. You would create a resource link for the catalog tables in an AWS Region where you want to run your analyses and create dashboards. For cross-Region resource link setup, refer to Setting up cross-Region table access.

Prerequisites

To implement this solution, you need the following prerequisites:

  • Two AWS accounts with Lake Formation cross-account sharing version 4 and Lake Formation administrator configured. Refer to the Lake Formation data administrator permissions and initial setup of Lake Formation.
  • Permissions from Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog granted to the Lake Formation administrator role on both accounts.
  • An S3 bucket in the producer account to host the sample Iceberg table data.
  • An AWS Identity and Access Management (IAM) role, LakeFormationS3Registration_custom, in the producer account to register your Iceberg table’s Amazon S3 location with Lake Formation. For details, refer to Registering an Amazon S3 location and Requirements for roles used to register locations.
  • An Amazon Redshift Serverless namespace in the producer account. Follow the instructions in Creating a data warehouse with Amazon Redshift Serverless to launch a serverless namespace with default settings.
  • Two sample datasets, orders and returns, in CSV format. This is Example Retail Corp’s data on their customer purchase and return trends. Their marketing team has collected these data in a Redshift table and Amazon S3 from various systems. The instructions to create these tables are provided in the appendix at the end of this post. After completing the steps in the appendix, you should have customerdb.returnstbl_iceberg in your default catalog and ordersdb.orderstbl in your Redshift Serverless application default namespace.
  • An IAM role, Glue-execution-role, in the consumer account, with the following policies:
    1. AWS managed policies AWSGlueServiceRole and AmazonRedshiftDataFullAccess.
    2. Create a new in-line policy with the following permissions and attach it:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "LFandRSserverlessAccess",
                  "Effect": "Allow",
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "redshift-serverless:GetCredentials"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": "iam:PassRole",
                  "Resource": "*",
                  "Condition": {
                      "StringEquals": {
                          "iam:PassedToService": "glue.amazonaws.com"
                      }
                  }
              }
          ]
      }

    3. Add the following trust policy to Glue-execution-role, allowing AWS Glue to assume this role:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [
                          "glue.amazonaws.com"
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }

    Steps for producer account setup

    For the producer account setup, you can either use your IAM administrator role added as Lake Formation administrator or use a Lake Formation administrator role with permissions added as discussed in the prerequisites. For illustration purposes, we use the IAM admin role Admin added as Lake Formation administrator.

    002-BDB 5089

    Configure your catalog

    Complete the following steps to set up your catalog:

    1. Log in to AWS Management Console as Admin.
    2. On the Amazon Redshift console, follow the instructions in Registering Amazon Redshift clusters and namespaces to the AWS Glue Data Catalog.
    3. After the registration is initiated, you will see the invite from Amazon Redshift on the Lake Formation console.
    4. Select the pending catalog invitation and choose Approve and create catalog.

    003-BDB 5089

    1. On the Set catalog details page, configure your catalog:
      1. For Name, enter a name (for this post, redshiftserverless1-uswest2).
      2. Select Access this catalog from Apache Iceberg compatible engines.
      3. Choose the IAM role you created for the data transfer.
      4. Choose Next.

      004-BDB 5089

    2. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add.

      005-BDB 5089

    3. Verify the granted permission on the next page and choose Next.
      006-BDB 5089
    4. Review the details on the Review and create page and choose Create catalog.
      007-BDB 5089

    Wait a few seconds for the catalog to show up.

    1. Choose Catalogs in the navigation pane and verify that the redshiftserverless1-uswest2 catalog is created.
      008-BDB 5089
    2. Explore the catalog detail page to verify the ordersdb.public database.
      009-BDB 5089
    3. On the database View dropdown menu, view the table and verify that the orderstbl table shows up.
      010-BDB 5089

    As the Admin role, you can also query the orderstbl in Amazon Athena and confirm the data is available.

    011-BDB 5089

    Grant permissions on the tables from the producer account to the consumer account

    In this step, we share the Amazon Redshift federated catalog database redshiftserverless1-uswest2:ordersdb.public and table orderstbl as well as the Amazon S3 based Iceberg table returnstbl_iceberg and its database customerdb from the default catalog to the consumer account. We can’t share the entire catalog to external accounts as a catalog-level permission; we just share the database and table.

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
      012-BDB 5089
    3. Under Principals, select External accounts.
    4. Provide the consumer account ID.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose the account ID that represents the default catalog.
    7. For Databases, choose customerdb.
      013-BDB 5089
    8. Under Database permissions, select Describe under Database permissions and Grantable permissions.
    9. Choose Grant.
      014-BDB 5089
    10. Repeat these steps and grant table-level Select and Describe permissions on returnstbl_iceberg.
    11. Repeat these steps again to grant database- and table-level permissions for the ordertbl table of the federated catalog database redshiftserverless1-uswest2/ordersdb.

    The following screenshots show the configuration for database-level permissions.

    015-BDB 5089

    016-BDB 5089

    The following screenshots show the configuration for table-level permissions.

    017-BDB 5089

    018-BDB 5089

    1. Choose Data permissions in the navigation pane and verify that the consumer account has been granted database- and table-level permissions for both orderstbl from the federated catalog and returnstbl_iceberg from the default catalog.
      019-BDB 5089

    Register the Amazon S3 location of the returnstbl_iceberg with Lake Formation.

    In this step, we register the Amazon S3 based Iceberg table returnstbl_iceberg data location with Lake Formation to be managed by Lake Formation permissions. Complete the following steps:

    1. On the Lake Formation console, choose Data lake locations in the navigation pane.
    2. Choose Register location.
      020-BDB 5089
    3. For Amazon S3 path, enter the path for your S3 bucket that you provided while creating the Iceberg table returnstbl_iceberg.
    4. For IAM role, provide the user-defined role LakeFormationS3Registration_custom that you created as a prerequisite.
    5. For Permission mode, select Lake Formation.
    6. Choose Register location.
      021-BDB 5089
    7. Choose Data lake locations in the navigation pane to verify the Amazon S3 registration.
      022-BDB 5089

    With this step, the producer account setup is complete.

    Steps for consumer account setup

    For the consumer account setup, we use the IAM admin role Admin, added as a Lake Formation administrator.

    The steps in the consumer account are quite involved. In the consumer account, a Lake Formation administrator will accept the AWS Resource Access Manager (AWS RAM) shares and create the required resource links that point to the shared catalog, database, and tables. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. The admin further grants permissions to the role Glue-execution-role on the resource links, database, and tables. The admin then runs a join query in AWS Glue 5.0 Spark using Glue-execution-role.

    Accept and verify the shared resources

    Lake Formation uses AWS RAM shares to enable cross-account sharing with Data Catalog resource policies in the AWS RAM policies. To view and verify the shared resources from producer account, complete the following steps:

    1. Log in to the consumer AWS console and set the AWS Region to match the producer’s shared resource Region. For this post, we use us-west-2.
    2. Open the Lake Formation console. You will see a message indicating there is a pending invite and asking you accept it on the AWS RAM console.
      023-BDB 5089
    3. Follow the instructions in Accepting a resource share invitation from AWS RAM to review and accept the pending invites.
    4. When the invite status changes to Accepted, choose Shared resources under Shared with me in the navigation pane.
    5. Verify that the Redshift Serverless federated catalog redshiftserverless1-uswest2, the default catalog database customerdb, the table returnstbl_iceberg, and the producer account ID under Owner ID column display correctly.
      024-BDB 5089
    6. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    7. Search by the producer account ID.
      You should see the customerdb and public databases. You can further select each database and choose View tables on the Actions dropdown menu and verify the table names

    025-BDB 5089

    You will not see an AWS RAM share invite for the catalog level on the Lake Formation console, because catalog-level sharing isn’t possible. You can review the shared federated catalog and Amazon Redshift managed catalog names on the AWS RAM console, or using the AWS Command Line Interface (AWS CLI) or SDK.

    Create a catalog link container and resource links

    A catalog link container is a Data Catalog object that references a local or cross-account federated database-level catalog from other AWS accounts. For more details, refer to Accessing a shared federated catalog. Catalog link containers are essentially Lake Formation resource links at the catalog level that reference or point to a Redshift cluster federated catalog or Amazon Redshift managed catalog object from other accounts.

    In the following steps, we create a catalog link container that points to the producer shared federated catalog redshiftserverless1-uswest2. Inside the catalog link container, we create a database. Inside the database, we create a resource link for the table that points to the shared federated catalog table <<producer account id>>:redshiftserverless1-uswest2/ordersdb.public.orderstbl.

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
    2. Choose Create catalog.

    026-BDB 5089

    1. Provide the following details for the catalog:
      1. For Name, enter a name for the catalog (for this post, rl_link_container_ordersdb).
      2. For Type, choose Catalog Link container.
      3. For Source, choose Redshift.
      4. For Target Redshift Catalog, enter the Amazon Resource Name (ARN) of the producer federated catalog (arn:aws:glue:us-west-2:<<producer account id>>:catalog/redshiftserverless1-uswest2/ordersdb).
      5. Under Access from engines, select Access this catalog from Apache Iceberg compatible engines.
      6. For IAM role, provide the Redshift-S3 data transfer role that you had created in the prerequisites.
      7. Choose Next.

    027-BDB 5089

    1. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add and then choose Next.

    028-BDB 5089

    1. Review the details on the Review and create page and choose Create catalog.

    Wait a few seconds for the catalog to show up.

    029-BDB 5089

    1. In the navigation pane, choose Catalogs.
    2. Verify that rl_link_container_ordersdb is created.

    030-BDB 5089

    Create a database under rl_link_container_ordersdb

    Complete the following steps:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose rl_link_container_ordersdb.
    3. Choose Create database.

    Alternatively, you can choose the Create dropdown menu and then choose Database.

    1. Provide details for the database:
      1. For Name, enter a name (for this post, public_db).
      2. For Catalog, choose rl_link_container_ordersdb.
      3. Leave Location – optional as blank.
      4. Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
      5. Choose Create database.

    031-BDB 5089

    1. Choose Catalogs in the navigation pane to verify that public_db is created under rl_link_container_ordersdb.

    032-BDB 5089

    Create a table resource link for the shared federated catalog table

    A resource link to a shared federated catalog table can reside only inside the database of a catalog link container. A resource link for such tables will not work if created inside the default catalog. For more details on resource links, refer to Creating a resource link to a shared Data Catalog table.

    Complete the following steps to create a table resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
    2. On the Create dropdown menu, choose Resource link.

    033-BDB 5089

    1. Provide details for the table resource link:
      1. For Resource link name, enter a name (for this post, rl_orderstbl).
      2. For Destination catalog, choose rl_link_container_ordersdb.
      3. For Database, choose public_db.
      4. For Shared table’s region, choose US West (Oregon).
      5. For Shared table, choose orderstbl.
      6. After the Shared table is selected, Shared table’s database and Shared table’s catalog ID should get automatically populated.
      7. Choose Create.

    034-BDB 5089

    1. In the navigation pane, choose Databases to verify that rl_orderstbl is created under public_db, inside rl_link_container_ordersdb.

    035-BDB 5089

    036-BDB 5089

    Create a database resource link for the shared default catalog database.

    Now we create a database resource link in the default catalog to query the Amazon S3 based Iceberg table shared from the producer. For details on database resource links, refer Creating a resource link to a shared Data Catalog database.

    Though we are able to see the shared database in the default catalog of the consumer, a resource link is required to query from analytics engines, such as Athena, Amazon EMR, and AWS Glue. When using AWS Glue with Lake Formation tables, the resource link needs to be named identically to the source account’s resource. For additional details on using AWS Glue with Lake Formation, refer to Considerations and limitations.

    Complete the following steps to create a database resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose the account ID to choose the default catalog.
    3. Search for customerdb.

    You should see the shared database name customerdb with the Owner account ID as that of your producer account ID.

    1. Select customerdb, and on the Create dropdown menu, choose Resource link.
    2. Provide details for the resource link:
      1. For Resource link name, enter a name (for this post, customerdb).
      2. The rest of the fields should be already populated.
      3. Choose Create.
    3. In the navigation pane, choose Databases and verify that customerdb is created under the default catalog. Resource link names will show in italicized font.

    037-BDB 5089

    Verify access as Admin using Athena

    Now you can verify your access using Athena. Complete the following steps:

    1. Open the Athena console.
    2. Make sure an S3 bucket is provided to store the Athena query results. For details, refer to Specify a query result location using the Athena console.
    3. In the navigation pane, verify both the default catalog and federated catalog tables by previewing them.
    4. You can also run a join query as follows. Pay attention to the three-point notation for referring to the tables from two different catalogs:
    SELECT
    returns_tb.market as Market,
    sum(orders_tb.quantity) as Total_Quantity
    FROM rl_link_container_ordersdb.public_db.rl_orderstbl as orders_tb
    JOIN awsdatacatalog.customerdb.returnstbl_iceberg as returns_tb
    ON orders_tb.order_id = returns_tb.order_id
    GROUP BY returns_tb.market;

    038-BDB 5089

    This verifies the new capability of SageMaker Lakehouse, which enables accessing Redshift cluster tables and Amazon S3 based Iceberg tables in the same query, across AWS accounts, through the Data Catalog, using Lake Formation permissions.

    Grant permissions to Glue-execution-role

    Now we will share the resources from the producer account with additional IAM principals in the consumer account. Usually, the data lake admin grants permissions to data analysts, data scientists, and data engineers in the consumer account to do their job functions, such as processing and analyzing the data.

    We set up Lake Formation permissions on the catalog link container, databases, tables, and resource links to the AWS Glue job execution role Glue-execution-role that we created in the prerequisites.

    Resource links allow only Describe and Drop permissions. You need to use the Grant on target configuration to provide database Describe and table Select permissions.

    Complete the following steps:

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
    3. Under Principals, select IAM users and roles.
    4. For IAM users and roles, enter Glue-execution-role.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose rl_link_container_ordersdb and the consumer account ID, which indicates the default catalog.
    7. Under Catalog permissions, select Describe for Catalog permissions.
    8. Choose Grant.

    039-BDB 5089

    040-BDB 5089

    1. Repeat these steps for the catalog rl_link_container_ordersdb:
      1. On the Databases dropdown menu, choose public_db.
      2. Under Database permissions, select Describe.
      3. Choose Grant.
    2. Repeat these steps again, but after choosing rl_link_container_ordersdb and public_db, on the Tables dropdown menu, choose rl_orderstbl.
      1. Under Resource link permissions, select Describe.
      2. Choose Grant.
    3. Repeat these steps to grant additional permissions to Glue-execution-role.
      1. For this iteration, grant Describe permissions on the default catalog databases public and customerdb.
      2. Grant Describe permission on the resource link customerdb.
      3. Grant Select permission on the tables returnstbl_iceberg and orderstbl.

    The following screenshots show the configuration for database public and customerdb permissions.

    041-BDB 5089

    042-BDB 5089

    The following screenshots show the configuration for resource link customerdb permissions.

    043-BDB 5089

    044-BDB 5089

    The following screenshots show the configuration for table returnstbl_iceberg permissions.

    045-BDB 5089

    046-BDB 5089

    The following screenshots show the configuration for table orderstbl permissions.

    047-BDB 5089

    048-BDB 5089

    1. In the navigation pane, choose Data permissions and verify permissions on Glue-execution-role.

    049-BDB 5089

    Run a PySpark job in AWS Glue 5.0

    Download the PySpark script LakeHouseGlueSparkJob.py. This AWS Glue PySpark script runs Spark SQL by joining the producer shared federated orderstbl table and Amazon S3 based returns table in the consumer account to analyze the data and identify the total orders placed per market.

    Replace <<consumer_account_id>> in the script with your consumer account ID. Complete the following steps to create and run an AWS Glue job:

    1. On the AWS Glue console, in the navigation pane, choose ETL jobs.
    2. Choose Create job, then choose Script editor.

    050-BDB 5089

    1. For Engine, choose Spark.
    2. For Options, choose Start fresh.
    3. Choose Upload script.
    4. Browse to the location where you downloaded and edited the script, select the script, and choose Open.
    5. On the Job details tab, provide the following information:
      1. For Name, enter a name (for this post, LakeHouseGlueSparkJob).
      2. Under Basic properties, for IAM role, choose Glue-execution-role.
      3. For Glue version, select Glue 5.0.
      4. Under Advanced properties, for Job parameters, choose Add new parameter.
      5. Add the parameters --datalake-formats = iceberg and --enable-lakeformation-fine-grained-access = true.
    6. Save the job.
    7. Choose Run to execute the AWS Glue job, and wait for the job to complete.
    8. Review the job run details from the Output logs

    051-BDB 5089

    052-BDB 5089

    Clean up

    To avoid incurring costs on your AWS accounts, clean up the resources you created:

    1. Delete the Lake Formation permissions, catalog link container, database, and tables in the consumer account.
    2. Delete the AWS Glue job in the consumer account.
    3. Delete the federated catalog, database, and table resources in the producer account.
    4. Delete the Redshift Serverless namespace in the producer account.
    5. Delete the S3 buckets you created as part of data transfer in both accounts and the Athena query results bucket in the consumer account.
    6. Clean up the IAM roles you created for the SageMaker Lakehouse setup as part of the prerequisites.

    Conclusion

    In this post, we illustrated how to bring your existing Redshift tables to SageMaker Lakehouse and share them securely with external AWS accounts. We also showed how to query the shared data warehouse and data lakehouse tables in the same Spark session, from a recipient account, using Spark in AWS Glue 5.0.

    We hope you find this useful to integrate your Redshift tables with an existing data mesh and access the tables using AWS Glue Spark. Test this solution in your accounts and share feedback in the comments section. Stay tuned for more updates and feel free to explore the features of SageMaker Lakehouse and AWS Glue versions.

    Appendix: Table creation

    Complete the following steps to create a returns table in the Amazon S3 based default catalog and an orders table in Amazon Redshift:

    1. Download the CSV format datasets orders and returns.
    2. Upload them to your S3 bucket under the corresponding table prefix path.
    3. Use the following SQL statements in Athena. First-time users of Athena should refer to Specify a query result location.
    CREATE DATABASE customerdb;
    CREATE EXTERNAL TABLE customerdb.returnstbl_csv(
      `returned` string, 
      `order_id` string, 
      `market` string)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '\;' 
    LOCATION
      's3://<your-S3-bucket>/<prefix-for-returns-table-data>/'
    TBLPROPERTIES (
      'skip.header.line.count'='1'
    );
    
    select * from customerdb.returnstbl_csv limit 10; 
    

    053-BDB 5089

    1. Create an Iceberg format table in the default catalog and insert data from the CSV format table:
    CREATE TABLE customerdb.returnstbl_iceberg(
      `returned` string, 
      `order_id` string, 
      `market` string)
    LOCATION 's3://<your-producer-account-bucket>/returnstbl_iceberg/' 
    TBLPROPERTIES (
      'table_type'='ICEBERG'
    );
    
    INSERT INTO customerdb.returnstbl_iceberg
    SELECT *
    FROM returnstbl_csv;  
    
    SELECT * FROM customerdb.returnstbl_iceberg LIMIT 10; 
    

    054-BDB 5089

    1. To create the orders table in the Redshift Serverless namespace, open the Query Editor v2 on the Amazon Redshift console.
    2. Connect to the default namespace using your database admin user credentials.
    3. Run the following commands in the SQL editor to create the database ordersdb and table orderstbl in it. Copy the data from your S3 location of the orders data to the orderstbl:
    create database ordersdb;
    use ordersdb;
    
    create table orderstbl(
      row_id int, 
      order_id VARCHAR, 
      order_date VARCHAR, 
      ship_date VARCHAR, 
      ship_mode VARCHAR, 
      customer_id VARCHAR, 
      customer_name VARCHAR, 
      segment VARCHAR, 
      city VARCHAR, 
      state VARCHAR, 
      country VARCHAR, 
      postal_code int, 
      market VARCHAR, 
      region VARCHAR, 
      product_id VARCHAR, 
      category VARCHAR, 
      sub_category VARCHAR, 
      product_name VARCHAR, 
      sales VARCHAR, 
      quantity bigint, 
      discount VARCHAR, 
      profit VARCHAR, 
      shipping_cost VARCHAR, 
      order_priority VARCHAR
      );
    
    copy orderstbl
    from 's3://<your-s3-bucket>/ordersdatacsv/orders.csv' 
    iam_role 'arn:aws:iam::<producer-account-id>:role/service-role/<your-Redshift-Role>'
    CSV 
    DELIMITER ';'
    IGNOREHEADER 1
    ;
    
    select * from ordersdb.orderstbl limit 5;
    


    About the Authors

    055-BDB 5089Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lakehouse solutions, and establishes best practices for data governance.

    056-BDB 5089Subhasis Sarkar is a Senior Data Engineer with Amazon. Subhasis thrives on solving complex technological challenges with innovative solutions. He specializes in AWS data architectures, particularly data mesh implementations using AWS CDK components.

Monitoring network traffic in AWS Lambda functions

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/monitoring-network-traffic-in-aws-lambda-functions/

Network monitoring provides essential visibility into cloud application traffic patterns across large organizations. It enables security and compliance teams to detect anomalies and maintain compliance, while allowing development teams to troubleshoot issues, optimize performance, and track costs in multi-tenant software as a service (SaaS) environments. Implementing robust network monitoring allows organizations to effectively manage their security, compliance, and operational requirements while continuously enhancing their applications.

In this post, you will learn methods for network monitoring in AWS Lambda functions and how to apply them to your scenarios.

Overview

Lambda is a secure and highly scalable serverless compute service where each function operates in an isolated execution environment with strict security boundaries. This architecture delivers key advantages, such as enhanced security, automatic compute capacity scaling, and minimal operational overhead. Minimizing infrastructure management allows Lambda to enable organizations to redirect their focus from managing servers to other critical aspects, such as performance optimization and network traffic analysis. In turn, these enable organizations to build more secure and efficient applications.

Lambda network monitoring addresses diverse organizational needs, such as compliance requirements for audit logs and anomaly detection, business needs for traffic metering and customer billing, and development needs for troubleshooting network issues. Traditional agent-based or host-based monitoring methods often aren’t compatible with the strongly isolated, ephemeral execution environment of Lambda, which necessitates alternative approaches to meet these critical requirements.

You can use AWS-native, integrated network monitoring solutions, such as Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, or build your own custom solution, as detailed in this post. Each solution offers distinct capabilities with varying levels of granularity and real-time visibility. When choosing an approach, you must evaluate key factors such as the desired data granularity, operational complexity, latency tolerance, and cost implications.

Using VPC Flow Logs

VPC Flow Logs is the AWS-recommended tool for network activity monitoring. If your scenario necessitates monitoring of the network activity of Lambda functions, then you can attach these functions to a VPC and enable Flow Logs. This captures detailed network traffic data, such as source and destination IPs, ports, protocols, and traffic volume for all traffic flowing through the network interfaces used by your functions.

When you attach your functions to a VPC, the Lambda service automatically creates an Elastic Network Interface (ENI) for functions to communicate with VPC-based resources. By default, VPC-attached functions can only access private resources within the VPC. If you need your functions to communicate with other AWS services, then you should use VPC Endpoints. If your function needs to communicate with the public internet, then you should use an NAT Gateway for egress traffic. The following diagram shows how you can use VPC Flow Logs for Lambda functions.

Flow Logs provide detailed insights into the IP traffic flowing to and from the network interfaces within a VPC, offering valuable data for network audits and activity monitoring. This approach promotes a clear separation of concerns between application and networking layers, with VPC constructs typically managed by a dedicated operations or infrastructure team.

VPC Flow Logs provides a robust network monitoring solution. However, when using it with Lambda functions, you should evaluate the following considerations:

  • It captures ENI-level information. ENIs can be reused across multiple functions, thus it won’t provide per-function or per-invoke granularity.
  • It only logs IP addresses, not DNS names (if capturing DNS names is a requirement for you).
  • It introduces infrastructure management into your serverless applications. You must learn VPC constructs or involve your infrastructure team.
  • Potential added data transfer costs. Go to the pricing for NAT Gateway, VPC Endpoints, and Flow Logs for more details.

The following sections explore Lambda network monitoring methods that can either be augmented with VPC Flow Logs for better granularity or used without attaching your functions to a VPC.

Proxying network traffic

You can configure the Lambda runtime to route egress network traffic through a side-car proxy that runs as a Lambda layer within the Lambda execution environment and logs network activity. The proxy layer should be agnostic to the language runtime. AWS recommends that you use compiled languages such as Rust or Golang for maximum reusability and minimal added latency. The following diagram shows emitting logs from a network proxy layer.

Applying proxy configuration differs across language runtimes. In Python you set proxy_http and proxy_https environment variables. Java uses JVM flags. Node.js doesn’t provide a native way to configure proxy using command line flags or environment variables. Therefore, you need to make code changes, such as configuring a proxy for your AWS SDK or using a third-party open source libraries such as global-agent or Interceptors.

The proxy approach is most suitable if you’re okay with making function code or configuration changes that might vary across runtimes. Furthermore, adding a network proxy server process inside the execution environment consumes resources shared with the function code, which can add network latency.

Refer to the post Enhancing runtime security and governance with the AWS Lambda Runtime API proxy extension for ways to intercept inbound requests/responses between the Lambda Runtime API and Runtime process.

Runtime-agnostic techniques

Following techniques use the fact that the Lambda execution environment is a Linux-based micro-VM. Lambda runtimes operate within a restricted user space that prevents the use of traditional OS-level monitoring tools needing elevated privileges, such as tcpdump, iptables, ptrace, or eBPF. The following techniques are specifically designed to work under these user space constraints, allowing their use without needing elevated privileges.

Reading OS networking layer information from procfs

Use this method when you need to obtain the OS-level information, such as metering transferred bytes, or see all open connections. You can use it to implement tenant chargebacks or detect network traffic pattern changes. The method is based on the proc pseudo-filesystem (also known as procfs) available in Linux OS, which provides an interface to kernel data and allows you to read OS-level information. For example, /proc/cpuinfo and /proc/meminfo pseudo-files provide information about current CPU and memory usage, while /proc/net/* provides you with network layer information. Reading /proc/net/tcp and /proc/net/udp gives you a list of active TCP/UDP connections, such as remote IP addresses and ports. Reading /proc/net/dev provides the list of network devices with detailed usage statistics, such as bytes transferred and received.

“The procfs method provides a simple but powerful way to collect critical network telemetry data from Lambda functions, such as up-to-date network statistics and file descriptor counts, which enables us to monitor outbound connections from Lambda functions. Better yet, it enables us to support multiple Lambda runtimes with a single implementation in our Rust-based, next-generation Lambda Extension”—AJ Stuyvenberg, Staff Engineer at Datadog.

The sample project provides the LambdaNetworkMonitor-Procfs stack to show this technique. For every invocation, the function reads /proc/net/dev, and sends network statistics to log and Amazon CloudWatch Metrics, as shown in the following figure.

Reading the /proc/net/dev pseudo-file is a simple and effective way to monitor Lambda functions networking without adding latency. However, it doesn’t capture DNS names and the IP addresses to which they resolve.

Intercepting network-related libc calls

Low-level network operations in Linux, such as DNS lookup and connection creation, are managed by the C Standard Library (libc). You can intercept libc function calls made by Lambda runtimes to monitor network traffic at the OS level. This is a significantly more advanced and complex mechanism, enabling visibility into OS-level activities, as shown in the following figure.

Intercepting libc function calls, such as getaddrinfo (DNS lookup) and connect, allows you to log details such as DNS name, IP addresses, ports, and protocols at a granular level, as shown in the following figure. This method allows you to capture comprehensive information about DNS queries and initiated network connections. It can provide precise per-function and per-invoke network monitoring, such as hostnames and IP addresses.

The following is a simplified flow:

  1. A function sends a request to example.com.
  2. The runtime calls libc getaddrinfo to resolve the DNS name.
  3. You intercept this call, log the DNS name, and forward the call to the original libc getaddrinfo.
  4. The original libc getaddrinfo returns resolved IP addresses. You log them and return to the runtime.
  5. The runtime calls libc connect method to create a new connection.
  6. You intercept this call, log the IP address, forward the call to the original libc connect, and so on.

To implement this technique, you need to use a language that compiles to a shared object (.so) file. To implement libc method signatures you should use a language such as C, C++, or Rust. The following sample code uses Rust for its strong safety guarantees and implements overriding the getaddrinfo libc function (DNS lookup).

pub extern "C" fn getaddrinfo(
    node: *const c_char,
    service: *const c_char,
    hints: *const addrinfo,
    res: *mut *mut addrinfo,
) -> c_int {
    let printable_node = format!("{}", PrintableCString::from(node));
    let printable_service = format!("{}", PrintableCString::from(service));

    log::debug!("> getaddrinfo node={printable_node} service={printable_service}");

    LIBC_GETADDRINFO(node, service, hints, res)
}

The following should be considered:

  • The method signature fully matches the libc function signature of the same name.
  • The node and service arguments would commonly be DNS name and port.
  • At the end, the function invokes the real libc getaddrinfo and returns the result.

When compiled to an .so file, you must package it as a Lambda layer, attach the layer to your functions, and configure the execution environment to use it through the Linux dynamic linker capability called preloading. Set the LD_PRELOAD environment variable to point to your .so file to instruct the OS to preload it before it loads any other library, such as libc. You can configure this either as a function environment variable or through a wrapper script, as shown in the following figure.

#!/bin/sh
echo "running wrapper..."
args=("$@")
export LD_PRELOAD=/opt/liblambda_network_monitor.so
exec "${args[@]}"

This technique allows you to get detailed connection-level monitoring such as DNS lookups, resolved IP addresses, ports, protocols, and count bytes transferred. Depending on your requirements, it can be adapted to track further network-related information as needed.

The sample project provides the LambdaNetworkMonitor-LdPreload stack to show this technique, as shown in the following figure. For every invocation, the function prints intercepted libc functions, DNS names, and connection IP addresses.

Considerations

  • OS-level techniques necessitate strong understanding of Linux fundamentals and careful implementation. AWS recommends that you closely evaluate which methods to use and keep your solution minimally invasive.
  • LD_PRELOAD is an advanced low-level technique that allows you to override libc methods and OS behavior. Incorrectly implemented hooks could lead to undefined behavior and crashes. Make sure your code is robust to recursion and thread-safe. Test it thoroughly in a controlled environment before using it in production.
  • The LD_PRELOAD technique relies on dynamic linking. It works with dynamically linked runtimes such as Node.js, Python, and Java. It doesn’t work with runtimes that use static linking, such as Golang.
  • When using runtime-dependent functionality, consider using Lambda runtime update controls to make sure that runtimes are only updated when the next function update occurs.
  • Always install layers from trusted sources only. Use infrastructure as code (IaC) tools to attach and audit layer configurations, such as AWS Identity and Access Management (IAM) permissions.

Conclusion

Monitoring network traffic in Lambda functions is a common requirement for many organizations. In case you need to audit IP-level network logs, AWS recommends that you to attach your functions to a VPC and use Flow Logs. If you need per-function or per-invoke granularity, then you can augment it with techniques described in this post.

These techniques provide valuable insights for debugging, auditing, and monitoring, but they also necessitate a solid understanding of Linux fundamentals and careful implementation. They offer a practical solution for organizations that need Lambda network monitoring, allowing them to troubleshoot issues and maintain compliance.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, go to Serverless Land.

How to use AWS Transfer Family and GuardDuty for malware protection

Post Syndicated from James Abbott original https://aws.amazon.com/blogs/security/how-to-use-aws-transfer-family-and-guardduty-for-malware-protection/

Organizations often need to securely share files with external parties over the internet. Allowing public access to a file transfer server exposes the organization to potential threats, such as malware-infected files uploaded by threat actors or inadvertently by genuine users. To mitigate this risk, companies can take steps to help make sure that files received through public channels are scanned for malware before processing.

This post demonstrates how to use AWS Transfer Family and Amazon GuardDuty to scan files uploaded through a secure FTP (SFTP) server for malware as part of an overall transfer workflow. For readers who might have read an earlier blog post on this topic, the key difference is that this solution is fully managed and doesn’t require the deployment of compute resources. GuardDuty automatically updates malware signatures every 15 minutes instead of using a container image for scanning, avoiding the need for manual patching to keep the signatures up to date.

Prerequisites

To deploy the solution in this post, you will need:

  • An AWS account: You need access to AWS to deploy this solution. If you don’t have an account that you can use, see Start building on AWS today.
  • AWS CLI: Install and configure the AWS Command Line Interface (AWS CLI) to be authenticated to your AWS account. Set up the environment variables for your AWS account using the access token and secret access key for your environment.
  • Git: You will use Git to pull down the example code from GitHub.
  • Terraform: You’ll use Terraform to run the automation. Follow the Terraform installation instructions to download and set up Terraform.

Solution overview

This solution uses Transfer Family and GuardDuty. Transfer Family provides a secure file transfer service that you can use to set up an SFTP server, and GuardDuty is an intelligent threat detection service. GuardDuty monitors for malicious activity and anomalous behavior to protect AWS accounts, workloads, and data. At a high level, the solution uses the following steps:

  • A user uploads a file through a Transfer Family SFTP server.
  • A Transfer Family managed workflow invokes AWS Lambda to execute an AWS Step Functions workflow.
    • The workflow begins only after a successful file upload.
    • Partial uploads to the SFTP server will invoke an error handling Lambda function to report a partial upload error.
  • A step function state machine invokes a Lambda function to move uploaded files to an Amazon Simple Storage Service (Amazon S3) bucket for processing and then starts scanning using GuardDuty.
  • The GuardDuty scan result is sent as a callback to the step function.
  • Infected files are moved or cleaned.
  • The workflow sends the user the results through an Amazon Simple Notification Service (Amazon SNS) topic. This can be a notification of an error or malicious upload during the scan or notification of a successful upload and a clean scan for further processing.

Solution architecture and walkthrough

The solution uses GuardDuty Malware Protection for S3 to scan newly uploaded objects to the S3 bucket. You can use this feature of GuardDuty to set up a malware protection plan for an S3 bucket at the bucket level or to watch for specific object prefixes.

Figure 1: Solution architecture

Figure 1: Solution architecture

The following steps (shown in Figure 1) describe the workflow for this solution starting from the point the file is uploaded until it’s scanned and marked as safe or as infected, leading to subsequent steps that can be customized based on your use case.

  1. A file is uploaded using the SFTP protocol through Transfer Family.
  2. If the file is successfully uploaded, Transfer Family uploads the file to the S3 bucket called Unscanned and the Managed Workflow Complete workflow is triggered. This is the workflow used to handle successful uploads and invokes the Step Function Invoker Lambda function.
  3. The Step Function Invoker starts the state machine and kicks off the first step in the process by invoking the GuardDuty – Scan Lambda function.
  4. The GuardDuty – Scan function moves the file to the Processing bucket. This is the bucket from which the files will be scanned.
  5. When an object upload activity is detected, GuardDuty automatically scans the object. In this implementation, a malware protection plan is created for the Processing bucket.
  6. When a scan completes, GuardDuty publishes the scan result to Amazon EventBridge.
  7. An EventBridge rule has been created to invoke a Lambda Callback function whenever a scan event has completed. EventBridge will invoke the function with an event that contains the scan results. See Monitoring S3 object scans with Amazon EventBridge for an example.
  8. The Lambda Callback function notifies the GuardDuty – Scan task using the callback task integration pattern. The results of the GuardDuty scan are returned to the GuardDuty – Scan function and these results are passed to the Move File task.
  9. If the result is a clean scan with no threats detected, the Move File task will place the file in the Clean S3 bucket, indicating that the file is successfully scanned and safe for further processing.
  10. At this point, the Move File function publishes a notification to the Success SNS topic to notify the subscribers.
  11. If the result indicates that the file is malicious, the Move File function will instead move the file to the Quarantine S3 bucket for further investigation. The function will also delete the file from the Processing bucket and publish a notification in the Error topic in SNS to notify the user of a potential malicious file being uploaded.
  12. If the file upload is unsuccessful and the file isn’t fully uploaded, then Transfer Family will trigger the Managed Workflow Partial workflow.
  13. Managed Workflow Partial is an error handling workflow and invokes the Error Publisher function, which is used for reporting errors that occur anywhere in the workflow.
  14. The Error Publisher function identifies the type of error—whether it’s because of the partial upload or an issue elsewhere in the workflow—and sets the error status accordingly. It will then publish an error message to Error Topic in SNS.
  15. The GuardDuty – Scan task has a timeout to make sure that an event is published to Error Topic to prompt a manual intervention to investigate further if the file isn’t successfully scanned. If the GuardDuty – Scan task fails, the Error clean up Lambda function is invoked.

Finally, there’s an S3 Lifecycle policy attached to the Processing bucket. This is to make sure that no file is left in the Processing bucket for more than one day.

Code repository

The GitHub AWS-samples repository has a sample implementation developed using Terraform and Python-based Lambda functions to implement this solution. The same solution can also be implemented using AWS CloudFormation. The code has the components needed to deploy the entire workflow to demonstrate the abilities of Transfer Family and the GuardDuty malware protection plan.

Install the solution

Use the following steps to deploy this solution to your test environment.

  1. Clone the repository to your working directory using Git.
  2. Navigate to the root directory of your cloned project directory.
  3. Update the terraform locals.tf file with the values of your choice for the S3 bucket names, SFTP server names, and other variables.
  4. Run terraform plan.
  5. If everything looks good, run a terraform apply and enter yes to create the resources.

Clean up

After testing and exploring the solution, it’s important to clean up the resources you created to avoid incurring unnecessary costs. To delete the resources created by this solution, navigate to the root directory of your cloned project and run the following command:

terraform destroy

This command will remove the resources created by Terraform, including the SFTP server, S3 buckets, Lambda functions, and other components. Confirm the deletion by entering yes when prompted.

Conclusion

By using the approach outlined in the post, you can make sure that the files received over SFTP and uploaded to your S3 bucket are scanned for threats and are safe for further processing. The solution reduces the exposure surface by making sure that public uploads are scanned in a safe environment before they’re sent to other components of your system.

If you have feedback about this post, submit comments in the Comments section below.

James Abbott

James Abbott

James is a Principal Solutions Architect at AWS, working in Global Financial Services. When not in the office he enjoys mountain biking in North Carolina.

Santhosh Srinivasan

Santhosh Srinivasan

Santhosh is a Sr. Cloud Application Architect with the Professional Services team at AWS. He specializes in building and modernizing large scale enterprise applications in the cloud with a focus on the financial services industry.

Suhas Pasricha

Suhas Pasricha

Suhas is a Cloud Infrastructure Architect in the AWS Professional Services team. He has a background in web development and infrastructure automation. At Amazon, he has been helping customers set up and operate an enterprise-wide landing zone and cloud environment. In his spare time, he likes to read and play video games.

Optimizing cold start performance of AWS Lambda using advanced priming strategies with SnapStart

Post Syndicated from Shan Kandaswamy original https://aws.amazon.com/blogs/compute/optimizing-cold-start-performance-of-aws-lambda-using-advanced-priming-strategies-with-snapstart/

Introduced at re:Invent 2022, SnapStart is a performance optimization that makes it easier to build highly responsive and scalable applications using AWS Lambda. The largest contributor to startup latency (often referred to as cold-start time) is the time spent initializing a function. This includes loading the function’s code and initializing dependencies. For latency-sensitive workloads such as APIs and real-time data processing applications, high startup latency can result in a suboptimal end user experience. Lambda SnapStart can reduce startup duration from several seconds to as low as sub-second, with minimal or no code changes. This post discusses ‘Priming’, a technique to further optimize startup times for AWS Lambda functions built using Java and Spring Boot.

Spring Boot applications typically experience high cold start latency during JVM and framework initialization, where significant time is spent loading classes and performing Just-In-Time (JIT) compilation of Java bytecode. This blog post uses a Spring Boot application as an example that retrieves 10 records from a ‘UnicornEmployee’ table in an Amazon RDS for PostgreSQL database, where each employee record includes employee name, location, and hire date.

The sample application uses Amazon API Gateway which triggers an AWS Lambda function that connects to the database through RDS Proxy to return the employee data. While this sample application uses dummy employee data for demonstration, the patterns and optimization techniques discussed in this post are applicable to real-world scenarios with similar data access patterns. Sample code for this implementation can be found in our GitHub repository at lambda-priming-crac-java-cdk.

Background: How SnapStart works

The post assumes familiarity with SnapStart and provides a short background. For additional details, refer to the SnapStart documentation.

To quickly recap, the INIT phase for a Lambda function involves downloading the function’s code, starting the runtime and any external dependencies, and running the function’s initialization code. For functions that don’t use SnapStart, this phase occurs each time your application scales up to create a new execution environment. When SnapStart is activated, the INIT phase happens when you publish a function version.

The following image shows a comparison of a Lambda request lifecycle with and without SnapStart.

Figure 1 – comparison of a Lambda request lifecycle with and without SnapStart

At the end of the INIT phase, Lambda executes your before-checkpoint runtime hooks. Lambda then snapshots the memory and disk state of the initialized execution environment, persists the encrypted snapshot, and caches it for low-latency access. When the function is subsequently invoked, new execution environments are resumed from the cached snapshot (during the RESTORE phase), speeding up function startup.

Figure 2 – new execution environments are resumed from the cached snapshot.

You can validate this speedup by comparing the RESTORE duration with the INIT duration recorded before SnapStart in your Lambda function’s Amazon CloudWatch Logs. As demonstrated in the following table, enabling SnapStart reduces the startup latency of our sample Spring Boot application by 4.3x from 6.1s to 1.4s. The 6.1s cold start latency for ON_DEMAND is primarily due to the combination of (1) initializing the JVM and Spring Boot framework, (2) JIT compilation of lazy loaded application code during initial invocation and (3) the time needed to establish a database connection with RDS through Amazon RDS Proxy. By enabling SnapStart, Lambda initializes the JVM and Spring Boot prior to the function invocation – resulting in the significantly reduced latency of 1.4s.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms

You can reduce cold starts even further for your latency-sensitive Spring Boot applications by using priming techniques on Lambda functions. Let’s explore how to implement priming techniques.

Priming explained

Priming is the process of preloading dependencies and initializing resources during the INIT phase, rather than during the INVOKE phase to further optimize startup performance with SnapStart. This is required because Java frameworks that use dependency injection load classes into memory when these classes are explicitly invoked, which typically happens during Lambda’s INVOKE phase. You can proactively load classes using Java runtime hooks, that are part of the open source CRaC (Coordinated Restore at Checkpoint) project. This post demonstrates how to use this hook, called beforeCheckpoint(), to prime SnapStart-enabled Java functions, in two ways:

  1. Invoke Priming: This approach involves directly invoking application endpoints or methods in your pre-snapshotting hook so that they are JIT compiled during the INIT phase and included in the snapshot. This can include operations such as invoking API Gateway endpoints or fetching data from an S3 bucket or RDS database to proactively execute the code paths, ensuring that the underlying classes are included in the snapshot.
  2. Class Priming: This approach involves proactive initialization of classes during the INIT phase, ensuring that they are included in the function’s snapshot without risking unwanted changes to application state or data. This can be achieved by leveraging Java’s forName() method, which loads, links, and initializes the specified class. Initialization refers to the JVM process of loading the class definition into memory, verifying the bytecode, preparing static fields with default values, and executing static initializers. This is different from instantiation, which creates objects of the class using constructors. To generate a list of the classes required for pre-loading, you can use the following VM option, writing the list to a file called classes-loaded.txt:
    -Xlog:class+load=info:classes-loaded.txt

While invoke priming can offer better performance, it requires additional effort to ensure that the actions performed are idempotent and do not have unintended side effects, for instance processing financial transactions in a banking application. For this reason, invoke priming should only be used when code executed during priming is either idempotent or does not modify state. For scenarios where this is not possible, class priming provides a safer alternative by only initializing classes without executing their methods. Note that this assumes your application does not execute state-modifying code during class initialization.

With this context, let’s look at how to implement Invoke and Class priming for a Spring Boot sample application.

Example priming Implementation using CRaC runtime hooks before taking a Lambda snapshot

This post demonstrates both Invoke priming and Class priming using the sample Spring Boot application. The choice between the two approaches depends on the specific requirements and complexities of your application.

Step 1: Set up your Spring Boot Application using the aws-serverless-springboot3-archetype as explained in our Quick Start Spring Boot3 guide, adding database connectivity code, or simply clone the sample project from GitHub repository.

  1. Create a Spring Boot Application.
    // src/main/java/software/amazon/awscdk/examples/unicorn/UnicornApplication.java
    package software.amazon.awscdk.examples.unicorn;
    …
    @Import({ UnicornConfig.class })
    @SpringBootApplication
    public class UnicornApplication {
    
        private static final Logger log = LoggerFactory.getLogger(UnicornApplication.class);
    
        public static void main(String... arguments) {
            SpringApplication.run(UnicornApplication.class, arguments);
        }
    
    }

  2. Add all the necessary Maven dependencies for Spring Boot, AWS Lambda, and Database Connection in your pom.xml file. The following, highlighted, dependency contains the classes required to use the CRaC runtime hooks.
    ...
            <dependency>
                <groupId>org.crac</groupId>
                <artifactId>crac</artifactId>
            </dependency>
    ...

  3. Configure Database Connection – Set up the database connection details in application.properties.
    spring.datasource.password=${SPRING_DATASOURCE_PASSWORD} 
    spring.datasource.url=${SPRING_DATASOURCE_URL} 
    spring.datasource.username=postgres 
    spring.datasource.hikari.maximumPoolSize=1 

Step 2: Implement Lambda Function Handler with CRaC runtime hooks and Invoke Priming Approach:

Create Lambda Function Handler and integrate CRaC runtime hooks to execute beforeCheckpoint() and afterRestore() methods in your application for before taking and after restoring the snapshot.

  1. Implement the RequestHandler<UnicornRequest, UnicornResponse> interface in the Lambda function handler class.
  2. Implement the CRaC resource interface with two methods: beforeCheckpoint() and afterRestore(), which defines actions performed before Lambda creates the snapshot and after the snapshot is restored.
  3. Add invoke priming by creating a UnicornRequest object with a GET request to a specific endpoint (such as, /unicorn) and call the handleRequest(unicornRequest, null) method.

This ensures that the code paths associated with the specified endpoint are JIT compiled and optimized for faster execution during the first invocation after the snapshot is restored.

/src/main/java/software/amazon/awscdk/examples/unicorn/handler/InvokePriming.java
package software.amazon.awscdk.examples.unicorn.handler;

import org.crac.Core;
import org.crac.Resource;
...
public class InvokePriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
	...

@Override
public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
    var awsLambdaInitializationType = System.getenv("AWS_LAMBDA_INITIALIZATION_TYPE");
    var unicorns = getUnicorns();
    var body = gson.toJson(unicorns);
    return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
}

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
        throws Exception {
    var event = APIGatewayV2HTTPEvent.builder().build();
    handleRequest(event, null);
}
...
}

Step 3: Implement Class priming Approach:

The class priming approach focuses on pre-loading required classes to achieve optimal performance. To implement class priming, generate the list of classes that are loaded during the application startup and function execution by running the application locally using the following JVM argument: -Xlog:class+load=info:classes-loaded.txt

  1. Ensure that your application classes included in the generated classes-loaded.txt file are not mutating state during static initialization.
    Note: the generated classes-loaded.txt contains class entries in the following format:

    [0.068s][info][class,load] software.amazon.awscdk.examples.unicorn.handler.ClassPriming source: file:/var/task/

  2. Extract only the fully qualified class names from each line and remove the additional logging information. For Example:
    software.amazon.awscdk.examples.unicorn.handler.ClassPriming

  3. Use the ClassLoaderUtil.loadClassesFromFile() utility method to extract the generated class entries.
    	     //src/main/java/software/amazon/awscdk/examples/unicorn/service/ClassLoaderUtil.java
    package software.amazon.awscdk.examples.unicorn;
    	...
    public class ClassLoaderUtil {
    	...
        public static void loadClassesFromFile() {
            log.info("loadClassesFromFile->started");
            Path path = Paths.get("classes-loaded.txt");
    
            try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
                Stream<String> lines = bufferedReader.lines();
                lines.forEach(line -> {
                    var index1 = line.indexOf("[class,load] ");
                    var index2 = line.indexOf(" source: ");
    
                    if (index1 < 0 || index2 < 0) {
                        return;
                    }
    
                    var className = line.substring(index1 + 13, index2);
                    try {
                        Class.forName(className, true,
                                ClassPriming.class.getClassLoader());
                    } catch (Throwable ignored) {
                    }
                });
    
                log.info("loadClassesFromFile->finished");
            } catch (IOException exception) {
                log.error("Error on newBufferedReader", exception);
            }
        }
    ...
    }

  4. Read a file (such as, /classes-loaded.txt) that contains a list of classes that have been loaded during the application’s execution in the beforeCheckpoint() method.
  5. Use the Class.forName() method to load and initialize the class, ensuring that it is ready during the snapshot.
    Note: by systematically pre-loading these classes, the Class priming approach simplifies the optimization process and reduces the complexities associated with Invoke priming.

    //src/main/java/software/amazon/awscdk/examples/unicorn/handler/ClassPriming.java
    package software.amazon.awscdk.examples.unicorn.handler;
    
    ...
    import org.crac.Core;
    import org.crac.Resource;
    
    public class ClassPriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
    
    ...
            ConfigurableApplicationContext configurableApplicationContext =
    				SpringApplication.run(UnicornApplication.class);
    
            this.unicornService = configurableApplicationContext.getBean(UnicornService.class);
            this.gson = configurableApplicationContext.getBean(Gson.class);
    
            Core.getGlobalContext().register(this);
        }
    
        @Override
        public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
            var unicorns = getUnicorns();
            var body = gson.toJson(unicorns);
    
            return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
        }
    
        @Override
        public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
                throws Exception {
    
            ClassLoaderUtil.loadClassesFromFile();
    
        }
    ...
    }

Step 4: AWS CDK Infrastructure Setup

Before proceeding, review the prerequisites in the project README file.

The CDK stack deploys a serverless application and required infrastructure for testing different Lambda optimization strategies. It creates a VPC with private subnets, an RDS for PostgreSQL instance with a database proxy, and five Lambda functions implementing different optimization approaches (ON_DEMAND without SnapStart, SnapStart without priming, SnapStart with invoke priming, and SnapStart with class priming). Each Lambda function is integrated with API Gateway for HTTP access, configured with Java 21 runtime on ARM64 architecture, and includes CloudWatch log groups for monitoring.

Follow these steps to deploy the infrastructure:

  1. Clone the sample repository:
    git clone https://github.com/aws-samples/lambda-priming-crac-java-cdk.git

  2. Deploy the CDK stack:
    cd lambda-priming-crac-java-cdk/infrastructure
    cdk synth
    cdk deploy --require-approval never --all 2>&1 | tee cdk_output.txt

  3. Save the API Gateway URLs:
    The deployment will output five URLs in this format:

    # ON_DEMAND endpoint (without SnapStart)
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi1ONDEMANDEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart without priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi2SnapStartNOPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with invoke priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi3SnapStartINVOKEPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with class priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi4SnapStartCLASSPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # Database setup endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi5DBLOADEREndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

  4.  Extract the URLs into variables for testing:
    ONDEMAND_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 1) \
    
    NOPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 2 | tail -n 1) \
    
    INVOKEPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 3 | tail -n 1) \
    
    CLASSPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 4 | tail -n 1) \
    
    SETUP_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 5 | tail -n 1)

Step 5: Load database and run performance testing using artillery:

  1. Initialize the database with sample data.
    curl -X GET "$SETUP_URL"
    
    #Expected output: {"message":"Database schema initialized and data loaded"}

  2. Run performance tests for all endpoints
    artillery run -t "$ONDEMAND_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$NOPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$INVOKEPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$CLASSPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml

Step 6: Compare the load test results for On-demand (non-SnapStart), SnapStart, Invoke priming, and Class priming

The performance test results in the table below are sorted from slowest to fastest startup latency. The function without SnapStart performs the slowest due to JVM initialization, class loading and JIT compilation that occurs when the function is invoked. Notice a 4.3x improvement with SnapStart, which resumes invocations from a pre-initialized snapshot thereby avoiding JVM initialization and initial JIT compilation. SnapStart with class priming achieves a 1.4x speed-up over SnapStart, by proactively loading/initializing classes during INIT so that they are included in your function’s snapshot. Finally, SnapStart with invoke priming achieves the fastest performance – with a 781.68ms p99.9 cold-start latency that is 1.8x faster than SnapStart. This is because in addition to initializing classes, it also executes methods on the instances of those classes, resulting in even more components being included in the function’s snapshot.

Note that with invoke priming, any application code you execute must either be idempotent or modify stub data only. For instance, consider application code that triggers a financial transaction. If this code is executed during invoke priming with real user data, it may drive unintended effects with potentially serious consequences. Class priming avoids this, since application classes are initialized rather than being instantiated and their methods executed. This assumes that application code does not execute state modifying logic during class initialization. We recommend that you keep these considerations in mind when using invoke and/or class priming, and choose the appropriate approach for your use case.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms
PrimingLogGroup-4_SnapStart_CLASS_PRIMING 82 857.81 ms 997.49 ms 1085.94 ms 1085.94 ms
PrimingLogGroup-3_SnapStart_INVOKE_PRIMING 66 608.42 ms 688.88 ms 781.68 ms 781.68 ms

 Conclusion

This post showed how AWS Lambda SnapStart, enhanced by CRaC runtime hooks, unlocks granular control over cold-start optimization for Java applications through two distinct priming strategies:

  • Invoke Priming: improves performance by executing critical endpoints during snapshot creation, ideal for idempotent workflows.
  • Class Priming: preloads classes without triggering business logic, mitigating side-effect risks.

To implement these optimization techniques in your applications evaluate your use case and opt for the optimal priming approach. Track latency reductions and resource utilization of your application via Amazon CloudWatch metrics to quantify performance improvements. By integrating these strategies, developers can achieve sub-second cold starts while maintaining the scalability and cost-efficiency of serverless architecture using Java.

To dive deeper, check out the GitHub repository with the full example code, including setup instructions and reusable patterns you can adapt to your own projects. For more examples of Java applications running on AWS Lambda, visit serverlessland.com and explore a wide range of resources, tutorials, and real-world use cases.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-1-aws-glue-data-catalog-and-amazon-redshift/

Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Part of the next generation of Amazon SageMaker, SageMaker Unified Studio allows you to discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app development, and more, in a single governed environment. You can create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in different data sources through Amazon SageMaker Lakehouse.

This series of posts demonstrates how you can onboard and access existing AWS data sources using SageMaker Unified Studio. This post focuses on onboarding existing AWS Glue Data Catalog tables and database tables available in Amazon Redshift. Part 2 discusses using Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR.

Access your existing data and resources through Amazon SageMaker Unified Studio

This series primarily focuses on the UI experience. If you prefer script-based automation, refer to Bringing existing resources into Amazon SageMaker Unified Studio.

Access management with SageMaker Unified Studio

The SageMaker Unified Studio authorization model is a hierarchical access control list (ACL) based on the resource type such as a domain or a project. For example, at the domain level, a user might have a domain owner designation and at the project level, the user can be an owner or contributor. You can configure these profiles at AWS Identity and Access Management (IAM) user, single sign-on (SSO) user, and SSO group level.

Each project has a project role. When the user interacts with resources within SageMaker Unified Studio, it generates IAM session credentials based on the user’s effective profile in the specific project context, and then users can use tools such as Amazon Athena or Amazon Redshift to query the relevant data. The project owner can add or remove project members for their project, create publishing agreements with a domain, and publish assets to a domain.

SageMaker Unified Studio can be accessed by IAM users or SSO authenticated users, and IAM roles can interact with the SageMaker Unified Studio through its APIs.

Solution overview

AWS Lake Formation enables you to define fine-grained access control on the Data Catalog, where you can configure access at database, table, row, or column level or define permissions with tags. When setting up Lake Formation, you can configure it with hybrid access mode, where you get flexibility to selectively enable Lake Formation permissions for specific databases and tables, and continue using IAM permissions for others. SageMaker Unified Studio supports Lake Formation hybrid mode.

When you create a project in SageMaker Unified Studio, an AWS Glue database is added by default as part of the project. Assets published into that database don’t need any additional permissions, but if you want to publish or subscribe assets from an existing AWS Glue database, then you need to provide explicit permissions to SageMaker Unified Studio to be able to access the database and tables. For more details, see Configure Lake Formation permissions for Amazon SageMaker Unified Studio.

Let’s understand how we can access existing datasets through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In the SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing AWS Glue tables

This section has following prerequisites:

One extra prerequisite step is to revoke IAMAllowedPrincipals group permission on both database and table to enforce Lake Formation permission for access. For detailed instruction see Revoking permission using the Lake Formation console.

To access existing Data Catalog tables in SageMaker Unified Studio, complete the following steps:

  1. On the Lake Formation console using the data lake administrator, choose Data lake locations in the navigation pane and choose Register location.
  2. Enter the S3 prefix for Amazon S3 path.
  3. For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
  4. Select Lake Formation for Permission mode and choose Register location.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.

  1. For IAM users and roles, choose the project role.
  2. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  3. For Databases, choose your existing Data Catalog database.

  1. For Database permissions, select Describe and choose Grant.

The next step is to grant the permission on the tables to the project role.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.
  4. For IAM users and roles, choose the project role.
  5. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  6. For Databases, choose your Data Catalog database.
  7. For Tables, select the tables that you need to provide permission to the project role.

  1. For Table permissions, select Select and Describe.
  2. For Grantable permissions, select Select and Describe.
  3. Choose Grant.

You should revoke any existing permissions of IAMAllowedPrincipals on the databases and tables within Lake Formation.

Now let’s verify that we can access the existing AWS Glue table from the SageMaker Unified Studio Query Editor.

  1. In SageMaker Unified Studio, navigate to your project.
  2. On the project page, under Lakehouse, choose Data.
  3. Next to the Data Catalog table, choose the options menu (three dots), and choose Query with Athena.

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark. To access the data through the unified JupyterLab experience, complete the following steps:

  1. On the SageMaker Unified Studio project page, on the top menu, choose Build, and under IDE & APPLICATIONS, choose
  2. Wait for the space to be ready.
  3. Choose the plus sign and for Notebook, choose Python 3.
  4. In the notebook, switch the connection type to PySpark, choose spark.fineGrained, and query the existing Data Catalog table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_sql = spark.sql("""
select * from retaildb.salesorders
""" )

df_sql.show()

Use existing Redshift clusters

This section has following prerequisites:

  • A Redshift cluster

To bring in existing Redshift clusters, follow these steps:

  1. To use your provisioned Redshift cluster or a Redshift Serverless workgroup, add either of the following tags (key/value) to the resource:
    1. Add AmazonDataZoneProject: <projectID> if you want to allow only a specific SageMaker Unified Studio project to access the Amazon Redshift resource. Replace <projectID> with the ID of the project created in SageMaker Unified Studio.
    2. Add for-use-with-all-datazone-projects: true if you want to allow all SageMaker Unified Studio projects to access the Amazon Redshift resource.

  1. To add the compute connection in SageMaker Unified Studio, you can authenticate the cluster using either the user name and password of the database, IAM credentials, or AWS Secrets Manager. To provide the authentication using Secrets Manager, add either of the following tags. This will enable the existing secret to appear on the dropdown menu, while defining the connection in SageMaker Unified Studio.
    1. AmazonDataZoneProject: <projectID>
    2. for-use-with-all-datazone-projects: true

In the following screenshot, you can see the tag configuration section within Secrets Manager settings for Redshift Serverless compute. To understand how to create a secret for a database in a Redshift cluster using Secrets Manager, refer to Managing Amazon Redshift admin passwords using AWS Secrets Manager.

  1. After the tags are applied, log in to SageMaker Unified Studio and choose the project.
  2. Go to the Compute section of your project, and on the Data warehouse tab, choose Add compute.
  3. Select Connect to existing compute resources.
  4. Choose the compute type: Amazon Redshift Provisioned cluster or Amazon Redshift Serverless.
  5. Configure the parameters by selecting the existing compute and authentication and choose Add compute.

The detailed walkthrough process is illustrated in the following screenshot.

Use Redshift tables with existing compute

This section has following prerequisites:

  • A Redshift table

In this section, we illustrate steps to create a federated connection for an existing Amazon Redshift data source. You can register an existing Redshift provisioned cluster as well as Redshift Serverless with the Data Catalog using SageMaker Unified Studio. This creates a federated multi-level catalog and provides the ability to centrally manage permissions to the data with fine-grained access control using Lake Formation. By mounting Amazon Redshift data in the Data Catalog, you can query it using your preferred tools such as Athena or AWS Glue extract, transform, and load (ETL) without having to copy or move the data.

Create an Amazon Redshift managed VPC endpoint for Amazon Redshift

Amazon Redshift managed virtual private cloud (VPC) endpoints use AWS PrivateLink to allow one VPC to privately access resources in another VPC as if they were local to the same VPC. With an Amazon Redshift managed VPC endpoint, you can connect to your private Redshift cluster with the RA3 instance type or Redshift Serverless within your VPC.

In this section, we explain how to create an Amazon Redshift managed VPC endpoint for both Redshift Serverless and an Amazon Redshift provisioned cluster in a single account. The managed VPC endpoint needs to be created only if your Redshift provisioned or Redshift Serverless cluster is in a different VPC than the SageMaker Unified Studio domain VPC.

If the SageMaker Unified Studio domain account is in a different account, allow the additional AWS accounts to create cluster endpoints. For steps to authorize your Amazon Redshift provisioned or Redshift Serverless cluster to deploy endpoints in additional accounts and grant access to the cross-account VPC, refer to Granting access to a VPC.

Redshift Serverless

For Redshift Serverless, follow these instructions.

The common practice is to allow port 5439 (Amazon Redshift connectivity port) to the security group or CIDR range in which your consumption workloads run.

  1. In the security group associated with the Redshift cluster, add an inbound rule with Type as Redshift, Protocol as TCP, Port range as 5439 (Amazon Redshift connectivity port), and Source as the CIDR range in which your consumption workloads run.

  1. On the Amazon Redshift console of the workgroup, go to Redshift-managed VPC endpoints.
  2. Choose Create endpoint.
  3. In the Endpoint settings section, choose the VPC, associated private subnet, and security group created for the SageMaker Unified Studio domain account to deploy the endpoint against.

The following screenshot shows the Amazon Redshift managed VPC endpoint created for Redshift Serverless.

Redshift provisioned

For Amazon Redshift provisioned, follow these instructions:

  1. To implement an Amazon Redshift managed VPC endpoint for a provisioned cluster, you need to enable cluster relocation and create subnet groups. In the cluster subnet group, choose the VPC and subnets of the SageMaker Unified Studio domain account.
  2. On the Amazon Redshift console, choose Configurations in the navigation pane.
  3. Provide the endpoint details, then choose Create endpoint.

Create a federated connection for Amazon Redshift

Complete the following steps to create a federated catalog in the Data Catalog to query the data using various preferred analytics tools such as Athena, visual ETL in SageMaker Unified Studio, Amazon EMR, and more:

  1. On the SageMaker Unified Studio console, choose your project.
  2. Choose Data in the navigation pane.
  3. In the data explorer, choose the plus sign to add a data source.
  4. Under Add a data source, choose Add connection, then choose Amazon Redshift.
  5. Enter the following parameters in the connection details, and choose Add data.
    1. Name: Enter the connection name.
    2. Host: Enter the Amazon Redshift managed VPC endpoint.
    3. Port: Enter the port number (Amazon Redshift uses 5439 as the default port).
    4. Database: Enter the database name.
    5. Authentication: Choose either the database user name and password credentials or Secrets Manager.

After the connection is established, you will see that the federated catalog is created, as shown in the following screenshot. This catalog uses the AWS Glue connection to connect to Amazon Redshift. The databases, tables, and views are automatically cataloged in the Data Catalog and registered with Lake Formation.

With Athena, data analysts can run federated SQL queries to scan data from multiple data sources in-place without creating complex data pipelines or data replication.

Use existing Data Catalog tables and Amazon Redshift assets in the SageMaker Unified Studio business data catalog

You can use the SageMaker Unified Studio business data catalog to catalog the data across your organization with business context. To use Amazon SageMaker Catalog, you must bring your existing data assets into the inventory of your project. Follow the instructions in this section to bring your existing Data Catalogs and Amazon Redshift assets into the project inventory.

Add an existing Data Catalog to the project inventory

To enrich the asset with business context and share your assets outside your own project, you must first bring the metadata to SageMaker Catalog. To import the metadata of the assets into the project’s inventory, you need to create a data source in the project catalog.

  1. In SageMaker Unified Studio, navigate to the Project catalog page within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose AWS Glue (Lakehouse) for Data source type.
  6. For Data selection, choose the Database name and choose Next.
  7. Keep the rest as default and choose CREATE.
  8. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Add existing Redshift tables and views to the project inventory

Create a data source to bring in the existing Redshift tables and views to add to the project’s inventory:

  1. In SageMaker Unified Studio, navigate to the Project catalog within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose Amazon Redshift for Data source type.
  6. For Connection, choose the name of the Redshift connection.
  7. For Database name, choose dev and for Schema, enter public.
  8. Keep the rest as default and choose CREATE.
  9. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Conclusion

This post explained how you can access existing data and resources available in the Data Catalog and Amazon Redshift using SageMaker Unified Studio. SageMaker Unified Studio provides an integrated environment for analytics and AI. Being able to access existing datasets available in your AWS account helps reduce operational overhead because users of your organization can access a common interface, collaborate, and share datasets. It also brings in efficiency for administrators because they can manage permissions for domains and projects in a common place.

In the next post, we will demonstrate how you can onboard and access other existing data sources such as Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 2: Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-2-amazon-s3-amazon-rds-amazon-dynamodb-and-amazon-emr/

Organizations often face the challenge of managing and analyzing data spread across multiple storage systems and databases while providing secure, efficient access for their data science teams. Amazon SageMaker Unified Studio addresses this challenge by providing a unified analytics and AI development environment where data scientists can access, analyze, and use data from various sources within a single, governed workspace, allowing teams to use their existing data infrastructure while taking advantage of advanced analytics and AI capabilities. SageMaker Unified Studio is part of the next generation of Amazon SageMaker, the center for all your data, analytics, and AI.

In Part 1 of this series, we explored how to access AWS Glue Data Catalog tables and Amazon Redshift resources through SageMaker Unified Studio. Continuing our journey, this post discusses integrating additional vital data sources such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR clusters. We demonstrate how to configure the necessary permissions, establish connections, and effectively use these resources within SageMaker Unified Studio. Whether you’re working with object storage, relational databases, NoSQL databases, or big data processing, this post can help you seamlessly incorporate your existing data infrastructure into your SageMaker Unified Studio workflows.

Access your existing data and resources through Amazon SageMaker Unified Studio

Solution overview

SageMaker Unified Studio seamlessly works with your existing data and resources through relevant permissions and network settings.

Let’s understand how we can access existing datasets across S3, RDS, DynamoDB, and EMR through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing S3 buckets

This section has following prerequisites:

  • An S3 bucket

To use an existing S3 bucket in SageMaker Unified Studio, configure an S3 bucket policy that allows the appropriate actions for the project AWS Identity and Access Management (IAM) role.

The following is an example bucket policy. Replace <aws_accountid> with the AWS account ID where the domain resides, <s3_bucket> with the name of the S3 bucket that you intend to query in SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<s3_bucket>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<s3_bucket>/*",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After you configure the policy, log in to SageMaker Unified Studio and open the project.

Query the data using the JupyterLab IDE to perform analysis, as shown in the following screenshot.

Although the project role has been given appropriate permissions to access the S3 bucket in SageMaker Unified Studio, you will not able to list the contents of the bucket and show the S3 path in the data explorer section within SageMaker Unified Studio.

Use existing RDS DB instances

This section has following prerequisites:

  • A VPC and a private subnet
  • A RDS DB instance on the private subnet in the VPC

SageMaker Unified Studio uses the virtual private cloud (VPC) and subnets that are specified in the domain creation. If you have the data source like an RDS DB instance in a separate VPC, you can configure network reachability between the domain VPC and the data source VPC using VPC peering, AWS Transit Gateway, or a resource VPC endpoint, or alternatively you can create a new domain using the data source VPC.

Add a PostgreSQL connection

Complete the following steps to configure that reachability using VPC peering with Amazon Virtual Private Cloud (Amazon VPC):

  1. On the Amazon VPC console, choose Your VPCs, and make a note of the VPC ID of your VPC named SageMakerUnifiedStudioVPC.
  2. Choose Peering connections, and choose Create peering connection.
  3. Under Select another VPC to peer with, for VPC ID (Requester), choose the VPC ID noted earlier.
  4. Under Select another VPC to peer with, for VPC ID (Accepter), choose the VPC where the target RDS DB instance is located.
  5. Review your settings and choose Create peering connection.
  6. On the Peering connections page, select your peering connection.
  7. Under Actions, choose Accept request.
  8. Review the settings and choose Accept request.

Now you have configured the VPC peering connection. The next step is to configure the network route from the SageMaker Unified Studio VPC to the Amazon RDS VPC.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of SageMakerUnifiedStudioVPC.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of the VPC where the RDS DB instance is located.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you have configured the route table from the SageMaker Unified Studio VPC to the Amazon RDS VPC. The next step is to configure the opposite route.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of the RDS DB instance.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of SageMakerUnifiedStudioVPC.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you configure your RDS security group to accept traffic coming from SageMaker Unified Studio.

  1. On the Amazon RDS console, navigate to your RDS DB instance, and choose VPC security groups.
  2. Select your security group, and choose Inbound rules.
  3. Choose Edit inbound rules.
  4. Choose Add rule.
  5. For Type, choose Custom TPC.
  6. For Port range, enter your RDS port number.
  7. For Source, enter the VPC CIDR of SageMakerUnifiedStudioVPC.

Now you have network reachability required to use the existing RDS DB instance. The next step is to create a connection pointing to that RDS DB instance in SageMaker Unified Studio.

  1. Sign in to SageMaker Unified Studio and open your project.
  2. In your project, in the navigation pane, choose Data.
  3. Choose the plus sign, and for Add data source, choose Add connection.
  4. Select PostgreSQL.
  5. For Data source name, enter postgresql_source.
  6. For Host, enter the host name of your Aurora PostgreSQL database cluster.
  7. For Port, enter the port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
  8. For Database, enter your database name.
  9. For Authentication, select Username and password, and enter your user name and password.
  10. Choose Add data source.

You will need to wait for several minutes to complete this step.

Use a visual ETL flow to ingest data to Amazon RDS

In a visual extract, transform, and load (ETL) flow, you can use PostgreSQL as source and target. You can create a PostgreSQL target, and for Name, choose postgresql_source to ingest data into Amazon RDS.

  1. Choose the plus sign, and under Data sources, choose Amazon S3.
  2. Choose Amazon S3 for the source node, and enter following values:
    1. S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
    2. Format: CSV
    3. Sep: ,
    4. Multiline: Enabled
    5. Header: Disabled
    6. Leave the rest as default.
  3. Wait for the data preview to be available.
  4. Choose the plus sign to the right of Amazon S3 Under Transforms, choose Rename Columns.
  5. Choose the Rename Columns node, and choose Add new rename pair.
  6. For Current name and New name, enter following pairs:
    1. _c0: venueid
    2. _c1: venuename
    3. _c2: venuecity
    4. _c3: venuestate
    5. _c4: venueseats
  7. Choose the plus sign to the right of Rename Columns
  8. Under Targets, choose PostgreSQL, and enter following values:
    1. Name: postgresql_source
    2. Schema: public
    3. Table: venue

  1. Choose Save to project. You can optionally change the name and add a description.
  2. Choose Run. Optionally, you can change the compute parameters.

Wait for completion. Then the data has been successfully ingested.

Run an Athena query to explore the table on Amazon RDS

After you create a table on Amazon RDS, you can explore the table through a data explorer in SageMaker Unified Studio:

  1. On SageMaker Unified Studio, choose Data.
  2. Under Lakehouse, choose postgresql_source, public, and venue.
  3. On the options menu (three dots), choose Query with Athena.

You get records from the RDS table venue.

Use existing DynamoDB tables

This section has following prerequisites:

  • A DynamoDB table

To access existing DynamoDB tables, configure a resource-based policy that allows the appropriate actions for the project role:

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Select your table.
  3. Choose the Permissions tab and choose Create table policy.

The following example policy allows connecting to DynamoDB tables as a federated source. Replace <aws_region> with your AWS Region, <aws_account_id> with the AWS account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table that you intend to query from SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an Amazon SageMaker Lakehouse connection within SageMaker Unified Studio:

  1. Choose Data in the navigation pane.
  2. In the data explorer, choose the plus sign to add a data source.
  3. Select Add connection and choose Next.
  4. Select Amazon DynamoDB and choose Next.
  5. For Name, enter a name, then choose Add data.

The following screenshot shows the detailed steps to create a federated DynamoDB connection in SageMaker Unified Studio. After the connection is established, you can query the data from the DynamoDB table with using the Athena query editor.

You can also use existing DynamoDB tables as part of the ETL process. In the following screenshot, we demonstrate this using a visual ETL flow.

Use existing EMR clusters

This section has following prerequisites:

  • An EMR on EC2 cluster

SageMaker Unified Studio enables you to create new compute or add existing compute resources to a project for submitting jobs. You can add existing Amazon EMR on EC2 clusters or add existing Amazon EMR Serverless applications to submit data analytics jobs. To add a new EMR Serverless application, an administrator must enable a blueprint for the project.

To add an existing EMR on EC2 cluster, complete the following steps:

  1. In SageMaker Unified Studio, navigate to the project for which you plan to add compute, then choose Compute in the navigation pane.
  2. Choose the Data processing
  3. To add an existing EMR on EC2 cluster, choose Add compute.
  4. Choose Connect to existing compute resources and choose Next.

  1. To specify the compute resources to choose from, choose EMR on EC2 cluster.

  1. The Add Compute dialog box requires you to have the correct permissions to access the EMR on EC2 cluster. You can choose Copy project information to copy the data; the admin will need to grant the data worker access. Send the information to your admin.
  2. After the account administrator has granted the data worker access, you can specify the Amazon Resource Names (ARNs) associated with the cluster. You must fill in the Access role ARN, EMR on EC2 cluster ARN, Instance profile role ARN, and Name
  3. After you configure these settings, choose Add compute.

Your EMR on EC2 instance will be added to your project.

After you have added a cluster to a project, you will be able to see the cluster on the Data processing tab of the Compute page. You can then view the cluster details by choosing the specific cluster.

In addition to adding existing compute resources, you have the option to create new compute resources, which allows you to create both EMR on EC2 cluster and EMR Serverless applications.

Conclusion

SageMaker Unified Studio enables you to integrate with multiple data sources, providing data scientists and analysts with a powerful, unified environment for their AI and analytics workflows. As demonstrated throughout this two-part series, you can seamlessly connect to and use data from the Data Catalog, Amazon Redshift, Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR—while maintaining proper security controls and permissions. This flexibility alleviates the need for complex data movement operations and allows teams to focus on extracting insights from their data rather than managing infrastructure. By following the approaches outlined in these posts, organizations can maximize their existing data investments while taking advantage of the advanced capabilities of SageMaker Unified Studio for their data science and analytics needs.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Melting the ice — How Natural Intelligence simplified a data lake migration to Apache Iceberg

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/melting-the-ice-how-natural-intelligence-simplified-a-data-lake-migration-to-apache-iceberg/

This post is co-written with Haya Axelrod Stern, Zion Rubin and Michal Urbanowicz from Natural Intelligence.

Many organizations turn to data lakes for the flexibility and scale needed to manage large volumes of structured and unstructured data. However, migrating an existing data lake to a new table format such as Apache Iceberg can bring significant technical and organizational challenges

Natural Intelligence (NI) is a world leader in multi-category marketplaces. NI’s leading brands, Top10.com and BestMoney.com, help millions of people worldwide to make informed decisions every day. Recently, NI embarked on a journey to transition their legacy data lake from Apache Hive to Apache Iceberg.

In this blog post, NI shares their journey, the innovative solutions developed, and the key takeaways that can guide other organizations considering a similar path.

This article details NI’s practical approach to this complex migration, focusing less on Apache Iceberg’s technical specifications, but rather on the real-world challenges and solutions encountered during the transition to Apache Iceberg, a challenge that many organizations are grappling with.

Why Apache Iceberg?

The architecture at NI followed the commonly used medallion architecture, comprised of a bronze-silver-gold layered framework, shown in the figure that follows:

  • Bronze layer: Unprocessed data from various sources, stored in its raw format in Amazon Simple Storage Service (Amazon S3), ingested through Apache Kafka brokers.
  • Silver layer: Contains cleaned and enriched data, processed using Apache Flink.
  • Gold layer: Holds analytics-ready datasets designed for business intelligence (BI) and reporting, produced using Apache Spark pipelines, and consumed by services such as Snowflake, Amazon Athena, Tableau, and Apache Druid. The data is stored in Apache Parquet format with AWS Glue Catalog providing metadata management.

BDB4681-Arch1

While this architecture supported NI analytical needs, it lacked the flexibility required for a truly open and adaptable data platform. The gold layer was coupled only with query engines that supported Hive and AWS Glue Data Catalog. It was possible to use Amazon Athena however Snowflake required maintaining another catalog in order to query those external tables. This issue made it difficult to evaluate or adopt alternative tools and engines without costly data duplication, query rewrite data catalog synchronization. As business scaled, NI needed a data platform that could seamlessly support multiple query engines simultaneously with a single data catalog and avoiding any vendor lock-in.

The power of Apache Iceberg

Apache Iceberg emerged as the perfect solution—a flexible, open table format that aligns with NI’s approach of Data Lake First. Iceberg offers several critical advantages such as ACID transactions, schema evolution, time travel, performance improvements and more. But the key strategic benefits lay in the ability to support multiple query engines simultaneously. It also has the following advantages:

  • Decoupling of storage and compute: The open table format enables you to separate the storage layer from the query engine, allowing an easy swap and support for multiple engines concurrently without data duplication.
  • Vendor independence: As an open table format, Apache Iceberg prevents vendor lock-in, giving you the flexibility to adapt to changing analytics needs.
  • Vendor adoption: Apache Iceberg is widely supported by major platforms and tools, providing seamless integration and long-term ecosystem compatibility.

By transitioning to Iceberg, NI was able to embrace a truly open data platform, providing long-term flexibility, scalability, and interoperability while maintaining a unified source of truth for all analytics and reporting needs.

Challenges faced

Migrating a live production data lake to Iceberg was challenging because of operational complexities and legacy constraints. The data service at NI runs hundreds of Spark and machine learning pipelines, manages thousands of tables, and supports over 400 dashboards—all operating 24/7. Any migration would need to be done without production interruptions; and coordinating such a migration while operations continue seamlessly was daunting.

NI needed to accommodate diverse users with varying requirements and timelines from data engineers to data analysts all the way to data scientists and BI teams.

Adding to the challenge were legacy constraints. Some of the existing tools didn’t fully support Iceberg, so there was a need to maintain Hive-backed tables for compatibility. As NI realized that not all consumers could adopt Iceberg immediately. A plan was required to allow for incremental transitions without downtime or disruption to ongoing operations.

Key pillars for migration

To help ensure a smooth and successful transition, six critical pillars were defined:

  • Support ongoing operations: Maintain uninterrupted compatibility with existing systems and workflows during the migration process.
  • User transparency: Minimize disruption for users by preserving existing table names and access patterns.
  • Gradual consumer migration: Allow consumers to adopt Iceberg at their own pace, avoiding a forced, simultaneous switchover.
  • ETL flexibility: Migrate ETL pipelines to Iceberg without imposing constraints on development or deployment.
  • Cost effectiveness: Minimize storage and compute duplication and overhead during the migration period.
  • Minimize maintenance: Reduce the operational burden of managing dual table formats (Hive and Iceberg) during the transition.

Evaluating traditional migration approaches

Apache Iceberg supports two main approaches for migration: In-place and rewrite-based migration.

In-place migration

How it works: Converts an existing dataset into an Iceberg table without duplicating data by creating Iceberg metadata on top of the existing files while preserving their layout and format.

Advantages:

  • Cost-effective in terms of storage (no data duplication)
  • Simplified implementation
  • Maintains existing table names and locations
  • No data movement and minimal compute requirements, translating into lower cost

Disadvantages:

  • Downtime required: All write operations must be paused during conversion, which was unacceptable in NI cases because data and analytics are considered mission critical and run 24/7
  • No gradual adoption: All consumers must switch to Iceberg simultaneously, increasing the risk of disruption
  • Limited validation: No opportunity to validate data before cutover; rollback requires restoring from backups
  • Technical constraints: Schema evolution during migration can be challenging; data type incompatibilities can halt the entire process

Rewrite-based migration

How it works: Rewrite-based migration in Apache Iceberg involves creating a new Iceberg table by rewriting and reorganizing existing dataset files into Iceberg’s optimized format and structure for improved performance and data management.

Advantages:

  • Zero downtime during migration
  • Supports gradual consumer migration
  • Enables thorough validation
  • Simple rollback mechanism

Disadvantages:

  • Resource overhead: Double storage and compute costs during migration
  • Maintenance complexity: Managing two parallel data pipelines increases operational burden
  • Consistency challenges: Maintaining perfect consistency between the two systems is challenging
  • Performance impact: Increased latency because of dual writes; potential pipeline slowdowns

Why neither option alone was good enough

NI decided that neither option could meet all critical requirements:

  • In-place migration fell short because of unacceptable downtime and lack of support for gradual migration.
  • Rewrite-based migration fell short because of prohibitive cost overhead and complex operational management.

This analysis led NI to develop a hybrid approach that combines the advantages of both methods while mitigating and minimizing limitations.

The hybrid solution

The hybrid migration strategy was designed around five foundational elements, using AWS analytical services for orchestration, processing, and state management.

  1. Hive-to-Iceberg CDC: Automatically synchronize Hive tables with Iceberg using a custom change data capture (CDC) process to support existing consumers. Unlike traditional CDC focusing on row-level changes, the process was done at the partition-level to preserve Hive’s behavior of updating tables by overwriting partitions. This helps ensure that data consistency is maintained between Hive and Iceberg without logic changes at the migration phase, making sure that the same data exists on both tables.
  2. Continuous schema synchronization: Schema evolution during the migration introduced maintenance challenges. Automated schema sync processes compared Hive and Iceberg schemas, reconciling differences while maintaining type compatibility.
  3. Iceberg-to-Hive reverse CDC: To enable the data team to transition extract, transform, and load (ETL) jobs to write directly to Iceberg while maintaining compatibility with existing Hive-based processes not yet migrated, a reverse CDC from Iceberg to Hive was implemented. This allowed ETLs to write to Iceberg while maintaining Hive tables for downstream processes that had not yet migrated and still relied on them during the migration period.
  4. Alias management in Snowflake: Snowflake aliases made sure that Iceberg tables retained their original names, making the transition transparent to users. This approach minimized reconfiguration efforts across dependent teams and workflows.
  5. Table replacement: Swap production tables while retaining original names, completing the migration.

Technical deep dive

The migration to from Hive to Iceberg was constructed of several steps:

1. Hive-to-Iceberg CDC pipeline

Objective: Keep Hive and Iceberg tables synchronized without duplicating effort.

The preceding figure demonstrates how every partition written to the Hive table is automatically and transparently copied to the Iceberg table using a CDC process. This process makes sure that both tables are synchronized, enabling a seamless and incremental migration without disrupting downstream systems. NI chose partition-level synchronization because the legacy Hive ETL jobs already wrote updates by overwriting entire partitions and updating the partition location. Adopting that same approach in the CDC pipeline helped ensure that it remained consistent with how data was originally managed, making the migration smoother and avoiding the need to rework row-level logic.

Implementation:

  • To keep Hive and Iceberg tables synchronized without duplicating effort, a streamlined pipeline was implemented. Whenever partitions in Hive tables are updated, the AWS Glue Catalog emits events such as UpdatePartition. Amazon EventBridge captured these events, filtered them for the relevant databases and tables according to the event bridge rule, and triggered an AWS Lambda This function parsed the event metadata and sent the partition updates to an Apache Kafka topic.
  • A Spark job running on Amazon EMR consumed the messages from Kafka, which contained the updated partition details from the Data Catalog events. Using that event metadata, the Spark job queried the relevant Hive table, and wrote it to Iceberg table in Amazon S3 using the Spark Iceberg overwritePartitions API, as shown in the following example:
{
   "id":"10397e54-c049-fc7b-76c8-59e148c7cbfc",
   "detail-type":"Glue Data Catalog Table State Change",
   "source":"aws.glue",
   "time":"2024-10-27T17:16:21Z",
   "region":"us-east-1",
   "detail":{
      "databaseName":"dlk_visitor_funnel_dwh_production",
      "changedPartitions":[
         "2024-10-27"
      ],
      "typeOfChange":"UpdatePartition",
      "tableName":"fact_events"
   }
}
  • By targeting only modified partitions, the pipeline (shown in the following figure) significantly reduced the need for costly full-table rewrites. Iceberg’s robust metadata layers, including snapshots and manifest files, were seamlessly updated to capture these changes, providing efficient and accurate synchronization between Hive and Iceberg tables.

2. Iceberg-to-Hive reverse CDC pipeline

Objective: Support Hive consumers while allowing ETL pipelines to transition to Iceberg.

BDB4681-arch4

The preceding figure shows the reverse process, where every partition written to the Iceberg table is automatically and transparently copied to the Hive table using a CDC mechanism. This process helps ensure synchronization between the two systems, enabling seamless data updates for legacy systems that still rely on Hive while transitioning to Iceberg.

Implementation:

Synchronizing data from Iceberg tables back to Hive tables presented a different challenge. Unlike Hive tables, Data Catalog doesn’t track partition updates for Iceberg tables because partitions in Iceberg are managed internally and not within the catalog. This meant NI couldn’t rely on Glue Catalog events to detect partition changes.

To address this, NI implemented a solution similar to the previous flow but adapted to Iceberg’s architecture. Apache Spark was used to query Iceberg’s metadata tables—specifically the snapshots and entries tables—to identify the partitions modified since the last synchronization. The query used was:

SELECT e.data_file.partition, MAX(s.committed_at) AS last_modified_time 
FROM $target_table.snapshots JOIN $target_table.entries e ON s.snapshot_id = e.snapshot_id 
WHERE s.committed_at &amp;gt; '$last_sync_time' 
GROUP BY e.data_file.partition;

This query returned only the partitions that had been updated since the last synchronization, enabling it to focus exclusively on the changed data. Using this information, similar to the earlier process, a Spark job retrieved the updated partitions from Iceberg and wrote them back to the corresponding Hive table, providing seamless synchronization between both tables.

3. Continuous schema synchronization

Objective: Automate schema updates to maintain consistency across Hive and Iceberg.

BDB4681-arch5

The preceding figure shows how the automatic schema sync process helps ensure consistency between Hive and Iceberg tables schemas by automatically synchronizing schema changes. In this example adding the Channel column, minimizing manual work and double maintenance during the extended migration period.

 Implementation:

To handle schema changes between Hive and Iceberg, a process was implemented to detect and reconcile differences automatically. When a schema change happens in a Hive table, Data Catalog emits an UpdateTable event. This event triggers a Lambda function (routed through EventBridge), which retrieves the updated schema from Data Catalog for the Hive table and compares it to the Iceberg schema. It’s important to call out that in NI’s setup, schema changes originate from Hive because the Iceberg table is hidden behind aliases across the system. Because Iceberg is primarily used for Snowflake, a one-way sync from Hive to Iceberg is sufficient. As a result, there is no mechanism to detect or handle schema changes made directly in Iceberg, because they aren’t needed in the current workflow.

During the schema reconciliation (shown in the following figure), data types are normalized to help ensure compatibility—for example, converting Hive’s VARCHAR to Iceberg’s STRING. Any new fields or type changes are validated and applied to the Iceberg schema using a Spark job running on Amazon EMR. Amazon DynamoDB stores schema synchronization checkpoints which allow tracking changes over time and maintain consistency between the Hive and Iceberg schemas.

BDB4681-arch6

By automating this schema synchronization, maintenance overhead was significantly reduced and freed developers from manually keeping schemas in sync, making the long migration period significantly more manageable.

The preceding figure depicts an automated workflow to maintain schema consistency between Hive and Iceberg tables. AWS Glue captures table state change events from Hive, which trigger an EventBridge event. The event invokes a Lambda function that fetches metadata from DynamoDB and compares schemas fetched from AWS Glue for both Hive and Iceberg tables. If a mismatch is detected, the schema in Iceberg is updated to help ensure alignment, minimizing manual intervention and supporting smooth operation during the migration.

4. Alias management in Snowflake

Objective: Enable Snowflake consumers to adopt Iceberg without changing query references.

The preceding figure shows how Snowflake aliases enable seamless migration by mapping queries like SELECT platform, COUNT(clickouts) FROM funnel.clickouts to Iceberg tables in the Glue Catalog. Even with suffixes added during the Iceberg migration, existing queries and workflows remain unchanged, minimizing disruption for BI tools and analysts.

Implementation:

To help ensure a seamless experience for BI tools and analysts during the migration, Snowflake aliases were used to map external tables to the Iceberg metadata stored in Data Catalog. By assigning aliases that matched the original Hive table names, existing queries and reports were preserved without interruption. For example, an external table was created in Snowflake and aliased it to the original table name, as shown in the following query:

CREATE OR REPLACE ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost 
EXTERNAL_VOLUME = 's3_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG = 'glue_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG_TABLE_NAME = 'aggregated_cost'; 
ALTER ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost REFRESH;

When migration was completed, a simple change back to the alias was done to point to the new location or schema, making the transition seamless and minimizing any disruption to user workflows.

5. Table replacement

Objective: When all ETLs and related data workflows were successfully transitioned to use Apache Iceberg’s capabilities, and everything was functioning correctly with the synchronization flow, it was time to move on to the final phase of the migration. The primary objective was to maintain the original table names, avoiding the use of any prefixes like those employed in the earlier, intermediate migration steps. This helped ensure that the configuration remained tidy and free from unnecessary naming complications.

The preceding figure shows the table replacement to complete the migration, where Hive on Amazon EMR was used to register Parquet files as Iceberg tables while preserving original table names and avoiding data duplication, helping to ensure a seamless and tidy migration.

Implementation:

One of the challenges was that renaming tables isn’t possible within AWS Glue, which prevents the use of a straightforward renaming approach for the existing synchronization flow tables. In addition, AWS Glue doesn’t support the Migrate procedure, which creates Iceberg metadata on top of the existing data file while preserving the original table name. The strategy to overcome this limitation was to use a Hive metastore on an Amazon EMR cluster. By using Hive on Amazon EMR, NI was able to create the final tables with their original names because it operates in a separate metastore environment, giving the flexibility to define any required schema and table names without interference.

The add_files procedure was used to methodically register all the existing Parquet files, thus constructing all necessary metadata within Hive. This was a crucial step, because it helped ensure that all data files were appropriately cataloged and linked within the metastore.

The preceding figure shows the transition of a production table to Iceberg by using the add_files procedure to register existing Parquet files and create Iceberg metadata. This helped ensure a smooth migration while preserving the original data and avoiding duplication.

This setup allowed the use of existing Parquet files without duplicating data, thus saving resources. Although the sync flow used separate buckets for the final architecture, NI chose to maintain the original buckets and cleaned the intermediate files. This resulted in a different folder structure on Amazon S3. The historical data had subfolders for each partition under the root table directory, while the new Iceberg data organizes subfolders within a data folder. This difference was acceptable to avoid data duplication and preserve the original Amazon S3 buckets.

Technical recap

The AWS Glue Data Catalog served as the primary source of truth for schema and table updates, with Amazon EventBridge capturing Data Catalog events to trigger synchronization workflows. AWS Lambda parsed event metadata and managed schema synchronization, while Apache Kafka buffered events for real-time processing. Apache Spark on Amazon EMR handled data transformations and incremental updates, and Amazon DynamoDB maintained state, including synchronization checkpoints and table mappings. Finally, Snowflake seamlessly consumed Iceberg tables via aliases without disrupting existing workflows.

Migration outcome

The migration was completed with zero downtime; continuous operations were maintained throughout the migration, supporting hundreds of pipelines and dashboards without interruption. The migration was done with a cost optimized mindset with incremental updates and partition-level synchronization that minimized the usage of compute and storage resources. Lastly, NI Established a modern, vendor-neutral platform that enables scaling their evolving analytics and machine learning needs. It enables seamless integration with multiple compute and query engines, supporting flexibility and further innovation.

Conclusion

Natural intelligence migration to Apache Iceberg was a pivotal step in modernizing the company’s data infrastructure. By adopting a hybrid strategy and using the power of event-driven architectures, NI helped ensure a seamless transition that balanced innovation with operational stability. The journey underscored the importance of careful planning, understanding the data ecosystem, and focusing on an organization-first approach.

Above all, business was kept in focus and continuity prioritized the user experience. By doing so, NI unlocked the flexibility and scalability of their data lake while minimizing disruption, allowing teams to use cutting-edge analytics capabilities, positioning the company at the forefront of modern data management and readiness for the future.

If you’re considering an Apache Iceberg migration or facing similar data infrastructure challenges, we encourage you to explore the possibilities. Embrace open formats, use automation, and design with your organization’s unique needs in mind. The journey might be complex, but the rewards in scalability, flexibility, and innovation are well worth the effort. You can use the AWS prescriptive guide to help learn more about how to best use Apache Iceberg for your organization


About the Authors

Yonatan DolanYonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist.

Haya Stern is a Senior Director of Data at Natural Intelligence. She leads the development of NI’s large-scale data platform, with a focus on enabling analytics, streamlining data workflows, and improving dev efficiency. In the past year, she led the successful migration from the previous data architecture to a modern lake house based on Apache Iceberg and Snowflake.

Zion Rubin is a Data Architect at Natural Intelligence with ten years of experience architecting large‑scale big‑data platforms, now focused on developing intelligent agent systems that turn complex data into real‑time business insight.

Michał Urbanowicz is a Cloud Data Engineer at Natural Intelligence with expertise in migrating data warehouses and implementing robust retention, cleanup, and monitoring processes to ensure scalability and reliability. He also develops automations that streamline and support campaign management operations in cloud-based environments.

Amazon SageMaker Lakehouse now supports attribute-based access control

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/amazon-sagemaker-lakehouse-now-supports-attribute-based-access-control/

Amazon SageMaker Lakehouse now supports attribute-based access control (ABAC) with AWS Lake Formation, using AWS Identity and Access Management (IAM) principals and session tags to simplify data access, grant creation, and maintenance. With ABAC, you can manage business attributes associated with user identities and enable organizations to create dynamic access control policies that adapt to the specific context.

SageMaker Lakehouse is a unified, open, and secure data lakehouse that now supports ABAC to provide unified access to general purpose Amazon S3 buckets, Amazon S3 Tables, Amazon Redshift data warehouses, and data sources such as Amazon DynamoDB or PostgreSQL. You can then query, analyze, and join the data using Redshift, Amazon AthenaAmazon EMR, and AWS Glue. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. In addition to its support for role-based and tag-based access control, Lake Formation extends support to attribute-based access to simplify data access management for SageMaker Lakehouse, with the following benefits:

  • Flexibility – ABAC policies are flexible and can be updated to meet changing business needs. Instead of creating new rigid roles, ABAC systems allow access rules to be modified by simply changing user or resource attributes.
  • Efficiency – Managing a smaller number of roles and policies is more straightforward than managing a large number of roles, reducing administrative overhead.
  • Scalability – ABAC systems are more scalable for larger enterprises because they can handle a large number of users and resources without requiring a large number of roles.

Attribute-based access control overview

Previously, within SageMaker Lakehouse, Lake Formation granted access to resources based on the identity of a requesting user. Our customers were requesting the capability to express the full complexity required for access control rules in organizations. ABAC allows for more flexible and nuanced access policies that can better reflect real-world needs. Organizations can now grant permissions on a resource based on user attribute and is context-driven. This allows administrators to grant permissions on a resource with conditions that specify user attribute keys and values. IAM principals with matching IAM or session tag key-value pairs will gain access to the resource.

Instead of creating a separate role for each team member’s access to a specific project, you can set up ABAC policies to grant access based on attributes like membership and user role, reducing the number of roles required. For instance, without ABAC, a company with an account manager role that covers five different geographical territories needs to create five different IAM roles and grant data access for only the specific territory for which the IAM role is meant. With ABAC, they can simply add those territory attributes as keys/values to the principal tag and provide data access grants based on those attributes. If the value of the attribute for a user changes, access to the dataset will automatically be invalidated.

With ABAC, you can use attributes such as department or country and use IAM or sessions tags to determine access to data, making it more straightforward to create and maintain data access grants. Administrators can define fine-grained access permissions with ABAC to limit access to databases, tables, rows, columns, or table cells.

In this post, we demonstrate how to get started with ABAC in SageMaker Lakehouse and use with various analytics services.

Solution overview

To illustrate the solution, we are going to consider a fictional company called Example Retail Corp. Example Retail’s leadership is interested in analyzing sales data in Amazon S3 to determine in-demand products, understand customer behavior, and identify trends, for better decision-making and increased profitability. The sales department sets up a team for sales analysis with the following data access requirements:

  • All data analysts in the Sales department in the US get access to only sales-specific data in only US regions
  • All BI analysts in the Sales department have full access to data in only US regions
  • All scientists in the Sales department get access to only sales-specific data across all regions
  • Anyone outside of Sales department have no access to sales data

For this post, we consider the database salesdb, which contains the store_sales table that has store sales details. The table store_sales has the following schema.

To demonstrate the product sales analysis use case, we will consider the following personas from the Example Retail Corp:

  • Ava is a data administrator in Example Retail Corp who is responsible for supporting team members with specific data permission policies
  • Alice is a data analyst who should be able to access sales specific US store data to perform product sales analysis
  • Bob is a BI analyst who should be able to access all data from US store sales to generate reports
  • Charlie is a data scientist who should be able to access sales specific across all regions to explore and find patterns for trend analysis

Ava decides to use SageMaker Lakehouse to unify data across various data sources while setting up fine-grained access control using ABAC. Alice is excited about this decision as she can now build daily reports using her expertise with Athena. Bob now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshift’s cost-based optimizer. Charlie, being an open source Apache Spark contributor, is excited that he can build Spark based processing with Amazon EMR to build ML forecasting models.

Ava defines the user attributes as static IAM tags that could also include attributes stored in the identity provider (IdP) or as session tags dynamically to represent the user metadata. These tags are assigned to IAM users or roles and can be used to define or restrict access to specific resources or data. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS.

For this post, Ava assigns users with static IAM tags to represent the user attributes, including their department membership, Region assignment, and current role relationship. The following table summarizes the tags that represent user attributes and user assignment.

User Persona Attributes Access
Alice Data Analyst Department=sales
Region=US
Role=Analyst
Sales specific data in US and no access to customer data
Bob BI Analyst Department=sales
Region=US
Role=BIAnalyst
All data in US
Charlie Data Scientist Department=sales
Region=ALL
Role=Scientist
Sales specific data in All regions and no access to customer data

Ava then defines access control policies in Lake Formation that grant or restrict access to certain resources based on predefined criteria (user attributes defined using IAM tags) being satisfied. This allows for flexible and context-aware security policies where access privileges can be adjusted dynamically by modifying the user attribute assignment without changing the policy rules. The following table summarizes the policies in the Sales department.

Access User Attributes Policy
All analysts (including Alice) in US get access to sales specific data in US regions Department=sales
Region=US
Role=Analyst
Table: store_sales (store_id, transaction_date, product_name, country, sales_price, quantity columns)
Row filter: country='US'
All BI analysts (including Bob) in US get access to all data in US regions Department=sales
Region=US
Role=BIAnalyst
Table: store_sales (all columns)
Row filter: country='US'
All scientists (including Charlie) get access to sales-specific data from all regions Department=sales
Region=ALL
Role=Scientist
Table: store_sales (all rows)
Column filter: store_id, transaction_date, product_name, country, sales_price,quantity

The following diagram illustrates the solution architecture.

Implementing this solution consists of the following high-level steps. For Example Retail, Ava as a data Administrator performs these steps:

  1. Define the user attributes and assign them to the principal.
  2. Grant permission on the resources (database and table) to the principal based on user attributes.
  3. Verify the permissions by querying the data using various analytics services.

Prerequisites

To follow the steps in this post, you must complete the following prerequisites:

  1. AWS account with access to the following AWS services:
    • Amazon S3
    • AWS Lake Formation and AWS Glue Data Catalog
    • Amazon Redshift
    • Amazon Athena
    • Amazon EMR
    • AWS Identity and Access Management (IAM)
  1. Set up an admin user for Ava. For instructions, see Create a user with administrative access.
  2. Setup S3 bucket for uploading script.
  3. Set up a data lake admin. For instructions, see Create a data lake administrator.
  4. Create IAM user named Alice and attach permissions for Athena access. For instructions, refer to Data analyst permissions.
  5. Create IAM user Bob and attach permissions for Redshift access.
  6. Create IAM user Charlie and attach permissions for EMR Serverless access.
  7. Create job runtime role: scientist_role and that will be used by Charlie. For instruction refer to: Job runtime roles for Amazon EMR Serverless
  8. Setup EMR Serverless application with Lake Formation enabled. For instruction refer to: Using EMR Serverless with AWS Lake Formation for fine-grained access control
  9. Have an existing AWS Glue database or table and Amazon Simple Storage Service (Amazon) S3 bucket that holds the table data. For this post, we use salesdb as our database, store_sales as our table, and data is stored in an S3 bucket.

Define attributes for the IAM principals Alice, Bob, Charlie

Ava completes the following steps to define the attributes for the IAM principal:

  1. Log in as an admin user and navigate to the IAM console.
  2. Choose Users under Access management in the navigation pane and search for the user Alice.
  3. Choose the user and choose the Tags tab.
  4. Choose Add new tag and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: US
    • Key: Role and value: Analyst
  5. Choose Save changes.
  6. Repeat the process for the user Bob and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: US
    • Key: Role and value: BIAnalyst
  7. Repeat the process for the user Charlie and IAM role scientist_role and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: ALL
    • Key: Role and value: Scientist

Grant permissions to Alice, Bob, Charlie using ABAC

Ava now grants database and table permissions to users with ABAC.

Grant database permissions

Complete the following steps:

  1. Ava logs in as data lake admin and navigate to the Lake Formation console.
  2. In the navigation pane, under Permissions, choose Data lake permissions.
  3. Choose Grant.
  4. On the Grant permissions page, choose Principals by attribute.
  5. Specify the following attributes:
    • Key: Department  and value: sales
    • Key: Role and value: Analyst,Scientist
  6. Review the resulting policy expression.
  7. For Permission scope, select This account.
  8. Next, choose the catalog resources to grant access:
    • For Catalogs, enter the account ID.
    • For Databases, enter salesdb.
  9. For Database permissions, select Describe.
  10. Choose Grant.

Ava now verifies the database permission by navigating to the Databases tab under the Data Catalog and searching for salesdb. Select salesdb and choose View under Actions.

Grant table permissions to Alice

Complete the following steps to create a data filter to view sales specific columns in store_sales records whose country=US:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. Provide the data filter name as us_sales_salesonlydata.
  4. For Target catalog, enter the account ID.
  5. For Target database, choose salesdb.
  6. For Target table, choose store_sales.
  7. For column-level access, choose Include columns: store_id, item_code, transaction_date, product_name, country, sales_price, and quantity.
  8. For Row-level access, choose Filter rows and enter the row filter country='US'.
  9. Choose Create data filter.
  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    • Key: Department and value: sales
    • Key: Role as value: Analyst
    • Key: Region and value: US
  3. Review the resulting policy expression.
  4. For Permission scope, select This account.
  5. Choose the catalog resources to grant access:
    • Catalogs: Account ID
    • Databases: salesdb
    • Table: store_sales
    • Data filters: us_sales
  6. For Data filter permissions, select Select.
  7. Choose Grant.

Grant table permissions to Bob

Complete the following steps to create a data filter to view only store_sales records whose country=US:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. Provide the data filter name as us_sales.
  4. For Target catalog, enter the account ID.
  5. For Target database, choose salesdb.
  6. For Target table, choose store_sales.
  7. Leave Column-level access as Access to all columns.
  8. For Row-level access, enter the row filter country='US'.
  9. Choose Create data filter.

Complete the following steps to grant table permissions to Bob:

  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    • Key: Department and value: sales
    • Key: Role as value: BIAnalyst
    • Key: Region and value: US
  3. Review the resulting policy expression.
  4. For Permission scope, select This account.
  5. Choose the catalog resources to grant access:
    • Catalogs: Account ID
    • Databases: salesdb
    • Table: store_sales
  6. For Data filter permissions, select Select.
  7. Choose Grant.

Grant table permissions to Charlie

Complete the following steps to grant table permissions to Charlie:

  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    1. Key: Department and value: sales
    2. Key: Role as value: Scientist
    3. Key: Region and value: ALL
  3. Review the resulting policy expression.
  4. For Permission scope, select This account
  5. Choose the catalog resources to grant access:
    1. Catalogs: Account ID
    2. Databases: salesdb
    3. Table: store_sales
  6. For Table permissions, select Select.
  7. For Data permissions, specify the following columns: store_id, transaction_date, product_name, country, sales_price, and quantity.
  8. Choose Grant.

Alice now verifies the table permission by navigating to the Tables tab under the Data Catalog and searching for store_sales. Select store_sales and choose View under Actions. The following screenshots show the details for both sets of permissions.

Data Analyst uses Athena for building daily sales reports

Alice, the data analyst logs in to the Athena console and run the following query:

select * from "salesdb"."store_sales" limit 5

Alice has the user attributes as Department=sales, Role=Analyst, Region=US, and this attribute combination allows her access to US sales data to specific sales only column, without access to customer data as shown in the following screenshot.

BI Analyst uses Redshift for building sales dashboards

Bob, the BI Analyst, logs in to the Redshift console and run the following query:

select * from "salesdb"."store_sales" limit 10

Bob has the user attributes Department=sales, Role=BIAnalyst, Region=US, and this attribute combination allows him access to all columns including customer data for US sales data.

Data Scientist uses Amazon EMR to process sales data

Finally, Charlie logs in to the EMR console and submit the EMR job with runtime role as scientist_role. Charlie uses  the script sales_analysis.py that is uploaded to s3 bucket created for the script. He chooses the EMR Serverless application created with Lake Formation enabled.

Charlie submits batch job runs by choosing the following values:

  • Name: sales_analysis_Charlie
  • Runtime_role: scientist_role
  • Script location: <s3_script_path>/sales_analysis.py
  • For spark properties, provide key as spark.emr-serverless.lakeformation.enabled and value as true.
  • Additional configurations: Under Metastore configuration select Use AWS Glue Data Catalog as metastore. Charlie keeps rest of the configuration as default.

Once the job run is completed, Charlie can view the output by selecting stdout under Driver log files.

Charlie uses scientist_role as job runtime role with the attributes Department=sales, Role=Scientist, Region=ALL, and this attribute combination allows him access to select columns of all sales data.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Delete the IAM users created.
  2. Delete the AWS Glue database and table resources created for the post, if any.
  3. Delete the Athena, Redshift and EMR resources created for the post.

Conclusion

In this post, we showcased how you can use SageMaker Lakehouse attribute-based access control, using IAM principals and session tags to simplify data access, grant creation, and maintenance. With attribute-based access control, you can manage permissions using dynamic business attributes associated with user identities and secure your data in the lakehouse by defining fine-grained permissions in the Lake Formation that are enforced across analytics and ML tools and engines.

For more information, refer to documentation. We encourage you to try out the SageMaker Lakehouse with ABAC and share your feedback with us.


About the authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Streamlining trace sampling behavior for AWS Lambda functions with AWS X-Ray

Post Syndicated from Joshua Smith original https://aws.amazon.com/blogs/compute/streamlining-trace-sampling-behavior-for-aws-lambda-functions-with-aws-x-ray/

Effective tracing enables developers and operators to quickly identify performance bottlenecks, troubleshoot issues across service boundaries, and make sure of optimal end-user experiences. This makes it crucial for maintaining and optimizing distributed serverless applications. This post explores the importance of distributed tracing for operating serverless applications and announces an important update to tracing behavior for AWS Lambda, which streamlines how trace context is handled in PassThrough mode. This blog post will demonstrate how this change gives you better control over how your Lambda functions handle tracing with AWS X-Ray through practical examples. Whether you’re building new applications or operating existing ones, this update helps you achieve more predictable and efficient tracing across your serverless applications built using Lambda.

Overview

Distributed serverless applications spanning numerous AWS services require robust monitoring as they scale. Traditional troubleshooting approaches fall short due to Lambda’s ephemeral nature, making it difficult for development teams to track requests across components, understand performance bottlenecks, and optimize costs by eliminating unnecessary function invocations. Without end-to-end visibility, production issues become increasingly time-consuming to resolve.

X-Ray addresses these observability challenges by providing powerful distributed tracing capabilities that help developers understand how their Lambda functions interact with other AWS services and identify performance issues. As serverless architectures grow in complexity, having fine-grained control over tracing behavior becomes crucial for maintaining efficient and cost-effective observability strategies that enable teams to effectively operate production workloads.

Lambda and X-Ray have steadily enhanced tracing capabilities in recent years to improve observability for serverless applications. In November 2022, X-Ray introduced trace linking between Amazon Simple Queue Service (Amazon SQS) and Lambda, enabling end-to-end tracing for event-driven applications. In February 2023, X-Ray added active tracing support for Amazon Simple Notification Service (Amazon SNS), allowing you to trace messages that flow through SNS topics to Lambda functions. In May 2023, X-Ray added tracing support to SnapStart-enabled Lambda functions, helping you troubleshoot and optimize the performance of latency-sensitive Java applications built using SnapStart-enabled functions. In November 2023, Lambda launched a unified experience in the Lambda console that brings together metrics, logs, and traces in a single view, allowing you to more directly troubleshoot and optimize your functions.

Building upon these enhancements, Lambda has now rolled out streamlined trace sampling behavior, which gives you better control over how your functions handle tracing with X-Ray. This launch makes an important change to tracing behavior in Lambda when the tracing configuration is set to PassThrough mode. With this launch, Lambda propagates the tracing context as is without any modifications in PassThrough mode. This means that Lambda won’t create any trace segments or subsegments for functions set to PassThrough mode, even if the incoming invocation contains a decision to sample the request. However, Lambda service does propagate the tracing context as received by the function.

This change to the X-Ray PassThrough mode for Lambda gives you more control and predictability over your tracing configuration. This enables you to optimize your tracing strategy and better understand the performance and behavior of your serverless applications. This post shows three different scenarios to demonstrate the new tracing behavior.

Understanding the Lambda/X-Ray tracing behavior: before and after

Tracing in Lambda with X-Ray is a powerful tool for gaining insights into the performance and behavior of serverless applications. Enabling tracing allows you to identify bottlenecks, troubleshoot issues, and optimize your Lambda functions. Lambda supports two tracing modes for X-Ray: Active and PassThrough. With Active tracing, Lambda automatically creates trace segments for function invocations and sends them to X-Ray. On the other hand, PassThrough mode propagates the tracing context to downstream services.

Previously, if you enabled tracing in an upstream service that invokes your function, Lambda would follow this sampling decision and send traces to X-Ray automatically, even in the case where the Lambda function was configured to use PassThrough mode. The following figure shows this process. This behavior could result in unexpected trace segments, which could become an overhead, particularly in high throughput scenarios.

Figure 1. Previous behavior: Lambda sends traces to X-Ray even when function tracing configuration is set to PassThrough

The updated X-Ray PassThrough mode for Lambda provides a more intuitive and consistent tracing experience. You can now expect Lambda to respect the incoming tracing context (if it exists) and propagate it without any modifications. In turn, downstream services can make their own tracing decisions based on their configuration. The following figure shows this updated behavior.

Figure 2. New behavior: When function tracing configuration is set to PassThrough, Lambda doesn’t send traces to X-Ray or modify sampling decision

PassThrough tracing configuration with upstream sampling

To configure your Lambda function to use PassThrough tracing mode in the console, complete the following steps:

  1. In the Lambda console, navigate to your function.
  2. On the Configuration tab, choose Monitoring and operations tools in the left pane.
  3. Confirm that X-Ray active tracing shows as Not enabled. If it’s enabled, then choose Edit.
  4. Under X-Ray, turn off Active tracing, then choose Save, as shown in the following figure.

    Figure 3. Lambda console showing function with active tracing disabled

You can also make use of the AWS Command Line Interface (AWS CLI) to achieve the aforementioned setting:

aws lambda update-function-configuration --function-name YOUR_FUNCTION_NAME --tracing-config Mode=PassThrough

This configuration allows your Lambda function to propagate the tracing context received from the upstream service without any changes. If you were previously using this configuration, then you no longer see trace segments created by the Lambda function on the X-Ray console. This configuration is useful when you want to propagate the tracing context without generating trace segments, in scenarios that need optimizing for tracing costs or overhead. The following figure shows the workflow.

Figure 4. A tracing map that shows the UpstreamFunction Lambda function isn’t displayed on the trace map, because it’s configured to use PassThrough tracing mode after this change

If you want to see trace segments for your Lambda function, then you need to set the tracing mode to Active.

Active tracing configuration

When you configure your Lambda function to use active tracing mode, and if there is no sampling decision from the upstream request, Lambda samples requests at the rate of one request per second and 5% of further requests. If there is a decision not to sample, then Lambda respects this sampling decision.

To configure your Lambda function to use active tracing mode, complete the following steps:

  1. On the Lambda console, navigate to the AWS X-Ray section on the Lambda function’s configuration page, as described in the previous section.
  2. Turn on Active tracing, then choose Save, as shown in the following figure.

    Figure 5: Lambda console showing active tracing enabled

You can also use the AWS CLI to set this configuration:

aws lambda update-function-configuration --function-name YOUR_FUNCTION_NAME --tracing-config Mode=Active

With active tracing mode, you can always see traces for sampled requests for your Lambda function on the X-Ray console. This mode is particularly useful when you want to have complete visibility into the performance and behavior of your Lambda function. The following figure shows the workflow for upstream and downstream Lambda functions with active tracing enabled.

Figure 6. A trace map showing both the UpstreamFunction and DownstreamFunction Lambda functions. This is because both functions have active tracing enabled.

The following screenshot shows a full trace corresponding to the preceding trace workflow with both upstream and downstream Lambda functions. Detailed insights gained from comprehensive tracing can be invaluable for troubleshooting, performance optimization, and understanding the end-to-end behavior of your serverless application.

Figure 7. A full trace corresponding to the preceding trace map with both upstream and downstream Lambda functions

PassThrough tracing configuration without upstream sampling

When you configure your Lambda function to use PassThrough tracing mode, and the upstream service has sampling turned off, Lambda continues to propagate the tracing context without any modifications, and without generating traces.

To configure your Lambda function to use PassThrough tracing mode, complete the following steps:

  1. On the Lambda console, navigate to the AWS X-Ray section on the Lambda function’s configuration page.
  2. Under X-Ray, turn off Active tracing, then choose Save, as shown in the following figure.

    Figure 8. Lambda console showing active tracing disabled

This configuration remains the same in the updated PassThrough configuration and is particularly useful when you want to allow downstream services to make their own tracing decisions.

Conclusion

The new streamlined trace sampling behavior for AWS Lambda functions provides you with more control and flexibility over insights into your applications. Whether you choose to use PassThrough mode with upstream sampling on or off, or active tracing mode, you can now configure your Lambda functions to handle tracing in a way that best suits your application’s needs.

This update empowers you to optimize your tracing setup, balance tracing costs and benefits, and gain valuable insights into the performance and behavior of your serverless applications.

This change in tracing behavior now applies to all new and existing functions in all AWS Regions where Lambda and AWS X-Ray are available, at no further cost. To learn more about the new tracing sampling behavior for Lambda, see the post Visualize Lambda function invocations using AWS X-Ray.

For more serverless learning resources, visit Serverless Land.

How Smartsheet reduced latency and optimized costs in their serverless architecture

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-smartsheet-reduced-latency-and-optimized-costs-in-their-serverless-architecture/

Cloud software as a service (SaaS) companies are often looking for ways to enhance their architectures for performance and cost-efficiency. Serverless technologies offload infrastructure management, allowing development teams to focus on innovation and delivering business value. As application architectures grow and face more demanding requirements, continued optimization helps maximize both the technical and financial advantages of the serverless approach.

In this post, we discuss Smartsheet’s journey optimizing its serverless architecture. We explore the solution, the stringent requirements Smartsheet faced, and how they’ve achieved an over 80% latency reduction. This technical journey offers valuable insights for organizations looking to enhance their serverless architectures with proven enterprise-grade optimization techniques.

Solution overview

Smartsheet is a leading cloud-based enterprise work management platform, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. At the core of the platform lies an event-driven architecture that processes real-time user activity across various document types. Given the collaborative nature of the platform, multiple users can work on these documents concurrently. Every document interaction triggers a series of events that must be processed with minimal latency to maintain data consistency and provide immediate feedback. Processing delays can impact user experience and productivity, making consistently low latency a fundamental business requirement.

Smartsheet’s traffic pattern is spiky during business hours and mostly dormant during nights and weekends. Within peak periods, traffic can fluctuate as users collaborate in real time. To efficiently manage dynamic workloads, which can surge from hundreds to tens of thousands of events per second within minutes, Smartsheet implements a serverless event processing architecture using services such as Amazon Simple Queue Service (Amazon SQS) and AWS Lambda. This architecture uses the elasticity of serverless services and the ability to automatically scale dynamically based on the traffic volume. It makes sure Smartsheet can efficiently handle sudden traffic surges while automatically scaling down during off-peak hours, optimizing for both performance and cost-efficiency.

The following diagram illustrates the high-level architecture of the Smartsheet event processing pipeline.

high-level architecture of the Smartsheet event processing pipeline

Optimization opportunity

Smartsheet uses Lambda functions to serve both batch jobs and API requests. The primary runtime used for building those functions is Java. Lambda automatically scales the number of execution environments allocated to your function on demand to accommodate traffic volume. When Lambda receives an incoming request, it attempts to serve it with an existing execution environment first. If no execution environments are available, the service initializes a new one. During initialization, the Smartsheet’s function code commonly sends several requests to external dependencies, such as databases and REST APIs, which might take time to reply.

The following diagram illustrates how Lambda functions reach out to external dependencies during initialization.

Lambda functions reach out to external dependencies during initialization

These tasks introduced execution environment initialization latency, commonly referred to as a cold start. Although cold starts typically affect less than 1% of requests, Smartsheet had stringent low latency requirements for their architecture to further prioritize the best possible end-user experience.

“To reduce customer request latency while keeping costs low, our engineering team utilized Lambda provisioned concurrency with auto scaling and Graviton, which resulted in an 83% reduction in P95 latency while providing a high quality of service as we continue to scale our platform and its limits,” says Abhishek Gurunathan, Sr Director of Engineering at Smartsheet.

Addressing the cold start with provisioned concurrency

To reduce cold start latency, the Smartsheet team adopted provisioned concurrency in their architecture, a capability that allows developers to specify the number of execution environments that Lambda should keep warm to instantly handle invocations. The following diagram illustrates the difference. Without provisioned concurrency, execution environments are created on demand, which means some invocations (typically less than 1%) need to wait for the execution environment to be created and initialization code to be run. With provisioned concurrency, Lambda creates execution environments and runs initialization code preemptively, making sure invocations are served by warm execution environments.

invocations are served by warm execution environments

Provisioned concurrency includes a dynamic spillover mechanism, making your serverless architecture highly resilient to traffic spikes. When incoming traffic exceeds the preconfigured provisioned concurrency, additional requests are automatically served by on-demand concurrency rather than being throttled. This provides seamless scalability and maintains service availability even during traffic surges, while still providing the performance benefits of pre-warmed execution environments for the majority of requests.

The Smartsheet team configured provisioned concurrency to match their historical P95 concurrency needs. This resulted in immediate improvements—the number of cold starts dropped dramatically and P95 invocation latency dropped by 83%. As the team monitored system performance, they quickly identified another architecture optimization opportunity—the Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends, as illustrated in the following graph.

Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends

Setting a static provisioned concurrency configuration worked great for busy periods, but was underutilized during off-times. The Smartsheet team wanted to further fine-tune their architecture and increase provisioned concurrency utilization rates to achieve higher cost-efficiency. This led them to look into provisioned concurrency auto scaling to match traffic patterns as well as adopting an AWS Graviton architecture.

Auto scaling provisioned concurrency and Graviton architecture

Two common approaches to enable provisioned concurrency are setting a static value and using auto scaling. With static configuration, you specify a fixed number of pre-initialized execution environments that remain continuously warm to serve invocations. This approach is highly effective for architectures that handle predictable traffic patterns. Unpredictable traffic patterns, however, can lead to under-provisioning during peak periods (with spillover to on-demand concurrency resulting in more cold starts) or underutilization during low-usage periods. To address that, provisioned concurrency with auto scaling dynamically adjusts the configuration based on utilization metrics, automatically scaling the number of execution environments up or down to match the actual demand. This dynamic approach optimizes for cost-efficiency and is particularly recommended for architectures with fluctuating traffic patterns.

The following figure compares static and dynamic provisioned concurrency.

static and dynamic provisioned concurrency

To further optimize the architecture for cost-efficiency, the Smartsheet team has implemented provisioned concurrency auto scaling based on utilization metrics. Smartsheet used an infrastructure as code (IaC) approach with Terraform to define auto scaling policies for maximum reusability across hundreds of functions. The policies track the LambdaProvisionedConcurrencyUtilization metric and define the scaling threshold according to the function purpose. For functions implementing interactive APIs, the auto scale threshold is 60% utilization to pre-provision execution environments early, keeping latency extra-low, and making functions more resilient towards traffic surges. For functions that implement asynchronous data processing, Smartsheet’s goal was to achieve the highest utilization rate and cost-efficiency, so they’ve defined the auto scale threshold at 90%.

The following diagram illustrates the architecture of auto scaling policies based on provisioned concurrency utilization rate and workload type.auto scaling policies based on provisioned concurrency utilization rate and workload type

Another optimization technique Smartsheet employed was switching the CPU architecture used by their Lambda functions from x86_64 to arm64 Graviton. To achieve this, Smartsheet adopted the ARM versions of Lambda layers they’ve used, such as Datadog and Lambda Insights extensions. This was required because binaries built using one architecture might be incompatible with a different one. Because Smartsheet functions were implemented with Java and packaged as JAR files, they didn’t have any compatibility issues when moving to Graviton. With Terraform used for codifying the infrastructure, this architecture switch was a simple property change in aws_lambda_function resources, as illustrated in the following code:

property change in aws_lambda_function resources

By switching to a Graviton architecture, Smartsheet saved 20% on function GB-second costs. See AWS Lambda pricing for details.

Best practices

Use the following techniques and best practices to optimize your serverless architectures, reduce cold starts, and increase cost-efficiency:

  • Fine-tune your Lambda functions to find the optimal balance between cost and performance. Increasing memory allocation also adds CPU capacity, which often means faster execution and can lead to reduced overall costs.
  • Use a Graviton2 architecture for compatible workloads to benefit from a better price-performance ratio. Depending on the workload type, switching to Graviton can yield up to 34% improvement.
  • Use provisioned concurrency and Lambda SnapStart to reduce cold starts in your serverless architectures. Start with static provisioned concurrency based on your historical concurrency requirements, monitor utilization, and introduce auto scaling into your architecture to achieve the optimal cost-performance profile.

Conclusion

Serverless architectures using services like Lambda and Amazon SQS offload the infrastructure management and scaling concerns to AWS, allowing teams to focus on innovation and delivering business value. As Smartsheet’s journey demonstrates, using provisioned concurrency and Graviton in your architectures can help significantly improve user experience by reducing latencies while also achieving better cost-efficiency, providing a practical blueprint for optimization across the organization. Whether you’re running large-scale enterprise applications or building new cloud solutions, these proven techniques can help you unlock similar performance gains and cost-efficiencies in your serverless architectures.

To learn more about serverless architectures, see Serverless Land.


About the authors

 

Simplifying Code Documentation with Amazon Q Developer

Post Syndicated from Jehu Gray original https://aws.amazon.com/blogs/devops/simplifying-code-documentation-with-amazon-q-developer/

In the fast-paced world of software development, maintaining comprehensive documentation often falls to the bottom of priority lists in favor of delivering functionality. Amazon Q Developer’s /doc agent changes this equation by automating README generation and updates. With this tool, the variable of time spent producing documentation is reduced to the point that it’s no longer a burden to the detriment of functionality.

Understanding Amazon Q Developer’s Documentation Generation

The /doc agent leverages generative AI to analyze your codebase and generate comprehensive documentation. Additionally, the agent respects your .gitignore file and excludes files you don’t want to be included in documentation review.

Solution Overview:
As an example, imagine a cloud infrastructure team at a technology consultancy had been working for weeks on their AWS DataSync project. The solution they built provided an elegant CDK implementation that automated data transfer between Amazon S3 buckets using AWS DataSync. The lead engineer had just finished implementing the final IAM role configurations when the product manager requested comprehensive documentation for the next day’s client handoff meeting. The team realized this would typically take hours of focused work. Instead, they decided to try Amazon Q Developer /doc agent.

Getting Started with /doc:

To begin using the /doc agent, you’ll need to:

  1. Set up Amazon Q: Follow these steps
  2. Open your IDE with the Amazon Q extension installed
  3. Click the Amazon Q icon to open the chat panel
  4. Enter /doc to start the documentation process
  5. Select your documentation task:
    1. Create a new README
    2. Update an existing README with recent code changes
This image shows the user entering /doc to start the documentation process

Figure 1 – Entering /doc to start the documentation process.

Example: Creating a New README:

For projects without documentation, simply select: Create a README. It will confirm the project you are creating the README for.

This image shows the user selecting the Create a README option

Figure 2 – Select the Create a README option.

Once you verify the folder and select yes, the agent begins creating the README document for the folder.
Here are the steps it works through: scanning the source files, summarizing the source files, and generating the documentation.

This image shows the user verifying the folder and select yes

Figure 3 – Verify the folder and select yes.

When the document is created, you can preview the README file. The agent then presents you with the ability to either accept the changes or request modifications before implementation.

This image shows the preview of the created README file.

Figure 4 – Preview the created README file.

If you choose to accept, the README file is added to your project.

This image shows the user accepting the changes so the README is added to your project folder

Figure 5 – Accept the changes so the README is added to your project folder.

 

Example: Updating Documentation with Code Changes

When your code evolves, you can keep the documentation synchronized by using /doc. The agent will review your recent code modifications and suggest appropriate documentation updates.

This image shows when the user selects update an existing README

Figure 6 – Select Update an existing README to make changes to a README file.

Then you can describe the changes you want the agent to make to your README file.

This image shows how you can describe changes you’d like the agent to make

Figure 7 – Describe changes to your README files.

For targeted documentation updates, you can provide specific instructions:

This image shows the user verifying the folder

Figure 8 – Verify the folder and select yes.

Once you’ve made the changes, the agent asks you to verify them by selecting yes.

This image shows the user verifying the changes made

Figure 9 – Verify the changes and select yes.

Advanced Documentation Management

Multi-step Documentation Refinement:

This image shows the steps in Documentation Refinement

Figure 10 – Multi-step Documentation Refinement.

 

Amazon Q Developer /doc agent allows for iterative improvement of your documentation through feedback loops. After generating initial documentation, you can:

  1. Review the generated content for gaps or inaccuracies
  2. Provide specific feedback to refine particular sections
  3. Request additional sections on complex topics
  4. Gradually build comprehensive documentation through multiple iterations

This iterative approach is particularly valuable for complex projects where documentation needs to evolve alongside the codebase.

Documentation for Specific Components

For modular projects, you can create targeted documentation at different levels:

  • Root-level README for project overview
  • Component-level READMEs for specific modules
  • Service-level documentation for microservices
  • API documentation for interfaces

By combining these documentation levels, you can maintain a hierarchical documentation structure that remains manageable and specific.

Handling Documentation Inheritance

This image shows how to handle Documentation Inheritance.

Figure 11 – Handling Documentation Inheritance.

When working with derived or extended codebases:

  1. Generate base documentation for the parent project
  2. Create specialized documentation for extensions
  3. Cross-reference related documentation to maintain consistency
  4. Use the /doc agent to update specific sections when inheritance patterns change

Documentation Syncing Strategy

This image shows Documentation Syncing Strategy

Figure 12 – Documentation Syncing Strategy.

For teams working on rapidly evolving projects:

  1. Establish a documentation update schedule aligned with sprints
  2. Assign documentation reviews as part of code review processes
  3. Use /doc to generate change summaries after significant updates
  4. Implement a verification process to ensure generated documentation accurately reflects code changes

Best Practices for /doc Agent

To improve results from documentation generation with Amazon Q Developer, follow these best practices:

  1. Optimize repository size: Amazon Q Developer supports documentation generation across your codebase, accommodating projects up to the specified size limits. While documentation for larger repositories may require additional processing time and could provide more generalized results, you have the option to request documentation for specific subsets of code or individual files to receive more detailed insights.
  2. Maintain high-quality code: The quality of documentation Amazon Q Developer generates improves significantly when your code is well-commented and organized, has meaningful naming conventions for programming entities, and follows standard coding conventions.
  3. Be specific with change requests: When requesting specific README changes in natural language, choose to update an existing README and select the option to make a specific change. After initial documentation generation, you can request additional modifications by describing exactly what updates you want.
  4. Craft effective change descriptions: When describing desired updates, include:
    1. Specific sections you want to modify
    2. Exact content you want to add or remove
    3. Particular issues that need correcting
    4. How project functionality should be reflected in the README
    5. References to content available in your codebase.
  5. Understand system limitations: Amazon Q Developer doesn’t have access to private or internal platforms and might lack knowledge of third-party tools, specialized software, or custom tooling in your code. Content requiring this knowledge won’t be documented automatically. In these cases, you’ll need to manually edit the README to include information Amazon Q Developer cannot generate.

Documentation Quotas and Limitations

When working with Amazon Q Developer /doc agent, be aware of these important constraints:

  • Document generations per task: There’s a limit to the number of feedback iterations allowed per documentation session. This quota resets each time you start a new documentation task.
  • File filtering: Amazon Q Developer filters out files or folders defined in your .gitignore file. This helps streamline the documentation process by focusing only on relevant code files.

Conclusion

Amazon Q Developer /doc agent transforms the documentation process from a tedious chore to an automated, efficient workflow. By generating and maintaining READMEs based on your actual code, it ensures documentation remains accurate and up-to-date without consuming precious development time.

As part of Amazon Q Developer’s free tier, the /doc agent is readily available to integrate into your development process. Start using it today to improve your project documentation and enhance team collaboration.

About the Authors:

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Abiola Olanrewaju

Abiola Olanrewaju is a Solutions Architect at AWS, specializing in helping Financial services customers implement scalable solutions that drive business outcomes. He has a keen interest in Data Analytics, Security and Generative AI.

Adeogo Olajide

Adeogo is a Solutions Architect at AWS, where he supports GovTech customers and other public sector customers in their cloud transformation journey. He specializes in designing secure, scalable, and compliant architectures that help public sector organizations modernize their digital services. Outside of work, he enjoys playing and watching soccer.

Joyce Muya

Joyce Muya is a Solutions Architect at AWS where she supports Enterprise Engaged customers in the media and entertainment sector. She specializes in Analytics and AI/ML workloads.

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.