Developing strategies to navigate the evolving digital sovereignty landscape is a top priority for organizations operating across industries and in the public sector. With data privacy, security, and compliance requirements becoming increasingly complex, organizations are seeking cloud solutions that provide sovereign controls and flexibility. Recently, Max Peterson, Amazon Web Services (AWS) Vice President of Sovereign Cloud, sat down with Daniel Newman, CEO of The Futurum Group and co-founder of Six Five Media, to explore how customers are meeting their unique digital sovereignty needs with AWS. Their thought-provoking conversation delves into the factors that are driving digital sovereignty strategies, the key considerations for customers, and AWS offerings that are designed to deliver control, choice, security, and resilience in the cloud. The podcast includes a discussion of AWS innovations, including the AWS Nitro System, AWS Dedicated Local Zones, AWS Key Management Service External Key Store, and the upcoming AWS European Sovereign Cloud. Check out the episode to gain valuable insights that can help you effectively navigate the digital sovereignty landscape while unlocking the full potential of cloud computing.
Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. You can ask questions about AWS architecture, your AWS resources, best practices, documentation, support, and more.
With Amazon Q Developer in your IDE, you can write a comment in natural language that outlines a specific task, such as, “Upload a file with server-side encryption.” Based on this information, Amazon Q Developer recommends one or more code snippets directly in the IDE that can accomplish the task. You can quickly and easily accept the top suggestions (tab key), view more suggestions (arrow keys), or continue writing your own code.
However, Amazon Q Developer in the IDE is more than just a code completion plugin. Amazon Q Developer is a generative AI (GenAI) powered assistant for software development that can be used to have a conversation about your code, get code suggestions, or ask questions about building software. This provides the benefits of collaborative paired programming, powered by GenAI models that have been trained on billions of lines of code, from the Amazon internal code-base and publicly available sources.
The challenge
At the 2024 AWS Summit in Sydney, an exhilarating code challenge took center stage, pitting a Blue Team against a Red Team, with approximately 10 to 15 challengers in each team, in a battle of coding prowess. The challenge consisted of 20 tasks, starting with basic math and string manipulation, and progressively escalating in difficulty to include complex algorithms and intricate ciphers.
The Blue Team had a distinct advantage, leveraging the powerful capabilities of Amazon Q Developer, the most capable generative AI-powered assistant for software development. With Q Developer’s guidance, the Blue Team navigated increasingly complex tasks with ease, tapping into Q Developer’s vast knowledge base and problem-solving abilities. In contrast, the Red Team competed without assistance, relying solely on their own coding expertise and problem-solving skills to tackle daunting challenges.
As the competition unfolded, the two teams battled it out, each striving to outperform the other. The Blue Team’s efficient use of Amazon Q Developer proved to be a game-changer, allowing them to tackle the most challenging tasks with remarkable speed and accuracy. However, the Red Team’s sheer determination and technical prowess kept them in the running, showcasing their ability to think outside the box and devise innovative solutions.
The culmination of the code challenge was a thrilling finale, with both teams pushing the boundaries of their skills and ultimately leaving the audience in a state of admiration for their remarkable achievements.
The graph shows the average completion time in which Team Blue “Q Developer” completed more questions across the board in less time than Team Red “Solo Coder”. Within the 1-hour time limit, Team Blue got all the way to Question 19, whereas Team Red only got to Question 16.
There are some assumptions and validations. People who consider themselves very experienced programmers were encouraged to choose team Red and not use AI, to test themselves against team Blue, those using AI. The code challenges were designed to test the output of applying logic. They were specifically designed to be passable without the use of Amazon Q Developer, to test the optimization of writing logical code with Amazon Q Developer. As a result, the code tasks worked well with Amazon Q Developer due to the nature of and underlying training of Amazon Q Developer models. Many people who attended the event were not Python Programmers (we constrained the challenge to Python only), and walked away impressed at how much of the challenge they could complete.
As an example of one of the more complex questions competitors were given to solve was:
Implement the rail fence cipher.
In the Rail Fence cipher, the message is written downwards on successive "rails" of an imaginary fence, then moving up when we get to the bottom (like a zig-zag). Finally the message is then read off in rows.
For example, using three "rails" and the message "WE ARE DISCOVERED FLEE AT ONCE", the cipherer writes out:
W . . . E . . . C . . . R . . . L . . . T . . . E
. E . R . D . S . O . E . E . F . E . A . O . C .
. . A . . . I . . . V . . . D . . . E . . . N . .
Then reads off: WECRLTEERDSOEEFEAOCAIVDEN
Given variable a. Use a three-rail fence cipher so that result is equal to the decoded message of variable a.
The questions were both algorithmic and logical in nature, which made them great for testing conversational natural language capability to solve questions using Amazon Q Developer, or by applying one’s own logic to write code to solve the question.
Top scoring individual per team:
Total Questions Complete
individual time (min)
With Q Developer (Blue Team)
19
30.46
Solo Coder (Red Team)
16
58.06
By comparing the top two competitors, and considering the solo coder was a highly experienced programmer versus the top Q Developer coder, who was a relatively new programmer not familiar with Python, you can see the efficiency gain when using Q Developer as an AI peer programmer. It took the entire 60 minutes for the solo coder to complete 16 questions, whereas the Q Developer coder got to the final question (Question 20, incomplete) in half of the time.
Summary
Integrating advanced IDE features and adopting paired programming have significantly improved coding efficiency and quality. However, the introduction of Amazon Q Developer has taken this evolution to new heights. By tapping into Q Developer’s vast knowledge base and problem-solving capabilities, the Blue Team was able to navigate complex coding challenges with remarkable speed and accuracy, outperforming the unassisted Red Team. This highlights the transformative impact of leveraging generative AI as a collaborative pair programmer in modern software development, delivering greater efficiency, problem-solving, and, ultimately, higher-quality code. Get started with Amazon Q Developer for your IDE by installing the plugin and enabling your builder ID today.
On August 19th, 2024, Gartner published its first Magic Quadrant for AI Code Assistants, which includes Amazon Web Services (AWS). Amazon Q Developer qualified for inclusion, having launched in general availability on April 30, 2024. AWS was ranked as a Leader for its ability to execute and completeness of vision.
We believe this Leader placement reflects our rapid pace of innovation, which makes the whole software development lifecycle easier and increases developer productivity with enterprise-grade access controls and security.
The Gartner Magic Quadrant evaluates 12 AI code assistants based on their Ability to Execute, which measures a vendor’s capacity to deliver its products or services effectively, and Completeness of Vision, which assesses a vendor’s understanding of the market and its strategy for future growth, according to Gartner’s report, How Markets and Vendors Are Evaluated in Gartner Magic Quadrants.
Here is the graphical representation of the 2024 Gartner Magic Quadrant for AI Code Assistants.
Here is the quote from Gartner’s report:
Amazon Web Services (AWS) is a Leader in this Magic Quadrant. Its product, Amazon Q Developer (formerly CodeWhisperer), is focused on assisting and automating developer tasks using AI. For example, Amazon Q Developer helps with code suggestions and transformation, testing and security, as well as feature development. Its operations are geographically diverse, and its clients are of all sizes. AWS is focused on delivering AI-driven solutions that enhance the software development life cycle (SDLC), automating complex tasks, optimizing performance, ensuring security, and driving innovation.
My team focuses on creating content on Amazon Q Developer that directly supports software developers’ jobs-to-be-done, enabled and enhanced by generative AI in Amazon Q Developer Center and Community.aws.
I’ve had the chance to talk with our customers to ask why they choose Amazon Q Developer. They said it is available to accelerate and complete tasks across the SDLC much more than general AI code assistants—from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines.
Here are the highlights that customers talked about more often:
Customizing code recommendations – You can get code recommendations based on your internal code base. Amazon Q Developer accelerates onboarding to a new code base to generate even more relevant inline code recommendations and chat responses (in preview) by making it aware of your internal libraries, APIs, best practices, and architectural patterns. Your organization’s administrators can securely connect Amazon Q Developer to your internal code bases to create multiple customizations. According to National Australia Bank (NAB), NAB has now added specific suggestions using the Amazon Q customization capability that are tailored to the NAB coding standards. They’re seeing increased acceptance rates of 60 percent with customization. To learn more, visit Customizing suggestions in the AWS documentation.
Upgrading your Java applications – Amazon Q Developer Agent for code transformation automates the process of upgrading and transforming your legacy Java applications. According to an internal Amazon study, Amazon has migrated tens of thousands of production applications from Java 8 or 11 to Java 17 with assistance from Amazon Q Developer. This represents a savings of over 4,500 years of development work for over a thousand developers (when compared to manual upgrades) and performance improvements worth $260 million dollars in annual cost savings. Transformations from Windows to cross-platform .NET are also coming soon! To learn more, visit Upgrading language versions with the Amazon Q Developer Agent for code transformation in the AWS documentation.
Gartner Magic Quadrant for AI Code Assistants, Arun Batchu, Philip Walsh, Matt Brasier, Haritha Khandabattu, 19 August, 2024.
Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved.
The current scaling approach of Amazon Redshift Serverless increases your compute capacity based on the query queue time and scales down when the queuing reduces on the data warehouse. However, you might need to automatically scale compute resources based on factors like query complexity and data volume to meet price-performance targets, irrespective of query queuing. To address this requirement, Redshift Serverless launched the artificial intelligence (AI)-driven scaling and optimization feature, which scales the compute not only based on the queuing, but also factoring data volume and query complexity.
In this post, we describe how Redshift Serverless utilizes the new AI-driven scaling and optimization capabilities to address common use cases. This post also includes example SQLs, which you can run on your own Redshift Serverless data warehouse to experience the benefits of this feature.
Solution overview
The AI-powered scaling and optimization feature in Redshift Serverless provides a user-friendly visual slider to set your desired balance between price and performance. By moving the slider, you can choose between optimized for cost, balanced performance and cost, or optimized for performance. Based on where you position the slider, Amazon Redshift will automatically add or remove resources to ensure better behavior and perform other AI-driven optimizations like automatic materialized views and automatic table design optimization to meet your selected price-performance target.
The slider offers the following options:
Optimized for cost – Prioritizes cost savings. Redshift attempts to automatically scale up compute capacity when doing so and doesn’t incur additional charges. And it will also attempt to scale down compute for lower cost, despite longer runtime.
Balanced – Offers balance between performance and cost. Redshift scales for performance with a moderate cost increase.
Optimized for performance – Prioritizes performance. Redshift scales aggressively for maximum performance, potentially incurring higher costs.
In the following sections, we illustrate how the AI-driven scaling and optimization feature can intelligently predict your workload compute needs and scale proactively for three scenarios:
Use case 1 – A long-running complex query. Compute scales based on query complexity.
Use case 2 – A sudden spike in ingestion volume (a three-fold increase, from 720 million to 2.1 billion). Compute scales based on data volume.
Use case 3 – A data lake query scanning large datasets (TBs). Compute scales based on the expected data to be scanned from the data lake. The expected data scan is predicted by machine learning (ML) models based on prior historical run statistics.
In the existing auto scaling mechanism, the use cases don’t increase compute capacity automatically unless queuing is identified across the instance.
Prerequisites
To follow along, complete the following prerequisites:
We use TPC-DS 1TB Cloud Data Warehouse Benchmark data to demonstrate this feature. Run the SQL statements to create tables and load the TPC-DS 1TB data.
Use case 1: Scale compute based on query complexity
The following query analyzes product sales across multiple channels such as websites, wholesale, and retail stores. This complex query typically takes about 25 minutes to run with the default 128 RPUs. Let’s run this workload on the preview workgroup created as part of prerequisites.
When a query is run for the first time, the AI scaling system may make a suboptimal decision regarding resource allocation or scaling as the system is still learning the query and data characteristics. However, the system learns from this experience, and when the same query is run again, it can make a more optimal scaling decision. Therefore, if the query didn’t scale during the first run, it is recommended to rerun the query. You can monitor the RPU capacity used on the Redshift Serverless console or by querying the SYS_SERVERLSS_USAGE system view.
The results cache is turned off in the following queries to avoid fetching results from the cache.
SET enable_result_cache_for_session TO off;
with /* TPC-DS demo query */
ws as
(select d_year AS ws_sold_year, ws_item_sk, ws_bill_customer_sk
ws_customer_sk, sum(ws_quantity) ws_qty, sum(ws_wholesale_cost) ws_wc,
sum(ws_sales_price) ws_sp from web_sales left join web_returns on
wr_order_number=ws_order_number and ws_item_sk=wr_item_sk join date_dim
on ws_sold_date_sk = d_date_sk where wr_order_number is null group by
d_year, ws_item_sk, ws_bill_customer_sk ),
cs as
(select d_year AS cs_sold_year,
cs_item_sk, cs_bill_customer_sk cs_customer_sk, sum(cs_quantity) cs_qty,
sum(cs_wholesale_cost) cs_wc, sum(cs_sales_price) cs_sp from catalog_sales
left join catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
join date_dim on cs_sold_date_sk = d_date_sk where cr_order_number is
null group by d_year, cs_item_sk, cs_bill_customer_sk ),
ss as
(select
d_year AS ss_sold_year, ss_item_sk, ss_customer_sk, sum(ss_quantity)
ss_qty, sum(ss_wholesale_cost) ss_wc, sum(ss_sales_price) ss_sp
from store_sales left join store_returns on sr_ticket_number=ss_ticket_number
and ss_item_sk=sr_item_sk join date_dim on ss_sold_date_sk = d_date_sk
where sr_ticket_number is null group by d_year, ss_item_sk, ss_customer_sk
)
select
ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0)
other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
from ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk
and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year
and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)where coalesce(ws_qty,0)>0
and coalesce(cs_qty, 0)>0 order by ss_customer_sk, ss_qty desc, ss_wc
desc, ss_sp desc, other_chan_qty, other_chan_wholesale_cost, other_chan_sales_price,
round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);
When the query is complete, run the following SQL to capture the start and end times of the query, which will be used in the next query:
select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo query%'
and query_text not like '%sys_query_history%'
order by start_time desc
Let’s assess the compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query with the output of the preceding query:
select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc
-- Example
--select * from sys_serverless_usage
--where end_time >= '2024-06-03 00:17:12.322353'
--and end_time <= DATEADD(minute,1,'2024-06-03 00:19:11.553218')
--order by end_time asc
The following screenshot shows an example output.
You can notice the increase in compute over the duration of this query. This demonstrates how Redshift Serverless scales based on query complexity.
Use case 2: Scale compute based on data volume
Let’s consider the web_sales ingestion job. For this example, your daily ingestion job processes 720 million records and completes in an average of 2 minutes. This is what you ingested in the prerequisite steps.
Due to some event (such as month end processing), your volumes increased by three times and now your ingestion job needs to process 2.1 billion records. In an existing scaling approach, this would increase your ingestion job runtime unless the queue time is enough to invoke additional compute resources. But with AI-driven scaling, in performance optimized mode, Amazon Redshift automatically scales compute to complete your ingestion job within usual runtimes. This helps protect your ingestion SLAs.
Run the following job to ingest 2.1 billion records into the web_sales table:
copy web_sales from 's3://redshift-downloads/TPC-DS/2.13/3TB/web_sales/' iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';
Run the following query to compare the duration of ingesting 2.1 billion records and 720 million records. Both ingestion jobs completed in approximately a similar time, despite the three-fold increase in volume.
select query_id,table_name,data_source,loaded_rows,duration/1000000.0 duration_in_seconds , start_time,end_time
from sys_load_history
where
table_name='web_sales'
order by start_time desc
Run the following query with the start times and end times from the previous output:
select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc
The following is an example output. You can notice the increase in compute capacity for the ingestion job that processes 2.1 billion records. This illustrates how Redshift Serverless scaled based on data volume.
Use case 3: Scale data lake queries
In this use case, you create external tables pointing to TPC-DS 3TB data in an Amazon Simple Storage Service (Amazon S3) location. Then you run a query that scans a large volume of data to demonstrate how Redshift Serverless can automatically scale compute capacity as needed.
In the following SQL, provide the ARN of the default IAM role you attached in the prerequisites:
-- Create external schema
create external schema ext_tpcds_3t
from data catalog
database ext_tpcds_db
iam_role '<ARN of the default IAM role attached>'
create external database if not exists;
Create external tables by running DDL statements in the following SQL file. You should see seven external tables in the query editor under the ext_tpcds_3t schema, as shown in the following screenshot.
Run the following query using external tables. As mentioned in the first use case, if the query didn’t scale during the first run, it is recommended to rerun the query, because the system will have learned from the previous experience and can potentially provide better scaling and performance for the subsequent run.
The results cache is turned off in the following queries to avoid fetching results from the cache.
SET enable_result_cache_for_session TO off;
with /* TPC-DS demo data lake query */
ws as
(select d_year AS ws_sold_year, ws_item_sk, ws_bill_customer_sk
ws_customer_sk, sum(ws_quantity) ws_qty, sum(ws_wholesale_cost) ws_wc,
sum(ws_sales_price) ws_sp from ext_tpcds_3t.web_sales left join ext_tpcds_3t.web_returns on
wr_order_number=ws_order_number and ws_item_sk=wr_item_sk join ext_tpcds_3t.date_dim
on ws_sold_date_sk = d_date_sk where wr_order_number is null group by
d_year, ws_item_sk, ws_bill_customer_sk ),
cs as
(select d_year AS cs_sold_year,
cs_item_sk, cs_bill_customer_sk cs_customer_sk, sum(cs_quantity) cs_qty,
sum(cs_wholesale_cost) cs_wc, sum(cs_sales_price) cs_sp from ext_tpcds_3t.catalog_sales
left join ext_tpcds_3t.catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
join ext_tpcds_3t.date_dim on cs_sold_date_sk = d_date_sk where cr_order_number is
null group by d_year, cs_item_sk, cs_bill_customer_sk ),
ss as
(select
d_year AS ss_sold_year, ss_item_sk, ss_customer_sk, sum(ss_quantity)
ss_qty, sum(ss_wholesale_cost) ss_wc, sum(ss_sales_price) ss_sp
from ext_tpcds_3t.store_sales left join ext_tpcds_3t.store_returns on sr_ticket_number=ss_ticket_number
and ss_item_sk=sr_item_sk join ext_tpcds_3t.date_dim on ss_sold_date_sk = d_date_sk
where sr_ticket_number is null group by d_year, ss_item_sk, ss_customer_sk)
SELECT ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0) other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
FROM ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)
where coalesce(ws_qty,0)>0
and coalesce(cs_qty, 0)>0
order by ss_customer_sk, ss_qty desc, ss_wc desc, ss_sp desc, other_chan_qty, other_chan_wholesale_cost, other_chan_sales_price, round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);
Review the total elapsed time of the query. You need the start_time and end_time from the results to feed into the next query.
select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo data lake query%'
and query_text not like '%sys_query_history%'
order by start_time desc
Run the following query to see how compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query from the output of the preceding query:
select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc
The following screenshot shows an example output.
The increased compute capacity for this data lake query shows that Redshift Serverless can scale to match the data being scanned. This demonstrates how Redshift Serverless can dynamically allocate resources based on query needs.
Considerations when choosing your price-performance target
You can use the price-performance slider to choose your desired price-performance target for your workload. The AI-driven scaling and optimizations provide holistic optimizations using the following models:
Query prediction models – These determine the actual resource needs (memory, CPU consumption, and so on) for each individual query
Scaling prediction models – These predict how the query would behave on different capacity sizes
Let’s consider a query that takes 7 minutes and costs $7. The following figure shows the query runtimes and cost with no scaling.
A given query might scale in a few different ways, as shown below. Based on the price-performance target you chose on the slider, AI-driven scaling predicts how the query trades off performance and cost, and scales it accordingly.
The slider options yield the following results:
Optimized for cost – When you choose Optimized for cost, the warehouse scales up if there is no additional cost or lesser costs to the user. In the preceding example, the superlinear scaling approach demonstrates this behavior. Scaling will only occur if it can be done in a cost-effective manner according to the scaling model predictions. If the scaling models predict that cost-optimized scaling isn’t possible for the given workload, then the warehouse won’t scale.
Balanced – With the Balanced option, the system will scale in favor of performance and there will be a cost increase, but it will be a limited increase in cost. In the preceding example, the linear scaling approach demonstrates this behavior.
Optimized for performance – With the Optimized for performance option, the system will scale in favor of performance even though the costs are higher and non-linear. In the preceding example, the sublinear scaling approach demonstrates this behavior. The closer the slider position is to the Optimized for performance position, the more sublinear scaling is permitted.
The following are additional points to note:
The price-performance slider options are dynamic and they can be changed anytime. However, the impact of these changes will not be realized immediately. The impact of this is effective as the system learns how to scale the current workload and any additional workloads better.
The price-performance slider options, Max capacity and Max RPU-hours are designed to work together. Max capacity and Max RPU-hours are the controls to limit maximum RPUs the data warehouse allowed to scale and maximum RPU hours allowed to consume respectively. These controls are always honored and enforced regardless of the settings on the price-performance target slider.
The AI-driven scaling and optimization feature dynamically adjusts compute resources to optimize query runtime speed while adhering to your price-performance requirements. It considers factors such as query queueing, concurrency, volume, and complexity. The system can either run queries on a compute resource with lower concurrent queries or spin up additional compute resources to avoid queueing. The goal is to provide the best price-performance balance based on your choices.
Monitoring
You can monitor the RPU scaling in the following ways:
Review the RPU capacity used graph on the Amazon Redshift console.
Monitor the ComputeCapacity metric under AWS/Redshift-Serverless and Workgroup in Amazon CloudWatch.
Query the SYS_QUERY_HISTORY view, providing the specific query ID or query text to identify the time period. Use this time period to query the SYS_SERVERLSS_USAGE system view to find the compute_capacity The compute_capacity field will show the RPUs scaled during the query runtime.
Delete the Redshift Serverless associated namespace.
Conclusion
In this post, we discussed how to optimize your workloads to scale based on the changes in data volume and query complexity. We demonstrated an approach to implement more responsive, proactive scaling with the AI-driven scaling feature in Redshift Serverless. Try this feature in your environment, conduct a proof of concept on your specific workloads, and share your feedback with us.
About the Authors
Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.
Ashish Agrawal is a Principal Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 25 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.
Davide Pagano is a Software Development Manager with Amazon Redshift based out of Palo Alto, specialized in building cloud-based data warehouses and analytics cloud services solutions. He has over 10 years of experience with databases, out of which 6 years of experience tailored to Amazon Redshift.
On November 1, 2023, the New York State Department of Financial Services (NYDFS) issued its Second Amendment (the Amendment) to its Cybersecurity Requirements for Financial Services Companies adopted in 2017, published within Section 500 of 23 NYCRR 500 (the Cybersecurity Requirements; the Cybersecurity Requirements as amended by the Amendment, the Amended Cybersecurity Requirements). In the introduction to its Cybersecurity Resource Center, the Department explains that the revisions are aimed at addressing the changes in the increasing sophistication of threat actors, the prevalence of and relative ease in running cyberattacks, and the availability of additional controls to manage cyber risks.
This blog post focuses on the revision to the encryption in transit requirement under section 500.15(a). It outlines the encryption capabilities and secure connectivity options offered by Amazon Web Services (AWS) to help customers demonstrate compliance with this updated requirement. The post also provides best practices guidance, emphasizing the shared responsibility model. This enables organizations to design robust data protection strategies that address not only the updated NYDFS encryption requirements but potentially also other security standards and regulatory requirements.
The target audience for this information includes security leaders, architects, engineers, and security operations team members and risk, compliance, and audit professionals.
Note that the information provided here is for informational purposes only; it is not legal or compliance advice and should not be relied on as legal or compliance advice. Customers are responsible for making their own independent assessments and should obtain appropriate advice from their own legal and compliance advisors regarding compliance with applicable NYDFS regulations.
500.15 Encryption of nonpublic information
The updated requirement in the Amendment states that:
As part of its cybersecurity program, each covered entity shall implement a written policy requiring encryption that meets industry standards, to protect nonpublic information held or transmitted by the covered entity both in transit over external networks and at rest.
To the extent a covered entity determines that encryption of nonpublic information at rest is infeasible, the covered entity may instead secure such nonpublic information using effective alternative compensating controls that have been reviewed and approved by the covered entity’s CISO in writing. The feasibility of encryption and effectiveness of the compensating controls shall be reviewed by the CISO at least annually.
This section of the Amendment removes the covered entity’s chief information security officer’s (CISO) discretion to approve compensating controls when encryption of nonpublic information in transit over external networks is deemed infeasible. The Amendment mandates that, effective November 2024, organizations must encrypt nonpublic information transmitted over external networks without the option of implementing alternative compensating controls. While the use of security best practices such as network segmentation, multi-factor authentication (MFA), and intrusion detection and prevention systems (IDS/IPS) can provide defense in depth, these compensating controls are no longer sufficient to replace encryption in transit over external networks for nonpublic information.
However, the Amendment still allows for the CISO to approve the use of alternative compensating controls where encryption of nonpublic information at rest is deemed infeasible. AWS is committed to providing industry-standard encryption services and capabilities to help protect customer data at rest in the cloud, offering customers the ability to add layers of security to their data at rest, providing scalable and efficient encryption features. This includes the following services:
Flexible key management options, including AWS Key Management Service (AWS KMS), which allow you to choose whether to have AWS manage the encryption keys or keep complete control over your keys.
Dedicated, hardware-based cryptographic key storage using AWS CloudHSM, to help you adhere to compliance requirements
While the above highlights encryption-at-rest capabilities offered by AWS, the focus of this blog post is to provide guidance and best practice recommendations for encryption in transit.
AWS guidance and best practice recommendations
Cloud network traffic encompasses connections to and from the cloud and traffic between cloud service provider (CSP) services. From an organization’s perspective, CSP networks and data centers are deemed external because they aren’t under the organization’s direct control. The connection between the organization and a CSP, typically established over the internet or dedicated links, is considered an external network. Encrypting data in transit over these external networks is crucial and should be an integral part of an organization’s cybersecurity program.
AWS implements multiple mechanisms to help ensure the confidentiality and integrity of customer data during transit and at rest across various points within its environment. While AWS employs transparent encryption at various transit points, we strongly recommend incorporating encryption by design into your architecture. AWS provides robust encryption-in-transit capabilities to help you adhere to compliance requirements and mitigate the risks of unauthorized disclosure and modification of nonpublic information in transit over external networks.
Additionally, AWS recommends that financial services institutions adopt a secure by design (SbD) approach to implement architectures that are pre-tested from a security perspective. SbD helps establish control objectives, security baselines, security configurations, and audit capabilities for workloads running on AWS.
Security and Compliance is a shared responsibility between AWS and the customer. Shared responsibility can vary depending on the security configuration options for each service. You should carefully consider the services you choose because your organization’s responsibilities vary depending on the services used, the integration of those services into your IT environment, and applicable laws and regulations. AWS provides resources such as service user guides and AWS Customer Compliance Guides, which map security best practices for individual services to leading compliance frameworks, including NYDFS.
Protecting connections to and from AWS
We understand that customers place a high priority on privacy and data security. That’s why AWS gives you ownership and control over your data through services that allow you to determine where your content will be stored, secure your content in transit and at rest, and manage access to AWS services and resources for your users. When architecting workloads on AWS, classifying data based on its sensitivity, criticality, and compliance requirements is essential. Proper data classification allows you to implement appropriate security controls and data protection mechanisms, such as Transport Layer Security (TLS) at the application layer, access control measures, and secure network connectivity options for nonpublic information over external networks. When it comes to transmitting nonpublic information over external networks, it’s a recommended practice to identify network segments traversed by this data based on your network architecture. While AWS employs transparent encryption at various transit points, it’s advisable to implement encryption solutions at multiple layers of the OSI model to establish defense in depth and enhance end-to-end encryption capabilities. Although requirement 500.15 of the Amendment doesn’t mandate end-to-end encryption, implementing such controls can provide an added layer of security and can help demonstrate that nonpublic information is consistently encrypted during transit.
AWS offers several options to achieve this. While not every option provides end-to-end encryption on its own, using them in combination helps to ensure that nonpublic information doesn’t traverse open, public networks unprotected. These options include:
Client-side encryption of data before sending it to AWS
AWS Direct Connect with MACsec encryption
AWS Direct Connect provides direct connectivity to the AWS network through third-party colocation facilities, using a cross-connect between an AWS owned device and either a customer- or partner-owned device. Direct Connect can reduce network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Within Direct Connect connections (a physical construct) there will be one or more virtual interfaces (VIFs). These are logical entities and are reflected as industry-standard 802.1Q VLANs on the customer equipment terminating the Direct Connect connection. Depending on the type of VIF, they will use either public or private IP addressing. There are three different types of VIFs:
Public virtual interface – Establish connectivity between AWS public endpoints and your data center, office, or colocation environment.
Transit virtual interface – Establish private connectivity between AWS Transit Gateways and your data center, office, or colocation environment. Transit Gateways is an AWS managed high availability and scalability regional network transit hub used to interconnect Amazon Virtual Private Cloud (Amazon VPC) and customer networks.
Private virtual interface – Establish private connectivity between Amazon VPC resources and your data center, office, or colocation environment.
By default, a Direct Connect connection isn’t encrypted from your premises to the Direct Connect location because AWS cannot assume your on-premises device supports the MACsec protocol. With MACsec, Direct Connect delivers native, near line-rate, point-to-point encryption, ensuring that data communications between AWS and your corporate network remain protected. MACsec is supported on 10 Gbps and 100 Gbps dedicated Direct Connect connections at selected points of presence. Using Direct Connect with MACsec-enabled connections and combining it with the transparent physical network encryption offered by AWS from the Direct Connect location through the AWS backbone not only benefits you by allowing you to securely exchange data with AWS, but also enables you to use the highest available bandwidth. For additional information on MACsec support and cipher suites, see the MACsec section in the Direct Connect FAQs.
Figure 1: Sample architecture for using Direct Connect with MACsec encryption
In the sample architecture, you can see that Layer 2 encryption through MACsec only encrypts the traffic from your on-premises systems to the AWS device in the Direct Connect location, and therefore you need to consider additional encryption solutions at Layer 3, 4, or 7 to get closer to end-to-end encryption to the device where you’re comfortable for the packets to be decrypted. In the next section, let’s review an option for using network layer encryption using AWS Site-to-Site VPN.
Direct Connect with Site-to-Site VPN
AWS Site-to-Site VPN is a fully managed service that creates a secure connection between your corporate network and your Amazon VPC using IP security (IPsec) tunnels over the internet. Data transferred between your VPC and the remote network routes over an encrypted VPN connection to help maintain the confidentiality and integrity of data in transit. Each VPN connection consists of two tunnels between a virtual private gateway or transit gateway on the AWS side and a customer gateway on the on-premises side. Each tunnel supports a maximum throughput of up to 1.25 Gbps. See Site-to-Site VPN quotas for more information.
You can use Site-to-Site VPN over Direct Connect to achieve secure IPsec connection with the low latency and consistent network experience of Direct Connect when reaching resources in your Amazon VPCs.
Figure 2: Encrypted connections between the AWS Cloud and a customer’s network using VPN
While Direct Connect with MACsec and Site-to-Site VPN with IPsec can provide encryption at the physical and network layers respectively, they primarily secure the data in transit between your on-premises network and the AWS network boundary. To further enhance the coverage for end-to-end encryption, it is advisable to use TLS encryption. In the next section, let’s review mechanisms for securing API endpoints on AWS using TLS encryption.
Secure API endpoints
APIs act as the front door for applications to access data, business logic, or functionality from other applications and backend services.
While requests to public AWS service API endpoints use HTTPS by default, a few services, such as Amazon S3 and Amazon DynamoDB, allow using either HTTP or HTTPS. If the client or application chooses HTTP, the communication isn’t encrypted. Customers are responsible for enforcing HTTPS connections when using such AWS services. To help ensure secure communication, you can establish an identity perimeter by using the IAM policy condition key aws:SecureTransport in your IAM roles to evaluate the connection and mandate HTTPS usage.
As enterprises increasingly adopt cloud computing and microservices architectures, teams frequently build and manage internal applications exposed as private API endpoints. Customers are responsible for managing the certificates on private customer-owned endpoints. AWS helps you deploy private customer-owned identities (that is, TLS certificates) through the use of AWS Certificate Manager (ACM)private certificate authorities (PCA) and the integration with AWS services that offer private customer-owned TLS termination endpoints.
ACM is a fully managed service that lets you provision, manage, and deploy public and private TLS certificates for use with AWS services and internal connected resources. ACM minimizes the time-consuming manual process of purchasing, uploading, and renewing TLS certificates. You can provide certificates for your integrated AWS services either by issuing them directly using ACM or by importing third-party certificates into the ACM management system. ACM offers two options for deploying managed X.509 certificates. You can choose the best one for your needs.
AWS Certificate Manager (ACM) – This service is for enterprise customers who need a secure web presence using TLS. ACM certificates are deployed through Elastic Load Balancing (ELB), Amazon CloudFront, Amazon API Gateway, and other integrated AWS services. The most common application of this type is a secure public website with significant traffic requirements. ACM also helps to simplify security management by automating the renewal of expiring certificates.
AWS Private Certificate Authority (Private CA) – This service is for enterprise customers building a public key infrastructure (PKI) inside the AWS Cloud and is intended for private use within an organization. With AWS Private CA, you can create your own certificate authority (CA) hierarchy and issue certificates with it for authenticating users, computers, applications, services, servers, and other devices. Certificates issued by a private CA cannot be used on the internet. For more information, see the AWS Private CA User Guide.
You can use a centralized API gateway service, such as Amazon API Gateway, to securely expose customer-owned private API endpoints. API Gateway is a fully managed service that allows developers to create, publish, maintain, monitor, and secure APIs at scale. With API Gateway, you can create RESTful APIs and WebSocket APIs, enabling near real-time, two-way communication applications. API Gateway operations must be encrypted in-transit using TLS, and require the use of HTTPS endpoints. You can use API Gateway to configure custom domains for your APIs using TLS certificates provisioned and managed by ACM. Developers can optionally choose a specific TLS version for their custom domain names. For use cases that require mutual TLS (mTLS) authentication, you can configure certificate-based mTLS authentication on your custom domains.
Pre-encryption of data to be sent to AWS
Depending on the risk profile and sensitivity of the data that’s being transferred to AWS, you might want to choose encrypting data in an application running on your corporate network before sending it to AWS (client-side encryption). AWS offers a variety of SDKs and client-side encryption libraries to help you encrypt and decrypt data in your applications. You can use these libraries with the cryptographic service provider of your choice, including AWS Key Management Service or AWS CloudHSM, but the libraries do not require an AWS service.
The AWS Encryption SDK is a client-side encryption library that you can use to encrypt and decrypt data in your application and is available in several programming languages, including a command-line interface. You can use the SDK to encrypt your data before you send it to an AWS service. The SDK offers advanced data protection features, including envelope encryption and additional authenticated data (AAD). It also offers secure, authenticated, symmetric key algorithm suites, such as 256-bit AES-GCM with key derivation and signing.
The AWS Database Encryption SDK is a set of software libraries developed in open source that enable you to include client-side encryption in your database design. The SDK provides record-level encryption solutions. You specify which fields are encrypted and which fields are included in the signatures that help ensure the authenticity of your data. Encrypting your sensitive data in transit and at rest helps ensure that your plaintext data isn’t available to a third party, including AWS. The AWS Database Encryption SDK for DynamoDB is designed especially for DynamoDB applications. It encrypts the attribute values in each table item using a unique encryption key. It then signs the item to protect it against unauthorized changes, such as adding or deleting attributes or swapping encrypted values. After you create and configure the required components, the SDK transparently encrypts and signs your table items when you add them to a table. It also verifies and decrypts them when you retrieve them. Searchable encryption in the AWS Database Encryption SDK enables you search encrypted records without decrypting the entire database. This is accomplished by using beacons, which create a map between the plaintext value written to a field and the encrypted value that is stored in your database. For more information, see the AWS Database Encryption SDK Developer Guide.
The Amazon S3 Encryption Client is a client-side encryption library that enables you to encrypt an object locally to help ensure its security before passing it to Amazon S3. It integrates seamlessly with the Amazon S3 APIs to provide a straightforward solution for client-side encryption of data before uploading to Amazon S3. After you instantiate the Amazon S3 Encryption Client, your objects are automatically encrypted and decrypted as part of your Amazon S3 PutObject and GetObject requests. Your objects are encrypted with a unique data key. You can use both the Amazon S3 Encryption Client and server-side encryption to encrypt your data. The Amazon S3 Encryption Client is supported in a variety of programming languages and supports industry-standard algorithms for encrypting objects and data keys. For more information, see the Amazon S3 Encryption Client developer guide.
Encryption in-transit inside AWS
AWS implements responsible and sophisticated technical and physical controls that are designed to help prevent unauthorized access to or disclosure of your content. To protect data in transit, traffic traversing through the AWS network that is outside of AWS physical control is transparently encrypted by AWS at the physical layer. This includes traffic between AWS Regions (except China Regions), traffic between Availability Zones, and between Direct Connect locations and Regions through the AWS backbone network.
Network segmentation
When you create an AWS account, AWS offers a virtual networking option to launch resources in a logically isolated virtual private network (VPN), Amazon Virtual Private Cloud (Amazon VPC). A VPC is limited to a single AWS Region and every VPC has one or more subnets. VPCs can be connected externally using an internet gateway (IGW), VPC peering connection, VPN, Direct Connect, or Transit Gateways. Traffic within the your VPC is considered internal because you have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.
As a customer, you maintain ownership of your data, and you select which AWS services can process, store, and host your data, and you choose the Regions in which your data is stored. AWS doesn’t automatically replicate data across Regions, unless the you choose to do so. Data transmitted over the AWS global network between Regions and Availability Zones is automatically encrypted at the physical layer before leaving AWS secured facilities. Cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region.
Encryption between instances
AWS provides secure and private connectivity between Amazon Elastic Compute Cloud (Amazon EC2) instances of all types. The Nitro System is the underlying foundation for modern Amazon EC2 instances. It’s a combination of purpose-built server designs, data processors, system management components, and specialized firmware that provides the underlying foundation for EC2 instances launched since the beginning of 2018. Instance types that use the offload capabilities of the underlying Nitro System hardware automatically encrypt in-transit traffic between instances. This encryption uses Authenticated Encryption with Associated Data (AEAD) algorithms, with 256-bit encryption and has no impact on network performance. To support this additional in-transit traffic encryption between instances, instances must be of supported instance types, in the same Region, and in the same VPC or peered VPCs. For a list of supported instance types and additional requirements, see Encryption in transit.
Conclusion
The second Amendment to the NYDFS Cybersecurity Regulation underscores the criticality of safeguarding nonpublic information during transmission over external networks. By mandating encryption for data in transit and eliminating the option for compensating controls, the Amendment reinforces the need for robust, industry-standard encryption measures to protect the confidentiality and integrity of sensitive information.
AWS provides a comprehensive suite of encryption services and secure connectivity options that enable you to design and implement robust data protection strategies. The transparent encryption mechanisms that AWS has built into services across its global network infrastructure, secure API endpoints with TLS encryption, and services such as Direct Connect with MACsec encryption and Site-to-Site VPN, can help you establish secure, encrypted pathways for transmitting nonpublic information over external networks.
By embracing the principles outlined in this blog post, financial services organizations can address not only the updated NYDFS encryption requirements for section 500.15(a) but can also potentially demonstrate their commitment to data security across other security standards and regulatory requirements.
Threat intelligence that can fend off security threats before they happen requires not just smarts, but the speed and worldwide scale that only AWS can offer.
Organizations around the world trust Amazon Web Services (AWS) with their most sensitive data. One of the ways we help secure data on AWS is with an industry-leading threat intelligence program where we identify and stop many kinds of malicious online activities that could harm or disrupt our customers or our infrastructure. Producing accurate, timely, actionable, and scalable threat intelligence is a responsibility we take very seriously, and is something we invest significant resources in.
Customers increasingly ask us where our threat intelligence comes from, what types of threats we see, how we act on what we observe, and what they need to do to protect themselves. Questions like these indicate that Chief Information Security Officers (CISOs)—whose roles have evolved from being primarily technical to now being a strategic, business-oriented function—understand that effective threat intelligence is critical to their organizations’ success and resilience. This blog post is the first of a series that begins to answer these questions and provides examples of how AWS threat intelligence protects our customers, partners, and other organizations.
High-fidelity threat intelligence that can only be achieved at the global scale of AWS
Every day across AWS infrastructure, we detect and thwart cyberattacks. With the largest public network footprint of any cloud provider, AWS has unparalleled insight into certain activities on the internet, in real time. For threat intelligence to have meaningful impact on security, large amounts of raw data from across the internet must be gathered and quickly analyzed. In addition, false positives must be purged. For example, threat intelligence findings could erroneously indicate an insider threat when an employee is logged accessing sensitive data after working hours, when in reality, that employee may have been tasked with a last-minute project and had to work overnight. Producing threat intelligence is very time consuming and requires substantial human and digital resources. Artificial intelligence (AI) and machine learning can help analysts sift through and analyze vast amounts of data. However, without the ability to collect and analyze relevant information across the entire internet, threat intelligence is not very useful. Even for organizations that are able to gather actionable threat intelligence on their own, without the reach of global-scale cloud infrastructure, it’s difficult or impossible for time-sensitive information to be collectively shared with others at a meaningful scale.
The AWS infrastructure radically transforms threat intelligence because we can significantly boost threat intelligence accuracy—what we refer to as high fidelity—because of the sheer number of intelligence signals (notifications generated by our security tools) we can observe. And we constantly improve our ability to observe and react to threat actors’ evolving tactics, techniques, and procedures (TTPs) as we discover and monitor potentially harmful activities through MadPot, our sophisticated globally-distributed network of honeypot threat sensors with automated response capabilities.
With our global network and internal tools such as MadPot, we receive and analyze thousands of different kinds of event signals in real time. For example, MadPot observes more than 100 million potential threats every day around the world, with approximately 500,000 of those observed activities classified as malicious. This means high-fidelity findings (pieces of relevant information) produce valuable threat intelligence that can be acted on quickly to protect customers around the world from harmful and malicious online activities. Our high-fidelity intelligence also generates real-time findings that are ingested into our intelligent threat detection security service Amazon GuardDuty, which automatically detects threats for millions of AWS accounts.
AWS’s Mithra ranks domain trustworthiness to help protect customers from threats
Let’s dive deeper. Identification of malicious domains (physical IP addresses on the internet) is crucial to effective threat intelligence. GuardDuty generates various kinds of findings (potential security issues such as anomalous behaviors) when AWS customers interact with domains, with each domain being assigned a reputation score derived from a variety of metrics that rank trustworthiness. Why this ranking? Because maintaining a high-quality list of malicious domain names is crucial to monitoring cybercriminal behavior so that we can protect customers. How do we accomplish the huge task of ranking? First, imagine a graph so large (perhaps one of the largest in existence) that it’s impossible for a human to view and comprehend the entirety of its contents, let alone derive usable insights.
Meet Mithra. Named after a mythological rising sun, Mithra is a massive internal neural network graph model, developed by AWS, that uses algorithms for threat intelligence. With its 3.5 billion nodes and 48 billion edges, Mithra’s reputation scoring system is tailored to identify malicious domains that customers come in contact with, so the domains can be ranked accordingly. We observe a significant number of DNS requests per day—up to 200 trillion in a single AWS Region alone—and Mithra detects an average of 182,000 new malicious domains daily. By assigning a reputation score that ranks every domain name queried within AWS on a daily basis, Mithra’s algorithms help AWS rely less on third parties for detecting emerging threats, and instead generate better knowledge, produced more quickly than would be possible if we used a third party.
Mithra is not only able to detect malicious domains with remarkable accuracy and fewer false positives, but this super graph is also capable of predicting malicious domains days, weeks, and sometimes even months before they show up on threat intel feeds from third parties. This world-class capability means that we can see and act on millions of security events and potential threats every day.
By scoring domain names, Mithra can be used in the following ways:
A high-confidence list of previously unknown malicious domain names can be used in security services like GuardDuty to help protect our customers. GuardDuty also allows customers to block malicious domains and get alerts for potential threats.
Services that use third-party threat feeds can use Mithra’s scores to significantly reduce false positives.
AWS security analysts can use scores for additional context as part of security investigations.
Sharing our high-fidelity threat intelligence with customers so they can protect themselves
Not only is our threat intelligence used to seamlessly enrich security services that AWS and our customers rely on, we also proactively reach out to share critical information with customers and other organizations that we believe may be targeted or potentially compromised by malicious actors. Sharing our threat intelligence enables recipients to assess information we provide, take steps to reduce their risk, and help prevent disruptions to their business.
For example, using our threat intelligence, we notify organizations around the world if we identify that their systems are potentially compromised by threat actors or appear to be running misconfigured systems vulnerable to exploits or abuse, such as open databases. Cybercriminals are constantly scanning the internet for exposed databases and other vulnerabilities, and the longer a database remains exposed, the higher the risk that malicious actors will discover and exploit it. In certain circumstances when we receive signals that suggest a third-party (non-customer) organization may be compromised by a threat actor, we also notify them because doing so can help head off further exploitation, which promotes a safer internet at large.
Often, when we alert customers and others to these kinds of issues, it’s the first time they become aware that they are potentially compromised. After we notify organizations, they can investigate and determine the steps they need to take to protect themselves and help prevent incidents that could cause disruptions to their organization or allow further exploitation. Our notifications often also include recommendations for actions organizations can take, such as to review security logs for specific domains and block them, implement mitigations, change configurations, conduct a forensic investigation, install the latest patches, or move infrastructure behind a network firewall. These proactive actions help organizations to get ahead of potential threats, rather than just reacting after an incident occurs.
Sometimes, the customers and other organizations we notify contribute information that in turn helps us assist others. After an investigation, if an affected organization provides us with related indicators of compromise (IOCs), this information can be used to improve our understanding of how a compromise occurred. This understanding can lead to critical insights we may be able to share with others, who can use it to take action to improve their security posture—a virtuous cycle that helps promote collaboration aimed at improving security. For example, information we receive may help us learn how a social engineering attack or particular phishing campaign was used to compromise an organization’s security to install malware on a victim’s system. Or, we may receive information about a zero-day vulnerability that was used to perpetrate an intrusion, or learn how a remote code execution (RCE) attack was used to run malicious code and other malware to steal an organization’s data. We can then use and share this intelligence to protect customers and other third parties. This type of collaboration and coordinated response is more effective when organizations work together and share resources, intelligence, and expertise.
Three examples of AWS high-fidelity threat intelligence in action
Example 1: We became aware of suspicious activity when our MadPot sensors indicated unusual network traffic known as backscatter (potentially unwanted or unintended network traffic that is often associated with a cyberattack) that contained known IOCs associated with a specific threat attempting to move across our infrastructure. The network traffic appeared to be originating from the IP space of a large multinational food service industry organization and flowing to Eastern Europe, suggesting potential malicious data exfiltration. Our threat intelligence team promptly contacted the security team at the affected organization, which wasn’t an AWS customer. They were already aware of the issue but believed they had successfully addressed and removed the threat from their IT environment. However, our sensors indicated that the threat was continuing and not resolved, showing that a persistent threat was ongoing. We requested an immediate escalation, and during a late-night phone call, the AWS CISO shared real-time security logs with the CISO of the impacted organization to show that large amounts of data were still being suspiciously exfiltrated and that urgent action was necessary. The CISO of the affected company agreed and engaged their Incident Response (IR) team, which we worked with to successfully stop the threat.
Example 2: Earlier this year, Volexity published research detailing two zero-day vulnerabilities in the Ivanti Connect Secure VPN, resulting in the publication of CVE-2023-46805 (an authentication-bypass vulnerability) and CVE-2024-21887 (a command-injection vulnerability found in multiple web components). The U.S. Cybersecurity and Infrastructure Security Agency (CISA) issued a cybersecurity advisory on February 29, 2024 on this issue. Earlier this year, Amazon security teams enhanced our MadPot sensors to detect attempts by malicious actors to exploit these vulnerabilities. Using information obtained by the MadPot sensors, Amazon identified multiple active exploitation campaigns targeting vulnerable Ivanti Connect Secure VPNs. We also published related intelligence in the GuardDuty common vulnerabilities and exposures (CVE) feed, enabling our customers who use this service to detect and stop this activity if it is present in their environment. (For more on CVSS metrics, see the National Institute of Standards and Technology (NIST) Vulnerability Metrics.)
Example 3: Around the time Russia began its invasion of Ukraine in 2022, Amazon proactively identified infrastructure that Russian threat groups were creating to use for phishing campaigns against Ukrainian government services. Our intelligence findings were integrated into GuardDuty to automatically protect AWS customers while also providing the information to the Ukrainian government for their own protection. After the invasion, Amazon identified IOCs and TTPs of Russian cyber threat actors that appeared to target certain technology supply chains that could adversely affect Western businesses opposed to Russia’s actions. We worked with the targeted AWS customers to thwart potentially harmful activities and help prevent supply chain disruption from taking place.
AWS operates the most trusted cloud infrastructure on the planet, which gives us a unique view of the security landscape and the threats our customers face every day. We are encouraged by how our efforts to share our threat intelligence have helped customers and other organizations be more secure, and we are committed to finding even more ways to help. Upcoming posts in this series will include other threat intelligence topics such as mean time to defend, our internal tool Sonaris, and more.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Development teams adopt DevOps practices to increase the speed and quality of their software delivery. The DevOps Research and Assessment (DORA) metrics provide a popular method to measure progress towards that outcome. Using four key metrics, senior leaders can assess the current state of team maturity and address areas of optimization.
This blog post shows you how to make use of DORA metrics for your Amazon Web Services (AWS) environments. We share a sample solution which allows you to bootstrap automatic metric collection in your AWS accounts.
Benefits of collecting DORA metrics
DORA metrics offer insights into your development teams’ performance and capacity by measuring qualitative aspects of deployment speed and stability. They also indicate the teams’ ability to adapt by measuring the average time to recover from failure. This helps product owners in defining work priorities, establishing transparency on team maturity, and developing a realistic workload schedule. The metrics are appropriate for communication with senior leadership. They help commit leadership support to resolve systemic issues inhibiting team satisfaction and user experience.
Use case
This solution is applicable to the following use case:
Development teams have a multi-account AWS setup including a tooling account where the CI/CD tools are hosted, and an operations account for log aggregation and visualization.
Developers use GitHub code repositories and AWS CodePipeline to promote code changes across application environment accounts.
Service impairment resulting from system change is logged as OpsItem in AWS Systems Manager OpsCenter.
Overview of solution
The four key DORA metrics
The ‘four keys’ measure team performance and ability to react to problems:
Deployment Frequency measures the frequency of successful change releases in your production environment.
Lead Time For Changes measures the average time for committed code to reach production.
Change Failure Rate measures how often changes in production lead to service incidents/failures, and is complementary to Mean Time Between Failure.
Mean Time To Recovery measures the average time from service interruption to full recovery.
The first two metrics focus on deployment speed, while the other two indicate deployment stability (Figure 1). We recommend organizations to set their own goals (that is, DORA metric targets) based on service criticality and customer needs. For a discussion of prior DORA benchmark data and what it reveals about the performance of development teams, consult How DORA Metrics Can Measure and Improve Performance.
For example, the Change Failure Rate focuses on changes that impair the production system. Limiting the calculation to tags (such as hotfixes) on pull requests would exclude issues related to the build process. It’s important to match system change records that lead to actual impairments in production. Limiting the calculation to the number of failed deployments from the deployment pipeline only considers deployments that didn’t reach production. We use AWS Systems Manager OpsCenter as the system of records for change-related outages, rather than relying solely on data from CI/CD tools.
Similarly, Mean Time To Recovery measures the duration from a service impairment in production to a successful pipeline run. We encourage teams to track both pipeline status and recovery time, as frequent pipeline failure can indicate insufficient local testing and potential pipeline engineering issues.
Gathering DORA events
Our metric calculation process runs in four steps:
In the tooling account, we send events from CodePipeline to the default event bus of Amazon EventBridge.
Events are forwarded to custom event buses which process them according to the defined metrics and any filters we may have set up.
The custom event buses call AWS Lambda functions which forward metric data to Amazon CloudWatch. CloudWatch gives us an aggregated view of each of the metrics. From Amazon CloudWatch, you can send the metrics to another designated dashboard like Amazon Managed Grafana.
As part of the data collection, the Lambda function will also query GitHub for the relevant commit to calculate the lead time for changes metric. It will query AWS Systems Manager for OpsItem data for change failure rate and mean time to recovery metrics. You can create OpsItems manually as part of your change management process or configure CloudWatch alarms to create OpsItems automatically.
Figure 2 visualizes these steps. This setup can be replicated to a group of accounts of one or multiple teams.
Figure 2. DORA metric setup for AWS CodePipeline deployments
Walkthrough
Follow these steps to deploy the solution in your AWS accounts.
Prerequisites
For this walkthrough, you should have the following prerequisites:
AWS accounts for tooling, operations, and application environments
Before you start deploying or working with this code base, there are a few configurations you need to complete in the constants.py file in the cdk/ directory. Open the file in your IDE and update the following constants:
TOOLING_ACCOUNT_ID & TOOLING_ACCOUNT_REGION: These represent the AWS account ID and AWS region for AWS CodePipeline (that is, your tooling account).
OPS_ACCOUNT_ID & OPS_ACCOUNT_REGION: These are for your operations account (used for centralized log aggregation and dashboard).
TOOLING_CROSS_ACCOUNT_LAMBDA_ROLE: The IAM Role for cross-account access that allows AWS Lambda to post metrics from your tooling account to your operations account/Amazon CloudWatch dashboard.
DEFAULT_MAIN_BRANCH: This is the default branch in your code repository that’s used to deploy to your production application environment. It is set to “main” by default, as we assumed feature-driven development (GitFlow) on the main branch; update if you use a different naming convention.
APP_PROD_STAGE_NAME: This is the name of your production stage and set to “DeployPROD” by default. It’s reserved for teams with trunk-based development.
Setting up the environment
To set up your environment on MacOS and Linux:
Create a virtual environment:
$ python3 -m venv .venv
Activate the virtual environment: On MacOS and Linux:
$ source .venv/bin/activate
Alternatively, to set up your environment on Windows:
Create a virtual environment:
% .venv\Scripts\activate.bat
Install the required Python packages:
$ pip install -r requirements.txt
To configure the AWS Command Line Interface (AWS CLI):
Configure your user profile (for example, Ops for operations account, Tooling for tooling account). You can check user profile names in the credentials file.
Deploying the CloudFormation stacks
Switch directory
$ cd cdk
Bootstrap CDK
$ cdk bootstrap –-profile Ops
Synthesize the AWS CloudFormation template for this project:
$ cdk synth
To deploy a specific stack (see Figure 3 for an overview), specify the stack name and AWS account number(s) in the following command:
To launch the other stacks in the Operations account (including DoraOpsGitHubLogsStack, DoraOpsDeploymentFrequencyStack, DoraOpsLeadTimeForChangeStack, DoraOpsChangeFailureRateStack, DoraOpsMeanTimeToRestoreStack, DoraOpsMetricsDashboardStack):
$ cdk deploy DoraOps* --profile Ops
The following figure shows the resources you’ll launch with each CloudFormation stack. This includes six AWS CloudFormation stacks in operations account. The first stack sets up log integration for GitHub commit activity. Four stacks contain a Lambda function which creates one of the DORA metrics. The sixth stack creates the consolidated dashboard in Amazon CloudWatch.
Figure 3. Resources provisioned with this solution
Testing the deployment
To run the provided tests:
$ pytest
Understanding what you’ve built
Deployed resources in tooling account
The DoraToolingEventBridgeStack includes Amazon EventBridge rules with a target of the central event bus in the operations account, plus an AWS IAM role with cross-account access to put events in the operations account. The event pattern for invoking our EventBridge rules listens for deployment state changes in AWS CodePipeline:
{
"detail-type": ["CodePipeline Pipeline Execution State Change"],
"source": ["aws.codepipeline"]
}
Deployed resources in operations account
The Lambda function for Deployment Frequency tracks the number of successful deployments to production, and posts the metric data to Amazon CloudWatch. You can add a dimension with the repository name in Amazon CloudWatch to filter on particular repositories/teams.
The Lambda function for the Lead Time For Change metric calculates the duration from the first commit to successful deployment in production. This covers all factors contributing to lead time for changes, including code reviews, build, test, as well as the deployment itself.
The Lambda function for Change Failure Rate keeps track of the count of successful deployments and the count of system impairment records (OpsItems) in production. It publishes both as metrics to Amazon CloudWatch and the latter calculates the ratio, as shown in below example.
The Lambda function for Mean Time To Recovery keeps track of all deployments with status SUCCEEDED in production and whose repository branch name references an existing OpsItem ID. For every matching event, the function gets the creation time of the OpsItem record and posts the duration between OpsItem creation and successful re-deployment to the CloudWatch dashboard.
All Lambda functions publish metric data to Amazon CloudWatch using the PutMetricData API. The final calculation of the four keys is performed on the CloudWatch dashboard. The solution includes a simple CloudWatch dashboard so you can validate the end-to-end data flow and confirm that it has deployed successfully:
Cleaning up
Remember to delete example resources if you no longer need them to avoid incurring future costs.
Alternatively, go to the CloudFormation console in each AWS account, select the stacks related to DORA and click on Delete. Confirm that the status of all DORA stacks is DELETE_COMPLETE.
Conclusion
DORA metrics provide a popular method to measure the speed and stability of your deployments. The solution in this blog post helps you bootstrap automatic metric collection in your AWS accounts. The four keys help you gain consensus on team performance and provide data points to back improvement suggestions. We recommend using the solution to gain leadership support for systemic issues inhibiting team satisfaction and user experience. To learn more about developer productivity research, we encourage you to also review alternative frameworks including DevEx and SPACE.
Customers often ask us if the Amazon Simple Email Service (SES) inbound capabilities they use with applications hosted on AWS infrastructure can also be used to process and automate employee email hosted on public services like Google Workspace and Microsoft 365. The answer has typically been “yes, but with some limitations”, as until now, SES inbound has been somewhat constrained by the fact that it didn’t support relaying messages for an existing domain. This limitation makes it very difficult to fully manage email flows across hybrid email environments.
Such conversations led the SES team to create Amazon Simple Email Service (SES) Mail Manager which offers a set of capabilities that simplify managing large volumes of email communications within an organization. Mail Manager’s rules set conditions and actions can optimize routing for improved delivery and communication flow, both for incoming and outgoing emails. Mail Manager’s email security features can be augmented by optional add-ons from industry-leading, vetted third-party providers. Flexible archiving features help organizations meet stringent compliance and record-keeping requirements.
In this blog, we position Mail Manager as a central ingress gateway for a fictitious company, Nutrition.co, that is based on real-world AWS customers. We discuss the customer challenges and explain how to configure Mail Manager’s SMTP Relay action to intercept, archive then deliver emails destined for employees’ Google Workspace hosted Gmail and Microsoft 365 hosted Exchange Online mailboxes. Similar mail flows can be used to process, automate and archive emails destined for their AWS hosted apps.
You can learn more about all of Mail Manager’s capabilities here.
Customer background and use case
Our fictitious company, Nutrition.co, is an online retail business with multiple employee departments, including administration, marketing, sales and fulfillment. The company has acquired several smaller rivals that use both Google Workspace and Microsoft 365 to host their employee inboxes, and plan to consolidate all users onto the same domain ( such as [email protected] and [email protected]). They also host several applications on Amazon Web Services (AWS) that use Amazon SES’ inbound capability to receive emails using a subdomain *customer-support*.nutrition.co, such as orders@*customer-support*.nutrition.co and returns@*customer-support*.nutrition.co.
Nutrition.co is looking for a solution to unify all their email domain routing, security and archiving processes onto one centralized management system to simplify their email infrastructure. They want an approach that provides more flexibility to control which addresses and domains are used for apps and automation as well as employee mail. They also want to enhance email compliance and governance with a flexible solution for screening, processing and archiving inbound emails to both employees and applications, before delivering those emails to recipient inboxes on Google Workspace and Microsoft 365 and applications hosted on AWS.
The SES Mail Manger based central ingress and egress gateway architecture we propose will allow Nutrition.co to manage their peer-to-peer and application-driven emails in one place, Amazon SES. It will simplify email security and management, and make it easy to unlock new cloud-enabled email use cases. The architecture can be modified to acommodate a wide variety of email infrastructure, including fully cloud hosted, on-premises, and hybrid mailbox hosting environments.
What is an Inbound SMTP Gateway?
An Inbound SMTP Gateway is an SMTP server that accepts inbound email via an Open Ingress Point, and then delivers those messages to another email environment’s inbound SMTP server. In the diagram below, Mail Manger is configured as an inbound SMTP Gateway:
Figure 1: Diagram of the inbound gateway mail flow to a mailbox hosting environment
“Inbound email” refers to email traffic flows where the originator of the message can be either a trusted (for example: the UK division of Nutrition.co) or an untrusted (for example: a Nutrition.co customer or vendor) entity. To send an email, the originating email system looks up the recipient domain’s MX record in the global DNS system to determine the address for the recipient’s inbound mail server. Once a connection is established on port 25, the originating server delivers the email message using the SMTP protocol typically using STARTTLS for transport level encryption. Inbound messages are typically authenticated using the SPF, DKIM, and DMARC industry standard protocols, which help ensure the messages are coming from the legitimate sender’s domain.
An Inbound SMTP gateway can act on messages, for example to process and/or archive, before passing them along to the end recipient’s email server. To learn more about archiving emails in transit, visit this blog.
Configuring Mail Manager as an Inbound SMTP Gateway
Before we can configure Mail Manager as an Inbound Gateway for Nutrition.co’s Google Workspace and Microsoft 365 hosted mailboxes, we need to “allow-list” Mail Manager in Nutrition.co’s Google Workspace and Microsoft 365 settings. Allow-listing in this context refers to configuring the hosted mailbox environments such that Mail Manager is not identified as the source of messages, but rather as an SMTP relay.
This configuration is necessary because the messages being relayed through Mail Manager originate from both trusted and untrusted senders. This mail flow will contain both wanted and, potentially, unwanted messages. Mail Manager is the intermediary, not the source of potentially unwanted email passing through Mail Manager’s Open Ingress Point before being relayed to the destination mailbox environment.
If Mail Manager is not allow-listed, inbound email that is relayed thru Mail Manager’s Open Ingress Point will fail SPF checks because the IP addresses of the intermediary server are not authorized by the domain’s SPF policy. Since DMARC relies on SPF, messages from intermediary mail servers will fail the domain’s DMARC policy if they are not signed with a domain-aligned DKIM signature.
Mailbox hosting environments and their anti-spam algorithms rely on SPF, DKIM and DMARC for authenticating different inbound mail flow configurations before making an assessment about the message’s disposition. Properly authenticated messages, if not otherwise identified as unwanted by recipients and their security administrator, are delivered to Inboxes. Messages that are not authenticated are more likely to be treated as spam. Messages from intermediary servers can sometimes be mistaken as spoofed or unwanted messages.
By allow-listing the egress IP addresses of the Mail Manager servers, Nutrition.co’s Google Workspace and Microsoft 365 hosting environments will be able to assess the correct SPF result when receiving inbound email from Mail Manager.
Note: Do not include Mail Manager’s IP addresses in the domain’s SPF policy, These IP addresses are shared by other Mail Manager customers so including them in the domain’s SPF policy can introduce a security risk.
Note: It is also possible to use DKIM and ARC for allow-listing mail streams, but Gmail and Exchange Online both support IP allow-listing.
Note: Nutrition.co’s Google Workspace and Microsoft 365 hosting environments may still make a spam assessment about the messages under the context that Mail Manager is not the original sender, but this is not common.
Figure 2: Diagram of the SES Mail Manager architecture to accept inbound email via an open Ingress endpoint and configured with a Rule set condition to relay messages with the SMTP Relay action.
In the diagram above, the interaction points are as follows:
1. Email senders look in DNS to discover the MX record for example.com. 2. The value of the domain’s MX record is the A record of the Mail Manager Ingress endpoint. The Ingress endpoint is configured as an ‘open’ Ingress endpoint so that it can receive inbound email without requiring SMTP Auth 3. The Ingress endpoint traffic policy is configured to allow and deny traffic 4. The Rule Set conditions determine which messages are to be relayed 5. The SMTP Relay action relays messages for recipients that are SES verified identities
Configuring Mail Manager as an Inbound SMTP Gateway
Prerequisites
Access to the administrative console for Nutrition.co’s Google Workspace and Microsoft 365 hosted mailboxes
Access to the DNS zone hosting the MX records for the Nutrition.co’s domains
Step 1: Allow-list the regional Mail Manager IP addresses in Nutrition.co’s Google Workspace and Microsoft 365, and create the Mail Manager relay action(s) in AWS SES console.
If you do not configure the allow-list Nutrition.co’s Google Workspace and Microsoft 365 hosted, it may cause those mailbox providers to reject as spam or send to junk the emails replayed from your Mail Manager environment.
Step 1-a: Follow the instructions to allow-list Mail Manager to relay email to Nutrition.co’s Google Workspace and Microsoft 365 environments.
Figure 3: Screenshot of an SMTP Relay rule action configured for Microsoft 365 Exchange Online inbound receiving
Figure 4: Screenshot of an SMTP Relay rule action configured for Google Workspaces Gmail inbound receiving
Because Nutrition.co hosts email in both Google Workspace and Microsoft 365, we must create SMTP Relay actions for both.
Step 2: In SES console, verify Nutrition.co’s email domain, which is nutrition.co
SES needs to prove that Nutrition.co owns the domain of each of the recipient addresses before it will begin relaying inbound email. If you cannot verify ownership of the recipient email destinations, SES will not relay messages.
Follow the instructions to verify Nutrition.co’s SES domain identity for the recipient email addresses within Nutrition.co’s Google Workspace and Microsoft 365 environments. (*note that subdomains such as customer-support.nutrition.co inherit verification from the parent domain*).
Default action: Allow (Optional) Add Policy statements, depending on your requirements. Choose the action to be taken when the filter conditions are met: Deny
While Nutrition.co does not want to apply additional security via the SMTP Relay gateway, Mail Manager supports both native capabilities and optional add-on subscriptions to 3rd party tools from vetted industry leaders such as Spamhaus and Abusix.
Figure 6: Screenshot of a traffic policy for accepting all email from the internet
Select the SMTP Relay that you created in Step 1-b and enable the **Preserve Mail From** option.
The ‘Preserve Mail From’ setting is necessary so that the mailbox provider can be configured to make the correct assessment of the message’s SPF policy evaluation, assuming that the allow-list configuration Step 1 is complete.
Add any conditions and exceptions for each rule, depending on your needs.
You may want to create a condition for the SMTP Relay rule so that only messages destined for recipients within your domain are relayed to the appropriate SMTP Relay action, and choose a different action for the recipients who are not hosted in your environment, such as the Archive action.
If you have both Google Workspace and Microsoft 365 configured as SMTP Relay destinations, you may combine the SMTP Relay actions in a single rule if the conditions are the same, or create them as separate rules if the conditions need to be different
Figure 7: A Mail Manager rule configured with an SMTP Relay action for Google Workspaces and another SMTP Relay actions for Microsoft 365
The Mail Manager Ingress point needs to be ‘Open“ for this use case because internet mail senders need to connect to port 25 and send without SMTP authentication for inbound mail flows.
Type: Open Traffic policy: Choose the traffic policy that you created step 3-a Rule set: Choose the rule set that you created in step 3-b
After saving the ingress endpoint settings, you should see something similar in the console.
Figure 8: Screenshot of an ‘open’ Mail Manager Ingress endpoint configured with a rule set and traffic policy
Step 4. Verify your configuration and change your domain’s MX record
Once you have finished configuring Mail Manager with an Inbound Gateway configuration you will have:
An Open ingress point that does not require authentication and has an open traffic policy to allow messages from the internet.
A Rule set with SMTP Relay actions that will relay inbound messages to Google Workspace and/or Microsoft 365.
Step 4-a: Test your configuration
Ingress point: You can test that the Ingress endpoint receives email by using an SMTP capable client application, such as “openssl s_client” from a host that allows for outbound port 25 connections to the A Record of your Open Ingress Point (many ISPs and cloud infrastructure providers block port 25 by default to stop the proliferation of spam on the internet). If you get a “250 OK” response from the SMTP transaction, the Ingress point is configured correctly.
Rule set: You can test your Rule set by sending a message to your Ingress endpoint that has a recipient destination that is both a verified domain, and a domain that is hosted by your mailbox environment. You may want to add the Archive and/or Save to S3 rule actions to occur prior to SMTP Relay. This enables you to view message headers and diagnose issues that may occur during the SMTP relay to the mailbox hosting environments.
Final delivery: You can test the entire mail flow by looking at the received messages in your mailbox hosting environment.
How to look at received messages in a mailbox hosting environment
Google Workspace – From within the Gmail interface, find the message and open the message menu options.
Choose “Show original”.
(The Screenshot above shows the Gmail ‘Show original“ message headers. The Mail From address (also appears as the Return-path header, and envelope-from value in other headers) is preserved within the @gmail.com domain, and Gmail’s assessment of SPF correctly attributed the message as originating from 209.85.216.51 even though the message was relayed through 206.55.129.47. Since the 209.x.x.x address is in the SPF policy for gmail.com, the message passes SPF due to the allow-list configuration)
Microsoft 365 – From within the Outlook on the Web interface, find the message and open the message menu options.
Choose “View message details”. You will see the message headers similar to the Gmail example above.
Step 4-b: Change the MX record for your domain.
Note: We recommend using a new subdomain so that you can test this mail flow configuration for a period of time prior to changing the MX record for the primary domain that is actively being used by end users and applications.
Once you have finished testing, you can change the MX record for the domain. The value of the MX record should be the **A Record** of the Open Ingress point along with the priority value.
Figure 13: A screenshot of an MX record configured in Amazon Route 53
Conclusion
In this blog post, we’ve explored how to leverage SES Mail Manager’s SMTP Relay action to simplify the handling of inbound email for organizations that use a mix of email hosting environments, specifically Google Workspace and Microsoft 365. By configuring Mail Manager as an inbound SMTP gateway, our fictitious customer, Nutrition.co was able to centralize the management of their email flows, enhance security through features like traffic policies and rule sets, and ensure compliance through flexible archiving.
The key steps involved setting up allow-listing in the Google Workspace and Microsoft 365 environments, creating SMTP relay configurations in Mail Manager, and updating Nutrition.co domain’s MX record to point to the Mail Manager ingress endpoint. This allowed Nutrition.co to seamlessly route inbound emails destined for both their cloud-hosted employee mailboxes and on-premises applications, processing and archiving the messages before final delivery.
The flexibility of Mail Manager’s SMTP Relay action makes it a powerful tool for organizations looking to unify their email infrastructure, especially in hybrid environments. By acting as a centralized ingress and egress gateway, Mail Manager can help streamline email management, improve security, and unlock new cloud-enabled email use cases. As email continues to be a critical communication channel, solutions like Mail Manager will become increasingly important for businesses looking to maximize the value of their email ecosystem.
Please visit AWS Re:Post to ask and find answers to questions about SES Mail Manager. Talk with your AWS account team if you are interested in exploring Mail Manager in more depth.
Jesse Thompson is an Email Deliverability Manager with the Amazon Simple Email Service team. His background is in enterprise development and operations, with a focus on email abuse mitigation and encouragement of authenticity practices with open standard protocols. Jesse’s favorite activity outside of technology is recreational curling.
Alexey Kurbatsky
Alexey is a Senior Software Development Engineer at AWS, specializing in building distributed and scalable services. Outside of work, he enjoys exploring nature thru hiking as well as playing guitar.
Zip
Zip is a Sr. Specialist Solutions Architect at AWS, working with Amazon Pinpoint and Simple Email Service and WorkMail. Outside of work he enjoys time with his family, cooking, mountain biking, boating, learning and beach plogging.
When designing Amazon Simple Email Service’s (SES) Mail Manager, we often heard from customers about the “PST-file problem” inherent with user-side mailbox-based archiving. This occurs when, for a variety of reasons, end users decide to archive their emails to local PST files or other local storage. These PST files are fragile and easily corrupted. Furthermore, they are subject to the backup practices of individual workstations. Lastly, PST files are readily are portable and can be easily copied and moved outside the visibility of the email system and your IT and IP controls.
We developed Amazon Simple Email Service (SES) Mail Manager archiving features in response to this problem, and based on additional customer feedback: the need for consistent email retention behaviors, for all email. Customers also wanted the flexibility to determine which messages to archive, where to put them, and how long to retain those messages.
To make the feature applicable to the widest set of use cases, we designed Mail Manager to be able to archive any email traversing the SES service, not just those that have already been delivered to a user’s mailbox. This added flexibility ensures organizations can maintain a complete record of exactly those email communications they wish to preserve. Rather than require external tools to search and export Mail Manager’s archives, we built these functions directly into the SES console.
In fact, the entire Media Manager archiving solution is fully managed by SES within the customer’s Mail Manager account, reducing the operational overhead traditionally associated with email archiving and compliance.
Figure 1 – Mail Manager Archiving
At the core of the SES Mail Manager archiving solution is the ability to capture and retain any message, regardless of its source or destination, as it flows through the service’s rules engine. This design approach ensures that every email message traversing Mail Manager can be subject to archiving and retention policies, rather than requiring organizations to manage different systems and tools for mail flowing through mail servers, internal relays and other email infrastructure. The result is a unified, comprehensive compliance solution that provides visibility and control over an organization’s email archiving.
Archiving on its own isn’t an innovation; it’s an email primitive – an essential capability that can be used to enable other, more complex solutions. Historically, retention of email was configured as a function of your on-premises mail server, where your mailboxes themselves were resident. Personally-authored emails were considered the high-value material to retain, and adding archiving as a function of mailbox configurations was the simplest approach.
In practice, we find that the mail captured at the mailbox server, or end user’s inbox, represents only a fraction of of the mail a typical enterprise generates. As organizations grow, the number of applications generating Application To Person (A2P) messages tends to increase dramatically. Similarly, as corporate environments become more complex, SaaS-based solutions that are external to the primary email infrastructure often use email to update employees along with workflow-management systems. Much of that mail eludes archiving as it bypasses individual user mailboxes.
The SES strategy for archiving is to capture mail from anywhere, to anywhere, as long as it transits an ingress endpoint as part of your Mail Manager configuration. You have two choices: you can write those messages directly to an S3 bucket you control, and then ingest it into any other tool you like. Alternately, you can send messages into a managed archive within Mail Manager, and gain access to search, export, and configurable retention features. By default, SES configures retention for 6 months, but it’s adjustable up to permanent retention for customers who require it.
Mail Manager’s archiving feature captures any message which matches your rule, or all messages traversing any ingress endpoint. You can choose to write all messages to or from your senior leadership team into one archive, or you can organize by other envelope metadata. The rules operate the same way whether the message is A2P or Person to Person (P2P), ensuring uniform policies and retention options.
With Mail Manager’s managed archives, you pay for each gigabyte ingested, indexed, and available for search, and a separate storage fee for each gigabyte retained every month. Note that the storage fee includes both the raw content of the messages, and the size of the computed index required for search and export functions.
For messages you write to your S3 buckets, you also have the option to invoke an S3 trigger action that calls an Amazon Lambda to drive various automatation workflows. Regulated industries might want to write all messages to S3 to leverage S3’s glacier storage option for very long-term storage.
You can even split your workload between Mail Manager’s managed archive, for emails you are likely to need readily discoverable, and the Write to S3 option, for content which you don’t expect to ever need to search with granularity, but still needs to be archived to “check the box” for your retention policy. In fact, AWS encourages such a builder-oriented approach, because it rewards thoughtful decisions and resource utilization, and conforms to the broad goal of consumption-based pricing, which Mail Manager embraces fully at every step.
Figure 2 – Rule Set with conditions for archivingMail Manager provides a more comprehensive, resilient archiving approach that increases both the overall scope of mail that can be captured, and the fidelity of the archived data. You don’t need any special adapters or plugins to capture mail from any source. All email that comes through your Mail Manager Ingress Endpoint can be archived.
Figure 3 – Create archive
Why not try Mail Manager today and experience the benefits of a centralized, scalable email archiving solution? You’ll pay only for the data you ingest and retain each month, without the fragility and visibility issues of user-managed archives. Visit the SES website to start your free trial of Mail Manager and take control of your organization’s critical email records. To start with Mail Manager, visit https://aws.amazon.com/ses/, click on Mail Manager, and set up your first workload today.
If you have any questions or need further guidance, feel free to reach out to us via the SES Forums or in the comments section of this blog post. We’re here to help you navigate the evolving email landscape and unlock the full potential of your Amazon SES investment.
About the Authors
Toby Weir-Jones
Toby is a Principal Product Manager for Amazon SES and WorkMail. He joined AWS in January 2021 and has significant experience in both business and consumer information security products and services. His focus on email solutions at SES is all about tackling a product that everyone uses and finding ways to bring innovation and improved performance to one of the most ubiquitous IT tools.
Brett Ezell
Brett is an Amazon Pinpoint and Amazon Simple Email Service Specialist Solutions Architect at AWS. As a Navy veteran, he joined AWS in 2020 through an AWS technical military apprenticeship program. When he isn’t deep diving into solutions for customer challenges, Brett spends his time collecting vinyl, attending live music, and training at the gym. An admitted comic book nerd, he feeds his addiction every Wednesday by combing through his local shop for new books.
Zip
Zip is a Sr. Specialist Solutions Architect at AWS, working with Amazon Pinpoint and Simple Email Service and WorkMail. Outside of work he enjoys time with his family, cooking, mountain biking, boating, learning and beach plogging.
In this blog post you’ll learn how to use a new feature in AWS CodeDeploy to deploy your application one Availability Zone (AZ) at a time to help increase the operational resilience or your services through improved fault isolation.
Introducing change to a system can be a time of risk. Even the most advanced CI/CD systems with comprehensive testing and phased deployments can still promote a bad change into production. A common approach to reduce this risk is using fractional deployments and closely monitoring critical metrics like availability and latency to gauge a deployment’s success. If the deployment shows signs of failure, the CI/CD system initiates an
This blog post will show you how to leverage CodeDeploy zonal deployments as part of a holistic AZ independent (AZI) architecture strategy, both patterns that many AWS service teams follow. With this feature, you no longer need to distinguish between infrastructure or deployment failures in order to respond to the event. You can use the same observability tools and recovery techniques for both, which allows you to contain the scope of impact to a single AZ and mitigate the impact more quickly and with less complexity. First, let’s define what an AZI architecture is so we can understand how this feature in CodeDeploy supports it.
Availability Zone independence
Fault isolation is an architectural pattern that limits the scope of impact of failures by creating independent fault containers that don’t share fate. It also allows you to quickly recover from failures by shifting traffic or resources away from the impaired fault container. AWS provides a number of different fault isolation boundaries, but the ones most people are familiar with are AZs and Regions. When you build multi-AZ architectures, you can design your application to implement AZI that uses the fault boundary provided by AZs to keep the interaction of resources isolated to their respective AZ (to the greatest extent possible).
An Availability Zone independent (AZI) architecture implemented by disabling cross-zone load balancing and using an independent database read replica per AZ. Only database writes have to cross an AZ boundary.
The result is that the impacts from an impairment in one AZ don’t cascade to resources in other AZs, making the operation of your application in one AZ independent from events in the others. You should also monitor the health of each AZ independently, for example by looking at per-AZ load balancer HTTPCode_Target_5XX_Count metrics, or by sending synthetic requests to the load balancer nodes in each AZ and recording availability and latency metrics for each. When an event occurs that impacts your availability or latency in a single AZ, you can use capabilities like Amazon Route 53 Application Recovery Controller zonal shift to shift traffic away from that AZ to quickly mitigate impact, often in single-digit minutes.
Using zonal shift to move traffic away from the AZ experiencing a service impairment
Traditional deployment strategy challenges
During an event, SRE, engineering, or operations teams can spend a lot of time trying to figure out if the source of impact is an infrastructure problem or related to a failed deployment. Then, based on the identified cause, they may take different mitigation actions. Thus, precious time is spent investigating the source of impact and deciding on the appropriate mitigation plan.
When the cause is due to a failed deployment, traditionally rollbacks are used to mitigate the problem. But rollbacks, even when automated, take time to complete. For example, let’s say your deployment batches take 5 minutes to complete, you deploy in 10% batches, and you’re halfway through a deployment to 100 instances when the rollback is initiated. This means it’s going to take at least 25 minutes to finish the rollback (5 batches, each taking 5 minutes to re-deploy to). And it’s entirely possible during that time that instances where the new software was deployed continue to pass health checks, but result in errors being returned to your customers. In the worst case, if all instances had been deployed to, this event could last for almost an hour with customers being impacted during the entire rollback process. In some cases, deployments can’t be rolled back and have to be rolled forward, meaning a new, updated version needs to be deployed to fix the previous deployment. Writing the code for the new deployment and testing it adds to the recovery time of your system and can be error prone.
Additionally, if your unit of deployment includes multiple AZs, then your potential scope of impact from a failed deployment isn’t predictably bounded. For example, if your CodeDeploy deployment groups target Amazon Elastic Compute Cloud (Amazon EC2) instances based on tags or an Amazon EC2 Auto Scaling group that spans multiple AZs, then you could see impact across the whole Region, even if you’re using fractional deployments. There’s not a smaller fault container that helps consistently limit the scope of impact to a predictable size.
Let’s look at how we can overcome these two challenges by using zonal deployments with CodeDeploy.
Zonal deployments with AWS CodeDeploy
One of the best practices we follow at AWS, described in My CI/CD pipeline is my release captain, is performing fractional deployments aligned to intended fault isolation boundaries, like individual hosts, cells, AZs, and Regions. When we release a change, the deployment is separated into waves, which represent fault containers (like Regions) that are deployed to in parallel, and those are further separated into stages. Within a single Region, the deployment starts with a one-box environment, representing a single host, then moves on to fractional batches (like 10% at a time) inside a single AZ, waits for a period of bake time, moves on to the next AZ, and so on until we’ve completed rolling out the change.
Deployment stages aligned to intended fault isolation boundaries within a single deployment wave for one Region
By aligning each stage to an expected fault isolation boundary, we create well-defined fault containers that provide an understood and bounded scope of impact in the case that something goes wrong with a deployment. You can take advantage of this same deployment strategy in your own applications by using zonal deployments in CodeDeploy. To utilize this capability, you need to define a custom deployment configuration shown below.
Creating a zonal deployment configuration that deploys to 10% of the EC2 instances in each AZ at a time, one AZ at a time
This configuration defines a few important properties. First, it enables the zonal configuration, which ensures deployments will be phased one AZ at a time. In this case, updates will be deployed to batches of 10% of the instances in each AZ (see the minimum number of healthy instances per Availability Zone for more details on configuring this setting). Second, it defines a monitor duration, which is the bake time where the effects of the changes are observed before moving on to the next AZ. This ensures sufficient use of the new software to discover any potential bugs or problems before moving on. The value in this example is defined as 900 seconds, or 15 minutes. You should ensure this value is longer than the time it takes for your alarms to trigger. For example, if you are using an M of N alarm for availability and/or latency, that is using 3 data points out of 5 with 1-minute intervals, you need to make sure your bake time is set to greater than 600 seconds, otherwise, you might move on to the next AZ before your alarm has a chance to mark the deployment as unsuccessful. Finally, I’ve also defined a first zone monitor duration. This overrides the “monitor duration” for the first AZ being deployed to. This is useful since the first AZ is acting as our canary or one-box environment and we may want to wait additional time to be really confident the deployment is successful before moving on to the second AZ.
If your service is deployed behind a load balancer with cross-zone load balancing disabled (which is important to achieve AZI), carefully consider your batch size. The load balancer evenly splits traffic across AZs regardless of how many healthy hosts are in each AZ. Ensure your batch size is small enough that the temporary reduction in capacity during each batch doesn’t overwhelm the remaining instances in the same AZ. You can use the CodeDeploy minimum healthy hosts per AZ option to ensure there are enough healthy hosts in the AZ during a deployment batch or Elastic Load Balancing (ELB)target group minimum healthy target count with DNS failover to shift traffic away from the AZ if too few targets are present.
Recovering from a failed zonal deployment.
When a failure occurs, the highest priority is mitigating the impact, not fixing the root cause. While an automated rollback can help achieve both for a failed deployment, using a zonal shift can improve your recovery time. Let’s take a simple dashboard like the following figure. The top graph shows your availability as perceived by customers through using the regional endpoint of your load balancer like https://load-balancer-name-and-id.elb.us-east-1.amazonaws.com. The graphs below it show the measured availability from Amazon CloudWatch Syntheticscanaries that test the load balancer endpoints in each AZ using endpoints like https://us-east-1a.load-balancer-name-and-id.elb.us-east-1.amazonaws.com.
Dashboard showing impact in one AZ that affects the availability of the service
We can see that something starts impacting resources in AZ1 at 10:38 causing an availability drop. As we would expect, this impact is also seen by customers, shown in the top graph, but it’s unclear what the underlying cause of the availability drop is. Using the approach described in this post, it turns out that it doesn’t matter. Within a few minutes, at 10:41 the CloudWatch composite alarm monitoring the health of AZ1 transitions to the ALARM state and invokes a Lambda function that reads the alarm’s definition to get the AZ ID and ALB ARN involved, and initiates the zonal shift. It’s important that the alarm logic only reacts when a single AZ is impacted, if there was impact in more than one AZ, we would need to treat this as a Regional issue.
After a failed deployment to AZ1, an automatically initiated zonal shift quickly mitigates the customer impact
Then, after a few more minutes, at 10:44, we can see availability from the customer perspective has gone back up to 100% by shifting traffic away from AZ1.
The impact of the failed deployment is mitigated by shifting traffic away from AZ1
It turns out the cause of impact in this case was a failed deployment, and we can see that our synthetic canaries still see the failure while the deployment is rolling back, but we’ve achieved our primary goal of quickly removing the impact to the customer experience. From the start of impact to mitigation, 6 minutes elapsed, which was significantly faster than waiting for the deployment to completely rollback. After the rollback is complete, at 10:58, 20 minutes after the start of the event, we can see the alarm transition back to the OK state and availability return to normal in AZ1, meaning we can end the zonal shift and return to normal operation.
After the deployment rollback is complete, the availability in AZ1 recovers and the zonal shift can be ended
Conclusion
Performing zonal deployments helps improve the effectiveness of AZI architectures. Aligning your deployments to your intended fault isolation boundaries, in this case AZs, creates a predictable scope of impact and helps prevents cascading failures. This in turn allows you to use a common set of observability and mitigation tools for both single-AZ infrastructure events and failed deployments, which can mitigate the impact faster than automated rollbacks. Additionally, by removing the ambiguity on selecting a recovery strategy for operators, it further reduces recovery time and complexity. Learn more about zonal deployments in AWS CodeDeploy here.
When Amazon Web Services (AWS) launched Amazon Q Developer agent for code transformation as a preview last year to upgrade Java applications, we saw many organizations desire to significantly accelerate their Java upgrades. Previously, these upgrades were considered daunting, a time-consuming manual task requiring weeks if not months of effort and with Amazon Q Developer they could significantly reduce that burden. Companies such as Toyota, Novacamp, Pragma and Persistent saw productivity gains not only in reducing the amount of time the upgrades would take, but re-prioritizing that time saved into other business priorities related to the software development lifecycle (SDLC). In addition, a small team of AWS developers upgraded over 1000 carefully chosen applications from multiple independent services. These applications used less than 10 dependencies and required minimal mandatory changes to the application code for the upgrade. While we saw a high degree of upgrade success for simpler applications, we also heard from customers who wanted even more capabilities in the Amazon Q Developer agent for code transformation. They expected the agent to upgrade their libraries to the latest major versions, replace deprecated API calls, and provide more explainability of changes made.
We added these and more capabilities to the agent at General Availability (GA). Today, we’re going into detail on the following four categories for what Amazon Q Developer can do for your Java upgrades: major version upgrades of popular frameworks, directly replacing deprecated API calls on your behalf, clear explainability on code changes, and using some examples of our unprecedented AI technology powered by Amazon Bedrock, that is capable, for example, of correcting more compilation errors that can be encountered when attempting the build in Java 17.
This blog post will dive deeper into these three ways we improved the product experience through an example application. You can download the application from this GitHub repository.
About the application
This is a Java 1.8 based microservice application which displays free list of movies for the month based on configuration stored in AWS AppConfig service using AWS SDK. This application was first open sourced in 2020 and uses legacy versions of libraries such as Spring Boot 2.x, Log4j 2.13.x, Mockito 1.x, Javax and Junit 4. You can download the sample project to try it for yourself.
(1) Popular framework upgrades
While Spring Boot version 2.7 is compatible with Java 17, Amazon Q Developer agent for code transformation can bring your applications up to version 3.2. This has been helpful because it can be time consuming to correct all the changes in annotations and code implementation that are no longer compatible when going into the new major version upgrade. Moreover, this version of Spring Boot provides improved observability, performance improvements, modernized security, and overall enhanced development experience. Let’s dive into some examples of where you see Amazon Q Developer accelerate some of this heavy lifting during the upgrade process.
When working with file uploads in Spring Boot v2, you would typically use the @RequestParam("file") annotation to bind the uploaded file to a method parameter. However, in Spring Boot v3.2, this approach has been updated to better align with the Jakarta EE specifications. Instead of using a plain String parameter, you’ll need to use the MultipartFile class from the org.springframework.web.multipart package. Here is an example before and after code transformation with Amazon Q Developer:
Java 8 code with Spring v2
@RequestMapping(value = "/movies/{movie}/edit", method = POST)
public String processUpdateMovie(@Valid Movie movie, BindingResult result,
@PathVariable("movieId") int movieId) {
...
}
Java 17 code upgraded by Amazon Q Developer with Spring v3
@PostMapping("/movies/{movie}/edit")
public String processUpdateMovie(@Valid Movie movie, BindingResult result,
@PathVariable int movieId) {
...
}
The following lists some common libraries we’ve seen you all use, with the corresponding compatible version in Java 8 and what we’re capable of upgrading to using Q Developer. This list isn’t comprehensive of all libraries, the intent is to show some common ones. This could change in the future and is up-to-date as of the time of this publication.
Let’s review a few examples from the Sample project.
Java 8 code with Junit 4
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;
import com.amazonaws.samples.appconfig.movies.MoviesController;
@RunWith(SpringRunner.class)
@SpringBootTest
public class MovieTest {
...
}
@Before
public void setUp() {
MockitoAnnotations.initMocks(this);
moviesController = new MoviesController();
moviesController.env = env;
}
One of the biggest challenges organizations face when upgrading to a newer version of Java is dealing with deprecated APIs. As new language features and libraries are introduced, older APIs are often marked as deprecated and eventually removed in subsequent releases. This can lead to compilation errors and compatibility issues, requiring developers to manually identify and replace these deprecated APIs throughout their codebase – a time-consuming and error-prone process.
In our sample application, you will see a variety of deprecated APIs that are addressed by Q Developer as an output of the transformation to Java 17. Here are some examples:
Java versions before 15 required explicit line terminators, string concatenations, and delimiters to embed multi-line code snippets. Java 15 introduced text blocks to simplify embedding code snippets and text sequences, particularly useful for literals like HTML, JSON, and SQL. Text blocks are an alternative string representation that can replace traditional double-quoted string literals, allowing multi-line strings without explicit line terminators or concatenations. Amazon Q Developer agent for code transformation will migrate your complex multi-line text which not so readable to text blocks.
Including these examples above in the sample app, Q Developer supports a wide range of abilities to address deprecated APIs across various domains, including primitive type constructors and conversions, character and string utilities, date and time handling, mathematical operations, networking and sockets, security and cryptography, concurrent programming and atomics, reflection and bytecode generation, as well as a significant number of deprecated methods related to Swing and AWT components. Whether it’s replacing outdated methods for handling dates, encoding URLs, or working with BigDecimal arithmetic, we can automatically update your code to use their modern equivalents. It also addresses deprecations in areas like multicast sockets, atomic variables, and even bytecode generation using ASM.
With the Q Developer agent for code transformation, we’ve made it easier than ever to handle deprecated APIs during your Java 17 migration. Q Developer is capable of automatically detecting and replacing deprecated API calls with their modern equivalents, saving you countless hours of manual effort and reducing the risk of introducing bugs or regressions.
(3) Unprecedented AI technology
Even after addressing deprecated APIs and framework upgrades, the process of migrating to a new Java version can still encounter compilation errors or unexpected behavior. These issues can arise from subtle changes in language semantics, incompatibilities between libraries, or other factors that are difficult to anticipate and diagnose.
To tackle this challenge, the Q Developer agent for code transformation leverages our self-debugging technology powered by Bedrock which analyzes the context of compilation errors and is capable of implementing targeted code modifications to resolve them. Here are some examples.
Amazon Q changing dependency from javax.security to java.security package and fixing compilation errors related to X509Certificate . It modified the code to get the X509Certificate from CertificateFactory instead of directly getting from the X509Certificate.getInstance.
With Q Developer’s AI technology, it can automatically correct a wider range of issues that would otherwise require manual intervention, further streamlining the upgrade process and reducing the risk of costly delays or regressions.
Conclusion
Throughout this blog post, we explored the three major areas of improvement in the Amazon Q Developer agent for code transformation since its general availability: major version upgrades of popular frameworks, direct replacement of deprecated library calls, and leveraging our AI technology using Amazon Bedrock capabilities to correct compilation errors during Java 17 migration. By addressing these critical aspects, the Q Developer agent has become an even more powerful tool for organizations seeking to unlock the benefits of Java 17 while minimizing the time and effort required for application upgrades. As we continue to enhance the Q Developer agent based on customer feedback, we encourage you to explore the open source example application provided and experience firsthand how this tool can streamline your Java modernization journey. See Getting Started with Amazon Q Developer agent for code transformation for a step by step process to transforming a java application with Q Developer.
About the authors
Jonathan Vogel
Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.
Venugopalan Vasudevan
Venugopalan is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), where he specializes in AWS Generative AI services. His expertise lies in helping customers leverage cutting-edge services like Amazon Q, and Amazon Bedrock to streamline development processes, accelerate innovation, and drive digital transformation.
Operators, administrators, developers, and many other personas leveraging AWS come across multiple common issues when it comes to troubleshooting in the AWS Console. To help alleviate this burden, AWS released Amazon Q. Amazon Q is AWS’s generative AI-powered assistant that helps make your organizational data more accessible, write code, answer questions, generate content, solve problems, manage AWS resources, and take action. A component of Amazon Q is Amazon Q Developer. Amazon Q Developer reimagines your experience across the entire development lifecycle, including having the ability to help you understand errors and remediate them in the AWS Management Console. Additionally, Amazon Q also provides access to opening new AWS support cases to address your AWS questions if further troubleshooting help is needed.
In this blog post, we will highlight the five troubleshooting examples with Amazon Q. Specific use cases that will be covered include: EC2 SSH connection issues, VPC Network troubleshooting, IAM Permission troubleshooting, AWS Lambda troubleshooting, and troubleshooting S3 errors.
Prerequisites
To follow along with these examples, the following prerequisites are required:
In this section, we will show an example of troubleshooting an EC2 SSH connection issue. If you haven’t already, please be sure to create an Amazon EC2 instance for the purpose of this walkthrough.
First, sign into the AWS console and navigate to the us-west-2 region then click on the Amazon Q icon in the right sidebar on the AWS Management Console as shown below in figure 1.
Figure 1 – Opening Amazon Q chat in the console
With the Amazon Q chat open, we enter the following prompt below:
Prompt:
"Why cant I SSH into my EC2 instance <insert Instance ID here>?"
Note: you can obtain the instance ID from within EC2 service in the console.
We now get a response up stating: “It looks like you need help with network connectivity issues. Amazon Q works with VPC Reachability Analyzer to provide an interactive generative AI experience for troubleshooting network connectivity issues. You can try the preview experience here (available in US East N. Virginia Region).”
Now, Amazon Q will run an analysis for connectivity between the internet and your EC2 instance. Find a sample response from Amazon Q below:
Figure 3 – Response from Amazon Q network troubleshooting
Toward the end of the explanation from Amazon Q, it states that it checked the security groups for allowing inbound traffic from port 22 and was blocked from accessing.
Figure 4 – Response from Amazon Q network troubleshooting cont.
In this section, we will show how to troubleshoot a VPC network connection issue.
In this example, I have two EC2 instances, Server-1-demo and Server-2-demo in two separate VPCs shown below in figure 5. I want to leverage amazon Q troubleshooting to understand why these two instances cannot communicate with each other.
Figure 5 – two EC2 instances
First, we navigate to the AWS console and click on the Amazon Q icon in the right sidebar on the AWS Management Console as shown below in figure 1.
Figure 6 – Opening Amazon Q chat in the console
Now, with the Q console chat open, I enter the following prompt for Amazon Q below to help understand the connectivity issue between the servers:
Prompt:
"Why cant my Server-1-demo communicate with Server-2-demo?"
Figure 7 – prompt for Amazon Q connectivity troubleshooting
Now, click the preview experience here hyperlink to be redirected to the Amazon Q network troubleshooting – preview. Amazon Q troubleshooting will now generate a response as shown below in Figure 8.
Figure 8 – connectivity troubleshooting response generated by Amazon Q
In the response, Amazon Q states, “It sounds like you are troubleshooting connectivity between Server-1-demo and Server-2-demo. Based on the previous context, these instances are in different VPCs which could explain why TCP testing previously did not resolve the issue, if a peering connection is not established between the VPCs.“
So, we need to establish a VPC peering connection between the two instances since they are in different VPCs.
IAM Permission troubleshooting
Now, let’s take a look at how Amazon Q can help resolve IAM Permission issues.
In this example, I’m creating a cluster with Amazon Elastic Container Service (ECS). I chose to deploy my containers on Amazon EC2 instances, which prompted some configuration options, including whether I wanted an SSH Key pair. I chose to “Create a new key pair”.
Figure 9 – Configuring ECS key pair
That opens up a new tab in the EC2 console.
Figure 10 – Creating ECS key pair
But when I tried to create the SSH. I got the error below:
Figure 11 – ECS console error
So, I clicked the link to “Troubleshoot with Amazon Q” which revealed an explanation as to why my user was not able to create the SSH key pair and the specific permissions that were missing.
Figure 12 – Amazon Q troubleshooting analysis
So, I clicked the “Help me resolve” link ad I got the following steps.
Figure 13 – Amazon Q troubleshooting resolution
Even though my user had permissions to use Amazon ECS, the user also needs certain permission permissions in the Amazon EC2 services as well, specifically ec2:CreateKeyPair. By only enabling the specific action required for this IAM user, your organization can follow the best practice of least privilege.
Lambda troubleshooting
Another area Amazon Q can help is with AWS Lambda errors when doing development work in the AWS Console. Users may find issues with things like missing configurations, environment variables, and code typos. With Amazon Q, it can help you fix and troubleshoot these issues with step by step guidance on how to fix it.
In this example, in the us-west-2 region, we have created a new lambda function called demo_function_blog in the console with the Python 3.12 runtime. The following code below is included with a missing lambda layer for AWS pandas.
Lambda Code:
import json
import pandas as pd
def lambda_handler(event, context):
data = {'Name': ['John', 'Jane', 'Jim'],'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df.head()) # print first five rows
return {
'statusCode': 200,
'body': json.dumps("execution successful!")
}
Now, we configure a test event to test the following code within the lambda console called test-event as shown below in figure 14.
Figure 14 – configuring test event
Now that the test event is created, we can move over to the Test tab in the lambda console and click the Test button. We will then see an error (intended) and we will click on the Troubleshoot with Amazon Q button as shown below in figure 15.
Figure 15 – Lambda Error
Now we will be able to see Amazon Qs analysis of the issue. It states “It appears that the Lambda function is missing a dependency. The error message indicates that the function code requires the ‘pandas’ module, ….”. Click Help me resolve to get step by step instructions on the fix as shown below in figure 16.
Figure 16 – Amazon Q Analysis
Amazon Q will then generate a step-by-step resolution on how to the fix the error as shown below in figure 17.
Figure 17 – Amazon Q Resolution
Following with Amazon Q’s recommendations, we need to add a new lambda layer for the pandas dependency as shown below in figure 18:
Figure 18 – Updating lambda layer
Once updated, go to the Test tab once again and click Test. The function code should now run successfully as shown below in figure 19:
While working with Amazon S3, users might encounter errors that can disrupt the smooth functioning of their operations. Identifying and resolving these issues promptly is crucial for ensuring uninterrupted access to S3 resources. Amazon Q, a powerful tool, offers a seamless way to troubleshoot errors across various AWS services, including Amazon S3.
In this example we use Q to troubleshoot S3 Replication rule configuration error. Imagine you’re attempting to configure a replication rule for an Amazon S3 bucket, and configuration fails. You can turn to Amazon Q for assistance. If you receive an error that Amazon Q can help with, a Troubleshoot with Amazon Q button appears in the error message. Navigate to the Amazon S3 service in the console to follow along with this example if it applies to your use case.
Figure 20 – S3 console error
To use Amazon Q to troubleshoot, choose Troubleshoot with Amazon Q to proceed. A window appears where Amazon Q provides information about the error titled “Analysis“.
Amazon Q diagnosed that the error occurred because versioning is not enabled for the source bucket specified. Versioning must be enabled on the source bucket in order to replicate objects from that bucket.
Amazon Q also provides an overview on how to resolve this error. To see detailed steps for how to resolve the error, choose Help me resolve.
Figure 21 – Amazon Q analysis
It can take several seconds for Amazon Q to generate instructions. After they appear, follow the instructions to resolve the error.
Figure 22 – Amazon Q Resolution
Here, Amazon Q recommends the following steps to resolve the error:
Navigate to the S3 console
Select the S3 bucket
Go to the Properties tab
Under Versioning, click Edit
Enable versioning on the bucket
Return to replication rule creation page
Retry creating replication rule
Conclusion
Amazon Q is a powerful AI-powered assistant that can greatly simplify troubleshooting of common issues across various AWS services, especially for Developers. Amazon Q provides detailed analysis and step-by-step guidance to resolve errors efficiently. By leveraging Amazon Q, AWS users can save significant time and effort in diagnosing and fixing problems, allowing them to focus more on building and innovating with AWS. Amazon Q represents a valuable addition to the AWS ecosystem, empowering users with enhanced support and streamlined troubleshooting capabilities.
We’re thrilled to announce that AWS has been named a Leader in the IDC MarketScape: Worldwide Analytic Stream Processing Software 2024 Vendor Assessment (doc #US51053123, March 2024).
We believe this recognition validates the power and performance of Apache Flink for real-time data processing, and how AWS is leading the way to help customers build and run fully managed Apache Flink applications. You can read the full report from IDC.
Unleashing real-time insights for your organization
Apache Flink’s robust architecture enables real-time data processing at scale, making it a favored choice among organizations for its efficiency and speed. With its advanced features for event time processing and state management, Apache Flink empowers users to build complex stream processing applications, making it indispensable for modern data-driven organizations. Managed Service for Apache Flink takes the complexity out of Apache Flink deployment and management, letting you focus on building game-changing applications. With Managed Service for Apache Flink, you can transform and analyze streaming data in real time using Apache Flink and integrate applications with other AWS services. There are no servers and clusters to manage, and there is no compute and storage infrastructure to set up. You pay only for the resources you use.
But what does this mean for your organizations and IT teams? The following are some use cases and benefits:
Faster insights, quicker action – Analyze data streams as they arrive, allowing you to react promptly to changing conditions and make informed decisions based on the latest information, achieving agility and competitiveness in dynamic markets.
Real-time fraud detection – Identify suspicious activity the moment it occurs, enabling proactive measures to protect your customers and revenue from potential financial losses, bolstering trust and security in your business operations.
Personalized customer interactions – Gain insights from user behavior in real time, enabling personalized experiences and the ability to proactively address potential issues before they impact customer satisfaction, fostering loyalty and enhancing brand reputation.
Data-driven optimization – Utilize real-time insights from sensor data and machine logs to streamline processes, identify inefficiencies, and optimize resource allocation, driving operational excellence and cost savings while maintaining peak performance.
Advanced AI – Continuously feed real-time data to your machine learning (ML) and generative artificial intelligence (AI) models, allowing them to adapt and personalize outputs for more relevant and impactful results.
Beyond the buzzword: Apache Flink in action
Apache Flink’s versatility extends beyond single use cases. The following are just a few examples of how our customers are taking advantage of its capabilities:
The National Hockey League is the second oldest of the four major professional team sports leagues in North America. Predicting events such as face-off winning probabilities during a live game is a complex task that requires processing a significant amount of quality historical data and data streams in real time. The NHL constructed the Face-off Probability model using Apache Flink. Managed Service for Apache Flink provides the underlying infrastructure for the Apache Flink applications, removing the need to self-manage an Apache Flink cluster and reducing maintenance complexity and costs.
Arity is a technology company focused on making transportation smarter, safer, and more useful. They transform massive amounts of data into actionable insights to help partners better predict risk and make smarter decisions in real time. Arity uses the managed ability of Managed Service for Apache Flink to transform and analyze streaming data in near real time using Apache Flink. On Managed Service for Apache Flink, Arity generates driving behavior insights based on collated driving data.
SOCAR is the leading Korean mobility company with strong competitiveness in car sharing. SOCAR solves mobility-related social problems, such as parking difficulties and traffic congestion, and changes the car ownership-oriented mobility habits in Korea.
Join the leaders in stream processing
By choosing Managed Service for Apache Flink, you’re joining a growing community of organizations who are unlocking the power of real-time data analysis. Get started today and see how Apache Flink can transform your data strategy, including powering the next generation of generative AI applications.
Ready to learn more?
Contact us today and discover how Apache Flink can empower your business.
About the author
Anna Montalat is the Product Marketing lead for AWS analytics and streaming data services, including Amazon Managed Streaming for Apache Kafka (MSK), Kinesis Data Streams, Kinesis Video Streams, Amazon Data Firehose, and Amazon Managed Service for Apache Flink, among others. She is passionate about bringing new and emerging technologies to market, working closely with service teams and enterprise customers. Outside of work, Anna skis through winter time and sails through summer.
RSA Conference 2024 drew 650 speakers, 600 exhibitors, and thousands of security practitioners from across the globe to the Moscone Center in San Francisco, California from May 6 through 9.
The keynote lineup was diverse, with 33 presentations featuring speakers ranging from WarGames actor Matthew Broderick, to public and private-sector luminaries such as Cybersecurity and Infrastructure Security Agency (CISA) Director Jen Easterly, U.S. Secretary of State Antony Blinken, security technologist Bruce Schneier, and cryptography experts Tal Rabin, Whitfield Diffie, and Adi Shamir.
Topics aligned with this year’s conference theme, “The art of possible,” and focused on actions we can take to revolutionize technology through innovation, while fortifying our defenses against an evolving threat landscape.
This post highlights three themes that caught our attention: artificial intelligence (AI) security, the Secure by Design approach to building products and services, and Chief Information Security Officer (CISO) collaboration.
AI security
Organizations in all industries have started building generative AI applications using large language models (LLMs) and other foundation models (FMs) to enhance customer experiences, transform operations, improve employee productivity, and create new revenue channels. So it’s not surprising that AI dominated conversations. Over 100 sessions touched on the topic, and the desire of attendees to understand AI technology and learn how to balance its risks and opportunities was clear.
“Discussions of artificial intelligence often swirl with mysticism regarding how an AI system functions. The reality is far more simple: AI is a type of software system.” — CISA
FMs and the applications built around them are often used with highly sensitive business data such as personal data, compliance data, operational data, and financial information to optimize the model’s output. As we explore the advantages of generative AI, protecting highly sensitive data and investments is a top priority. However, many organizations aren’t paying enough attention to security.
A joint generative AI security report released by Amazon Web Services (AWS) and the IBM Institute for Business Value during the conference found that 82% of business leaders view secure and trustworthy AI as essential for their operations, but only 24% are actively securing generative AI models and embedding security processes in AI development. In fact, nearly 70% say innovation takes precedence over security, despite concerns over threats and vulnerabilities (detailed in Figure 1).
Figure 1: Generative AI adoption concerns, Source: IBM Security
Because data and model weights—the numerical values models learn and adjust as they train—are incredibly valuable, organizations need them to stay protected, secure, and private, whether that means restricting access from an organization’s own administrators, customers, or cloud service provider, or protecting data from vulnerabilities in software running in the organization’s own environment.
There is no silver AI-security bullet, but as the report points out, there are proactive steps you can take to start protecting your organization and leveraging AI technology to improve your security posture:
Establish a governance, risk, and compliance (GRC) foundation. Trust in gen AI starts with new security governance models (Figure 2) that integrate and embed GRC capabilities into your AI initiatives, and include policies, processes, and controls that are aligned with your business objectives.
Figure 2: Updating governance, risk, and compliance models, Source: IBM Security
Strengthen your security culture. When we think of securing AI, it’s natural to focus on technical measures that can help protect the business. But organizations are made up of people—not technology. Educating employees at all levels of the organization can help avoid preventable harms such as prompt-based risks and unapproved tool use, and foster a resilient culture of cybersecurity that supports effective risk mitigation, incident detection and response, and continuous collaboration.
“You’ve got to understand early on that security can’t be effective if you’re running it like a project or a program. You really have to run it as an operational imperative—a core function of the business. That’s when magic can happen.” — Hart Rossman, Global Services Security Vice President at AWS
Engage with partners. Developing and securing AI solutions requires resources and skills that many organizations lack. Partners can provide you with comprehensive security support—whether that’s informing and advising you about generative AI, or augmenting your delivery and support capabilities. This can help make your engineers and your security controls more effective.
While many organizations purchase security products or solutions with embedded generative AI capabilities, nearly two-thirds, as detailed in Figure 3, report that their generative AI security capabilities come through some type of partner.
Figure 3: Most security gen AI capabilities are coming from third-party products or partners, Source: IBM Security
Tens of thousands of customers are using AWS, for example, to experiment and move transformative generative AI applications into production. AWS provides AI-powered tools and services, a Generative AI Innovation Center program, and an extensive network of AWS partners that have demonstrated expertise delivering machine learning (ML) and generative AI solutions. These resources can support your teams with hands-on help developing solutions mapped to your requirements, and a broader collection of knowledge they can use to help you make the nuanced decisions required for effective security.
Building secure software was a popular and related focus at the conference. Insecure design is ranked as the number four critical web application security concern on the Open Web Application Security Project (OWASP) Top 10.
The concept known as Secure by Design is gaining importance in the effort to mitigate vulnerabilities early, minimize risks, and recognize security as a core business requirement. Secure by Design builds off of security models such as Zero Trust, and aims to reduce the burden of cybersecurity and break the cycle of constantly creating and applying updates by developing products that are foundationally secure.
More than 60 technology companies—including AWS—signed CISA’s Secure by Design Pledge during RSA Conference as part of a collaborative push to put security first when designing products and services.
The pledge demonstrates a commitment to making measurable progress towards seven goals within a year:
Broaden the use of multi-factor authentication (MFA)
Reduce default passwords
Enable a significant reduction in the prevalence of one or more vulnerability classes
Increase the installation of security patches by customers
Publish a vulnerability disclosure policy (VDP)
Demonstrate transparency in vulnerability reporting
Strengthen the ability of customers to gather evidence of cybersecurity intrusions affecting products
“From day one, we have pioneered secure by design and secure by default practices in the cloud, so AWS is designed to be the most secure place for customers to run their workloads. We are committed to continuing to help organizations around the world elevate their security posture, and we look forward to collaborating with CISA and other stakeholders to further grow and promote security by design and default practices.” — Chris Betz, CISO at AWS
The need for security by design applies to AI like any other software system. To protect users and data, we need to build security into ML and AI with a Secure by Design approach that considers these technologies to be part of a larger software system, and weaves security into the AI pipeline.
Since models tend to have very high privileges and access to data, integrating an AI bill of materials (AI/ML BOM) and Cryptography Bill of Materials (CBOM) into BOM processes can help you catalog security-relevant information, and gain visibility into model components and data sources. Additionally, frameworks and standards such as the AI RMF 1.0, the HITRUST AI Assurance Program, and ISO/IEC 42001 can facilitate the incorporation of trustworthiness considerations into the design, development, and use of AI systems.
CISO collaboration
In the RSA Conference keynote session CISO Confidential: What Separates The Best From The Rest, Trellix CEO Bryan Palma and CISO Harold Rivas noted that there are approximately 32,000 global CISOs today—4 times more than 10 years ago. The challenges they face include staffing shortages, liability concerns, and a rapidly evolving threat landscape. According to research conducted by the Information Systems Security Association (ISSA), nearly half of organizations (46%) report that their cybersecurity team is understaffed, and more than 80% of CISOs recently surveyed by Trellix have experienced an increase in cybersecurity threats over the past six months. When asked what would most improve their organizations’ abilities to defend against these threats, their top answer was industry peers sharing insights and best practices.
Building trusted relationships with peers and technology partners can help you gain the knowledge you need to effectively communicate the story of risk to your board of directors, keep up with technology, and build success as a CISO.
AWS CISO Circles provide a forum for cybersecurity executives from organizations of all sizes and industries to share their challenges, insights, and best practices. CISOs come together in locations around the world to discuss the biggest security topics of the moment. With NDAs in place and the Chatham House Rule in effect, security leaders can feel free to speak their minds, ask questions, and get feedback from peers through candid conversations facilitated by AWS Security leaders.
“When it comes to security, community unlocks possibilities. CISO Circles give us an opportunity to deeply lean into CISOs’ concerns, and the topics that resonate with them. Chatham House Rule gives security leaders the confidence they need to speak openly and honestly with each other, and build a global community of knowledge-sharing and support.” — Clarke Rodgers, Director of Enterprise Strategy at AWS
At RSA Conference, CISO Circle attendees discussed the challenges of adopting generative AI. When asked whether CISOs or the business own generative AI risk for the organization, the consensus was that security can help with policies and recommendations, but the business should own the risk and decisions about how and when to use the technology. Some attendees noted that they took initial responsibility for generative AI risk, before transitioning ownership to an advisory board or committee comprised of leaders from their HR, legal, IT, finance, privacy, and compliance and ethics teams over time. Several CISOs expressed the belief that quickly taking ownership of generative AI risk before shepherding it to the right owner gave them a valuable opportunity to earn trust with their boards and executive peers, and to demonstrate business leadership during a time of uncertainty.
Embrace the art of possible
There are many more RSA Conference highlights on a wide range of additional topics, including post-quantum cryptography developments, identity and access management, data perimeters, threat modeling, cybersecurity budgets, and cyber insurance trends. If there’s one key takeaway, it’s that we should never underestimate what is possible from threat actors or defenders. By harnessing AI’s potential while addressing its risks, building foundationally secure products and services, and developing meaningful collaboration, we can collectively strengthen security and establish cyber resilience.
Join us to learn more about cloud security in the age of generative AI at AWS re:Inforce 2024 June 10–12 in Pennsylvania. Register today with the code SECBLOfnakb to receive a limited time $150 USD discount, while supplies last.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Amazon Simple Email Service (SES) is a cloud-based email sending service provided by Amazon Web Services (AWS), handling both inbound and outbound email traffic for your applications. It allows users to send and receive email using SES’s reliable and cost-effective infrastructure without having to provision email servers yourself.
Managing multiple email workloads at scale can be a daunting task for organizations. From handling high volumes of emails to routing them efficiently, and ensuring uniform compliance with regulations, the challenges can be overwhelming. Managing different types of outbound emails, whether one-to-one user email, transactional or marketing emails generated from applications, also becomes challenging due to increasing concerns of security and compliance requirements. To help customers tackle these pain points, Amazon Web Services (AWS) has introduced a new feature to streamline inbound and outbound email management: SES Mail Manager.
The challenge: Managing different email flows efficiently with compliance and security in place
Efficiently routing and processing emails to the appropriate teams or systems while ensuring proper filtering, security, and compliance is a complex undertaking. Meanwhile, outbound email flows have become increasing complex. Besides emails being sent between users, more and more emails are generated from different types of applications. On top of that, keeping up security and compliance requirements is an ongoing task for all IT administrators and CISOs. Maintaining email integrations with existing business applications, providing scalability and redundancy to accommodate spikes, and facilitating long-term archiving and retrieval further compound the difficulties. Without a robust and scalable solution, organizations struggle to manage email communications effectively, hindering productivity and exposing themselves to risks.
Solution: Amazon SES Mail Manager
Amazon SES Mail Manager is a comprehensive solution with powerful set of email gateway features that strengthens your organization’s email infrastructure. It simplifies email workflow management and streamlines compliance control, while integrating seamlessly with your existing systems. Mail Manager consolidates all incoming and outgoing email through a single control point. This allows you to apply unified tools, rules, and delivery behaviors across your entire email flow. The centralized approach improves reliability, security, and flexibility.
Some key capabilities include connecting different business applications, automating inbound email processing, managing outgoing emails, enhancing compliance through email archiving, and efficiently controlling overall email traffic. It provides a centralized hub to optimize email infrastructure, simplify processes, ensure compliance, and maintain a high degree of reliability and security.
Mail Manager features
Ingress Endpoints: Customizable SMTP endpoints for receiving emails
Ingress endpoint is a key infrastructure component that utilizes filtering polices and rules that you can configure to determine which emails should be allowed into your organization and which ones should be rejected.
Amazon SES currently offers a way to receive incoming emails from the internet using its SMTP interface called SES Inbound. This provided a shared, regional SMTP endpoint that all SES customers could use to accept emails. Improved upon this, Mail Manager introduces a more flexible and powerful approach with different types of ingress endpoints to handling inbound email with Amazon SES.
Mail Manager now offers two options for customers: Open Ingress Endpoint, Authenticated Ingress Endpoint.
Open Ingress Endpoints allows you to create unique, customizable SMTP endpoints that give you control to accept or reject email messages tailored to your specific needs. Open Ingress Points do not require domain verification to receive inbound emails. Simply point your domain’s MX record to the newly created Ingress Endpoint, and it will start receiving emails for that domain.
Authenticated Ingress Endpoints in Mail Manager also enables a new capability – allowing SES to accept emails from trusted external SMTP servers for further processing. Users can create a Traffic Policy to configure trusted external SMTP servers with either type of Ingress Endpoints. What’s different about Authenticated Ingress Endpoints is, users need to use SMTP Authorization to send messages. Once provisioned, you obtain credentials to connect your existing email infrastructure to the Authenticated Ingress Endpoint as an outgoing email server.
For step-by-step guide on how to create this, refer to documentation.
Traffic policies & policy statements
Traffic Policies enable fine-grained control over accepting or rejecting inbound email. A traffic policy is a container for policy statements that you assign to an ingress endpoint, so that it can sort the incoming mail by allowing or blocking specific types of email when the conditions of the policy statements are met. You have the option to set a maximum message size so that any email with a size greater, will immediately be discarded—this acts as a “first pass” filter when set. Next, you set either Allow or Deny as the default action that’s taken for email that falls outside of the conditions of your policy statements. This is the “catch all” action for the traffic policy.
Policy statements are also created with either an allow or block action that is taken when the statements’ conditions are met. You build the conditions by selecting an email protocol and a conditional operator for a value you enter that must be matched by the incoming message before the policy statement will allow or block it. The three conditions available in the policy statement are Recipient address, Sender IP range and TLS protocol version. Each policy statement can have multiple conditions.
Rule Sets & rules
Taking Control with Rule Sets: After traffic policies permit certain messages into Mail Manager, customers use rule sets to apply custom processing logic for routing, optional functions, archiving, and delivery. You can add multiple Rules within a Rule Set. You can specify the order in which Rules within a Rule Set are evaluated, as well as the order of Actions within each Rule. Each Rule consists of: Conditions: Criteria that an email must match for the Rule to be applied. Exceptions: Criteria that, if matched, will exempt the email from the Rule. Actions: The operations to be performed on emails that meet the Rule’s Conditions and don’t match any Exceptions.
Recipient-Oriented Processing:
Mail Manager processes emails in a recipient-oriented manner, meaning rules are applied separately for each recipient. In traditional email gateways, rules are typically applied at the message level, affecting all recipients uniformly. For instance, if a rule in a traditional gateway adds a header for emails addressed to [email protected], all recipients of that email will see the header. This can lead to unintended side effects where actions meant for one recipient affect others. With Mail Manager, only emails to [email protected] will have the header, ensuring rules are specific to each recipient.
Additionally, Mail Manager allows rules to be applied to all recipients when needed, such as using the Subject header as a rule condition. This flexibility provides greater precision and control in email processing, allowing rules administrators to tailor the application of rules to meet specific requirements for individual recipients or for the entire email.
By enabling both recipient-oriented and message-oriented approaches, it enhances privacy, compliance, and security by preventing unintended data exposure and ensuring actions are applied only where intended.
Flexible Conditions and Actions: Mail Manager’s Rule Sets offer a powerful expression language for defining conditions based on various email properties, such as recipient address, TLS version, source IP, subject header, and more. Additionally, Rule Sets support a wide range of Actions, including:
With these capabilities, Rule Sets enable you to build sophisticated, automated email processing workflows tailored to your organization’s needs.
SMTP Relay
Mail Manager’s SMTP Relay functionality allows you to integrate your inbound email processing workflows with external email infrastructure, such as on-premises Microsoft Exchange servers or third-party email gateways. Mail Manager’s SMTP Relay functionality allows you to integrate your email flows with appropriate servers based on predefined criteria, optimizing the journey of every email.
How SMTP Relay Works:
Define an SMTP Relay – First, you create an SMTP Relay resource within Mail Manager, specifying the details of the external SMTP server you want to relay emails to, such as the server hostname, port, and authentication credentials (if required).
Create a Rule with the SMTP Relay Action Next, within a Rule Set, you create a Rule that includes the “SMTP Relay” action, selecting the SMTP Relay resource you defined earlier.
Configure the Rule Conditions You then set the conditions for this Rule, determining which incoming emails should be relayed to the external SMTP server. For example, you could set a condition to relay all email destined for a specific domain (e.g., “@gmail.com”).
Assign the Rule Set to an Ingress Endpoint Finally, you assign the Rule Set containing this Rule to one or more Ingress Endpoints.
When an email matching the Rule’s conditions is received by the Ingress Endpoint, Mail Manager will automatically relay that email to the external SMTP server specified in the SMTP Relay resource.
Use Cases for SMTP Relay:
Processing layer for incoming emails: Relay incoming emails from Mail Manager after rules engine processing to your email server whether it’s on-premises or cloud email system.
Supporting hybrid and migration: In hybrid email environments where some mailboxes are hosted on-premises and others are in the cloud (e.g., Microsoft 365 or Google Workspace), SMTP relay allows for seamless communication between the two environments. During email migration projects, SMTP relay can be used to temporarily route emails between the old and new email platforms, ensuring that no messages are lost during the transition period.
Mailbox resilience: By terminating MX at Mail Manager, and then configuring rules for delivery to 1 or more mailbox providers, you can manage resilient mailbox delivery if your primary mailbox provider is impaired. No DNS propagation delays, just change the delivery rule and instantly fall into your other system.
Enforcement layer: Integrate Mail Manager with third-party email services or gateways by relaying emails to their SMTP endpoints, leveraging their capabilities to enforce additional policies or security measures while maintaining control with Mail Manager.
Inter-Server Communication: SMTP relay facilitates communication between different internal email servers or systems within the organization’s network, ensuring seamless delivery of emails across various domains or platforms.
Load balancing and redundancy: Distribute email traffic across multiple servers or gateways to optimize performance and resource utilization, ensuring high availability and fault tolerance.
With SMTP Relay, Mail Manager acts as a flexible email processing layer, allowing you to incorporate its powerful capabilities while maintaining and extending your current email infrastructure investments.
Email Archiving
As organizations face increasing regulatory and compliance requirements around email retention, Mail Manager provides a robust email archiving solution. The archiving feature allows you to securely store and easily search through your email data, ensuring you meet your archiving obligations.
How Email Archiving Works:
You create an archive resource within Mail Manager, specifying the desired retention period for your archived emails.
Create a Rule with the archive action within a Rule Set. Create a Rule that includes the “Archive” action, selecting the Archive resource you defined earlier.
You then set the conditions for this Rule, determining which incoming emails should be archived. For example, you could archive all emails sent to a specific department’s email alias.
Finally, you assign the Rule Set containing this archiving Rule to one or more Ingress Endpoints.
Now, when an email matching the Rule’s conditions is received by the Ingress Endpoint, Mail Manager will automatically archive a verbatim copy of that email to the designated Archive resource.
Mail Manager’s archiving capabilities offer several advantages for organizations:
Archiving stores email data in a secure, durable, and searchable archive, meeting regulatory requirements for email retention and enabling efficient audits.
Utilize powerful search filters to locate specific emails within your archive, and export search results for further analysis or legal purposes.
Reduce the storage burden on your mail servers by archiving emails to Mail Manager’s scalable and cost-effective storage solution.
Set customizable retention periods for your archives, ensuring important email data is preserved for as long as needed.
By integrating email archiving into your Mail Manager workflows, you can maintain a comprehensive, searchable, and compliant email archive without the hassle of managing additional infrastructure.
Email Add-ons
Mail Manager offers a suite of specialized security tools, called Email Add-ons, that allow you to enhance your email security posture and tailor your inbound email workflows to your specific needs. Add-ons can be used as conditions within Traffic Policies to control which emails are allowed into your Ingress Endpoints, or as conditions within Rule Sets to determine the actions taken on specific email types. These Add-ons are certified security intelligence and enforcement solutions from vetted providers, ready to be integrated directly into your Mail Manager environment (e.g., Spamhaus Domain Block List, Abusix Mail Intelligence, Trend Micro Virus Scanning).
Email Add-Ons provide a flexible and modular approach to email security, enabling you to select and combine the solutions that best fit your unique use cases. Instead of investing in a monolithic product that may not fully align with your requirements, you can choose from a range of Add-ons and pay only for the capabilities you need, on a metered-price basis. Once you’ve subscribed to an Email Add-on from the Mail Manager console, you can seamlessly incorporate it into your email workflows.
Email Add-ons extend Mail Manager’s core threat intelligence and security enforcement features on a per-workload basis, ensuring you have the right level of protection without over-provisioning resources. Within the Mail Manager console, you can explore detailed product descriptions, key benefits, and pricing information for each Add-on, empowering you to make informed decisions.
Key benefits of Add-ons:
Immediate use: no separate setup/integration work required.
Cost effective: pay for only what is needed and consumed, turn on and off as required
Granular deployment via individual traffic policy or rule action
Conclusion:
Amazon SES Mail Manager introduces advanced email routing and archiving features, providing significant benefits to customers. With customizable SMTP endpoints and recipient-oriented rule processing, customers gain precise control over email traffic, ensuring that rules are applied specifically to each recipient. The enhanced traffic policies improve email security and compliance, while the robust SMTP relay functionality seamlessly integrates with existing systems, ensuring efficient email routing and processing. Mail Manager’s archiving capabilities help meet regulatory requirements and simplify data management. Overall, Mail Manager streamlines email operations, optimizes infrastructure, and enhances reliability, security, and compliance, offering a powerful solution for managing complex email workflows.
About the Authors:
Jessica Fan is a Senior Product Manager at AWS, striving to improve the experience for Amazon SES customers. Outside of work, she enjoys long distance running, biking and bouldering.
Vinay Ujjini is an Amazon Pinpoint and Amazon Simple Email Service Worldwide Principal Specialist Solutions Architect at AWS. He has been solving customer’s omni-channel challenges for over 15 years. He is an avid sports enthusiast and in his spare time, enjoys playing tennis & cricket.
Amazon Web Services (AWS) continues to believe it’s essential that our customers have control over their data and choices for how they secure and manage that data in the cloud. AWS gives customers the flexibility to choose how and where they want to run their workloads, including a proven track record of innovation to support specialized workloads around the world. While many customers are able to meet their stringent security, sovereignty, and privacy requirements using our existing sovereign-by-design AWS Regions, we know there’s not a one-size-fits-all solution. AWS continues to innovate based on the criteria we know are most important to our customers to give them more choice and more control. Last year we announced the AWS European Sovereign Cloud, a new independent cloud for Europe, designed to give public sector organizations and customers in highly regulated industries further choice to meet their unique sovereignty needs. Today, we’re excited to share more details about the AWS European Sovereign Cloud roadmap so that customers and partners can start planning. The AWS European Sovereign Cloud is planning to launch its first AWS Region in the State of Brandenburg, Germany by the end of 2025. Available to all AWS customers, this effort is backed by a €7.8B investment in infrastructure, jobs creation, and skills development.
The AWS European Sovereign Cloud will utilize the full power of AWS with the same familiar architecture, expansive service portfolio, and APIs that customers use today. This means that customers using the AWS European Sovereign Cloud will get the benefits of AWS infrastructure including industry-leading security, availability, performance, and resilience. We offer a broad set of services, including a full suite of databases, compute, storage, analytics, machine learning and AI, networking, mobile, developer tools, IoT, security, and enterprise applications. Today, customers can start building applications in any existing Region and simply move them to the AWS European Sovereign Cloud when the first Region launches in 2025. Partners in the AWS Partner Network, which features more than 130,000 partners, already provide a range of offerings in our existing AWS Regions to help customers meet requirements and will now be able to seamlessly deploy applications on the AWS European Sovereign Cloud.
More control, more choice
Like our existing Regions, the AWS European Sovereign Cloud will be powered by the AWS Nitro System. The Nitro System is an unparalleled computing backbone for AWS, with security and performance at its core. Its specialized hardware and associated firmware are designed to enforce restrictions so that nobody, including anyone in AWS, can access customer workloads or data running on Amazon Elastic Compute Cloud (Amazon EC2) Nitro based instances. The design of the Nitro System has been validated by the NCC Group, an independent cybersecurity firm. The controls that help prevent operator access are so fundamental to the Nitro System that we’ve added them in our AWS Service Terms to provide an additional contractual assurance to all of our customers.
To date, we have launched 33 Regions around the globe with our secure and sovereign-by-design approach. Customers come to AWS because they want to migrate to and build on a secure cloud foundation. Customers who need to comply with European data residency requirements have the choice to deploy their data to any of our eight existing Regions in Europe (Ireland, Frankfurt, London, Paris, Stockholm, Milan, Zurich, and Spain) to keep their data securely in Europe.
For customers who need to meet additional stringent operational autonomy and data residency requirements within the European Union (EU), the AWS European Sovereign Cloud will be available as another option, with infrastructure wholly located within the EU and operated independently from existing Regions. The AWS European Sovereign Cloud will allow customers to keep all customer data and the metadata they create (such as the roles, permissions, resource labels, and configurations they use to run AWS) in the EU. Customers who need options to address stringent isolation and in-country data residency needs will be able to use AWS Dedicated Local Zones or AWS Outposts to deploy AWS European Sovereign Cloud infrastructure in locations they select. We continue to work with our customers and partners to shape the AWS European Sovereign Cloud, applying learnings from our engagements with European regulators and national cybersecurity authorities.
Continued investment in Europe
Over the last 25 years, we’ve driven economic development through our investment in infrastructure, jobs, and skills in communities and countries across Europe. Since 2010, Amazon has invested more than €150 billion in the EU, and we’re proud to employ more than 150,000 people in permanent roles across the European Single Market.
AWS now plans to invest €7.8 billion in the AWS European Sovereign Cloud by 2040, building on our long-term commitment to Europe and ongoing support of the region’s sovereignty needs. This long-term investment is expected to lead to a ripple effect in the local cloud community through accelerating productivity gains, empowering the digital transformation of businesses, empowering the AWS Partner Network (APN), upskilling the cloud and digital workforce, developing renewable energy projects, and creating a positive impact in the communities where AWS operates. In total, the AWS planned investment is estimated to contribute €17.2 billion to Germany’s total Gross Domestic Product (GDP) through 2040, and support an average 2,800 full-time equivalent jobs in local German businesses each year. These positions, including construction, facility maintenance, engineering, telecommunications, and other jobs within the broader local economy, are part of the AWS data center supply chain.
In addition, AWS is also creating new highly skilled permanent roles to build and operate the AWS European Sovereign Cloud. These jobs will include software engineers, systems developers, and solutions architects. This is part of our commitment that all day-to-day operations of the AWS European Sovereign Cloud will be controlled exclusively by personnel located in the EU, including access to data centers, technical support, and customer service.
In Germany, we also collaborate with local communities on long-term, innovative programs that will have a lasting impact in the areas where our infrastructure is located. This includes developing cloud workforce and education initiatives for learners of all ages, helping to solve for the skills gap and prepare for the tech jobs of the future. For example, last year AWS partnered with Siemens AG to design the first apprenticeship program for AWS data centers in Germany, launched the first national cloud computing certification with the German Chamber of Commerce (DIHK), and established the AWS Skills to Jobs Tech Alliance in Germany. We will work closely with local partners to roll out these skills programs and make sure they are tailored to regional needs.
“High performing, reliable, and secure infrastructure is the most important prerequisite for an increasingly digitalized economy and society. Brandenburg is making progress here. In recent years, we have set on a course to invest in modern and sustainable data center infrastructure in our state, strengthening Brandenburg as a business location. State-of-the-art data centers for secure cloud computing are the basis for a strong digital economy. I am pleased Amazon Web Services (AWS) has chosen Brandenburg for a long-term investment in its cloud computing infrastructure for the AWS European Sovereign Cloud.”
— Brandenburg’s Minister of Economic Affairs, Prof. Dr. Jörg Steinbach
Build confidently with AWS
For customers that are early in their cloud adoption journey and are considering the AWS European Sovereign Cloud, we provide a wide range of resources to help adopt the cloud effectively. From lifting and shifting workloads to migrating entire data centers, customers get the organizational, operational, and technical capabilities needed for a successful migration to AWS. For example, we offer the AWS Cloud Adoption Framework (AWS CAF) to provide best practices for organizations to develop an efficient and effective plan for cloud adoption, and AWS Migration Hub to help assess migration needs, define migration and modernization strategy, and leverage automation. We frequently host AWS events, webinars, and workshops focused on cloud adoption and migration strategies, where customers can learn from AWS experts and connect with other customers and partners.
We’re committed to giving customers more control and more choice to help meet their unique digital sovereignty needs, without compromising on the full power of AWS. The AWS European Sovereign Cloud is a testament to this. To help customers and partners continue to plan and build, we will share additional updates as we drive towards launch. You can discover more about the AWS European Sovereign Cloud on our European Digital Sovereignty website.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
AWS European Sovereign Cloud bis Ende 2025: AWS plant Investitionen in Höhe von 7,8 Milliarden Euro
Amazon Web Services (AWS) ist davon überzeugt, dass es für Kunden von essentieller Bedeutung ist, die Kontrolle über ihre Daten und Auswahlmöglichkeiten zu haben, wie sie diese Daten in der Cloud sichern und verwalten. Daher können Kunden flexibel wählen, wie und wo sie ihre Workloads ausführen. Dazu gehört auch eine langjährige Erfolgsbilanz von Innovationen zur Unterstützung spezialisierter Workloads auf der ganzen Welt. Viele Kunden können bereits ihre strengen Sicherheits-, Souveränitäts- und Datenschutzanforderungen mit unseren AWS-Regionen unter dem „sovereign-by-design“-Ansatz erfüllen. Aber wir wissen ebenso: Es gibt keine Einheitslösung für alle. Daher arbeitet AWS kontinuierlich an Innovationen, die auf jenen Kriterien basieren, die für unsere Kunden am wichtigsten sind und ihnen mehr Auswahl sowie Kontrolle bieten. Vor diesem Hintergrund haben wir letztes Jahr die AWS European Sovereign Cloud angekündigt. Mit ihr entsteht eine neue, unabhängige Cloud für Europa. Sie soll Organisationen des öffentlichen Sektors und Kunden in stark regulierten Branchen dabei helfen, die sich wandelnden Anforderungen an die digitale Souveränität zu erfüllen.
Heute freuen wir uns, dass wir weitere Details über die Roadmap der AWS European Sovereign Cloud bekanntgeben können. So können unsere Kunden und Partner mit ihren weiteren Planungen beginnen. Der Start der ersten Region der AWS European Sovereign Cloud ist in Brandenburg bis zum Jahresende 2025 geplant. Dieses Angebot steht allen AWS-Kunden zur Verfügung und wird von einer Investition in Höhe von 7,8 Milliarden Euro in die Infrastruktur, Arbeitsplatzschaffung und Kompetenzentwicklung unterstützt.
Die AWS European Cloud in Brandenburg bietet die volle Leistungsfähigkeit, mit der bekannten Architektur, dem umfangreichen Angebot an Services und denselben APIs, die Millionen von Kunden bereits kennen. Das bedeutet: Kunden der AWS European Sovereign Cloud profitieren somit bei voller Unabhängigkeit von den bekannten Vorteilen der AWS-Infrastruktur, einschließlich der branchenführenden Sicherheit, Verfügbarkeit, Leistung und Resilienz.
AWS-Kunden haben Zugriff auf ein breites Spektrum an Services – darunter ein umfangreiches Angebot bestehend aus Datenbanken, Datenverarbeitung, Datenspeicherung, Analytics, maschinellem Lernen (ML) und künstlicher Intelligenz (KI), Netzwerken, mobilen Applikationen, Entwickler-Tools, Internet of Things (IoT), Sicherheit und Unternehmensanwendungen. Bereits heute können Kunden Anwendungen in jeder bestehenden Region entwickeln und diese einfach in die AWS European Sovereign Cloud auslagern, sobald die erste AWS-Region 2025 startet. Die Partner im AWS-Partnernetzwerks (APN), das mehr als 130.000 Partner umfasst, bietet bereits eine Reihe von Angeboten in den bestehenden AWS-Regionen an. Dadurch unterstützen sie Kunden dabei, ihre Anforderungen zu erfüllen und Anwendungen einfach in der AWS European Sovereign Cloud bereitzustellen.
Mehr Kontrolle, größere Auswahl
Die AWS European Sovereign Cloud nutzt wie auch unsere bestehenden Regionen das AWS Nitro System. Dabei handelt es sich um einen Computing-Backbone für AWS, bei dem Sicherheit und Leistung im Mittelpunkt stehen. Die spezialisierte Hardware und zugehörige Firmware sind so konzipiert, dass strikte Beschränkungen gelten und niemand, auch nicht AWS selbst, auf die Workloads oder Daten von Kunden zugreifen kann, die auf Amazon Elastic Compute Cloud (Amazon EC2) Nitro-basierten Instanzen laufen. Dieses Design wurde von der NCC Group validiert, einem unabhängigen Unternehmen für Cybersicherheit. Die Kontrollen, die den Zugriff durch Betreiber verhindern, sind grundlegend für das Nitro System. Daher haben wir sie in unsere AWS Service Terms aufgenommen, um allen unseren Kunden diese zusätzliche vertragliche Zusicherung zu geben.
Bis heute haben wir 33 Regionen rund um den Globus mit unserem sicheren und „sovereign-by-design“-Ansatz gestartet. Unsere Kunden nutzen AWS, weil sie auf einer sicheren Cloud-Umgebung migrieren und aufbauen möchten. Für Kunden, die europäische Anforderungen an den Ort der Datenverarbeitung erfüllen müssen, bietet AWS die Möglichkeit, ihre Daten in einer unserer acht bestehenden Regionen in Europa zu verarbeiten: Irland, Frankfurt, London, Paris, Stockholm, Mailand, Zürich und Spanien. So können sie ihre Daten sicher innerhalb Europas halten.
Müssen Kunden zusätzliche Anforderungen an die betriebliche Autonomie und den Ort der Datenverarbeitung innerhalb der Europäischen Union erfüllen, steht die AWS European Sovereign Cloud als weitere Option zur Verfügung. Die Infrastruktur hierfür ist vollständig in der EU angesiedelt und wird unabhängig von den bestehenden Regionen betrieben. Sie ermöglicht es AWS-Kunden, ihre Kundeninhalte und von ihnen erstellten Metadaten in der EU zu behalten – etwa Rollen, Berechtigungen, Ressourcenbezeichnungen und Konfigurationen für den Betrieb von AWS.
Sollten Kunden weitere Optionen benötigen, um eine Isolierung zu ermöglichen und strenge Anforderungen an den Ort der Datenverarbeitung in einem bestimmten Land zu erfüllen, können sie auf AWS Dedicated Local Zones oder AWS Outposts zurückgreifen. Auf diese Weise können sie die Infrastruktur der AWS European Sovereign Cloud am Ort ihrer Wahl einsetzen. Wir arbeiten mit unseren Kunden und Partnern kontinuierlich daran, die AWS European Sovereign Cloud so zu gestalten, dass sie den benötigten Anforderungen entspricht. Dabei nutzen wir auch Feedback aus unseren Gesprächen mit europäischen Regulierungsbehörden und nationalen Cybersicherheitsbehörden.
„Eine funktionierende, verlässliche und sichere Infrastruktur ist die wichtigste Vorrausetzung für eine zunehmend digitalisierte Wirtschaft und Gesellschaft. Brandenburg schreitet hier voran. Wir haben in den vergangenen Jahren entscheidende Weichen gestellt, um Investitionen in eine moderne und nachhaltige Rechenzentruminfrastruktur in unserem Land auszubauen und so den Wirtschaftsstandort Brandenburg zu stärken. Hochmoderne Rechenzentren für sicheres Cloud-Computing sind die Basis für eine digitale Wirtschaft. Für unsere digitale Souveränität ist es wichtig, dass Rechenleistungen vor Ort in Deutschland erbracht werden. Ich freue mich, dass Amazon Web Services Brandenburg für ein langfristiges Investment in ihre Cloud-Computing-Infrastruktur für die AWS European Sovereign Cloud ausgewählt hat.“
— sagt Brandenburgs WirtschaftsministerProf. Dr.-Ing. Jörg Steinbach
Kontinuierliche Investitionen in Europa
Im Laufe der vergangenen 25 Jahre haben wir die wirtschaftliche Entwicklung in europäischen Ländern und Gemeinden vorangetrieben und in Infrastruktur, Arbeitsplätze sowie den Ausbau von Kompetenzen investiert. Seit 2010 hat Amazon über 150 Milliarden Euro in der Europäischen Union investiert und wir sind stolz darauf, im gesamten europäischen Binnenmarkt mehr als 150.000 Menschen in Festanstellung zu beschäftigen.
AWS plant bis zum Jahr 2040 7,8 Milliarden Euro in die AWS European Sovereign Cloud zu investieren. Diese Investition ist Teil der langfristigen Bestrebungen von AWS, das europäische Bedürfnis nach digitaler Souveränität zu unterstützen. Mit dieser langfristigen Investition löst AWS einen Multiplikatoreffekt für Cloud-Computing in Europa aus. Sie wird die digitale Transformation der Verwaltung und von Unternehmen vorantreiben, das AWS Partner Network (APN) stärken, die Zahl der Cloud- und Digitalfachkräfte erhöhen, erneuerbare Energieprojekte vorantreiben und eine positive Wirkung in den Gemeinden erzielen, in denen AWS präsent ist. Insgesamt wird die geplante AWS-Investition bis 2040 voraussichtlich 17,2 Milliarden Euro zum deutschen Bruttoinlandsprodukt und zur Schaffung von 2.800 Vollzeitstellen bei regionalen Unternehmen beitragen. Diese Arbeitsplätze in den Bereichen Bau, Instandhaltung, Ingenieurwesen, Telekommunikation und der breiteren regionalen Wirtschaft sind Teil der Lieferkette für AWS-Rechenzentren.
Darüber hinaus wird AWS neue Stellen für hochqualifizierte festangestellte Fachkräfte wie Softwareentwickler, Systemingenieure und Lösungsarchitekten schaffen, um die AWS European Sovereign Cloud aufzubauen und zu betreiben. Die Investition in zusätzliches Personal unterstreicht unser Commitment, dass der gesamte Betrieb dieser souveränen Cloud-Umgebung – angefangen bei der Zugangskontrolle zu den Rechenzentren über den technischen Support bis hin zum Kundendienst – ausnahmslos durch Fachkräfte innerhalb der Europäischen Union kontrolliert und gesteuert wird.
In Deutschland arbeitet AWS mit den Beteiligten vor Ort auch an langfristigen und innovativen Programmen zusammen. Diese sollen einen nachhaltigen positiven Einfluss auf die Gemeinden haben, in denen sich die Infrastruktur des Unternehmens befindet. AWS konzentriert sich auf die Entwicklung von Cloud-Fachkräften und Schulungsinitiativen für Lernende aller Altersgruppen. Diese Maßnahmen tragen dazu bei, den Fachkräftemangel zu beheben und sich auf die technischen Berufe der Zukunft vorzubereiten. Im vergangenen Jahr hat AWS beispielsweise gemeinsam mit der Siemens AG das erste Ausbildungsprogramm für AWS-Rechenzentren in Deutschland entwickelt. Ebenso hat das Unternehmen in Kooperation mit dem Deutschen Industrie und Handelstag (DIHK) den bundeseinheitlichen Zertifikatslehrgang zum „Cloud Business Expert“ entwickelt sowie die AWS Skills to Jobs Tech Alliance in Deutschland ins Leben gerufen. AWS wird gemeinsam mit lokalen Partnern daran arbeiten, Ausbildungsprogramme und Fortbildungen anzubieten, die auf die Bedürfnisse vor Ort zugeschnitten sind.
Vertrauensvoll bauen mit AWS
Für Kunden, die sich noch am Anfang ihrer Cloud-Reise befinden und die AWS European Sovereign Cloud in Betracht ziehen, bieten wir eine Vielzahl von Ressourcen an, um den Wechsel in die Cloud effektiv zu gestalten. Egal ob einzelne Workloads verlagert oder ganze Rechenzentren migriert werden sollen – Kunden erhalten von uns die nötigen organisatorischen, operativen und technischen Fähigkeiten für eine erfolgreiche Migration zu AWS. Beispielsweise bieten wir das AWS Cloud Adoption Framework (AWS CAF) an, das Unternehmen bei der Entwicklung eines effizienten und effektiven Cloud-Adoptionsplans mit Best Practices unterstützt. Auch der AWS Migration Hub hilft bei der Bewertung des Migrationsbedarfs, der Definition der Migrations- und Modernisierungsstrategie und der Nutzung von Automatisierung. Darüber hinaus veranstalten wir regelmäßig AWS-Events, Webinare und Workshops rund um die Themen Cloud-Adoption und Migrationsstrategie. Dabei können Kunden von AWS-Experten lernen und sich mit anderen Kunden und Partnern vernetzen.
Wir sind bestrebt, unseren Kunden mehr Kontrolle und weitere Optionen anzubieten, damit diese ihre ganz individuellen Anforderungen an die digitale Souveränität erfüllen können, ohne dabei auf die volle Leistungsfähigkeit von AWS verzichten zu müssen.
Um Kunden und Partnern bei der weiteren Planung und Entwicklung zu unterstützen, werden wir laufend zusätzliche Updates bereitstellen, während wir auf den Start der AWS European Sovereign Cloud hinarbeiten. Mehr über die AWS European Sovereign Cloud erfahren Sie auf unserer Website zur European Digital Sovereignty.
As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment.
Amazon Redshift Serverless provides a fully managed, petabyte-scale, auto scaling cloud data warehouse to support high-concurrency analytics. It offers data analysts, developers, and scientists a fast, flexible analytic environment to gain insights from their data with optimal price-performance. Redshift Serverless auto scales during usage spikes, enabling enterprises to cost-effectively help meet changing business demands. You can benefit from this simplicity without changing your existing analytics and business intelligence (BI) applications.
To help meet demanding performance needs like high concurrency, usage spikes, and fast query response times while optimizing costs, this post proposes using Redshift Serverless. The proposed solution aims to address three key performance requirements:
Support thousands of concurrent connections with high availability by using multiple Redshift Serverless endpoints behind a Network Load Balancer
Accommodate hundreds of concurrent queries with low-latency service level agreements through scalable and distributed workgroups
Enable subsecond response times for short queries against large datasets using the fast query processing of Amazon Redshift
The suggested architecture uses multiple Redshift Serverless endpoints accessed through a single Network Load Balancer client endpoint. The Network Load Balancer evenly distributes incoming requests across workgroups. This improves performance and reduces latency by scaling out resources to meet high throughput and low latency demands.
Solution overview
The following diagram outlines a Redshift Serverless architecture with multiple Amazon Redshift managed VPC endpoints behind a Network Load Balancer.
The following are the main components of this architecture:
Amazon Redshift data sharing – This allows you to securely share live data across Redshift clusters, workgroups, AWS accounts, and AWS Regions without manually moving or copying the data. Users can see up-to-date and consistent information in Amazon Redshift as soon as it’s updated. With Amazon Redshift data sharing, the ingestion can be done at the producer or consumer endpoint, allowing the other consumer endpoints to read and write the same data and thereby enabling horizontal scaling.
Network Load Balancer – This serves as the single point of contact for clients. The load balancer distributes incoming traffic across multiple targets, such as Redshift Serverless managed VPC endpoints. This increases the availability, scalability, and performance of your application. You can add one or more listeners to your load balancer. A listener checks for connection requests from clients, using the protocol and port that you configure, and forwards requests to a target group. A target group routes requests to one or more registered targets, such as Redshift Serverless managed VPC endpoints, using the protocol and the port number that you specify.
VPC – Redshift Serverless is provisioned in a VPC. By creating a Redshift managed VPC endpoint, you enable private access to Redshift Serverless from applications in another VPC. This design allows you to scale by having multiple VPCs as needed. The VPC endpoint provides a dedicate private IP for each Redshift Serverless workgroup to be used as the target groups on the Network Load Balancer.
Create an Amazon Redshift managed VPC endpoint
Complete the following steps to create the Amazon Redshift managed VPC endpoint:
On the Redshift Serverless console, choose Workgroup configuration in the navigation pane.
Choose a workgroup from the list.
On the Data access tab, in the Redshift managed VPC endpoints section, choose Create endpoint.
Enter the endpoint name. Create a name that is meaningful for your organization.
The AWS account ID will be populated. This is your 12-digit account ID.
Choose a VPC where the endpoint will be created.
Choose a subnet ID. In the most common use case, this is a subnet where you have a client that you want to connect to your Redshift Serverless instance.
Choose which VPC security groups to add. Each security group acts as a virtual firewall to control inbound and outbound traffic to resources protected by the security group, such as specific virtual desktop instances.
The following screenshot shows an example of this workgroup. Note down the IP address to use during the creation of the target group.
Repeat these steps to create all your Redshift Serverless workgroups.
Add VPC endpoints for the target group for the Network Load Balancer
To add these VPC endpoints to the target group for the Network Load Balancer using Amazon Elastic Compute Cloud (Amazon EC2), complete the following steps:
On the Amazon EC2 console, choose Target groups under Load Balancing in the navigation pane.
Choose Create target group.
For Choose a target type, select Instances to register targets by instance ID, or select IP addresses to register targets by IP address.
For Target group name, enter a name for the target group.
For Protocol, choose TCP or TCP_UDP.
For Port, use 5439 (Amazon Redshift port).
For IP address type, choose IPv4 or IPv6. This option is available only if the target type is Instances or IP addresses and the protocol is TCP or TLS.
You must associate an IPv6 target group with a dual-stack load balancer. All targets in the target group must have the same IP address type. You can’t change the IP address type of a target group after you create it.
For VPC, choose the VPC with the targets to register.
Leave the default selections for the Health checks section, Attributes section, and Tags section.
Create a load balancer
After you create the target group, you can create your load balancer. We recommend using port 5439 (Amazon Redshift default port) for it.
The Network Load Balancer serves as a single-access endpoint and will be used on connections to reach Amazon Redshift. This allows you to add more Redshift Serverless workgroups and increase the concurrency transparently.
Testing the solution
We tested this architecture to run three BI reports with the TPC-DS dataset (cloud benchmark dataset) as our data. Amazon Redshift includes this dataset for free when you choose to load sample data (sample_data_dev database). The installation also provides the queries to test the setup.
Among all the queries from TPC-DS benchmark, we chose the following three to use as our report queries. We changed the first two report queries to use a CREATE TABLE AS SELECT (CTAS) query on temporary tables instead of the WITH clause to emulate options you can see on a typical BI tool. For our testing, we also disabled the result cache to make sure that Amazon Redshift would run the queries every time.
The set of queries contains the creation of temporary tables, a join between those tables, and the cleanup. The cleanup step drops tables. This isn’t needed because they’re deleted at the end of the session, but this aims to simulate all that the BI tool does.
For the tests, we used the following configurations:
Test 1 – A single 96 RPU Redshift Serverless vs. three workgroups at 32 RPU each
Test 2 – A single 48 RPU Redshift Serverless vs. three workgroups at 16 RPU each
We tested three reports by spawning 100 sessions per report (300 total). There were 14 statements across the three reports (4,200 total). All sessions were triggered simultaneously.
The following table summarizes the tables used in the test.
Table Name
Row Count
Catalog_page
93,744
Catalog_sales
23,064,768
Customer_address
50,000
Customer
100,000
Date_dim
73,049
Item
144,000
Promotion
2,400
Store_returns
4,600,224
Store_sales
46,086,464
Store
96
Web_returns
1,148,208
Web_sales
11,510,144
Web_site
240
Some tables were modified by ingesting more data than what the TPC-DS schema offers on Amazon Redshift. Data was reinserted on the table to increase the size.
Test results
The following table summarizes our test results.
TEST 1
.
Time Consumed
Number of Queries
Cost
Max Scaled RPU
Performance
Single: 96 RPUs
0:02:06
2,100
$6
279
Base
Parallel: 3x 32 RPUs
0:01:06
2,100
$1.20
96
48.03%
Parallel 1 (32 RPU)
0:01:03
688
$0.40
32
50.10%
Parallel 2 (32 RPU)
0:01:03
703
$0.40
32
50.13%
Parallel 3 (32 RPU)
0:01:06
709
$0.40
32
48.03%
TEST 2
.
Time Consumed
Number of Queries
Cost
Max Scaled RPU
Performance
Single: 48 RPUs
0:01:55
2,100
$3.30
168
Base
Parallel: 3x 16 RPUs
0:01:47
2,100
$1.90
96
6.77%
Parallel 1 (16 RPU)
0:01:47
712
$0.70
36
6.77%
Parallel 2 (16 RPU)
0:01:44
696
$0.50
25
9.13%
Parallel 3 (16 RPU)
0:01:46
692
$0.70
35
7.79%
The preceding table shows that the parallel setup was faster than the single at a lower cost. Also, in our tests, even though Test 1 had double the capacity of Test 2 for the parallel setup, the cost was still 36% lower and the speed was 39% faster. Based on these results, we can conclude that for workloads that have high throughput (I/O), low latency, and high concurrency requirements, this architecture is cost-efficient and performant. Refer to the AWS Pricing Cost Calculator for Network Load Balancer and VPC endpoints pricing.
Redshift Serverless automatically scales the capacity to deliver optimal performance during periods of peak workloads including spikes in concurrency of the workload. This is evident from the maximum scaled RPU results in the preceding table.
Recently released features of Redshift Serverless such as MaxRPU and AI-driven scaling were not used for this test. These new features can increase the price-performance of the workload even further.
We recommend enabling cross-zone load balancing on the Network Load Balancer because it distributes requests from clients to registered targets. Enabling cross-zone load balancing will help balance the requests among the Redshift Serverless managed VPC endpoints irrespective of the Availability Zone they are configured in. Also, if the Network Load Balancer receives traffic from only one server (same IP), you should always use an odd number of Redshift Serverless managed VPC endpoints behind the Network Load Balancer.
Conclusion
In this post, we discussed a scalable architecture that increases the throughput of Redshift Serverless in low latency, high concurrency scenarios. Having multiple Redshift Serverless workgroups behind a Network Load Balancer can deliver a horizontally scalable solution at the best price-performance.
Additionally, Redshift Serverless uses AI techniques (currently in preview) to scale automatically with workload changes across all key dimensions—such as data volume changes, concurrent users, and query complexity—to meet and maintain your price-performance targets.
We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.
About the Authors
Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS.
Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.
Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.
Amol Gaikaiwari is a Sr. Redshift Specialist focused on helping customers realize their business outcomes with optimal Redshift price-performance. He loves to simplify data pipelines and enhance capabilities through adoption of latest Redshift features.
AWS Regions provide fault isolation boundaries that prevent correlated failure and contain the impact from AWS service impairments to a single Region when they occur. You can use these fault boundaries to build multi-Region applications that consist of independent, fault-isolated replicas in each Region that limit shared fate scenarios. This allows you to build multi-Region applications and leverage a spectrum of approaches from backup and restore to pilot light to active/active to implement your multi-Region architecture. However, applications typically don’t operate in isolation; consider both the components you will use and their dependencies as part of your failover strategy. Generally, multiple applications make up what we refer to as a user story, a specific capability offered to an end user, like “posting a picture and caption on a social media app” or “checking out on an e-commerce site”. Because of this, you should develop an organizational multi-Region failover strategy that provides the necessary coordination and consistency to make your approach successful.
Overview
There are four high-level strategies that organizations can pick from to guide a multi-Region approach:
Component-level failover
Individual application failover
Dependency graph failover
Entire application portfolio failover
These strategies move from the most granular to the coarsest approach. Each strategy has tradeoffs and addresses different challenges, including flexibility of failover decision making, testability of the failover combinations, presence of modal behavior, and organizational investment in planning and implementation. By the end of this post, you will be able to identify the pros and cons of each strategy so you can make intentional choices about which you select for your multi-Region failover solution.
Component-level failover
Applications are made up of multiple components, including their infrastructure, code and config, data stores, and dependencies. The component-level failover strategy helps you recover from individual component impairments. This means that when a single component is impaired, the application will fail over to a component hosted in a different Region. Consider the application in Figure 1. When the Amazon Simple Storage Service (Amazon S3) resources used by the application experience elevated error rates or higher latency, the application fails over to use data from an S3 bucket in its secondary Region.
Figure 1. When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.
This strategy gives the most autonomy and flexibility to individual applications, but has four main tradeoffs:
It adds latency by using resources in a second Region because they are physically further away. This gives the application multiple modes of behavior, lower latency when all components are in one Region, and higher latency when the components are split between Regions. Modal behavior can produce unexpected and undesirable results.
It introduces the possibility for inconsistent data if asynchronous replication is used in the data store.
It typically requires a runtime update of the application’s configuration to switch a component to a different Region, which can be unreliable during a failure scenario.
There are 2N-1 possible configurations (where N is the number of components in the application) of the application, which can make every possible combination in an application difficult to test.
Individual application failover
The next strategy allows individual applications to make an autonomous decision to fail over all of its components together, shown in Figure 2. This removes the latency tradeoff from the previous strategy by keeping all of the application components in the same Region. It also significantly reduces the complexity by only having two possible configurations per application. Additionally, applications can be failed over to another Region without updating their configuration by using approaches like Amazon Route 53 DNS failover, removing the unreliability of runtime configuration updates.
Figure 2. Application 3 experiences an impairment and fails over to the secondary Region
However, allowing individual applications to make their own failover decision can introduce the same modal behavior we saw with component-level failover, just in a different dimension. In the worst case, 50% of the applications in a user story could fail over while 50% don’t, meaning every application interaction could be a cross-Region request, shown in Figure 3.
Figure 3. The worst-case scenario of allowing applications to make failover decisions independently
Additionally, while this approach removes the complexity of the component failover approach, it still exhibits a level of similar complexity, albeit smaller, by having 2N-1 combinations of application locations across Regions, also making this approach difficult to test and coordinate.
Dependency graph failover
To solve the complexity of the previous strategy, you might decide to coordinate failover of all applications that support a user story as a single unit. We call this a dependency graph and it ensures that all applications that interact with each other will always be in the same Region, as shown in Figure 4.
Figure 4. A dependency graph of applications that all support user story “A”
While this solves the previous latency, modal behavior, and complexity tradeoffs, it comes with its own challenges. In a portfolio with multiple user stories and applications, this graph can be very large and discovering each dependency, especially infrequently used ones, can be difficult. In fact, seemingly unrelated dependency graphs can be connected by a single vertex that is shared between them, as shown in Figure 5.
Figure 5. Two unrelated user stories share a dependency on Application 4, requiring both dependency graphs to failover if either experience an impairment
For example, if every user story you provide depends on a single authentication and authorization system, when one graph of applications needs to failover, then so does the entire authorization system. In turn, every other user story that depends on that authorization system needs to fail over as well. To mitigate this, you might implement independent replicas of these types of applications in each Region, if possible, to remove edges from the dependency graph.
Entire portfolio failover
The final strategy is failing over an entire application portfolio, whether or not applications are impacted or have any interaction with those that are, as shown in Figure 6. This strategy helps remove the operational burden of creating and maintaining dependency graphs for every user story your business supports.
Figure 6. Every user story fails over together regardless of observed impact from a failure
The major tradeoff is the organizational investment to create multi-Region capabilities for every application – you might not have made that broad investment in the other strategies. You can make this strategy slightly more granular by implementing it for specific application tiers, for example, failing over all tier-1 applications together, as long as you know there aren’t dependencies across applications of different criticality.
You can also combine this approach with the second strategy. Let individual applications make failover decisions until you see broad enough impact, or impact from the modal behavior, that you decide to make all applications failover to your secondary Region to mitigate the effects.
Conclusion
This blog post has looked at four different high-level approaches for creating an organizational multi-Region failover strategy.
Each strategy optimizes for different outcomes. Component-level failover gives you the highest degree of flexibility without organizational capabilities or coordination, but introduces the most complexity and bimodal behavior. Individual application failover optimizes for less complexity in failover combinations than component-level while still maintaining decentralized flexibility in failover decision making. Dependency graph failover optimizes for only needing to failover the minimum set of applications to support a capability, which removes the presence of modal behavior while requiring more organizational investment to do so. Finally, portfolio failover optimizes for not needing to maintain dependency graphs, but requires significant additional investment to build a multi-Region capability for every application.
Creating the strategy can be an iterative journey. You might start with allowing individual applications to make failover decisions while you build toward a future state of managing failover of independent dependency graphs. For more information on creating multi-Region architectures, see AWS Multi-Region Fundamentals and Disaster Recovery of Workloads on AWS.
Generative artificial intelligence (generative AI) is a type of AI used to generate content, including conversations, images, videos, and music. Generative AI can be used directly to build customer-facing features (a chatbot or an image generator), or it can serve as an underlying component in a more complex system. For example, it can generate embeddings (or compressed representations) or any other artifact necessary to improve downstream machine learning (ML) models or back-end services.
With the advent of generative AI, it’s fundamental to understand what it is, how it works under the hood, and which options are available for putting it into production. In some cases, it can also be helpful to move closer to the underlying model in order to fine tune or drive domain-specific improvements. With this edition of Let’s Architect!, we’ll cover these topics and share an initial set of methodologies to put generative AI into production. We’ll start with a broad introduction to the domain and then share a mix of videos, blogs, and hands-on workshops.
Many teams are turning to open source tools running on Kubernetes to help accelerate their ML and generative AI journeys. In this video session, experts discuss why Kubernetes is ideal for ML, then tackle challenges like dependency management and security. You will learn how tools like Ray, JupyterHub, Argo Workflows, and Karpenter can accelerate your path to building and deploying generative AI applications on Amazon Elastic Kubernetes Service (Amazon EKS). A real-world example showcases how Adobe leveraged Amazon EKS to achieve faster time-to-market and reduced costs. You will be also introduced to Data on EKS, a new AWS project offering best practices for deploying various data workloads on Amazon EKS.
This video session aims to provide an in-depth exploration of the emerging concepts in generative AI. By delving into practical applications and detailing best practices for implementation, the session offers a concrete understanding that empowers businesses to harness the full potential of these technologies. You can gain valuable insights into navigating the complexities of generative AI, equipping you with the knowledge and strategies necessary to stay ahead of the curve and capitalize on the transformative power of these new methods. If you want to dive even deeper, check this generative AI best practices post.
Working with AI/ML workloads and generative AI in a production environment requires appropriate system design and careful considerations for tenant separation in the context of SaaS. You’ll need to think about how the different tenants are mapped to models, how inferencing is scaled, how solutions are integrated with other upstream/downstream services, and how large language models (LLMs) can be fine-tuned to meet tenant-specific needs.
This video drills down into the concept of multi-tenancy for AI/ML workloads, including the common design, performance, isolation, and experience challenges that you can find during your journey. You will also become familiar with concepts like RAG (used to enrich the LLMs with contextual information) and fine tuning through practical examples.
DevOps Research and Assessment (DORA) metrics, which measure critical DevOps performance indicators like lead time, are essential to engineering practices, as shown in the Accelerate book‘s research. By leveraging generative AI technology, the zAdviser Enterprise platform can now offer in-depth insights and actionable recommendations to help organizations optimize their DevOps practices and drive continuous improvement. This blog demonstrates how generative AI can go beyond language or image generation, applying to a wide spectrum of domains.
Figure 4. Generative AI is used to provide summarization, analysis, and recommendations for improvement based on the DORA metrics.
Hands-on Generative AI: AWS workshops
Getting hands on is often the best way to understand how everything works in practice and create the mental model to connect theoretical foundations with some real-world applications.
Generative AI on Amazon SageMaker shows how you can build, train, and deploy generative AI models. You can learn about options to fine-tune, use out-of-the-box existing models, or even customize the existing open source models based on your needs.
Building with Amazon Bedrock and LangChain demonstrates how an existing fully-managed service provided by AWS can be used when you work with foundational models, covering a wide variety of use cases. Also, if you want a quick guide for prompt engineering, you can check out the PartyRock lab in the workshop.
Figure 5. An image replacement example that you can find in the workshop.
See you next time!
Thanks for reading! We hope you got some insight into the applications of generative AI and discovered new strategies for using it. In the next blog, we will dive deeper into machine learning.
To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.
Our customers depend on Amazon Web Services (AWS) for their mission-critical applications and most sensitive data. Every day, the world’s fastest-growing startups, largest enterprises, and most trusted governmental organizations are choosing AWS as the place to run their technology infrastructure. They choose us because security has been our top priority from day one. We designed AWS from its foundation to be the most secure way for our customers to run their workloads, and we’ve built our internal culture around security as a business imperative.
While technical security measures are important, organizations are made up of people. A recent report from the Cyber Safety Review Board (CSRB) makes it clear that a deficient security culture can be a root cause for avoidable errors that allow intrusions to succeed and remain undetected.
Security is our top priority
Our security culture starts at the top, and it extends through every part of our organization. Over eight years ago, we made the decision for our security team to report directly to our CEO. This structural design redefined how we build security into the culture of AWS and informs everyone at the company that security is our top priority by providing direct visibility to senior leadership. We empower our service teams to fully own the security of their services and scale security best practices and programs so our customers have the confidence to innovate on AWS.
We believe that there are four key principles to building a strong culture of security:
Security is built into our organizational structure
At AWS, we view security as a core function of our business, deeply connected to our mission objectives. This goes beyond good intentions—it’s embedded directly into our organizational structure. At Amazon, we make an intentional choice for all our security teams to report directly to the CEO while also being deeply embedded in our respective business units. The goal is to build security into the structural fabric of how we make decisions. Every week, the AWS leadership team, led by our CEO, meets with my team to discuss security and ensure we’re making the right choices on tactical and strategic security issues and course-correcting when needed. We report internally on operational metrics that tie our security culture to the impact that it has on our customers, connecting data to business outcomes and providing an opportunity for leadership to engage and ask questions. This support for security from the top levels of executive leadership helps us reinforce the idea that security is accelerating our business outcomes and improving our customers’ experiences rather than acting as a roadblock.
Security is everyone’s job
AWS operates with a strong ownership model built around our culture of security. Ownership is one of our key Leadership Principles at Amazon. Employees in every role receive regular training and reinforcement of the message that security is everyone’s job. Every service and product team is fully responsible for the security of the service or capability that they deliver. Security is built into every product roadmap, engineering plan, and weekly stand-up meeting, just as much as capabilities, performance, cost, and other core responsibilities of the builder team. The best security is not something that can be “bolted on” at the end of a process or on the outside of a system; rather, security is integral and foundational.
AWS business leaders prioritize building products and services that are designed to be secure. At the same time, they strive to create an environment that encourages employees to identify and escalate potential security concerns even when uncertain about whether there is an actual issue. Escalation is a normal part of how we work in AWS, and our practice of escalation provides a “security reporting safe space” to everyone. Our teams and individuals are encouraged to report and escalate any possible security issues or concerns with a high-priority ticket to the security team. We would much rather hear about a possible security concern and investigate it, regardless of whether it is unlikely or not. Our employees know that we welcome reports even for things that turn out to be nonissues.
Distributing security expertise and ownership across AWS
Our central AWS Security team provides a number of critical capabilities and services that support and enable our engineering and service teams to fulfill their security responsibilities effectively. Our central team provides training, consultation, threat-modeling tools, automated code-scanning frameworks and tools, design reviews, penetration testing, automated API test frameworks, and—in the end—a final security review of each new service or new feature. The security reviewer is empowered to make a go or no-go decision with respect to each and every release. If a service or feature does not pass the security review process in the first review, we dive deep to understand why so we can improve processes and catch issues earlier in development. But, releasing something that’s not ready would be an even bigger failure, so we err on the side of maintaining our high security bar and always trying to deliver to the high standards that our customers expect and rely on.
One important mechanism to distribute security ownership that we’ve developed over the years is the Security Guardians program. The Security Guardians program trains, develops, and empowers service team developers in each two-pizza team to be security ambassadors, or Guardians, within the product teams. At a high level, Guardians are the “security conscience” of each team. They make sure that security considerations for a product are made earlier and more often, helping their peers build and ship their product faster, while working closely with the central security team to help ensure the security bar remains high at AWS. Security Guardians feel empowered by being part of a cross-organizational community while also playing a critical role for the team and for AWS as a whole.
Scaling security through innovation
Another way we scale security across our culture at AWS is through innovation. We innovate to build tools and processes to help all of our people be as effective as possible and maintain focus. We use artificial intelligence (AI) to accelerate our secure software development process, as well as new generative AI–powered features in Amazon Inspector, Amazon Detective, AWS Config, and Amazon CodeWhisperer that complement the human skillset by helping people make better security decisions, using a broader collection of knowledge. This pattern of combining sophisticated tooling with skilled engineers is highly effective because it positions people to make the nuanced decisions required for effective security.
For large organizations, it can take years to assess every scenario and prove systems are secure. Even then, their systems are constantly changing. Our automated reasoning tools use mathematical logic to answer critical questions about infrastructure to detect misconfigurations that could potentially expose data. This provable security provides higher assurance in the security of the cloud and in the cloud. We apply automated reasoning in key service areas such as storage, networking, virtualization, identity, and cryptography. Amazon scientists and engineers also use automated reasoning to prove the correctness of critical internal systems. We process over a billion mathematical queries per day that power AWS Identity and Access Management Access Analyzer, Amazon Simple Storage Service (Amazon S3) Block Public Access, and other security offerings. AWS is the first and only cloud provider to use automated reasoning at this scale.
Advancing the future of cloud security
At AWS, we care deeply about our culture of security. We’re consistently working backwards from our customers and investing in raising the bar on our security tools and capabilities. For example, AWS enables encryption of everything. AWS Key Management Service (AWS KMS) is the first and only highly scalable, cloud-native key management system that is also FIPS 140-2 Level 3 certified. No one can retrieve customer plaintext keys, not even the most privileged admins within AWS. With the AWS Nitro System, which is the foundation of the AWS compute service Amazon Elastic Compute Cloud (Amazon EC2), we designed and delivered first-of-a-kind and still unique in the industry innovation to maximize the security of customers’ workloads. The Nitro System provides industry-leading privacy and isolation for all their compute needs, including GPU-based computing for the latest generative AI systems. No one, not even the most privileged admins within AWS, can access a customer’s workloads or data in Nitro-based EC2 instances.
We continue to innovate on behalf of our customers so they can move quickly, securely, and with confidence to enable their businesses, and our track record in the area of cloud security is second to none. That said, cybersecurity challenges continue to evolve, and while we’re proud of our achievements to date, we’re committed to constant improvement as we innovate and advance our technologies and our culture of security.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.