Optimize cost and performance for Amazon MWAA

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/optimize-cost-and-performance-for-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With Amazon MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance on how you can optimize performance and save cost by following best practices.

Amazon MWAA environments include four Airflow components hosted on groups of AWS compute resources: the scheduler that schedules the work, the workers that implement the work, the web server that provides the UI, and the metadata database that keeps track of state. For intermittent or varying workloads, optimizing costs while maintaining price and performance is crucial. This post outlines best practices to achieve cost optimization and efficient performance in Amazon MWAA environments, with detailed explanations and examples. It may not be necessary to apply all of these best practices for a given Amazon MWAA workload; you can selectively choose and implement relevant and applicable principles for your specific workloads.

Right-sizing your Amazon MWAA environment

Right-sizing your Amazon MWAA environment makes sure you have an environment that is able to concurrently scale across your different workloads to provide the best price-performance. The environment class you choose for your Amazon MWAA environment determines the size and the number of concurrent tasks supported by the worker nodes. In Amazon MWAA, you can choose from five different environment classes. In this section, we discuss the steps you can follow to right-size your Amazon MWAA environment.

Monitor resource utilization

The first step in right-sizing your Amazon MWAA environment is to monitor the resource utilization of your existing setup. You can monitor the underlying components of your environments using Amazon CloudWatch, which collects raw data and processes data into readable, near real-time metrics. With these environment metrics, you have greater visibility into key performance indicators to help you appropriately size your environments and debug issues with your workflows. Based on the concurrent tasks needed for your workload, you can adjust the environment size as well as the maximum and minimum workers needed. CloudWatch will provide CPU and memory utilization for all the underlying AWS services utilize by Amazon MWAA. Refer to Container, queue, and database metrics for Amazon MWAA for additional details on available metrics for Amazon MWAA. These metrics also include the number of base workers, additional workers, schedulers, and web servers.

Analyze your workload patterns

Next, take a deep dive into your workflow patterns. Examine DAG schedules, task concurrency, and task runtimes. Monitor CPU/memory usage during peak periods. Query CloudWatch metrics and Airflow logs. Identify long-running tasks, bottlenecks, and resource-intensive operations for optimal environment sizing. Understanding the resource demands of your workload will help you make informed decisions about the appropriate Amazon MWAA environment class to use.

Choose the right environment class

Match requirements to Amazon MWAA environment class specifications (mw1.small to mw1.2xlarge) that can handle your workload efficiently. You can vertically scale up or scale down an existing environment through an API, the AWS Command Line Interface (AWS CLI), or the AWS Management Console. Be aware that a change in the environment class requires a scheduled downtime.

Fine tune configuration parameters

Fine-tuning configuration parameters in Apache Airflow is crucial for optimizing workflow performance and cost reductions. It allows you to tune settings such as Auto scaling, parallelism, logging, and DAG code optimizations.

Auto scaling

Amazon MWAA supports worker auto scaling, which automatically adjusts the number of running worker and web server nodes based on your workload demands. You can specify the minimum and maximum number of Airflow workers that run in your environment. For worker node auto scaling, Amazon MWAA uses RunningTasks and QueuedTasks metrics, where (tasks running + tasks queued) / (tasks per worker) = (required workers). If the required number of workers is greater than the current number of running workers, Amazon MWAA will add additional worker instances using AWS Fargate, up to the maximum value specified by the maximum worker configuration.

Auto scaling in Amazon MWAA will gracefully downscale when there are more additional workers than required. For example, let’s assume a large Amazon MWAA environment with a minimum of 1 worker and a maximum of 10, where each large Amazon MWAA worker can support up to 20 tasks. Let’s say, each day at 8:00 AM, DAGs start up that use 190 concurrent tasks. Amazon MWAA will automatically scale to 10 workers, because the required workers = 190 requested tasks (some running, some queued) / 20 (tasks per worker) = 9.5 workers, rounded up to 10. At 10:00 AM, half of the tasks complete, leaving 85 running. Amazon MWAA will then downscale to 6 workers (95 tasks/20 tasks per worker = 5.25 workers, rounded up to 6). Any workers that are still running tasks remain protected during downscaling until they’re complete, and no tasks will be interrupted. As the queued and running tasks decrease, Amazon MWAA will remove workers without affecting running tasks, down to the minimum specified worker count.

Web server auto scaling in Amazon MWAA allows you to automatically scale the number of web servers based on CPU utilization and active connection count. Amazon MWAA makes sure your Airflow environment can seamlessly accommodate increased demand, whether from REST API requests, AWS CLI usage, or more concurrent Airflow UI users. You can specify the maximum and minimum web server count while configuring your Amazon MWAA environment.

Logging and metrics

In this section, we discuss the steps to select and set the appropriate log configurations and CloudWatch metrics.

Choose the right log levels

If enabled, Amazon MWAA will send Airflow logs to CloudWatch. You can view the logs to determine Airflow task delays or workflow errors without the need for additional third-party tools. You need to enable logging to view Airflow DAG processing, tasks, scheduler, web server, and worker logs. You can enable Airflow logs at the INFO, WARNING, ERROR, or CRITICAL level. When you choose a log level, Amazon MWAA sends logs for that level and higher levels of severity. Standard CloudWatch logs charges apply, so reducing log levels where possible can reduce overall costs. Use the most appropriate log level based on environment, such as INFO for dev and UAT, and ERROR for production.

Set appropriate log retention policy

By default, logs are kept indefinitely and never expire. To reduce CloudWatch cost, you can adjust the retention policy for each log group.

Choose required CloudWatch metrics

You can choose which Airflow metrics are sent to CloudWatch by using the Amazon MWAA configuration option metrics.statsd_allow_list. Refer to the complete list of available metrics. Some metrics such as schedule_delay and duration_success are published per DAG, whereas others such as ti.finish are published per task per DAG.

Therefore, the cumulative number of DAGs and tasks directly influence your CloudWatch metric ingestion costs. To control CloudWatch costs, choose to publish selective metrics. For example, the following will only publish metrics that start with scheduler and executor:

metrics.statsd_allow_list = scheduler,executor

We recommend using metrics.statsd_allow_list with metrics.metrics_use_pattern_match.

An effective practice is to utilize regular expression (regex) pattern matching against the entire metric name instead of only matching the prefix at the beginning of the name.

Monitor CloudWatch dashboards and set up alarms

Create a custom dashboard in CloudWatch and add alarms for a particular metric to monitor the health status of your Amazon MWAA environment. Configuring alarms allows you to proactively monitor the health of the environment.

Optimize AWS Secrets Manager invocations

Airflow has a mechanism to store secrets such as variables and connection information. By default, these secrets are stored in the Airflow meta database. Airflow users can optionally configure a centrally managed location for secrets, such as AWS Secrets Manager. When specified, Airflow will first check this alternate secrets backend when a connection or variable is requested. If the alternate backend contains the needed value, it is returned; if not, Airflow will check the meta database for the value and return that instead. One of the factors affecting the cost to use Secrets Manager is the number of API calls made to it.

On the Amazon MWAA console, you can configure the backend Secrets Manager path for the connections and variables that will be used by Airflow. By default, Airflow searches for all connections and variables in the configured backend. To reduce the number of API calls Amazon MWAA makes to Secrets Manager on your behalf, configure it to use a lookup pattern. By specifying a pattern, you narrow the possible paths that Airflow will look at. This will help in lowering your costs when using Secrets Manager with Amazon MWAA.

To use a secrets cache, enable AIRFLOW_SECRETS_USE_CACHE with TTL to help to reduce the Secrets Manager API calls.

For example, if you want to only look up a specific subset of connections, variables, or config in Secrets Manager, set the relevant *_lookup_pattern parameter. This parameter takes a regex as a string as value. To lookup connections starting with m in Secrets Manager, your configuration file should look like the following code:

[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs =

{
  "connections_prefix": "airflow/connections",
  "connections_lookup_pattern": "^m",
  "profile_name": "default"
}

DAG code optimization

Schedulers and workers are two components that are involved in parsing the DAG. After the scheduler parses the DAG and places it in a queue, the worker picks up the DAG from the queue. At the point, all the worker knows is the DAG_id and the Python file, along with some other info. The worker has to parse the Python file in order to run the task.

DAG parsing is run twice, once by the scheduler and then by the worker. Because the workers are also parsing the DAG, the amount of time it takes for the code to parse dictates the number of workers needed, which adds cost of running those workers.

For example, for a total of 200 DAGs having 10 tasks each, taking 60 seconds per task to parse, we can calculate the following:

  • Total tasks across all DAGs = 2,000
  • Time per task = 60 seconds + 20 seconds (parse DAG)
  • Total time = 2000 * 80 = 160,000 seconds
  • Total time per worker = 72,000 seconds
  • Number of workers needs = Total time/Total time per worker = 160,000/72,000 = ~3

Now, let’s increase the time taken to parse the DAGs to 100 seconds:

  • Total tasks across all DAGs = 2,000
  • Time per task = 60 seconds + 100 seconds
  • Total time = 2,000 *160 = 320,000 seconds
  • Total time per worker = 72,000 seconds
  • Number of workers needs = Total time/Total time per worker = 320,000/72,000 = ~5

As you can see, when the DAG parsing time increased from 20 seconds to 100 seconds, the number of worker nodes needed increased from 3 to 5, thereby adding compute cost.

To reduce the time it takes for parsing the code, follow the best practices in the subsequent sections.

Remove top-level imports

Code imports will run every time the DAG is parsed. If you don’t need the libraries being imported to create the DAG objects, move the import to the task level instead of defining it at the top. After it’s defined in the task, the import will be called only when the task is run.

Avoid multiple calls to databases like the meta database or external system database. Variables are used within the DAG that are defined in the meta database or a backend system like Secrets Manager. Use templating (Jinja) wherein calls to populate the variables are only made at task runtime and not at task parsing time.

For example, see the following code:

import pendulum
from airflow import DAG
from airflow.decorators import task
import numpy as np  # <-- DON'T DO THAT!

with DAG(
    dag_id="example_python_operator",
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as dag:

    @task()
    def print_array():
        """Print Numpy array."""
        import numpy as np  # <-- INSTEAD DO THIS!
        a = np.arange(15).reshape(3, 5)
        print(a)
        return a
    print_array()

The following code is another example:

# Bad example
from airflow.models import Variable

foo_var = Variable.get("foo")  # DON'T DO THAT

bash_use_variable_bad_1 = BashOperator(
    task_id="bash_use_variable_bad_1", bash_command="echo variable foo=${foo_env}", env={"foo_env": foo_var}
)

bash_use_variable_bad_2 = BashOperator(
    task_id="bash_use_variable_bad_2",
    bash_command=f"echo variable foo=${Variable.get('foo')}",  # DON'T DO THAT
)

bash_use_variable_bad_3 = BashOperator(
    task_id="bash_use_variable_bad_3",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": Variable.get("foo")},  # DON'T DO THAT
)

# Good example
bash_use_variable_good = BashOperator(
    task_id="bash_use_variable_good",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": "{{ var.value.get('foo') }}"},
)

@task
def my_task():
    var = Variable.get("foo")  # this is fine, because func my_task called only run task, not scan DAGs.
print(var)

Writing DAGs

Complex DAGs with a large number of tasks and dependencies between them can impact performance of scheduling. One way to keep your Airflow instance performant and well utilized is to simplify and optimize your DAGs.

For example, a DAG that has simple linear structure A → B → C will experience less delays in task scheduling than a DAG that has a deeply nested tree structure with an exponentially growing number of dependent tasks.

Dynamic DAGs

In the following example, a DAG is defined with hardcoded table names from a database. A developer has to define N number of DAGs for N number of tables in a database.

# Bad example
dag_params = getData()
no_of_dags = int(dag_params["no_of_dags"]['N'])
# build a dag for each number in no_of_dags
for n in range(no_of_dags):
    dag_id = 'dynperf_t1_{}'.format(str(n))
default_args = {'owner': 'airflow','start_date': datetime(2022, 2, 2, 12, n)}

To reduce verbose and error-prone work, use dynamic DAGs. The following definition of the DAG is created after querying a database catalog, and creates as many DAGs dynamically as there are tables in the database. This achieves the same objective with less code.

def getData():
    client = boto3.client('dynamodb’)
    response = client.get_item(
        TableName="mwaa-dag-creation",
        Key={'key': {'S': 'mwaa’}}
    )
    return response["Item"]

Stagger DAG schedules

Running all DAGs simultaneously or within a short interval in your environment can result in a higher number of worker nodes required to process the tasks, thereby increasing compute costs. For business scenarios where the workload is not time-sensitive, consider spreading the schedule of DAG runs in a way that maximizes the utilization of available worker resources.

DAG folder parsing

Simpler DAGs are usually only in a single Python file; more complex DAGs might be spread across multiple files and have dependencies that should be shipped with them. You can either do this all inside of the DAG_FOLDER , with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single .zip file. Airflow will look into all the directories and files in the DAG_FOLDER. Using the .airflowignore file specifies which directories or files Airflow should intentionally ignore. This will increase the efficiency of finding a DAG within a directory, improving parsing times.

Deferrable operators

You can run deferrable operators on Amazon MWAA. Deferrable operators have the ability to suspend themselves and free up the worker slot. No tasks in the worker means fewer required worker resources, which can lower the worker cost.

For example, let’s assume you’re using a large number of sensors that wait for something to occur and occupy worker node slots. By making the sensors deferrable and using worker auto scaling improvements to aggressively downscale workers, you will immediately see an impact where fewer worker nodes are needed, saving on worker node costs.

Dynamic Task Mapping

Dynamic Task Mapping allows a way for a workflow to create a number of tasks at runtime based on current data, rather than the DAG author having to know in advance how many tasks would be needed. This is similar to defining your tasks in a for loop, but instead of having the DAG file fetch the data and do that itself, the scheduler can do this based on the output of a previous task. Right before a mapped task is run, the scheduler will create N copies of the task, one for each input.

Stop and start the environment

You can stop and start your Amazon MWAA environment based on your workload requirements, which will result in cost savings. You can perform the action manually or automate stopping and starting Amazon MWAA environments. Refer to Automating stopping and starting Amazon MWAA environments to reduce cost to learn how to automate the stop and start of your Amazon MWAA environment retaining metadata.

Conclusion

In conclusion, implementing performance optimization best practices for Amazon MWAA can significantly reduce overall costs while maintaining optimal performance and reliability. Key strategies include right-sizing environment classes based on CloudWatch metrics, managing logging and monitoring costs, using lookup patterns with Secrets Manager, optimizing DAG code, and selectively stopping and starting environments based on workload demands. Continuously monitoring and adjusting these settings as workloads evolve can maximize your cost-efficiency.


About the Authors

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Retina Satish is a Solutions Architect at AWS, bringing her expertise in data analytics and generative AI. She collaborates with customers to understand business challenges and architect innovative, data-driven solutions using cutting-edge technologies. She is dedicated to delivering secure, scalable, and cost-effective solutions that drive digital transformation.

Jeetendra Vaidya is a Senior Solutions Architect at AWS, bringing his expertise to the realms of AI/ML, serverless, and data analytics domains. He is passionate about assisting customers in architecting secure, scalable, reliable, and cost-effective solutions.

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/optimize-your-workloads-with-amazon-redshift-serverless-ai-driven-scaling-and-optimization/

The current scaling approach of Amazon Redshift Serverless increases your compute capacity based on the query queue time and scales down when the queuing reduces on the data warehouse. However, you might need to automatically scale compute resources based on factors like query complexity and data volume to meet price-performance targets, irrespective of query queuing. To address this requirement, Redshift Serverless launched the artificial intelligence (AI)-driven scaling and optimization feature, which scales the compute not only based on the queuing, but also factoring data volume and query complexity.

In this post, we describe how Redshift Serverless utilizes the new AI-driven scaling and optimization capabilities to address common use cases. This post also includes example SQLs, which you can run on your own Redshift Serverless data warehouse to experience the benefits of this feature.

Solution overview

The AI-powered scaling and optimization feature in Redshift Serverless provides a user-friendly visual slider to set your desired balance between price and performance. By moving the slider, you can choose between optimized for cost, balanced performance and cost, or optimized for performance. Based on where you position the slider, Amazon Redshift will automatically add or remove resources to ensure better behavior and perform other AI-driven optimizations like automatic materialized views and automatic table design optimization to meet your selected price-performance target.

Price Performance Slider

The slider offers the following options:

  • Optimized for cost – Prioritizes cost savings. Redshift attempts to automatically scale up compute capacity when doing so and doesn’t incur additional charges. And it will also attempt to scale down compute for lower cost, despite longer runtime.
  • Balanced – Offers balance between performance and cost. Redshift scales for performance with a moderate cost increase.
  • Optimized for performance – Prioritizes performance. Redshift scales aggressively for maximum performance, potentially incurring higher costs.

In the following sections, we illustrate how the AI-driven scaling and optimization feature can intelligently predict your workload compute needs and scale proactively for three scenarios:

  • Use case 1 – A long-running complex query. Compute scales based on query complexity.
  • Use case 2 – A sudden spike in ingestion volume (a three-fold increase, from 720 million to 2.1 billion). Compute scales based on data volume.
  • Use case 3 – A data lake query scanning large datasets (TBs). Compute scales based on the expected data to be scanned from the data lake. The expected data scan is predicted by machine learning (ML) models based on prior historical run statistics.

In the existing auto scaling mechanism, the use cases don’t increase compute capacity automatically unless queuing is identified across the instance.

Prerequisites

To follow along, complete the following prerequisites:

  1. Create a Redshift Serverless workgroup in preview mode. For instructions, see Creating a preview workgroup.
  2. While creating the preview group, choose Performance and Cost Controls and Price-performance target, and adjust the slider to Optimized for performance. For more information, refer to Amazon Redshift adds new AI capabilities, including Amazon Q, to boost efficiency and productivity.
  3. Set up an AWS Identity and Access Management (IAM) role as the default IAM role. Refer to Managing IAM roles created for a cluster using the console for instructions.
  4. We use TPC-DS 1TB Cloud Data Warehouse Benchmark data to demonstrate this feature. Run the SQL statements to create tables and load the TPC-DS 1TB data.

Use case 1: Scale compute based on query complexity

The following query analyzes product sales across multiple channels such as websites, wholesale, and retail stores. This complex query typically takes about 25 minutes to run with the default 128 RPUs. Let’s run this workload on the preview workgroup created as part of prerequisites.

When a query is run for the first time, the AI scaling system may make a suboptimal decision regarding resource allocation or scaling as the system is still learning the query and data characteristics. However, the system learns from this experience, and when the same query is run again, it can make a more optimal scaling decision. Therefore, if the query didn’t scale during the first run, it is recommended to rerun the query. You can monitor the RPU capacity used on the Redshift Serverless console or by querying the SYS_SERVERLSS_USAGE system view.

The results cache is turned off in the following queries to avoid fetching results from the cache.

SET enable_result_cache_for_session TO off;
with /* TPC-DS demo query */
    ws as
    (select d_year AS ws_sold_year, ws_item_sk,    ws_bill_customer_sk
     ws_customer_sk,    sum(ws_quantity) ws_qty,    sum(ws_wholesale_cost) ws_wc,
        sum(ws_sales_price) ws_sp   from web_sales   left join web_returns on
     wr_order_number=ws_order_number and ws_item_sk=wr_item_sk   join date_dim
     on ws_sold_date_sk = d_date_sk   where wr_order_number is null   group by
     d_year, ws_item_sk, ws_bill_customer_sk   ),
    cs as  
    (select d_year AS cs_sold_year,
     cs_item_sk,    cs_bill_customer_sk cs_customer_sk,    sum(cs_quantity) cs_qty,
        sum(cs_wholesale_cost) cs_wc,    sum(cs_sales_price) cs_sp   from catalog_sales
       left join catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
       join date_dim on cs_sold_date_sk = d_date_sk   where cr_order_number is
     null   group by d_year, cs_item_sk, cs_bill_customer_sk   ),
    ss as  
    (select
     d_year AS ss_sold_year, ss_item_sk,    ss_customer_sk,    sum(ss_quantity)
     ss_qty,    sum(ss_wholesale_cost) ss_wc,    sum(ss_sales_price) ss_sp
       from store_sales left join store_returns on sr_ticket_number=ss_ticket_number
     and ss_item_sk=sr_item_sk   join date_dim on ss_sold_date_sk = d_date_sk
       where sr_ticket_number is null   group by d_year, ss_item_sk, ss_customer_sk
       ) 
       
       select 
       ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
     ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
    coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0)
     other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
    from ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk
     and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year
     and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)where coalesce(ws_qty,0)>0
     and coalesce(cs_qty, 0)>0 order by   ss_customer_sk,  ss_qty desc, ss_wc
     desc, ss_sp desc,  other_chan_qty,  other_chan_wholesale_cost,  other_chan_sales_price,
      round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);

When the query is complete, run the following SQL to capture the start and end times of the query, which will be used in the next query:

select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo query%'
and query_text not like '%sys_query_history%'
order by start_time desc

Let’s assess the compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query with the output of the preceding query:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

-- Example
--select * from sys_serverless_usage
--where end_time >= '2024-06-03 00:17:12.322353'
--and end_time <= DATEADD(minute,1,'2024-06-03 00:19:11.553218')
--order by end_time asc

The following screenshot shows an example output.

Use Case 1 output

You can notice the increase in compute over the duration of this query. This demonstrates how Redshift Serverless scales based on query complexity.

Use case 2: Scale compute based on data volume

Let’s consider the web_sales ingestion job. For this example, your daily ingestion job processes 720 million records and completes in an average of 2 minutes. This is what you ingested in the prerequisite steps.

Due to some event (such as month end processing), your volumes increased by three times and now your ingestion job needs to process 2.1 billion records. In an existing scaling approach, this would increase your ingestion job runtime unless the queue time is enough to invoke additional compute resources. But with AI-driven scaling, in performance optimized mode, Amazon Redshift automatically scales compute to complete your ingestion job within usual runtimes. This helps protect your ingestion SLAs.

Run the following job to ingest 2.1 billion records into the web_sales table:

copy web_sales from 's3://redshift-downloads/TPC-DS/2.13/3TB/web_sales/' iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

Run the following query to compare the duration of ingesting 2.1 billion records and 720 million records. Both ingestion jobs completed in approximately a similar time, despite the three-fold increase in volume.

select query_id,table_name,data_source,loaded_rows,duration/1000000.0 duration_in_seconds , start_time,end_time
from sys_load_history
where
table_name='web_sales'
order by start_time desc

Run the following query with the start times and end times from the previous output:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

The following is an example output. You can notice the increase in compute capacity for the ingestion job that processes 2.1 billion records. This illustrates how Redshift Serverless scaled based on data volume.

Use Case 2 Output

Use case 3: Scale data lake queries

In this use case, you create external tables pointing to TPC-DS 3TB data in an Amazon Simple Storage Service (Amazon S3) location. Then you run a query that scans a large volume of data to demonstrate how Redshift Serverless can automatically scale compute capacity as needed.

In the following SQL, provide the ARN of the default IAM role you attached in the prerequisites:

-- Create external schema
create external schema ext_tpcds_3t
from data catalog
database ext_tpcds_db
iam_role '<ARN of the default IAM role attached>'
create external database if not exists;

Create external tables by running DDL statements in the following SQL file. You should see seven external tables in the query editor under the ext_tpcds_3t schema, as shown in the following screenshot.

External Tables

Run the following query using external tables. As mentioned in the first use case, if the query didn’t scale during the first run, it is recommended to rerun the query, because the system will have learned from the previous experience and can potentially provide better scaling and performance for the subsequent run.

The results cache is turned off in the following queries to avoid fetching results from the cache.

SET enable_result_cache_for_session TO off;

with /* TPC-DS demo data lake query */

ws as
(select d_year AS ws_sold_year, ws_item_sk, ws_bill_customer_sk
ws_customer_sk,    sum(ws_quantity) ws_qty,    sum(ws_wholesale_cost) ws_wc,
sum(ws_sales_price) ws_sp   from ext_tpcds_3t.web_sales   left join ext_tpcds_3t.web_returns on
wr_order_number=ws_order_number and ws_item_sk=wr_item_sk   join ext_tpcds_3t.date_dim
on ws_sold_date_sk = d_date_sk   where wr_order_number is null   group by
d_year, ws_item_sk, ws_bill_customer_sk   ),

cs as
(select d_year AS cs_sold_year,
cs_item_sk,    cs_bill_customer_sk cs_customer_sk,    sum(cs_quantity) cs_qty,
sum(cs_wholesale_cost) cs_wc,    sum(cs_sales_price) cs_sp   from ext_tpcds_3t.catalog_sales
left join ext_tpcds_3t.catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
join ext_tpcds_3t.date_dim on cs_sold_date_sk = d_date_sk   where cr_order_number is
null   group by d_year, cs_item_sk, cs_bill_customer_sk   ),

ss as
(select
d_year AS ss_sold_year, ss_item_sk,    ss_customer_sk,    sum(ss_quantity)
ss_qty,    sum(ss_wholesale_cost) ss_wc,    sum(ss_sales_price) ss_sp
from ext_tpcds_3t.store_sales left join ext_tpcds_3t.store_returns on sr_ticket_number=ss_ticket_number
and ss_item_sk=sr_item_sk   join ext_tpcds_3t.date_dim on ss_sold_date_sk = d_date_sk
where sr_ticket_number is null   group by d_year, ss_item_sk, ss_customer_sk)

SELECT           ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0)    other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
FROM ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)
where coalesce(ws_qty,0)>0
and coalesce(cs_qty, 0)>0
order by   ss_customer_sk,  ss_qty desc, ss_wc desc, ss_sp desc,  other_chan_qty,  other_chan_wholesale_cost,  other_chan_sales_price,     round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);

Review the total elapsed time of the query. You need the start_time and end_time from the results to feed into the next query.

select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo data lake query%'
and query_text not like '%sys_query_history%'
order by start_time desc

Run the following query to see how compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query from the output of the preceding query:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

The following screenshot shows an example output.

Use Case 3 Output

The increased compute capacity for this data lake query shows that Redshift Serverless can scale to match the data being scanned. This demonstrates how Redshift Serverless can dynamically allocate resources based on query needs.

Considerations when choosing your price-performance target

You can use the price-performance slider to choose your desired price-performance target for your workload. The AI-driven scaling and optimizations provide holistic optimizations using the following models:

  • Query prediction models – These determine the actual resource needs (memory, CPU consumption, and so on) for each individual query
  • Scaling prediction models – These predict how the query would behave on different capacity sizes

Let’s consider a query that takes 7 minutes and costs $7. The following figure shows the query runtimes and cost with no scaling.

Scaling Type Example

A given query might scale in a few different ways, as shown below. Based on the price-performance target you chose on the slider, AI-driven scaling predicts how the query trades off performance and cost, and scales it accordingly.

Scaling Types

The slider options yield the following results:

  • Optimized for cost – When you choose Optimized for cost, the warehouse scales up if there is no additional cost or lesser costs to the user. In the preceding example, the superlinear scaling approach demonstrates this behavior. Scaling will only occur if it can be done in a cost-effective manner according to the scaling model predictions. If the scaling models predict that cost-optimized scaling isn’t possible for the given workload, then the warehouse won’t scale.
  • Balanced – With the Balanced option, the system will scale in favor of performance and there will be a cost increase, but it will be a limited increase in cost. In the preceding example, the linear scaling approach demonstrates this behavior.
  • Optimized for performance – With the Optimized for performance option, the system will scale in favor of performance even though the costs are higher and non-linear. In the preceding example, the sublinear scaling approach demonstrates this behavior. The closer the slider position is to the Optimized for performance position, the more sublinear scaling is permitted.

The following are additional points to note:

  • The price-performance slider options are dynamic and they can be changed anytime. However, the impact of these changes will not be realized immediately. The impact of this is effective as the system learns how to scale the current workload and any additional workloads better.
  • The price-performance slider options, Max capacity and Max RPU-hours are designed to work together. Max capacity and Max RPU-hours are the controls to limit maximum RPUs the data warehouse allowed to scale and maximum RPU hours allowed to consume respectively. These controls are always honored and enforced regardless of the settings on the price-performance target slider.
  • The AI-driven scaling and optimization feature dynamically adjusts compute resources to optimize query runtime speed while adhering to your price-performance requirements. It considers factors such as query queueing, concurrency, volume, and complexity. The system can either run queries on a compute resource with lower concurrent queries or spin up additional compute resources to avoid queueing. The goal is to provide the best price-performance balance based on your choices.

Monitoring

You can monitor the RPU scaling in the following ways:

  • Review the RPU capacity used graph on the Amazon Redshift console.
  • Monitor the ComputeCapacity metric under AWS/Redshift-Serverless and Workgroup in Amazon CloudWatch.
  • Query the SYS_QUERY_HISTORY view, providing the specific query ID or query text to identify the time period. Use this time period to query the SYS_SERVERLSS_USAGE system view to find the compute_capacity The compute_capacity field will show the RPUs scaled during the query runtime.

Refer to Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable for the step-by-step instructions on using these approaches.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Delete the Redshift Serverless workgroup.
  2. Delete the Redshift Serverless associated namespace.

Conclusion

In this post, we discussed how to optimize your workloads to scale based on the changes in data volume and query complexity. We demonstrated an approach to implement more responsive, proactive scaling with the AI-driven scaling feature in Redshift Serverless. Try this feature in your environment, conduct a proof of concept on your specific workloads, and share your feedback with us.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Ashish Agrawal is a Principal Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 25 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.

Davide Pagano is a Software Development Manager with Amazon Redshift based out of Palo Alto, specialized in building cloud-based data warehouses and analytics cloud services solutions. He has over 10 years of experience with databases, out of which 6 years of experience tailored to Amazon Redshift.

[$] Modernizing openSUSE installation with Agama

Post Syndicated from jzb original https://lwn.net/Articles/974969/

Linux installers receive a disproportionate amount of attention
compared to the amount of time that most users spend with them. Ideally,
a user spends only a few minutes using the installer, versus years using
the distribution after it is installed. Yet, the installer sets the
first impression, and if it fails to do its job, little else matters.
Installers also have to continually evolve to keep pace with new
hardware, changes in distribution packaging (such as image-based Linux
distributions), and so forth. Along those lines, the SUSE team that maintains the
venerable YaST installer has
decided it’s time to start (almost) fresh with a new Linux installer
project, called Agama,
for new projects. YaST is not going away as an administration tool,
but it is likely to be relieved of installer duties at some point.

Encryption in transit over external networks: AWS guidance for NYDFS and beyond

Post Syndicated from Aravind Gopaluni original https://aws.amazon.com/blogs/security/encryption-in-transit-over-external-networks-aws-guidance-for-nydfs-and-beyond/

On November 1, 2023, the New York State Department of Financial Services (NYDFS) issued its Second Amendment (the Amendment) to its Cybersecurity Requirements for Financial Services Companies adopted in 2017, published within Section 500 of 23 NYCRR 500 (the Cybersecurity Requirements; the Cybersecurity Requirements as amended by the Amendment, the Amended Cybersecurity Requirements). In the introduction to its Cybersecurity Resource Center, the Department explains that the revisions are aimed at addressing the changes in the increasing sophistication of threat actors, the prevalence of and relative ease in running cyberattacks, and the availability of additional controls to manage cyber risks.

This blog post focuses on the revision to the encryption in transit requirement under section 500.15(a). It outlines the encryption capabilities and secure connectivity options offered by Amazon Web Services (AWS) to help customers demonstrate compliance with this updated requirement. The post also provides best practices guidance, emphasizing the shared responsibility model. This enables organizations to design robust data protection strategies that address not only the updated NYDFS encryption requirements but potentially also other security standards and regulatory requirements.

The target audience for this information includes security leaders, architects, engineers, and security operations team members and risk, compliance, and audit professionals.

Note that the information provided here is for informational purposes only; it is not legal or compliance advice and should not be relied on as legal or compliance advice. Customers are responsible for making their own independent assessments and should obtain appropriate advice from their own legal and compliance advisors regarding compliance with applicable NYDFS regulations.

500.15 Encryption of nonpublic information

The updated requirement in the Amendment states that:

  1. As part of its cybersecurity program, each covered entity shall implement a written policy requiring encryption that meets industry standards, to protect nonpublic information held or transmitted by the covered entity both in transit over external networks and at rest.
  2. To the extent a covered entity determines that encryption of nonpublic information at rest is infeasible, the covered entity may instead secure such nonpublic information using effective alternative compensating controls that have been reviewed and approved by the covered entity’s CISO in writing. The feasibility of encryption and effectiveness of the compensating controls shall be reviewed by the CISO at least annually.

This section of the Amendment removes the covered entity’s chief information security officer’s (CISO) discretion to approve compensating controls when encryption of nonpublic information in transit over external networks is deemed infeasible. The Amendment mandates that, effective November 2024, organizations must encrypt nonpublic information transmitted over external networks without the option of implementing alternative compensating controls. While the use of security best practices such as network segmentation, multi-factor authentication (MFA), and intrusion detection and prevention systems (IDS/IPS) can provide defense in depth, these compensating controls are no longer sufficient to replace encryption in transit over external networks for nonpublic information.

However, the Amendment still allows for the CISO to approve the use of alternative compensating controls where encryption of nonpublic information at rest is deemed infeasible. AWS is committed to providing industry-standard encryption services and capabilities to help protect customer data at rest in the cloud, offering customers the ability to add layers of security to their data at rest, providing scalable and efficient encryption features. This includes the following services:

While the above highlights encryption-at-rest capabilities offered by AWS, the focus of this blog post is to provide guidance and best practice recommendations for encryption in transit.

AWS guidance and best practice recommendations

Cloud network traffic encompasses connections to and from the cloud and traffic between cloud service provider (CSP) services. From an organization’s perspective, CSP networks and data centers are deemed external because they aren’t under the organization’s direct control. The connection between the organization and a CSP, typically established over the internet or dedicated links, is considered an external network. Encrypting data in transit over these external networks is crucial and should be an integral part of an organization’s cybersecurity program.

AWS implements multiple mechanisms to help ensure the confidentiality and integrity of customer data during transit and at rest across various points within its environment. While AWS employs transparent encryption at various transit points, we strongly recommend incorporating encryption by design into your architecture. AWS provides robust encryption-in-transit capabilities to help you adhere to compliance requirements and mitigate the risks of unauthorized disclosure and modification of nonpublic information in transit over external networks.

Additionally, AWS recommends that financial services institutions adopt a secure by design (SbD) approach to implement architectures that are pre-tested from a security perspective. SbD helps establish control objectives, security baselines, security configurations, and audit capabilities for workloads running on AWS.

Security and Compliance is a shared responsibility between AWS and the customer. Shared responsibility can vary depending on the security configuration options for each service. You should carefully consider the services you choose because your organization’s responsibilities vary depending on the services used, the integration of those services into your IT environment, and applicable laws and regulations. AWS provides resources such as service user guides and AWS Customer Compliance Guides, which map security best practices for individual services to leading compliance frameworks, including NYDFS.

Protecting connections to and from AWS

We understand that customers place a high priority on privacy and data security. That’s why AWS gives you ownership and control over your data through services that allow you to determine where your content will be stored, secure your content in transit and at rest, and manage access to AWS services and resources for your users. When architecting workloads on AWS, classifying data based on its sensitivity, criticality, and compliance requirements is essential. Proper data classification allows you to implement appropriate security controls and data protection mechanisms, such as Transport Layer Security (TLS) at the application layer, access control measures, and secure network connectivity options for nonpublic information over external networks. When it comes to transmitting nonpublic information over external networks, it’s a recommended practice to identify network segments traversed by this data based on your network architecture. While AWS employs transparent encryption at various transit points, it’s advisable to implement encryption solutions at multiple layers of the OSI model to establish defense in depth and enhance end-to-end encryption capabilities. Although requirement 500.15 of the Amendment doesn’t mandate end-to-end encryption, implementing such controls can provide an added layer of security and can help demonstrate that nonpublic information is consistently encrypted during transit.

AWS offers several options to achieve this. While not every option provides end-to-end encryption on its own, using them in combination helps to ensure that nonpublic information doesn’t traverse open, public networks unprotected. These options include:

  • Using AWS Direct Connect with IEEE 802.1AE MAC Security Standard (MACsec) encryption
  • VPN connections
  • Secure API endpoints
  • Client-side encryption of data before sending it to AWS

AWS Direct Connect with MACsec encryption

AWS Direct Connect provides direct connectivity to the AWS network through third-party colocation facilities, using a cross-connect between an AWS owned device and either a customer- or partner-owned device. Direct Connect can reduce network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Within Direct Connect connections (a physical construct) there will be one or more virtual interfaces (VIFs). These are logical entities and are reflected as industry-standard 802.1Q VLANs on the customer equipment terminating the Direct Connect connection. Depending on the type of VIF, they will use either public or private IP addressing. There are three different types of VIFs:

  • Public virtual interface – Establish connectivity between AWS public endpoints and your data center, office, or colocation environment.
  • Transit virtual interface – Establish private connectivity between AWS Transit Gateways and your data center, office, or colocation environment. Transit Gateways is an AWS managed high availability and scalability regional network transit hub used to interconnect Amazon Virtual Private Cloud (Amazon VPC) and customer networks.
  • Private virtual interface – Establish private connectivity between Amazon VPC resources and your data center, office, or colocation environment.

By default, a Direct Connect connection isn’t encrypted from your premises to the Direct Connect location because AWS cannot assume your on-premises device supports the MACsec protocol. With MACsec, Direct Connect delivers native, near line-rate, point-to-point encryption, ensuring that data communications between AWS and your corporate network remain protected. MACsec is supported on 10 Gbps and 100 Gbps dedicated Direct Connect connections at selected points of presence. Using Direct Connect with MACsec-enabled connections and combining it with the transparent physical network encryption offered by AWS from the Direct Connect location through the AWS backbone not only benefits you by allowing you to securely exchange data with AWS, but also enables you to use the highest available bandwidth. For additional information on MACsec support and cipher suites, see the MACsec section in the Direct Connect FAQs.

Figure 1 illustrates a sample reference architecture for securing traffic from corporate network to your VPCs over Direct Connect with MACsec and AWS Transit Gateways.

Figure 1: Sample architecture for using Direct Connect with MACsec encryption

Figure 1: Sample architecture for using Direct Connect with MACsec encryption

In the sample architecture, you can see that Layer 2 encryption through MACsec only encrypts the traffic from your on-premises systems to the AWS device in the Direct Connect location, and therefore you need to consider additional encryption solutions at Layer 3, 4, or 7 to get closer to end-to-end encryption to the device where you’re comfortable for the packets to be decrypted. In the next section, let’s review an option for using network layer encryption using AWS Site-to-Site VPN.

Direct Connect with Site-to-Site VPN

AWS Site-to-Site VPN is a fully managed service that creates a secure connection between your corporate network and your Amazon VPC using IP security (IPsec) tunnels over the internet. Data transferred between your VPC and the remote network routes over an encrypted VPN connection to help maintain the confidentiality and integrity of data in transit. Each VPN connection consists of two tunnels between a virtual private gateway or transit gateway on the AWS side and a customer gateway on the on-premises side. Each tunnel supports a maximum throughput of up to 1.25 Gbps. See Site-to-Site VPN quotas for more information.

You can use Site-to-Site VPN over Direct Connect to achieve secure IPsec connection with the low latency and consistent network experience of Direct Connect when reaching resources in your Amazon VPCs.

Figure 2 illustrates a sample reference architecture for establishing end-to-end IPsec-encrypted connections between your networks and Transit Gateway over a private dedicated connection.

Figure 2: Encrypted connections between the AWS Cloud and a customer’s network using VPN

Figure 2: Encrypted connections between the AWS Cloud and a customer’s network using VPN

While Direct Connect with MACsec and Site-to-Site VPN with IPsec can provide encryption at the physical and network layers respectively, they primarily secure the data in transit between your on-premises network and the AWS network boundary. To further enhance the coverage for end-to-end encryption, it is advisable to use TLS encryption. In the next section, let’s review mechanisms for securing API endpoints on AWS using TLS encryption.

Secure API endpoints

APIs act as the front door for applications to access data, business logic, or functionality from other applications and backend services.

AWS enables you to establish secure, encrypted connections to its services using public AWS service API endpoints. Public AWS owned service API endpoints (AWS managed services like Amazon Simple Queue Service (Amazon SQS), AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), others) have certificates that are owned and deployed by AWS. By default, requests to these public endpoints use HTTPS. To align with evolving technology and regulatory standards for TLS, as of February 27, 2024, AWS has updated its TLS policy to require a minimum of TLS 1.2, thereby deprecating support for TLS 1.0 and 1.1 versions on AWS service API endpoints across each of our AWS Regions and Availability Zones.

Additionally, to enhance connection performance, AWS has begun enabling TLS version 1.3 globally for its service API endpoints. If you’re using the AWS SDKs or AWS Command Line Interface (AWS CLI), you will automatically benefit from TLS 1.3 after a service enables it.

While requests to public AWS service API endpoints use HTTPS by default, a few services, such as Amazon S3 and Amazon DynamoDB, allow using either HTTP or HTTPS. If the client or application chooses HTTP, the communication isn’t encrypted. Customers are responsible for enforcing HTTPS connections when using such AWS services. To help ensure secure communication, you can establish an identity perimeter by using the IAM policy condition key aws:SecureTransport in your IAM roles to evaluate the connection and mandate HTTPS usage.

As enterprises increasingly adopt cloud computing and microservices architectures, teams frequently build and manage internal applications exposed as private API endpoints. Customers are responsible for managing the certificates on private customer-owned endpoints. AWS helps you deploy private customer-owned identities (that is, TLS certificates) through the use of AWS Certificate Manager (ACM) private certificate authorities (PCA) and the integration with AWS services that offer private customer-owned TLS termination endpoints.

ACM is a fully managed service that lets you provision, manage, and deploy public and private TLS certificates for use with AWS services and internal connected resources. ACM minimizes the time-consuming manual process of purchasing, uploading, and renewing TLS certificates. You can provide certificates for your integrated AWS services either by issuing them directly using ACM or by importing third-party certificates into the ACM management system. ACM offers two options for deploying managed X.509 certificates. You can choose the best one for your needs.

  • AWS Certificate Manager (ACM) – This service is for enterprise customers who need a secure web presence using TLS. ACM certificates are deployed through Elastic Load Balancing (ELB), Amazon CloudFront, Amazon API Gateway, and other integrated AWS services. The most common application of this type is a secure public website with significant traffic requirements. ACM also helps to simplify security management by automating the renewal of expiring certificates.
  • AWS Private Certificate Authority (Private CA) – This service is for enterprise customers building a public key infrastructure (PKI) inside the AWS Cloud and is intended for private use within an organization. With AWS Private CA, you can create your own certificate authority (CA) hierarchy and issue certificates with it for authenticating users, computers, applications, services, servers, and other devices. Certificates issued by a private CA cannot be used on the internet. For more information, see the AWS Private CA User Guide.

You can use a centralized API gateway service, such as Amazon API Gateway, to securely expose customer-owned private API endpoints. API Gateway is a fully managed service that allows developers to create, publish, maintain, monitor, and secure APIs at scale. With API Gateway, you can create RESTful APIs and WebSocket APIs, enabling near real-time, two-way communication applications. API Gateway operations must be encrypted in-transit using TLS, and require the use of HTTPS endpoints. You can use API Gateway to configure custom domains for your APIs using TLS certificates provisioned and managed by ACM. Developers can optionally choose a specific TLS version for their custom domain names. For use cases that require mutual TLS (mTLS) authentication, you can configure certificate-based mTLS authentication on your custom domains.

Pre-encryption of data to be sent to AWS

Depending on the risk profile and sensitivity of the data that’s being transferred to AWS, you might want to choose encrypting data in an application running on your corporate network before sending it to AWS (client-side encryption). AWS offers a variety of SDKs and client-side encryption libraries to help you encrypt and decrypt data in your applications. You can use these libraries with the cryptographic service provider of your choice, including AWS Key Management Service or AWS CloudHSM, but the libraries do not require an AWS service.

  • The AWS Encryption SDK is a client-side encryption library that you can use to encrypt and decrypt data in your application and is available in several programming languages, including a command-line interface. You can use the SDK to encrypt your data before you send it to an AWS service. The SDK offers advanced data protection features, including envelope encryption and additional authenticated data (AAD). It also offers secure, authenticated, symmetric key algorithm suites, such as 256-bit AES-GCM with key derivation and signing.
  • The AWS Database Encryption SDK is a set of software libraries developed in open source that enable you to include client-side encryption in your database design. The SDK provides record-level encryption solutions. You specify which fields are encrypted and which fields are included in the signatures that help ensure the authenticity of your data. Encrypting your sensitive data in transit and at rest helps ensure that your plaintext data isn’t available to a third party, including AWS. The AWS Database Encryption SDK for DynamoDB is designed especially for DynamoDB applications. It encrypts the attribute values in each table item using a unique encryption key. It then signs the item to protect it against unauthorized changes, such as adding or deleting attributes or swapping encrypted values. After you create and configure the required components, the SDK transparently encrypts and signs your table items when you add them to a table. It also verifies and decrypts them when you retrieve them. Searchable encryption in the AWS Database Encryption SDK enables you search encrypted records without decrypting the entire database. This is accomplished by using beacons, which create a map between the plaintext value written to a field and the encrypted value that is stored in your database. For more information, see the AWS Database Encryption SDK Developer Guide.
  • The Amazon S3 Encryption Client is a client-side encryption library that enables you to encrypt an object locally to help ensure its security before passing it to Amazon S3. It integrates seamlessly with the Amazon S3 APIs to provide a straightforward solution for client-side encryption of data before uploading to Amazon S3. After you instantiate the Amazon S3 Encryption Client, your objects are automatically encrypted and decrypted as part of your Amazon S3 PutObject and GetObject requests. Your objects are encrypted with a unique data key. You can use both the Amazon S3 Encryption Client and server-side encryption to encrypt your data. The Amazon S3 Encryption Client is supported in a variety of programming languages and supports industry-standard algorithms for encrypting objects and data keys. For more information, see the Amazon S3 Encryption Client developer guide.

Encryption in-transit inside AWS

AWS implements responsible and sophisticated technical and physical controls that are designed to help prevent unauthorized access to or disclosure of your content. To protect data in transit, traffic traversing through the AWS network that is outside of AWS physical control is transparently encrypted by AWS at the physical layer. This includes traffic between AWS Regions (except China Regions), traffic between Availability Zones, and between Direct Connect locations and Regions through the AWS backbone network.

Network segmentation

When you create an AWS account, AWS offers a virtual networking option to launch resources in a logically isolated virtual private network (VPN), Amazon Virtual Private Cloud (Amazon VPC). A VPC is limited to a single AWS Region and every VPC has one or more subnets. VPCs can be connected externally using an internet gateway (IGW), VPC peering connection, VPN, Direct Connect, or Transit Gateways. Traffic within the your VPC is considered internal because you have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

As a customer, you maintain ownership of your data, and you select which AWS services can process, store, and host your data, and you choose the Regions in which your data is stored. AWS doesn’t automatically replicate data across Regions, unless the you choose to do so. Data transmitted over the AWS global network between Regions and Availability Zones is automatically encrypted at the physical layer before leaving AWS secured facilities. Cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region.

Encryption between instances

AWS provides secure and private connectivity between Amazon Elastic Compute Cloud (Amazon EC2) instances of all types. The Nitro System is the underlying foundation for modern Amazon EC2 instances. It’s a combination of purpose-built server designs, data processors, system management components, and specialized firmware that provides the underlying foundation for EC2 instances launched since the beginning of 2018. Instance types that use the offload capabilities of the underlying Nitro System hardware automatically encrypt in-transit traffic between instances. This encryption uses Authenticated Encryption with Associated Data (AEAD) algorithms, with 256-bit encryption and has no impact on network performance. To support this additional in-transit traffic encryption between instances, instances must be of supported instance types, in the same Region, and in the same VPC or peered VPCs. For a list of supported instance types and additional requirements, see Encryption in transit.

Conclusion

The second Amendment to the NYDFS Cybersecurity Regulation underscores the criticality of safeguarding nonpublic information during transmission over external networks. By mandating encryption for data in transit and eliminating the option for compensating controls, the Amendment reinforces the need for robust, industry-standard encryption measures to protect the confidentiality and integrity of sensitive information.

AWS provides a comprehensive suite of encryption services and secure connectivity options that enable you to design and implement robust data protection strategies. The transparent encryption mechanisms that AWS has built into services across its global network infrastructure, secure API endpoints with TLS encryption, and services such as Direct Connect with MACsec encryption and Site-to-Site VPN, can help you establish secure, encrypted pathways for transmitting nonpublic information over external networks.

By embracing the principles outlined in this blog post, financial services organizations can address not only the updated NYDFS encryption requirements for section 500.15(a) but can also potentially demonstrate their commitment to data security across other security standards and regulatory requirements.

For further reading on considerations for AWS customers regarding adherence to the Second Amendment to the NYDFS Cybersecurity Regulation, see the AWS Compliance Guide to NYDFS Cybersecurity Regulation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Financial Services re:Post and AWS Security, Identity, & Compliance re:Post ,or contact AWS Support.
 

Aravind Gopaluni
Aravind Gopaluni

Aravind is a Senior Security Solutions Architect at AWS, helping financial services customers navigate ever-evolving cloud security and compliance needs. With over 20 years of experience, he has honed his expertise in delivering robust solutions to numerous global enterprises. Away from the world of cybersecurity, he cherishes traveling and exploring cuisines with his family.
Stephen Eschbach
Stephen Eschbach

Stephen is a Senior Compliance Specialist at AWS, helping financial services customers meet their security and compliance objectives on AWS. With over 18 years of experience in enterprise risk, IT GRC, and IT regulatory compliance, Stephen has worked and consulted for several global financial services companies. Outside of work, Stephen enjoys family time, kids’ sports, fishing, golf, and Texas BBQ.

Story of an Undercover CIA Agent who Penetrated Al Qaeda

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/08/story-of-an-undercover-cia-agent-who-penetrated-al-qaeda.html

Rolling Stone has a long investigative story (non-paywalled version here) about a CIA agent who spent years posing as an Islamic radical.

Unrelated, but also in the “real life spies” file: a fake Sudanese diving resort run by Mossad.

Get ready for Moonhack 2024: Projects on climate change

Post Syndicated from Isabel Ronaldson original https://www.raspberrypi.org/blog/moonhack-2024/

Moonhack is a free, international coding challenge for young people run online every year by Code Club Australia, powered by our partner the Telstra Foundation. The yearly challenge is open to young people worldwide, and in 2023, over 44,500 young people registered to take part.

A Moonhack 2024 logo.

Moonhack 2024 runs from 14 to 31 October. This year’s theme is taken from World Space Week 2024: climate change. As always, the projects cater for everyone from brand-new beginners to more experienced coders. And young people have a chance to win a prize for their submitted project!

We caught up with Kaye North, Community and Engagement Manager at Code Club Australia, to find out more.

What to expect from Moonhack in 2024

For this year’s projects, Kaye told us that she collaborated with farmers, scientists, and young people from across Australia to cover diverse topics related to climate change and space. The projects will help participants learn about topics from how people who work in agriculture use climate data to increase crop yields and practise sustainable farming, to the impact of rising global temperatures on sea life populations.

An illustration depicting various elements related to the environment and sustainability.

Kaye also hopes to help young people understand the role of satellite data related to climate change, such as the data NASA collects and shares via satellite. Satellite data on rising sea levels, called out in United Nations Sustainable Development Goal 13, forms the basis of one of the Moonhack projects this year.

Moonhack participants will be able to code with Scratch, micro:bit, or Python. They can also take on a project brief where they may choose their favourite programming language and even include physical computing if they wish.

A computing classroom filled with learners.

All six projects will be available from 1 September when registration opens, and projects can be submitted until 30 November.

Inspiring young people to create a better future

Climate change is an issue that affects everyone, and for many young people it’s a source of concern. Kaye’s aim this year is to show small changes young people can make to contribute to a big, global impact.

“Moonhack’s question this year is ‘Can we create calls to action through our coding to influence others to make better choices, or even inform them of things that they didn’t know that they can share with others?’” – Kaye North, Code Club Australia

Moonhack support for volunteers, teachers and parents

This year’s Moonhack includes new resources to help educators and mentors who are supporting young people to take part:

Get your young coders involved: Key info

  • Registration for Moonhack 2024 opens on 1 September
  • The challenge runs from 14 to 31 October, and projects can be submitted until 30 November
  • Participation is free and open to any young coder worldwide, whether they are part of a Code Club or not
  • Everyone from beginners to advanced coders can participate
  • The six projects for Moonhack 2024 will be available in around 30 languages

To find out more, visit the Moonhack website and sign up to the Moonhack newsletter.

Code Club Australia is powered by the Telstra Foundation as part of a strategic partnership with us at the Raspberry Pi Foundation.

The post Get ready for Moonhack 2024: Projects on climate change appeared first on Raspberry Pi Foundation.

[$] Python subinterpreters and free-threading

Post Syndicated from jake original https://lwn.net/Articles/985041/

At
PyCon 2024 in Pittsburgh,
Pennsylvania, Anthony Shaw looked at the various kinds of parallelism
available to Python programs. There have been two major developments on
the parallel-execution front over the last few years, with the effort to
provide subinterpreters, each with its own
global interpreter lock (GIL), along with the work to remove the GIL entirely. In the talk, he
explored the two approaches to try to give attendees a sense of how to make
the right choice for their applications.

NIST’s first post-quantum standards

Post Syndicated from Luke Valenta original https://blog.cloudflare.com/nists-first-post-quantum-standards


On August 13th, 2024, the US National Institute of Standards and Technology (NIST) published the first three cryptographic standards designed to resist an attack from quantum computers: ML-KEM, ML-DSA, and SLH-DSA. This announcement marks a significant milestone for ensuring that today’s communications remain secure in a future world where large-scale quantum computers are a reality.

In this blog post, we briefly discuss the significance of NIST’s recent announcement, how we expect the ecosystem to evolve given these new standards, and the next steps we are taking. For a deeper dive, see our March 2024 blog post.

Why are quantum computers a threat?

Cryptography is a fundamental aspect of modern technology, securing everything from online communications to financial transactions. For instance, when visiting this blog, your web browser used cryptography to establish a secure communication channel to Cloudflare’s server to ensure that you’re really talking to Cloudflare (and not an impersonator), and that the conversation remains private from eavesdroppers.

Much of the cryptography in widespread use today is based on mathematical puzzles (like factoring very large numbers) which are computationally out of reach for classical (non-quantum) computers. We could likely continue to use traditional cryptography for decades to come if not for the advent of quantum computers, devices that use properties of quantum mechanics to perform certain specialized calculations much more efficiently than traditional computers. Unfortunately, those specialized calculations include solving the mathematical puzzles upon which most widely deployed cryptography depends.

As of today, no quantum computers exist that are large and stable enough to break today’s cryptography, but experts predict that it’s only a matter of time until such a cryptographically-relevant quantum computer (CRQC) exists. For instance, more than a quarter of interviewed experts in a 2023 survey expect that a CRQC is more likely than not to appear in the next decade.

What is being done about the quantum threat?

In recognition of the quantum threat, the US National Institute of Standards and Technology (NIST) launched a public competition in 2016 to solicit, evaluate, and standardize new “post-quantum” cryptographic schemes that are designed to be resistant to attacks from quantum computers. On August 13, 2024, NIST published the final standards for the first three post-quantum algorithms to come out of the competition: ML-KEM for key agreement, and ML-DSA and SLH-DSA for digital signatures. A fourth standard based on FALCON is planned for release in late 2024 and will be dubbed FN-DSA, short for FFT (fast-Fourier transform) over NTRU-Lattice-Based Digital Signature Algorithm.

The publication of the final standards marks a significant milestone in an eight-year global community effort managed by NIST to prepare for the arrival of quantum computers. Teams of cryptographers from around the world jointly submitted 82 algorithms to the first round of the competition in 2017. After years of evaluation and cryptanalysis from the global cryptography community, NIST winnowed the algorithms under consideration down through several rounds until they decided upon the first four algorithms to standardize, which they announced in 2022.

This has been a monumental effort, and we would like to extend our gratitude to NIST and all the cryptographers and engineers across academia and industry that participated.

Security was a primary concern in the selection process, but algorithms also need to be performant enough to be deployed in real-world systems. Cloudflare’s involvement in the NIST competition began in 2019 when we performed experiments with industry partners to evaluate how algorithms under consideration performed when deployed on the open Internet. Gaining practical experience with the new algorithms was a crucial part of the evaluation process, and helped to identify and remove obstacles for deploying the final standards.

Having standardized algorithms is a significant step, but migrating systems to use these new algorithms is going to require a multi-year effort. To understand the effort involved, let’s look at two classes of traditional cryptography that are susceptible to quantum attacks: key agreement and digital signatures.

Key agreement allows two parties that have never communicated before to establish a shared secret over an insecure communication channel (like the Internet). The parties can then use this shared secret to encrypt future communications between them. An adversary may be able to observe the encrypted communication going over the network, but without access to the shared secret they cannot decrypt and “see inside” the encrypted packets.

However, in what is known as the “harvest now, decrypt later” threat model, an adversary can store encrypted data until some point in the future when they gain access to a sufficiently large quantum computer, and then can decrypt at their leisure. Thus, today’s communication is already at risk from a future quantum adversary, and it is urgent that we upgrade systems to use post-quantum key agreement as soon as possible.

In 2022, soon after NIST announced the first set of algorithms to be standardized, Cloudflare worked with industry partners to deploy a preliminary version of ML-KEM to protect traffic arriving at Cloudflare’s servers (and our internal systems), both to pave the way for adoption of the final standard and to start protecting traffic as soon as possible. As of mid-August 2024, over 16% of human-generated requests to Cloudflare’s servers are already protected with post-quantum key agreement.

Percentage of human traffic to Cloudflare protected by X25519Kyber, a preliminary version of ML-KEM as shown on Cloudflare Radar.

Other players in the tech industry have deployed post-quantum key agreement as well, including Google, Apple, Meta, and Signal.

Signatures are crucial to ensure that you’re communicating with who you think you are communicating. In the web public key infrastructure (WebPKI), signatures are used in certificates to prove that a website operator is the rightful owner of a domain. The threat model for signatures is different than for key agreement. An adversary capable of forging a digital signature could carry out an active attack to impersonate a web server to a client, but today’s communication is not yet at risk.

While the migration to post-quantum signatures is less urgent than the migration for key agreement (since traffic is only at risk once CRQCs exist), it is much more challenging. Consider, for instance, the number of parties involved. In key agreement, only two parties need to support a new key agreement protocol: the client and the server. In the WebPKI, there are many more parties involved, from library developers, to browsers, to server operators, to certificate authorities, to hardware manufacturers. Furthermore, post-quantum signatures are much larger than we’re used to from traditional signatures. For more details on the tradeoffs between the different signature algorithms, deployment challenges, and out-of-the-box solutions see our previous blog post.

Reaching consensus on the right approach for migrating to post-quantum signatures is going to require extensive effort and coordination among stakeholders. However, that work is already well underway. For instance, in 2021 we ran large scale experiments to understand the feasibility of post-quantum signatures in the WebPKI, and we have more studies planned.

What’s next?

Now that NIST has published the first set of standards for post-quantum cryptography, what comes next?

In 2022, Cloudflare deployed a preliminary version of the ML-KEM key agreement algorithm, Kyber, which is now used to protect double-digit percentages of requests to Cloudflare’s network. We use a hybrid with X25519, to hedge against future advances in cryptanalysis and implementation vulnerabilities. In coordination with industry partners at the NIST NCCoE and IETF, we will upgrade our systems to support the final ML-KEM standard, again using a hybrid. We will slowly phase out support for the pre-standard version X25519Kyber768 after clients have moved to the ML-KEM-768 hybrid, and will quickly phase out X25519Kyber512, which hasn’t seen real-world usage.

Now that the final standards are available, we expect to see widespread adoption of ML-KEM industry-wide as support is added in software and hardware, and post-quantum becomes the new default for key agreement. Organizations should look into upgrading their systems to use post-quantum key agreement as soon as possible to protect their data from future quantum-capable adversaries. Check if your browser already supports post-quantum key agreement by visiting pq.cloudflareresearch.com, and if you’re a Cloudflare customer, see how you can enable post-quantum key agreement support to your origin today.

Adoption of the newly-standardized post-quantum signatures ML-DSA and SLH-DSA will take longer as stakeholders work to reach consensus on the migration path. We expect the first post-quantum certificates to be available in 2026, but not to be enabled by default. Organizations should prepare for a future flip-the-switch migration to post-quantum signatures, but there is no need to flip the switch just yet.

We’ll continue to provide updates in this blog and at pq.cloudflareresearch.com. Don’t hesitate to reach out to us at [email protected] with any questions.

Command Like a Pro with New Backblaze B2 CLI Enhancements

Post Syndicated from Bala Krishna Gangisetty original https://www.backblaze.com/blog/command-like-a-pro-with-new-backblaze-b2-cli-enhancements/

An image of a computer monitor with the words B2 Command Line Interface Tool Version 4.1.0

The tools you use impact your efficiency, productivity, and the quality of your work. That’s true whether you’re a carpenter looking for the best saw blades, a chef choosing high-quality knives, or a developer or programmer investing in top-notch software. The B2 Command Line Interface (CLI) is one tool that you can use to interact with B2 Cloud Storage, and some recent improvements make it a more powerful, intuitive part of your arsenal. 

It’s been a while since our last blog about the Backblaze B2 Command Line Tool (B2 CLI for short). Today, we’re sharing more details on the key enhancements and new features as part of the B2 CLI version 4.1.0.

Let’s dive into the highlights of these changes and explore how they can elevate your B2 CLI experience.

User experience enhancements

1. A new nested command structure

Gone are the days of sifting through a long list of commands to find what you need. The B2 command structure has been revamped to be more intuitive and organized. With the new nested command structure, related commands are logically grouped together. The new structure looks like b2 <resource>. It makes it easier for you to locate and utilize the functionality you require. Whether you’re managing files, buckets, keys, or accounts, commands are now categorized in a way that aligns with their functions. This gives you a clearer and more concise enhanced user experience.

An image listing the usage tags for the Backblaze B2 CLI
New command structure.

2. Streamlined ls and rm commands

Why use two when one will do? The b2 ls and b2 rm commands can now accept a single cohesive string, B2 URI (e.g., b2://bucketName/path), instead of two separate positional arguments, giving you enhanced consistency and usability. It simplifies the command syntax and reduces potential for errors by eliminating the chance of misplacing or mistyping one of the separate arguments. And it ensures that the bucket and file path are always correctly associated with each other. This change minimizes confusion and helps to avoid common mistakes that can occur with multiple arguments.

In addition, some commands, such as b2 file large parts, accept a B2 ID URI (e.g. b2id://4_zf1f51fb…), which specifies a file by its unique identifier (a.k.a. Fguid).

Some redundant commands have also been deprecated with the introduction of the B2 and B2 URIs. For example: download-file-by-id and download-file-by-name functionality is available through b2 file download b2://bucketname/path and b2 file download b2id://fileid command.

3. Enhanced credential management

To enhance security and performance, the CLI will no longer persist credentials on disk if they are passed through B2_* environment variables (that is, B2_application_key_id and B2_application_key). This reduces the risk of unauthorized access to your credentials and improves the overall security of your environment.

At the same time, it’s important that security is balanced with performance. To address this, you can persist your credentials to local cache and can continue using local cache for better performance. You can explicitly choose to persist your credentials by using the b2 account authorize command. 

By eliminating the automatic persistence of credentials from environment variables and providing a clear method to manage local caching, you now have a balanced approach that keeps your data secure while ensuring efficient CLI operations.

4. Transition to kebab-case flags

Previously CLI flags had mixed camelCase and kebab-case styles. Users needed to remember the style to use it along with the name for the option. But kebab-case, where words are separated by hyphens (e.g., --my-flag), offers a clearer and more straightforward way to read and interpret flags. We’ve transitioned all CLI flags to --kebab-case. This style not only enhances readability, making it easier to understand complex commands at a glance, but also makes it easy to remember. It’s particularly beneficial when flags are composed of multiple words, as it reduces visual clutter and makes the flag names more accessible.

5. Simplified listing with ls

Ever wondered how to list all your buckets in one go? Now, you can call b2 ls without any arguments to do this. Whether you’re managing multiple buckets or just need a quick overview of your entire bucket inventory, the ability to list all buckets with a single command saves you time and effort. The enhancement to the b2 ls command is all about making your life easier. (As an aside, it’s also the quickest way to check that Backblaze B2 is correctly configured and you’re using the right set of credentials.)

6. Handy aliases for common flags

Why go the long way when you can take shortcuts? You can now use -r as an alias for the --recursive argument and -q for the --quiet argument. These shortcuts make your command-line interactions quicker and more efficient. You can get things done with fewer presses.

7. Global quiet mode

The --quiet option is now available for all commands, allowing you to suppress all messages printed to stdout and stderr. This is particularly useful for scripting and automation, where you want to minimize output.

8. Autocomplete

This enhancement for the B2 CLI means that you no longer have to remember and type out lengthy command arguments or options manually. As you start typing a command, the CLI will provide you with suggestions for completing the command, options, and arguments based on the context of your input. This can significantly save up your time and help you avoid typos or incorrect entries.

New features to boost your productivity

In addition to the CLI enhancements, we’ve also recently announced a few new features and capabilities for Backblaze B2, including:

  • Event Notifications: Event Notifications helps you automate workflows and integrate Backblaze B2 with other tools and systems. You can now manage Event Notification rules through b2 bucket notification-rule commands directly from the CLI. The feature is available in public preview. If you’re interested, check out the announcement and sign up here.  
  • Unhide files with ease: Previously, if you needed to reverse the hiding of a file, the process could be cumbersome or require multiple steps. To restore hidden files, using the b2 file unhide command is now as simple as it sounds. You only need to specify the file you want to unhide, and the command will handle the rest. This ensures that you can quickly and accurately restore file visibility without unnecessary complications. Whether you’ve hidden previous backup files and need to access them again, or when reorganizing your storage or adjusting file visibility for different users, or if you unintentionally hide files and need to make them visible for auditing or review purposes, you can use this command swiftly.
  • Custom file upload timestamps: You can now enable custom file upload timestamps on your account, enabling you to preserve original upload times for your files. This feature is ideal for maintaining accurate records for compliance and reporting, and it gives you greater control over the file metadata. If you’d like to enable the feature, please reach out to Backblaze Support.

In addition to the above highlights, we’ve implemented crucial fixes to improve the stability and reliability of the CLI. We’ve also made several improvements to our documentation, ensuring you have the guidance you need right at your fingertips.

Start using the new features today

The easier we can make your CLI experience, the easier your job becomes and the more you can get out of Backblaze B2. Install or upgrade the B2 CLI today to take advantage of all the new features.

As always, we value your feedback. If you have any thoughts or experiences to share as you start using the new enhancements and features, please let us know in the comments or submit feedback via our Product Portal. Your input is crucial in helping us continue to improve and innovate.

Happy coding, and enjoy the new B2 CLI offerings!

The post Command Like a Pro with New Backblaze B2 CLI Enhancements appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

AWS Lambda introduces recursive loop detection APIs

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/aws-lambda-introduces-recursive-loop-detection-apis/

This post is written by James Ngai, Senior Product Manager, AWS Lambda, and Aneel Murari, Senior Specialist SA, Serverless.

Today, AWS Lambda is announcing new recursive loop detection APIs that allow you to set recursive loop detection configuration on individual Lambda functions. This allows you to turn off recursive loop detection on functions that intentionally use recursive patterns, avoiding disruption of these workloads. You can use these APIs to avoid disruption to any intentionally recursive workflows as Lambda expands support of recursive loop detection to other AWS services.

Overview

AWS Lambda functions are triggered in response to events generated by various AWS services. These Lambda functions may interact with other AWS services by invoking the corresponding service APIs. Typically, the service and resource that generates the triggering event is distinct from the service and resource that the Lambda function calls. However, due to coding errors or configuration issues, there may be situations where these two resources are the same, leading to an infinite or recursive loop. Such misconfigurations can result in runaway workloads, which can incur unplanned usage and charges to your AWS account. For example, a Lambda function processes messages from an Amazon Simple Notification Service (SNS) topic but then puts the resulting notification back to the same SNS topic. This causes an infinite loop.

Lambda provides a built-in preventative guardrail that detects and stops functions running in a recursive or infinite loop between Lambda, Amazon Simple Queue Service (SQS), and SNS. This feature, known as recursive loop detection, is enabled by default for all Lambda functions. This serves as a protective mechanism against unintended usage and unexpected billing from runaway workloads.

Lambda uses an AWS X-Ray trace header primitive called “lineage” to track the number of times a function has been invoked with an event. When your function code sends an event using a supported AWS SDK version, Lambda increments the counter in the lineage header. If your function is then invoked with the same triggering event more than 16 times, Lambda stops the next invocation for that event and emits an Amazon CloudWatch RecursiveInvocationsDropped metric. If the function is invoked synchronously, Lambda returns a RecursiveInvocationException to the caller. For asynchronous invocations, Lambda sends the event to a dead-letter queue or on-failure destination if one is configured.

You do not need to configure active X-Ray tracing for this feature to work. For more information on this feature and an example scenario, please refer to Detecting and stopping recursive loops in AWS Lambda functions.

Although AWS generally discourages this practice due to the possibility of runaway workloads, some customers intentionally employ recursive patterns in their workflows. Previously, customers that run workloads that intentionally use recursive patterns could only opt-out of recursive loop detection on a per-account basis by contacting AWS Support. With these new APIs, customers can selectively opt-out of recursive loop detection on individual functions while maintaining this preventative guardrail for the remaining functions in their account that do not use recursive code.

Today we are introducing two new API actions for recursive loop detection:

  • GetFunctionRecursiveConfig returns details about a function’s recursive loop detection configuration.
  • PutFunctionRecursiveConfig sets the recursive loop detection configuration for a function. By default, recursive loop detection is turned ON for all functions.

How to use the new recursive loop detection APIs

You can configure recursive loop detection for Lambda functions through the Lambda Console, the AWS CLI, or Infrastructure as Code tools like AWS CloudFormation, AWS Serverless Application Model (AWS SAM), or AWS Cloud Development Kit (CDK). This new configuration option is supported in AWS SAM CLI version 1.123.0 and CDK v2.153.0.

If you turn recursive loop detection off for a function, the metric for RecursiveInvocationsDropped is no longer emitted for that function.

Turning off recursive loop detection on your function means that Lambda no longer prevents recursive invocations caused by misconfiguration. This may lead to unexpected usage and billing to your AWS account. You should explore alternate ways of architecting your workload that do not use recursive patterns. AWS recommends you exercise caution when turning off this guardrail feature.

Setting recursive loop detection configuration on a function using the Lambda Console

You can get recursive loop detection configuration in the AWS Lambda console:

  1. In the AWS Lambda Console, navigate to the Functions page. Select the function that uses intentionally recursive patterns.
  2. Select Configuration. You can find recursive loop detection controls under the Concurrency and recursion detection section.
  3. Recursive loop detection controls in the Lambda console

    Recursive loop detection controls in the Lambda console

  4. Recursive loop detection is turned on by default for all functions. You can change the recursive loop detection configuration of a function by choosing Edit.
  5. To turn off recursive loop detection for a function, select Allow recursive loops and select Save.
Setting to allow recursive loops

Setting to allow recursive loops

Setting recursive loop detection configuration using the AWS CLI

You can get the current recursion loop detection configuration of a Lambda function by using the following CLI command:

aws lambda get-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME

You can update the recursion loop detection configuration for a Lambda function by using the following CLI command:

aws lambda put-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME \
--recursive-loop Allow|Terminate

Make sure to set appropriate values for AWS_REGION and FUNCTION_NAME in the previous commands. Setting the put-function-recursion-config parameter to Allow turns off the default behavior of detecting recursive loops. Set this value to Terminate to switch back to default behavior.

Setting recursive loop detection configuration using AWS CloudFormation

You can control the recursive loop detection configuration for a Lambda function by setting the RecursiveLoop resource property in CloudFormation. Setting the value of this property to Allow turns off the default behavior of automatically detecting recursive loops. Set this property to Terminate if you want to switch it back to the default behavior. The following CloudFormation snippet shows RecursiveLoop set to Allow.

LambdaFunction:
    Type: AWS::Lambda::Function                                                                                                                                                                                    
    Properties:                                                                                                                                                                                       
      Code:                                                                                                                                                                                          
        S3Bucket:S3_BUCKET                                                                                                                                                                            
        S3Key: S3_KEY      
      Handler: com.example.App::handleRequest                                                                                                                                                        
      MemorySize: 1024
      Role:                                                                                                                                                                                          
        Fn::GetAtt:                                                                                                                                                                                  
        - LambdaFunctionRole                                                                                                                                                                         
        - Arn                                                                                                                                                                               
      Runtime: java17
      RecursiveLoop : Allow                                                                                                                                                                                                                                                                           
      Timeout: 20                                                                                                                                                                        
      TracingConfig:                                                                                                                                                                               
        Mode: Active                                                                                                                                                                                        

Extending recursive loop detection to additional AWS services

Today, recursive loop detection detects and stops loops between Lambda, SQS, and SNS after approximately 16 invocations. Lambda plans to extend support for recursive loop detection to additional AWS services. Using the APIs, you can turn off recursive loop detection for specific functions that use recursive patterns so that they are not impacted when Lambda expands recursive loop detection to additional AWS services in the future.

One way you can identify functions that use recursive patterns is by using the CloudWatch metric RecursiveInvocationsDropped.

  1. Set a CloudWatch alarm on all Lambda functions for the CloudWatch metric RecursiveInvocationsDropped. Configure the alarm to trigger when the metric is greater than a threshold of zero. Refer to CloudWatch documentation to set alarms. You can use the following CLI command to set this alarm:
  2. aws cloudwatch put-metric-alarm --alarm-name lambda-recursive-alarm --metric-name RecursiveInvocationsDropped --namespace AWS/Lambda --statistic Sum --period 60 --threshold 0 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --alarm-actions $arn-of-sns-notification-topic
  3. When Lambda detects recursive invocations, it will emit the RecursiveInvocationsDropped metric, which will trigger the alarm. Note that Lambda will only detect and stop recursive invocations if all the services within the loop support recursive loop detection.
  4. Navigate to the CloudWatch Console and determine which function has emitted the RecursiveInvocationsDropped metric. On the Browse tab, under Metrics, choose to view metrics By Function Name and search for RecursiveInvocationsDropped. This will list all functions that have emitted that metric.
  5. RecursiveInvocationsDropped metric

    RecursiveInvocationsDropped metric

  6. Determine if recursion is the intended pattern for that function. If so, use the recursive loop detection API to turn off recursive loop detection for this function.

Conclusion

Lambda recursive loop detection automatically detects and stops recursive invocations between Lambda and supported services, preventing runaway workloads. In most cases, you should architect your workloads to avoid any recursive loops. In rare and special circumstances, you may want to turn off the default behavior on a case-by-case basis. The recursive loop detection APIs allow you to set recursive loop detection configuration on individual functions.

This feature is available in all AWS Regions where Lambda supports recursive loop detection.

To learn more about these APIs, refer to the AWS Lambda API Reference.

For more serverless learning resources, visit Serverless Land

Reducing long-term logging expenses by 4,800% with Amazon OpenSearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/reducing-long-term-logging-expenses-by-4800-with-amazon-opensearch-service/

When you use Amazon OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data in various tiers, enabling you to trade off data latency, durability, and availability. In October 2023, OpenSearch Service announced support for im4gn data nodes, with NVMe SSD storage of up to 30 TB. In November 2023, OpenSearch Service introduced or1, the OpenSearch-optimized instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon Simple Storage Service (Amazon S3) to provide 11 nines of durability. Finally, in May 2024, OpenSearch Service announced general availability for Amazon OpenSearch Service zero-ETL integration with Amazon S3. These new features join OpenSearch’s existing UltraWarm instances, which provide an up to 90% reduction in storage cost per GB, and UltraWarm’s cold storage option, which lets you detach UltraWarm indexes and durably store rarely accessed data in Amazon S3.

This post works through an example to help you understand the trade-offs available in cost, latency, throughput, data durability and availability, retention, and data access, so that you can choose the right deployment to maximize the value of your data and minimize the cost.

Examine your requirements

When designing your logging solution, you need a clear definition of your requirements as a prerequisite to making smart trade-offs. Carefully examine your requirements for latency, durability, availability, and cost. Additionally, consider which data you choose to send to OpenSearch Service, how long you retain data, and how you plan to access that data.

For the purposes of this discussion, we divide OpenSearch instance storage into two classes: ephemeral backed storage and Amazon S3 backed storage. The ephemeral backed storage class includes OpenSearch nodes that use Nonvolatile Memory Express SSDs (NVMe SSDs) and Amazon Elastic Block Store (Amazon EBS) volumes. The Amazon S3 backed storage class includes UltraWarm nodes, UltraWarm cold storage, or1 instances, and Amazon S3 storage you access with the service’s zero-ETL with Amazon S3. When designing your logging solution, consider the following:

  • Latency – if you need results in milliseconds, then you must use ephemeral backed storage. If seconds or minutes are acceptable, you can lower your cost by using Amazon S3 backed storage.
  • Throughput – As a general rule, ephemeral backed storage instances will provide higher throughput. Instances that have NVMe SSDs, like the im4gn, generally provide the best throughput, with EBS volumes providing good throughput. or1 instances take advantage of Amazon EBS storage for primary shards while using Amazon S3 with segment replication to reduce the compute cost of replication, thereby offering indexing throughput that can match or even exceed NVMe-based instances.
  • Data durability – Data stored in the hot tier (you deploy these as data nodes) has the lowest latency, and also the lowest durability. OpenSearch Service provides automated recovery of data in the hot tier through replicas, which provide durability with added cost. Data that OpenSearch stores in Amazon S3 (UltraWarm, UltraWarm cold storage, zero-ETL with Amazon S3, and or1 instances) gets the benefit of 11 nines of durability from Amazon S3.
  • Data availabilityBest practices dictate that you use replicas for data in ephemeral backed storage. When you have at least one replica, you can continue to access all of your data, even during a node failure. However, each replica adds a multiple of cost. If you can tolerate temporary unavailability, you can reduce replicas through or1 instances, with Amazon S3 backed storage.
  • Retention – Data in all storage tiers incurs cost. The longer you retain data for analysis, the more cumulative cost you incur for each GB of that data. Identify the maximum amount of time you must retain data before it loses all value. In some cases, compliance requirements may restrict your retention window.
  • Data access – Amazon S3 backed storage instances generally have a much higher storage to compute ratio, providing cost savings but with insufficient compute for high-volume workloads. If you have high query volume or your queries span a large volume of data, ephemeral backed storage is the right choice. Direct query (Amazon S3 backed storage) is perfect for large volume queries for infrequently queried data.

As you consider your requirements along these dimensions, your answers will guide your choices for implementation. To help you make trade-offs, we work through an extended example in the following sections.

OpenSearch Service cost model

To understand how to cost an OpenSearch Service deployment, you need to understand the cost dimensions. OpenSearch Service has two different deployment options: managed clusters and serverless. This post considers managed clusters only, because Amazon OpenSearch Serverless already tiers data and manages storage for you. When you use managed clusters, you configure data nodes, UltraWarm nodes, and cluster manager nodes, selecting Amazon Elastic Compute Cloud (Amazon EC2) instance types for each of these functions. OpenSearch Service deploys and manages these nodes for you, providing OpenSearch and OpenSearch Dashboards through a REST endpoint. You can choose Amazon EBS backed instances or instances with NVMe SSD drives. OpenSearch Service charges an hourly cost for the instances in your managed cluster. If you choose Amazon EBS backed instances, the service will charge you for the storage provisioned, and any provisioned IOPs you configure. If you choose or1 nodes, UltraWarm nodes, or UltraWarm cold storage, OpenSearch Service charges for the Amazon S3 storage consumed. Finally, the service charges for data transferred out.

Example use case

We use an example use case to examine the trade-offs in cost and performance. The cost and sizing of this example are based on best practices, and are directional in nature. Although you can expect to see similar savings, all workloads are unique and your actual costs may vary substantially from what we present in this post.

For our use case, Fizzywig, a fictitious company, is a large soft drink manufacturer. They have many plants for producing their beverages, with copious logging from their manufacturing line. They started out small, with an all-hot deployment and generating 10 GB of logs daily. Today, that has grown to 3 TB of log data daily, and management is mandating a reduction in cost. Fizzywig uses their log data for event debugging and analysis, as well as historical analysis over one year of log data. Let’s compute the cost of storing and using that data in OpenSearch Service.

Ephemeral backed storage deployments

Fizzywig’s current deployment is 189 r6g.12xlarge.search data nodes (no UltraWarm tier), with ephemeral backed storage. When you index data in OpenSearch Service, OpenSearch builds and stores index data structures that are usually about 10% larger than the source data, and you need to leave 25% free storage space for operating overhead. Three TB of daily source data will use 4.125 TB of storage for the first (primary) copy, including overhead. Fizzywig follows best practices, using two replica copies for maximum data durability and availability, with the OpenSearch Service Multi-AZ with Standby option, increasing the storage need to 12.375 TB per day. To store 1 year of data, multiply by 365 days to get 4.5 PB of storage needed.

To provision this much storage, they could also choose im4gn.16xlarge.search instances, or or1.16.xlarge.search instances. The following table gives the instance counts for each of these instance types, and with one, two, or three copies of the data.

. Max Storage (GB)
per Node

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search 30,000 52 104 156
or1.16xlarge.search 36,000 42 84 126
r6g.12xlarge.search 24,000 63 126 189

The preceding table and the following discussion are strictly based on storage needs. or1 instances and im4gn instances both provide higher throughput than r6g instances, which will reduce cost further. The amount of compute saved varies between 10–40% depending on the workload and the instance type. These savings do not pass straight through to the bottom line; they require scaling and modification of the index and shard strategy to fully realize them. The preceding table and subsequent calculations take the general assumption that these deployments are over-provisioned on compute, and are storage-bound. You would see more savings for or1 and im4gn, compared with r6g, if you had to scale higher for compute.

The following table represents the total cluster costs for the three different instance types across the three different data storage sizes specified. These are based on on-demand US East (N. Virginia) AWS Region costs and include instance hours, Amazon S3 cost for the or1 instances, and Amazon EBS storage costs for the or1 and r6g instances.

.

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search $3,977,145 $7,954,290 $11,931,435
or1.16xlarge.search $4,691,952 $9,354,996 $14,018,041
r6g.12xlarge.search $4,420,585 $8,841,170 $13,261,755

This table gives you the one-copy, two-copy, and three-copy costs (including Amazon S3 and Amazon EBS costs, where applicable) for this 4.5 PB workload. For this post, “one copy” refers to the first copy of your data, with the replication factor set to zero. “Two copies” includes a replica copy of all of the data, and “three copies” includes a primary and two replicas. As you can see, each replica adds a multiple of cost to the solution. Of course, each replica adds availability and durability to the data. With one copy (primary only), you would lose data in the case of a single node outage (with an exception for or1 instances). With one replica, you might lose some or all data in a two-node outage. With two replicas, you could lose data only in a three-node outage.

The or1 instances are an exception to this rule. or1 instances can support a one-copy deployment. These instances use Amazon S3 as a backing store, writing all index data to Amazon S3, as a means of replication, and for durability. Because all acknowledged writes are persisted in Amazon S3, you can run with a single copy, but with the risk of losing availability of your data in case of a node outage. If a data node becomes unavailable, any impacted indexes will be unavailable (red) during the recovery window (usually 10–20 minutes). Carefully evaluate whether you can tolerate this unavailability with your customers as well as your system (for example, your ingestion pipeline buffer). If so, you can drop your cost from $14 million to $4.7 million based on the one-copy (primary) column illustrated in the preceding table.

Reserved Instances

OpenSearch Service supports Reserved Instances (RIs), with 1-year and 3-year terms, with no up-front cost (NURI), partial up-front cost (PURI), or all up-front cost (AURI). All reserved instance commitments lower cost, with 3-year, all up-front RIs providing the deepest discount. Applying a 3-year AURI discount, annual costs for Fizzywig’s workload gives costs as shown in the following table.

. Primary Primary + Replica Primary + 2 Replicas
im4gn.16xlarge.search $1,909,076 $3,818,152 $5,727,228
or1.16xlarge.search $3,413,371 $6,826,742 $10,240,113
r6g.12xlarge.search $3,268,074 $6,536,148 $9,804,222

RIs provide a straightforward way to save cost, with no code or architecture changes. Adopting RIs for this workload brings the im4gn cost for three copies down to $5.7 million, and the one-copy cost for or1 instances down to $3.2 million.

Amazon S3 backed storage deployments

The preceding deployments are useful as a baseline and for comparison. In actuality, you would choose one of the Amazon S3 backed storage options to keep costs manageable.

OpenSearch Service UltraWarm instances store all data in Amazon S3, using UltraWarm nodes as a hot cache on top of this full dataset. UltraWarm works best for interactive querying of data in small time-bound slices, such as running multiple queries against 1 day of data from 6 months ago. Evaluate your access patterns carefully and consider whether UltraWarm’s cache-like behavior will serve you well. UltraWarm first-query latency scales with the amount of data you need to query.

When designing an OpenSearch Service domain for UltraWarm, you need to decide on your hot retention window and your warm retention window. Most OpenSearch Service customers use a hot retention window that varies between 7–14 days, with warm retention making up the rest of the full retention period. For our Fizzywig scenario, we use 14 days hot retention and 351 days of UltraWarm retention. We also use a two-copy (primary and one replica) deployment in the hot tier.

The 14-day, hot storage need (based on a daily ingestion rate of 4.125 TB) is 115.5 TB. You can deploy six instances of any of the three instance types to support this indexing and storage. UltraWarm stores a single replica in Amazon S3, and doesn’t need additional storage overhead, making your 351-day storage need 1.158 PiB. You can support this with 58 UltraWarm1.large.search instances. The following table gives the total cost for this deployment, with 3-year AURIs for the hot tier. The or1 instances’ Amazon S3 cost is rolled into the S3 column.

. Hot UltraWarm S3 Total
im4gn.16xlarge.search $220,278 $1,361,654 $333,590 $1,915,523
or1.16xlarge.search $337,696 $1,361,654 $418,136 $2,117,487
r6g.12xlarge.search $270,410 $1,361,654 $333,590 $1,965,655

You can further reduce the cost by moving data to UltraWarm cold storage. Cold storage reduces cost by reducing availability of the data—to query the data, you must issue an API call to reattach the target indexes to the UltraWarm tier. A typical pattern for 1 year of data keeps 14 days hot, 76 days in UltraWarm, and 275 days in cold storage. Following this pattern, you use 6 hot nodes and 13 UltraWarm1.large.search nodes. The following table illustrates the cost to run Fizzywig’s 3 TB daily workload. The or1 cost for Amazon S3 usage is rolled into the UltraWarm nodes + S3 column.

. Hot UltraWarm nodes + S3 Cold Total
im4gn.16xlarge.search $220,278 $377,429 $261,360 $859,067
or1.16xlarge.search $337,696 $461,975 $261,360 $1,061,031
r6g.12xlarge.search $270,410 $377,429 $261,360 $909,199

By employing Amazon S3 backed storage options, you’re able to reduce cost even further, with a single-copy or1 deployment at $337,000, and a maximum of $1 million annually with or1 instances.

OpenSearch Service zero-ETL for Amazon S3

When you use OpenSearch Service zero-ETL for Amazon S3, you keep all your secondary and older data in Amazon S3. Secondary data is the higher-volume data that has lower value for direct inspection, such as VPC Flow Logs and WAF logs. For these deployments, you keep the majority of infrequently queried data in Amazon S3, and only the most recent data in your hot tier. In some cases, you sample your secondary data, keeping a percentage in the hot tier as well. Fizzywig decides that they want to have 7 days of all of their data in the hot tier. They will access the rest with direct query (DQ).

When you use direct query, you can store your data in JSON, Parquet, and CSV formats. Parquet format is optimal for direct query and provides about 75% compression on the data. Fizzywig is using Amazon OpenSearch Ingestion, which can write Parquet format data directly to Amazon S3. Their 3 TB of daily source data compresses to 750 GB of daily Parquet data. OpenSearch Service maintains a pool of compute units for direct query. You are billed hourly for these OpenSearch Compute Units (OCUs), scaling based on the amount of data you access. For this conversation, we assume that Fizzywig will have some debugging sessions and run 50 queries daily over one day worth of data (750 GB). The following table summarizes the annual cost to run Fizzywig’s 3 TB daily workload, 7 days hot, 358 days in Amazon S3.

. Hot DQ Cost OR1 S3 Raw Data S3 Total
im4gn.16xlarge.search $220,278 $2,195 $0 $65,772 $288,245
or1.16xlarge.search $337,696 $2,195 $84,546 $65,772 $490,209
r6g.12xlarge.search $270,410 $2,195 $0 $65,772 $338,377

That’s quite a journey! Fizzywig’s cost for logging has come down from as high as $14 million annually to as low as $288,000 annually using direct query with zero-ETL from Amazon S3. That’s a savings of 4,800%!

Sampling and compression

In this post, we have looked at one data footprint to let you focus on data size, and the trade-offs you can make depending on how you want to access that data. OpenSearch has additional features that can further change the economics by reducing the amount of data you store.

For logs workloads, you can employ OpenSearch Ingestion sampling to reduce the size of data you send to OpenSearch Service. Sampling is appropriate when your data as a whole has statistical characteristics where a part can be representative of the whole. For example, if you’re running an observability workload, you can often send as little as 10% of your data to get a representative sampling of the traces of request handling in your system.

You can further employ a compression algorithm for your workloads. OpenSearch Service recently released support for Zstandard (zstd) compression that can bring higher compression rates and lower decompression latencies as compared to the default, best compression.

Conclusion

With OpenSearch Service, Fizzywig was able to balance cost, latency, throughput, durability and availability, data retention, and preferred access patterns. They were able to save 4,800% for their logging solution, and management was thrilled.

Across the board, im4gn comes out with the lowest absolute dollar amounts. However, there are a couple of caveats. First, or1 instances can provide higher throughput, especially for write-intensive workloads. This may mean additional savings through reduced need for compute. Additionally, with or1’s added durability, you can maintain availability and durability with lower replication, and therefore lower cost. Another factor to consider is RAM; the r6g instances provide additional RAM, which speeds up queries for lower latency. When coupled with UltraWarm, and with different hot/warm/cold ratios, r6g instances can also be an excellent choice.

Do you have a high-volume, logging workload? Have you benefitted from some or all of these methods? Let us know!


About the Author

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have vector, search, and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor’s of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.