Amazon Redshift: Lower price, higher performance

Post Syndicated from Stefan Gromoll original https://aws.amazon.com/blogs/big-data/amazon-redshift-lower-price-higher-performance/

Like virtually all customers, you want to spend as little as possible while getting the best possible performance. This means you need to pay attention to price-performance. With Amazon Redshift, you can have your cake and eat it too! Amazon Redshift delivers up to 4.9 times lower cost per user and up to 7.9 times better price-performance than other cloud data warehouses on real-world workloads using advanced techniques like concurrency scaling to support hundreds of concurrent users, enhanced string encoding for faster query performance, and Amazon Redshift Serverless performance enhancements. Read on to understand why price-performance matters and how Amazon Redshift price-performance is a measure of how much it costs to get a particular level of workload performance, namely performance ROI (return on investment).

Because both price and performance enter into the price-performance calculation, there are two ways to think about price-performance. The first way is to hold price constant: if you have $1 to spend, how much performance do you get from your data warehouse? A database with better price-performance will deliver better performance for each $1 spent. Therefore, when holding price constant when comparing two data warehouses that cost the same, the database with better price-performance will run your queries faster. The second way to look at price-performance is to hold performance constant: if you need your workload to finish in 10 minutes, what will it cost? A database with better price-performance will run your workload in 10 minutes at a lower cost. Therefore, when holding performance constant when comparing two data warehouses that are sized to deliver the same performance, the database with better price-performance will cost less and save you money.

Finally, another important aspect of price-performance is predictability. Knowing how much your data warehouse is going to cost as the number of data warehouse users grows is crucial for planning. It should not only deliver the best price-performance today, but also scale predictably and deliver the best price-performance as more users and workloads are added. An ideal data warehouse should have linear scale—scaling your data warehouse to deliver twice the query throughput should ideally cost twice as much (or less).

In this post, we share performance results to illustrate how Amazon Redshift delivers significantly better price-performance compared to leading alternative cloud data warehouses. This means that if you spend the same amount on Amazon Redshift as you would on one of these other data warehouses, you will get better performance with Amazon Redshift. Alternatively, if you size your Redshift cluster to deliver the same performance, you will see lower costs compared to these alternatives.

Price-performance for real-world workloads

You can use Amazon Redshift to power a very wide diversity of workloads, from batch-processing of complex extract, transform, and load (ETL)-based reports, and real-time streaming analytics to low-latency business intelligence (BI) dashboards that need to serve hundreds or even thousands of users at the same time with subsecond response times, and everything in between. One of the ways we continually improve price-performance for our customers is to constantly review the software and hardware performance telemetry from the Redshift fleet, looking for opportunities and customer use cases where we can further improve Amazon Redshift performance.

Some recent examples of performance optimizations driven by fleet telemetry include:

  • String query optimizations – By analyzing how Amazon Redshift processed different data types in the Redshift fleet, we found that optimizing string-heavy queries would bring significant benefit to our customers’ workloads. (We discuss this in more detail later in this post.)
  • Automated materialized views – We found that Amazon Redshift customers often run many queries that have common subquery patterns. For example, several different queries may join the same three tables using the same join condition. Amazon Redshift is now able to automatically create and maintain materialized views and then transparently rewrite queries to use the materialized views using the machine-learned automated materialized view autonomics feature in Amazon Redshift. When enabled, automated materialized views can transparently increase query performance for repetitive queries without any user intervention. (Note that automated materialized views were not used in any of the benchmark results discussed in this post).
  • High-concurrency workloads – A growing use case we see is using Amazon Redshift to serve dashboard-like workloads. These workloads are characterized by desired query response times of single-digit seconds or less, with tens or hundreds of concurrent users running queries simultaneously with a spiky and often unpredictable usage pattern. The prototypical example of this is an Amazon Redshift-backed BI dashboard that has a spike in traffic Monday mornings when a large number of users start their week.

High-concurrency workloads in particular have very broad applicability: most data warehouse workloads operate at concurrency, and it’s not uncommon for hundreds or even thousands of users to run queries on Amazon Redshift at the same time. Amazon Redshift was designed to keep query response times predictable and fast. Redshift Serverless does this automatically for you by adding and removing compute as needed to keep query response times fast and predictable. This means a Redshift Serverless-backed dashboard that loads quickly when it’s being accessed by one or two users will continue to load quickly even when many users are loading it at the same time.

To simulate this type of workload, we used a benchmark derived from TPC-DS with a 100 GB data set. TPC-DS is an industry-standard benchmark that includes a variety of typical data warehouse queries. At this relatively small scale of 100 GB, queries in this benchmark run on Redshift Serverless in an average of a few seconds, which is representative of what users loading an interactive BI dashboard would expect. We ran between 1–200 concurrent tests of this benchmark, simulating between 1–200 users trying to load a dashboard at the same time. We also repeated the test against several popular alternative cloud data warehouses that also support scaling out automatically (if you’re familiar with the post Amazon Redshift continues its price-performance leadership, we didn’t include Competitor A because it doesn’t support automatically scaling up). We measured average query response time, meaning how long a user would wait for their queries to finish (or their dashboard to load). The results are shown in the following chart.

Competitor B scales well until around 64 concurrent queries, at which point it is unable to provide additional compute and queries begin to queue, leading to increased query response times. Although Competitor C is able to scale automatically, it scales to lower query throughput than both Amazon Redshift and Competitor B and is not able to keep query runtimes low. In addition, it doesn’t support queueing queries when it runs out of compute, which prevents it from scaling beyond around 128 concurrent users. Submitting additional queries beyond this are rejected by the system.

Here, Redshift Serverless is able to keep the query response time relatively consistent at around 5 seconds even when hundreds of users are running queries at the same time. The average query response times for Competitors B and C increase steadily as load on the warehouses increases, which results in users having to wait longer (up to 16 seconds) for their queries to return when the data warehouse is busy. This means that if a user is trying to refresh a dashboard (which may even submit several concurrent queries when reloaded), Amazon Redshift would be able to keep dashboard load times far more consistent even if the dashboard is being loaded by tens or hundreds of other users at the same time.

Because Amazon Redshift is able to deliver very high query throughput for short queries (as we wrote about in Amazon Redshift continues its price-performance leadership), it’s also able to handle these higher concurrencies when scaling out more efficiently and therefore at a significantly lower cost. To quantify this, we look at the price-performance using published on-demand pricing for each of the warehouses in the preceding test, shown in the following chart. It’s worth noting that using Reserved Instances (RIs), especially 3-year RIs purchased with the all upfront payment option, has the lowest cost to run Amazon Redshift on Provisioned clusters, resulting in the best relative price-performance compared to on-demand or other RI options.

So not only is Amazon Redshift able to deliver better performance at higher concurrencies, it’s able to do so at significantly lower cost. Each data point in the price-performance chart is equivalent to the cost to run the benchmark at the specified concurrency. Because the price-performance is linear, we can divide the cost to run the benchmark at any concurrency by the concurrency (number of Concurrent Users in this chart) to tell us how much adding each new user costs for this particular benchmark.

The preceding results are straightforward to replicate. All queries used in the benchmark are available in our GitHub repository and performance is measured by launching a data warehouse, enabling Concurrency Scaling on Amazon Redshift (or the corresponding auto scaling feature on other warehouses), loading the data out of the box (no manual tuning or database-specific setup), and then running a concurrent stream of queries at concurrencies from 1–200 in steps of 32 on each data warehouse. The same GitHub repo references pregenerated (and unmodified) TPC-DS data in Amazon Simple Storage Service (Amazon S3) at various scales using the official TPC-DS data generation kit.

Optimizing string-heavy workloads

As mentioned earlier, the Amazon Redshift team is continuously looking for new opportunities to deliver even better price-performance for our customers. One improvement we recently launched that significantly improved performance is an optimization that accelerates the performance of queries over string data. For example, you might want to find the total revenue generated from retail stores located in New York City with a query like SELECT sum(price) FROM sales WHERE city = ‘New York’. This query is applying a predicate over string data (city = ‘New York’). As you can imagine, string data processing is ubiquitous in data warehouse applications.

To quantify how often customers’ workloads access strings, we conducted a detailed analysis of string data type usage using fleet telemetry of tens of thousands of customer clusters managed by Amazon Redshift. Our analysis indicates that in 90% of the clusters, string columns constitute at least 30% of all the columns, and in 50% of the clusters, string columns constitute at least 50% of all the columns. Moreover, a majority of all queries run on the Amazon Redshift cloud data warehouse platform access at least one string column. Another important factor is that string data is very often low cardinality, meaning the columns contain a relatively small set of unique values. For example, although an orders table representing sales data may contain billions of rows, an order_status column within that table might contain only a few unique values across those billions of rows, such as pending, in process, and completed.

As of this writing, most string columns in Amazon Redshift are compressed with LZO or ZSTD algorithms. These are good general-purpose compression algorithms, but they aren’t designed to take advantage of low-cardinality string data. In particular, they require that data be decompressed before being operated on, and are less efficient in their use of hardware memory bandwidth. For low-cardinality data, there is another type of encoding that can be more optimal: BYTEDICT. This encoding uses a dictionary-encoding scheme that allows the database engine to operate directly over compressed data without the need to decompress it first.

To further improve price-performance for string-heavy workloads, Amazon Redshift is now introducing additional performance enhancements that speed up scans and predicate evaluations, over low-cardinality string columns that are encoded as BYTEDICT, between 5–63 times faster (see results in the next section) compared to alternative compression encodings such as LZO or ZSTD. Amazon Redshift achieves this performance improvement by vectorizing scans over lightweight, CPU-efficient, BYTEDICT-encoded, low-cardinality string columns. These string-processing optimizations make effective use of memory bandwidth afforded by modern hardware, enabling real-time analytics over string data. These newly introduced performance capabilities are optimal for low-cardinality string columns (up to a few hundred unique string values).

You can automatically benefit from this new high performance string enhancement by enabling automatic table optimization in your Amazon Redshift data warehouse. If you don’t have automatic table optimization enabled on your tables, you can receive recommendations from the Amazon Redshift Advisor in the Amazon Redshift console on a string column’s suitability for BYTEDICT encoding. You can also define new tables that have low-cardinality string columns with BYTEDICT encoding. String enhancements in Amazon Redshift are now available in all AWS Regions where Amazon Redshift is available.

Performance results

To measure the performance impact of our string enhancements, we generated a 10TB (Tera Byte) dataset that consisted of low-cardinality string data. We generated three versions of the data using short, medium, and long strings, corresponding to the 25th, 50th, and 75th percentile of string lengths from Amazon Redshift fleet telemetry. We loaded this data into Amazon Redshift twice, encoding it in one case using LZO compression and in another using BYTEDICT compression. Finally, we measured the performance of scan-heavy queries that return many rows (90% of the table), a medium number of rows (50% of the table), and a few rows (1% of the table) over these low-cardinality string datasets. The performance results are summarized in the following chart.

Queries with predicates that match a high percentage of rows saw improvements of 5–30 times with the new vectorized BYTEDICT encoding compared to LZO, whereas queries with predicates that match a low percentage of rows saw improvements of 10–63 times in this internal benchmark.

Redshift Serverless price-performance

In addition to the high-concurrency performance results presented in this post, we also used the TPC-DS-derived Cloud Data Warehouse benchmark to compare the price-performance of Redshift Serverless to other data warehouses using a larger 3TB dataset. We chose data warehouses that were priced similarly, in this case within 10% of $32 per hour using publicly available on-demand pricing. These results show that, like Amazon Redshift RA3 instances, Redshift Serverless delivers better price-performance compared to other leading cloud data warehouses. As always, these results can be replicated by using our SQL scripts in our GitHub repository.

We encourage you to try Amazon Redshift using your own proof of concept workloads as the best way to see how Amazon Redshift can meet your data analytics needs.

Find the best price-performance for your workloads

The benchmarks used in this post are derived from the industry-standard TPC-DS benchmark, and have the following characteristics:

  • The schema and data are used unmodified from TPC-DS.
  • The queries are generated using the official TPC-DS kit with query parameters generated using the default random seed of the TPC-DS kit. TPC-approved query variants are used for a warehouse if the warehouse doesn’t support the SQL dialect of the default TPC-DS query.
  • The test includes the 99 TPC-DS SELECT queries. It doesn’t include maintenance and throughput steps.
  • For the single 3TB concurrency test, three power runs were run, and the best run is taken for each data warehouse.
  • Price-performance for the TPC-DS queries is calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to the cost to run the benchmark. The latest published on-demand pricing is used for all data warehouses and not Reserved Instance pricing as noted earlier.

We call this the Cloud Data Warehouse benchmark, and you can easily reproduce the preceding benchmark results using the scripts, queries, and data available in our GitHub repository. It is derived from the TPC-DS benchmarks as described in this post, and as such is not comparable to published TPC-DS results, because the results of our tests don’t comply with the official specification.

Conclusion

Amazon Redshift is committed to delivering the industry’s best price-performance for the widest variety of workloads. Redshift Serverless scales linearly with the best (lowest) price-performance, supporting hundreds of concurrent users while maintaining consistent query response times. Based on test results discussed in this post, Amazon Redshift has up to 2.6 times better price-performance at the same level of concurrency compared to the nearest competitor (Competitor B). As mentioned earlier, using Reserved Instances with the 3-year all upfront option gives you the lowest cost to run Amazon Redshift, resulting in even better relative price-performance compared to on-demand instance pricing that we used in this post. Our approach to continuous performance improvement involves a unique combination of customer obsession to understand customer use cases and their associated scalability bottlenecks coupled with continuous fleet data analysis to identify opportunities to make significant performance optimizations.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift can lower your costs while delivering better performance. When running your own proof of concept, it’s important to focus on the right metrics—query throughput (number of queries per hour), response time, and price-performance. You can make a data-driven decision by running a proof of concept on your own or with assistance from AWS or a system integration and consulting partner.

To stay up to date with the latest developments in Amazon Redshift, follow the What’s New in Amazon Redshift feed.


About the authors

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift team where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Ravi Animi is a Senior Product Management leader in the Amazon Redshift team and manages several functional areas of the Amazon Redshift cloud data warehouse service including performance, spatial analytics, streaming ingestion and migration strategies. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services and more recently as a startup founder using AI/deep learning, computer vision, and robotics.

Aamer Shah is a Senior Engineer in the Amazon Redshift Service team.

Sanket Hase is a Software Development Manager in the Amazon Redshift Service team.

Orestis Polychroniou is a Principal Engineer in the Amazon Redshift Service team.

Колко се възползваха от изкарването на ЕЗОК дистанционно?

Post Syndicated from Боян Юруков original https://yurukov.net/blog/2023/ezok-stats/

В края на юни разказах за история започнала още през януари 2022. Става въпрос за възможността да се издава Европейска здравноосигурителна карта изцяло дистанционно. В последствие в. Сега допълни, че няма нужда от електронен подпис, а е достатъчен ПИК на НАП, за да се осъществи подаването. Това улесни значително процеса.

От тогава доста хора споделиха в социалните мрежи и коментарите тук какъв е бил опита им. След като минаха летните отпуски, реших да попитам НЗОК все пак колко хора все пак са се възползвали спрямо общия брой изкарани карти. Ето какво научих.

През януари 2022 е подадено едно заявление през ССЕВ – моето. След три месеца увещаване получих картата си през април. Тогава писах за пръв път за случая в twitter. Според предоставените данни, следващите подадени заявления са чак август – общо 12. Три от тях са отново от мен за останалите от семейството ми. След това има между по 2 до 11 между ноември и април 2023. През май тази година пуснах още 4 заявления, за да обновя картите. Още 13 души са подали тогава.

В края на юни пуснах статията си как работи процеса и 58 души са го използвал. През юли вече беше копиран текста от няколко медии и 195 са подали заявление, а през август и септември – още общо 209. Откакто показах, че може и следва да е опция, над 400 души са попълнили заявлението и са получили картите с куриер.

За сравнение, от началото на 2022-ра до края на септември 2023 са били издадени 283625 карти на гише. Приемайки, че в този период някои са били преиздадени, това означава, че поне 164 хиляди души са се редели два пъти почти изцяло в клонове на партньорските им банки, за да си вземат картите от НЗОК.

Подновените карти са 60% от всички издадени. За такива се считат всички карти на хора, които някога са си изваждали ЕЗОК. Забелязва се сериозно покачване на новоиздадените карти през това лято, отчасти заради кампания в медиите колко са полезни всъщност. Интересно е, че немалко количество карти се издават в рамките на цялата година, не само през летните месеци, когато е основният поток от пътувания зад граница. Кривата съвпада доста добре с тази на излизащите от страната с цел почивка. Закономерно се вижда пик през юни и юли, т.е. преди летните отпуски.

Интересно наблюдение е, че немалка част от картите издадени пред 2022-ра не се преиздават скоро след изтичането на едногодишния им срок през 2023-та. Пикът това лято е основно от нови карти. Има обаче поне една група от около 100 хиляди души, които редовно си обновяват картите. Може да свалите всички числа от отговора на НЗОК тук.

Тук отново изниква въпроса защо е нужно всичко това. Обяснението на НЗОК от край време е, за да се пресичат злоупотребите. Попитах колко притежатели на карти са губили здравните си права докато са имали активна карта. Казаха ми, че не знаят. Причината е, че не го следят активно, а при искане за покриване на лечение. Отделно се оказа, че НЗОК нямат справка в реално време кой си плаща здравните вноски и кой не – налага се индивидуално да проверяват всеки за конкретен период през интерфейс на НАП. Това, както сами се сещате, е абсурдно на много нива.

Това, което ми отговориха обаче е, че от 140669 заявления подадени по какъвто и да е начин през 2022-ра, 2371 или 1.69% са били отказани заради прекъснати здравни права. Това число и съответно процент намалява значително откакто има ЕЗОК – от 5662 в 2013, през 4374 през 2018 до наполовина сега спрямо преди 10 години. Така смисълът от краткия срок на тези карти се обезсмисля, предвид, че разходът за издаването всяка година, разпространението и времето, което хората губят е повече, отколкото потенциалните щети за касата.

Докато някой в НЗОК започне да мисли за този и други далеч по-тежки пороци в системата, това, което огромна част от хората губещи време по каси и банки могат да направят е да използват дистанционния вариант. Това, с което Министерството за електронното управление от една страна и здравеопазването от друга могат да спомогнат, е да се направи прост формуляр, което да попълва всичко автоматично. Особено предвид, че 90% от информацията изисквана в наредбата е достъпна служебно правейки наредбата противозаконна.

Към този момент под 0.78% от ЕЗОК издадени след статията ми са били поискани през ССЕВ. Не съм фен на оптимизирането на процеси, които не следва да съществуват на първо място, но това тук е полезно упражнение в комуникация и честно казано – обучение на собствените ни институции. По този начин съм подавал както адресна регистрация, така и искания за акт за раждане, заявления по ЗДОИ и други документи.

The post Колко се възползваха от изкарването на ЕЗОК дистанционно? first appeared on Блогът на Юруков.

Let’s Architect! Designing systems for stream data processing

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-systems-for-stream-data-processing/

From customer interactions on e-commerce platforms to social media trends and from sensor data in internet of things (IoT) devices to financial market updates, streaming data encompasses a vast array of information. This ability to handle real-time flow often distinguishes successful organizations from their competitors. Harnessing the potential of streaming data processing offers organizations an opportunity to stay at the forefront of their industries, make data-informed decisions with unprecedented agility, and gain invaluable insights into customer behavior and operational efficiency.

AWS provides a foundation for building robust and reliable data pipelines that efficiently transport streaming data, eliminating the intricacies of infrastructure management. This shift empowers engineers to focus their talents and energies on creating business value, rather than consuming their time for managing infrastructure.

Build Modern Data Streaming Architectures on AWS

In a world of exploding data, traditional on-premises analytics struggle to scale and become cost-prohibitive. Modern data architecture on AWS offers a solution. It lets organizations easily access, analyze, and break down data silos, all while ensuring data security. This empowers real-time insights and versatile applications, from live dashboards to data lakes and warehouses, transforming the way we harness data.

This whitepaper guides you through implementing this architecture, focusing on streaming technologies. It simplifies data collection, management, and analysis, offering three movement patterns to glean insights from near real-time data using AWS’s tailored analytics services. The future of data analytics has arrived.

Take me to this whitepaper!

A serverless streaming data pipeline using Amazon Kinesis and AWS Glue

A serverless streaming data pipeline using Amazon Kinesis and AWS Glue

Lab: Streaming Data Analytics

In this workshop, you’ll see how to process data in real-time, using streaming and micro-batching technologies in the context of anomaly detection. You will also learn how to integrate Apache Kafka on Amazon Managed Streaming for Apache Kafka (Amazon MSK) with an Apache Flink consumer to process and aggregate the events for reporting purposes.

Take me to this workshop

A cloud architecture used for ingestion and stream processing on AWS

A cloud architecture used for ingestion and stream processing on AWS

Publishing real-time financial data feeds using Kafka

Streaming architectures built on Apache Kafka follow the publish/subscribe paradigm: producers publish events to topics via a write operation and the consumers read the events.

This video describes how to offer a real-time financial data feed as a service on AWS. By using Amazon MSK, you can work with Kafka to allow consumers to subscribe to message topics containing the data of interest. The sessions drills down into the best design practices for working with Kafka and the techniques for establishing hybrid connectivity for working at a global scale.

Take me to this video

The topics in Apache Kafka are partitioned for better scaling and replicated for resiliency

The topics in Apache Kafka are partitioned for better scaling and replicated for resiliency

How Samsung modernized architecture for real-time analytics

The Samsung SmartThings story is a compelling case study in how businesses can modernize and optimize their streaming data analytics, relieve the burden of infrastructure management, and embrace a future of real-time insights. After Samsung migrated to Amazon Managed Service for Apache Flink, the development team’s focus shifted from the tedium of infrastructure upkeep to the realm of delivering tangible business value. This change enabled them to harness the full potential of a fully managed stream-processing platform.

Take me to this video

The architecture Samsung used in their real-time analytics system

The architecture Samsung used in their real-time analytics system

See you next time!

Thanks for reading! Next time, we’ll talk about tools for developers. To find all the posts from this series, check the Let’s Architect! page.

An automated approach to perform an in-place engine upgrade in Amazon OpenSearch Service

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/an-automated-approach-to-perform-an-in-place-engine-upgrade-in-amazon-opensearch-service/

Software upgrades bring new features and better performance, and keep you current with the software provider. However, upgrades for software services can be difficult to complete successfully, especially when you can’t tolerate downtime and when the new version’s APIs introduce breaking changes and deprecation that you must remediate. This post shows you how to upgrade from Elasticsearch engine to OpenSearch engine on Amazon OpenSearch Service without needing an intermediate upgrade to Elasticsearch 7.10.

OpenSearch Service supports OpenSearch as an engine, with versions in the 1.x through 2.x series. The service also supports legacy versions of Elasticsearch, versions 1.x through 7.10. Although OpenSearch brings many improvements over earlier engines, it can feel daunting to consider not only upgrading versions, but also changing engines in the process. The good news is that OpenSearch 1.0 is wire compatible with Elasticsearch 7.10, making engine changes straightforward. If you’re running a version of Elasticsearch in the 6.x or early 7.x series on OpenSearch Service, you might think you need to upgrade to Elasticsearch 7.10, and then upgrade to OpenSearch 1.3. However, you can easily upgrade your existing Elasticsearch engine running 6.8, 7.1, 7.2, 7.4, 7.9, and 7.10 in OpenSearch Service to the OpenSearch 1.3 engine.

OpenSearch Service runs a variety of checks before running an actual upgrade:

  • Validation before starting an upgrade
  • Preparing the setup configuration for the desired version
  • Provisioning new nodes with the same hardware configuration
  • Moving shards from old nodes to newly provisioned nodes
  • Removing older nodes and old node references from OpenSearch endpoints

During an upgrade, AWS takes care of the undifferentiated heavy lifting of provisioning, deploying, and moving the data to new domain. You are responsible to make sure there are no breaking changes that affect the data migration and movement to the newer version of the OpenSearch domain. In this post, we discuss the things you must modify and verify before and after running an upgrade from 6.8, 7.1, 7.2, 7.4, 7.9, and 7.10 version of Elasticsearch to 1.3 OpenSearch Service.

Pre-upgrade breaking changes

The following are pre-upgrade breaking changes:

  • Dependency check for language clients and libraries – If you’re using the open-source high-level language clients from Elastic, for example the Java, go, or Python client libraries, AWS recommends moving to the open-source, OpenSearch versions of these clients. (If you don’t use a high-level language client, you can skip this step.) The following are a few steps to perform a dependency check:
    • Determine the client library – Choose an appropriate client library compatible with your programing language. Refer to OpenSearch language clients for a list of all supported client libraries.
    • Add dependencies and resolve conflicts – Update your project’s dependency management system with the necessary dependencies specified by the client library. If your project already has dependencies that conflict with the OpenSearch client library dependencies, you may encounter dependency conflicts. In such cases, you need to resolve the conflicts manually.
    • Test and verify the client – Test the OpenSearch client functionality by establishing a connection, performing some basic operations (like indexing and searching), and verifying the results.
  • Removal of mapping types – Multiple types within an index were deprecated in Elasticsearch version 6.x, and completely removed in version 7.0 or later. OpenSearch indexes can only contain one mapping type. From OpenSearch version 2.x onward, the mapping _type must be _doc. You must check and fix the mapping before upgrading to OpenSearch 1.3.

Complete the following steps to identify and fix mapping issues:

  1. Navigate to dev tools and use the following GET <index> mapping API to fetch the mapping information for all the indexes:
GET /index-name/_mapping

The mapping response will contain a JSON structure that represents the mapping for your index.

  1. Look for the top-level keys in the response JSON; each key represents a custom type within the index.

The _doc type is used for the default type in Elasticsearch 7.x and OpenSearch Service 1.x, but you may see additional types that you defined in earlier versions of Elasticsearch. The following is an example response for an index with two custom types, type1 and type2.

Note that indexes created in 5.x will continue to function in 6.x as they did in 5.x, but indexes created in 6.x only allow a single type per index.

{
  "myindex": {
    "mappings": {
      "type1": {
        "properties": {
          "field1_type1": {
            "type": "text"
          },
          "field2_type1": {
            "type": "integer"
          }
        }
      },
      "type2": {
        "properties": {
          "field1_type2": {
            "type": "keyword"
          },
          "field2_type2": {
            "type": "date"
          }
        }
      }
    }
  }
}

To fix the multiple mapping types in your existing domain, you need to reindex the data, where you can create one index for each mapping. This is a crucial step in the migration process because OpenSearch doesn’t support multiple types within a single index. In the next steps, we convert an index that has multiple mapping types into two separate indexes, each using the _doc type.

  1. You can unify the mapping by using your existing index name as a root and adding the type as a suffix. For example, the following code creates two indexes with myindex as the root name and type1 and type2 as the suffix:
    # Create an index for "type1"
    PUT /myindex_type1
    
    # Create an index for "type2"
    PUT /myindex_type2

  2. Use the _reindex API to reindex the data from the original index into the two new indexes. Alternately, you can reload the data from its source, if you’re keeping it in another system.
    POST _reindex
    {
      "source": {
        "index": "myindex",
        "type": "type1"  
      },
      "dest": {
        "index": "myindex_type1",
        "type": "_doc"  
      }
    }
    POST _reindex
    {
      "source": {
        "index": "myindex",
        "type": "type2"  
      },
      "dest": {
        "index": "myindex_type2",
        "type": "_doc"  
      }
    }

  3. If your application was previously querying the original index with multiple types, you’ll need to update your queries to specify the new indexes with _doc as the type. For example, if your client was querying using myindex, which has been reindexed to myindex_type1 and myindex_type2, then change your clients to point to myindex*, which will query across both indexes.
  4. After you have verified that the data is successfully reindexed and your application is working as expected with the new indexes, you should delete the original index before starting the upgrade, because it won’t be supported in the new version. Be cautious when deleting data and make sure you have backups if necessary.
  5. As part of this upgrade, Kibana will be replaced with OpenSearch Dashboards. When you’re done with the upgrade in the next step, you should advocate your users to use the new endpoint, which will be _dashboards. If you use a custom endpoint, be sure to update it to point to /_dashboards.
  6. We recommend that you update your AWS Identity and Access Management (IAM) policies to use the renamed API operations. However, OpenSearch Service will continue to respect existing policies by internally replicating the old API permissions. Service control policies (SCPs) introduce an additional layer of complexity compared to standard IAM. To prevent your SCP policies from breaking, you need to add both the old and the new API operations to each of your SCP policies.

Start the upgrade

The upgrade process is irreversible and can’t be paused or cancelled. You should make sure that everything will go smoothly by running a proof of concept (POC) check. By building a POC, you ensure data preservation, avoid compatibility issues, prevent unwanted bugs, and mitigate risks. You can run a POC with a small domain on the new version, and with a small subset of your data. Deploy and run any front-end code that communicates with OpenSearch Service via API against your test domain. Use your ingest pipeline to send data to your test domain as well, ensuring nothing breaks. Import or rebuild dashboards, alerts, anomaly detectors, and so on. This simple precaution can make your upgrade experience smoother and trouble-free while minimizing potential disruptions.

When OpenSearch Service starts the upgrade, it can take from 15 minutes to several hours to complete. OpenSearch Dashboards might be unavailable during some or all of the duration of the upgrade.

In this section, we show how you can start an upgrade using the AWS Management Console. You can also run the upgrade using the AWS Command Line Interface (AWS CLI) or AWS SDK. For more information, refer to Starting an upgrade (CLI) and Starting an upgrade (SDK).

  1. Take a manual snapshot of your domain.

This snapshot serves as a backup that you can restore on a new domain if you want to return to using the prior version.

  1. On the OpenSearch Service console, choose the domain that you want to upgrade.
  2. On the Actions menu, choose Upgrade.
  3. Select the target version as OpenSearch 1.3. If you’re upgrading to an OpenSearch version, then you’ll be able to enable compatibility mode. If you enable this setting, OpenSearch reports its version as 7.10 to allow Elastic’s open-source clients and plugins like Logstash to continue working with your OpenSearch Service domain.
  4. Choose Check Upgrade Eligibility, which helps you identify if there are any breaking changes you still need to fix before running an upgrade.
  5. Choose Upgrade.
  6. Check the status on the domain dashboard to monitor the status of the upgrade.

The following graphic gives a quick demonstration of running and monitoring the upgrade via the preceding steps.

Post-upgrade changes and validations

Now that you have successfully upgraded your domain to OpenSearch Service 1.3, be sure to make the following changes:

  • Custom endpoint – A custom endpoint for your OpenSearch Service domain makes it straightforward for you to refer to your OpenSearch and OpenSearch Dashboards URLs. You can include your company’s branding or just use a shorter, easier-to-remember endpoint than the standard one. In OpenSearch Service, Kibana has been renamed to dashboards. After you upgrade a domain from the Elasticsearch engine to the OpenSearch engine, the /_plugin/kibana endpoint changes to /_dashboards. If you use the Kibana endpoint to set up a custom domain, you need to update it to include the new /_dashboards endpoint.
  • SAML authentication for OpenSearch Dashboards – After you upgrade your domain to OpenSearch Service, you need to change all Kibana URLs configured in your identity provider (IdP) from /_plugin/kibana to /_dashboards. The most common URLs are assertion consumer service (ACS) URLs and recipient URLs.

Conclusion

This post discussed what to consider when planning to upgrade your existing Elasticsearch engine in your OpenSearch Service domain to the OpenSearch engine. OpenSearch Service continues to support older engine versions, including open-source versions of Elasticsearch from 1.x through 7.10. By migrating forward to OpenSearch Service, you will be able to take advantage of bug fixes, performance improvements, new service features, and expanded instance types for your data nodes, like AWS Graviton 2 instances.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the Authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Harsh Bansal is a Solutions Architect at AWS with Analytics as his area of specialty.He has been building solutions to help organizations make data-driven decisions.

Create, train, and deploy Amazon Redshift ML model integrating features from Amazon SageMaker Feature Store

Post Syndicated from Anirban Sinha original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-amazon-redshift-ml-model-integrating-features-from-amazon-sagemaker-feature-store/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies. Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using SQL commands familiar to many roles such as executives, business analysts, and data analysts. We covered in a previous post how you can use data in Amazon Redshift to train models in Amazon SageMaker, a fully managed ML service, and then make predictions within your Redshift data warehouse.

Redshift ML currently supports ML algorithms such as XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Additionally, you can import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. However, one challenge in training a production-ready ML model using SageMaker Feature Store is access to a diverse set of features that aren’t always owned and maintained by the team that is building the model. For example, an ML model to identify fraudulent financial transactions needs access to both identifying (device type, browser) and transaction (amount, credit or debit, and so on) related features. As a data scientist building an ML model, you may have access to the identifying information but not the transaction information, and having access to a feature store solves this.

In this post, we discuss the combined feature store pattern, which allows teams to maintain their own local feature stores using a local Redshift table while still being able to access shared features from the centralized feature store. In a local feature store, you can store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.

We also show you how to use familiar SQL statements to create and train ML models by combining shared features from the centralized store with local features and use these models to make in-database predictions on new data for use cases such as fraud risk scoring.

Overview of solution

For this post, we create an ML model to predict if a transaction is fraudulent or not, given the transaction record. To build this, we need to engineer features that describe an individual credit card’s spending pattern, such as the number of transactions or the average transaction amount, and also information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

To get started, we need an Amazon Redshift Serverless data warehouse with the Redshift ML feature enabled and an Amazon SageMaker Studio environment with access to SageMaker Feature Store. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We also need an offline feature store to store features in feature groups. The offline store uses an Amazon Simple Storage Service (Amazon S3) bucket for storage and can also fetch data using Amazon Athena queries. For an introduction to SageMaker Feature Store and instructions on setting it up, see Getting started with Amazon SageMaker Feature Store.

The following diagram illustrates solution architecture.

The workflow contains the following steps:

  1. Create the offline feature group in SageMaker Feature Store and ingest data into the feature group.
  2. Create a Redshift table and load local feature data into the table.
  3. Create an external schema for Amazon Redshift Spectrum to access the offline store data stored in Amazon S3 using the AWS Glue Data Catalog.
  4. Train and validate a fraud risk scoring ML model using local feature data and external offline feature store data.
  5. Use the offline feature store and local store for inference.

Dataset

To demonstrate this use case, we use a synthetic dataset with two tables: identity and transactions. They can both be joined by the TransactionID column. The transaction table contains information about a particular transaction, such as amount, credit or debit card, and so on, and the identity table contains information about the user, such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.

The following is an example of the transactions dataset.

The following is an example of the identity dataset.

Let’s assume that across the organization, data science teams centrally manage the identity data and process it to extract features in a centralized offline feature store. The data warehouse team ingests and analyzes transaction data in a Redshift table, owned by them.

We work through this use case to understand how the data warehouse team can securely retrieve the latest features from the identity feature group and join it with transaction data in Amazon Redshift to create a feature set for training and inferencing a fraud detection model.

Create the offline feature group and ingest data

To start, we set up SageMaker Feature Store, create a feature group for the identity dataset, inspect and process the dataset, and ingest some sample data. We then prepare the transaction features from the transaction data and store it in Amazon S3 for further loading into the Redshift table.

Alternatively, you can author features using Amazon SageMaker Data Wrangler, create feature groups in SageMaker Feature Store, and ingest features in batches using an Amazon SageMaker Processing job with a notebook exported from SageMaker Data Wrangler. This mode allows for batch ingestion into the offline store.

Let’s explore some of the key steps in this section.

  1. Download the sample notebook.
  2. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  3. Locate your notebook instance and choose Open Jupyter.
  4. Choose Upload and upload the notebook you just downloaded.
  5. Open the notebook sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
  6. Follow the instructions and run all the cells up to the Cleanup Resources section.

The following are key steps from the notebook:

  1. We create a Pandas DataFrame with the initial CSV data. We apply feature transformations for this dataset.
    identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].read()))
    transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].read()))
    
    identity_data = identity_data.round(5)
    transaction_data = transaction_data.round(5)
    
    identity_data = identity_data.fillna(0)
    transaction_data = transaction_data.fillna(0)
    
    # Feature transformations for this dataset are applied 
    # One hot encode card4, card6
    encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
    encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")
    
    transformed_transaction_data = pd.concat(
        [transaction_data, encoded_card_type, encoded_card_bank], axis=1
    )

  2. We store the processed and transformed transaction dataset in an S3 bucket. This transaction data will be loaded later in the Redshift table for building the local feature store.
    transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
    s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

  3. Next, we need a record identifier name and an event time feature name. In our fraud detection example, the column of interest is TransactionID.EventTime can be appended to your data when no timestamp is available. In the following code, you can see how these variables are set, and then EventTime is appended to both features’ data.
    # record identifier and event time feature names
    record_identifier_feature_name = "TransactionID"
    event_time_feature_name = "EventTime"
    
    # append EventTime feature
    identity_data[event_time_feature_name] = pd.Series(
        [current_time_sec] * len(identity_data), dtype="float64"
    )

  4. We then create and ingest the data into the feature group using the SageMaker SDK FeatureGroup.ingest API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into SageMaker Feature Store, such as batch ingestion with Apache Spark.
    identity_feature_group.create(
        s3_uri=<S3_Path_Feature_Store>,
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=<role_arn>,
        enable_online_store=False,
    )
    
    identity_feature_group_name = "identity-feature-group"
    
    # load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
    identity_feature_group.load_feature_definitions(data_frame=identity_data)
    identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)
    

  5. We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.

At this point, the identity feature group is created in an offline feature store with historical data persisted in Amazon S3. SageMaker Feature Store automatically creates an AWS Glue Data Catalog for the offline store, which enables us to run SQL queries against the offline data using Athena or Redshift Spectrum.

Create a Redshift table and load local feature data

To build a Redshift ML model, we build a training dataset joining the identity data and transaction data using SQL queries. The identity data is in a centralized feature store where the historical set of records are persisted in Amazon S3. The transaction data is a local feature for training data that needs to made available in the Redshift table.

Let’s explore how to create the schema and load the processed transaction data from Amazon S3 into a Redshift table.

  1. Create the customer_transaction table and load daily transaction data into the table, which you’ll use to train the ML model:
    DROP TABLE customer_transaction;
    CREATE TABLE customer_transaction (
      TransactionID INT,    
      isFraud INT,  
      TransactionDT INT,    
      TransactionAmt decimal(10,2), 
      card1 INT,    
      card2 decimal(10,2),card3 decimal(10,2),  
      card4 VARCHAR(20),card5 decimal(10,2),    
      card6 VARCHAR(20),    
      B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
      B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
      F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
      F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
      F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
      N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
      N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
      N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
      card_type_0  boolean,
      card_type_credit boolean,
      card_type_debit  boolean,
      card_bank_0  boolean,
      card_bank_american_express boolean,
      card_bank_discover  boolean,
      card_bank_mastercard  boolean,
      card_bank_visa boolean  
    );

  2. Load the sample data by using the following command. Replace your Region and S3 path as appropriate. You will find the S3 path in the S3 Bucket Setup For The OfflineStore section in the notebook or by checking the dataset_uri_prefix in the notebook.
    COPY customer_transaction
    FROM '<s3path>/transformed_transaction_data.csv' 
    IAM_ROLE default delimiter ',' 
    region 'your-region';

Now that we have created a local feature store for the transaction data, we focus on integrating a centralized feature store with Amazon Redshift to access the identity data.

Create an external schema for Redshift Spectrum to access the offline store data

We have created a centralized feature store for identity features, and we can access this offline feature store using services such as Redshift Spectrum. When the identity data is available through the Redshift Spectrum table, we can create a training dataset with feature values from the identity feature group and customer_transaction, joining on the TransactionId column.

This section provides an overview of how to enable Redshift Spectrum to query data directly from files on Amazon S3 through an external database in an AWS Glue Data Catalog.

  1. First, check that the identity-feature-group table is present in the Data Catalog under the sagemamker_featurestore database.
  2. Using Redshift Query Editor V2, create an external schema using the following command:
    CREATE EXTERNAL SCHEMA sagemaker_featurestore
    FROM DATA CATALOG
    DATABASE 'sagemaker_featurestore'
    IAM_ROLE default
    create external database if not exists;

All the tables, including identity-feature-group external tables, are visible under the sagemaker_featurestore external schema. In Redshift Query Editor v2, you can check the contents of the external schema.

  1. Run the following query to sample a few records—note that your table name may be different:
    Select * from sagemaker_featurestore.identity_feature_group_1680208535 limit 10;

  2. Create a view to join the latest data from identity-feature-group and customer_transaction on the TransactionId column. Be sure to change the external table name to match your external table name:
    create or replace view public.credit_fraud_detection_v
    AS select  "isfraud",
            "transactiondt",
            "transactionamt",
            "card1","card2","card3","card5",
             case when "card_type_credit" = 'False' then 0 else 1 end as card_type_credit,
             case when "card_type_debit" = 'False' then 0 else 1 end as card_type_debit,
             case when "card_bank_american_express" = 'False' then 0 else 1 end as card_bank_american_express,
             case when "card_bank_discover" = 'False' then 0 else 1 end as card_bank_discover,
             case when "card_bank_mastercard" = 'False' then 0 else 1 end as card_bank_mastercard,
             case when "card_bank_visa" = 'False' then 0 else 1 end as card_bank_visa,
            "id_01","id_02","id_03","id_04","id_05"
    from public.customer_transaction ct left join sagemaker_featurestore.identity_feature_group_1680208535 id
    on id.transactionid = ct.transactionid with no schema binding;

Train and validate the fraud risk scoring ML model

Redshift ML gives you the flexibility to specify your own algorithms and model types and also to provide your own advanced parameters, which can include preprocessors, problem type, and hyperparameters. In this post, we create a customer model by specifying AUTO OFF and the model type of XGBOOST. By turning AUTO OFF and using XGBoost, we are providing the necessary inputs for SageMaker to train the model. A benefit of this can be faster training times. XGBoost is as open-source version of the gradient boosted trees algorithm. For more details on XGBoost, refer to Build XGBoost models with Amazon Redshift ML.

We train the model using 80% of the dataset by filtering on transactiondt < 12517618. The other 20% will be used for inference. A centralized feature store is useful in providing the latest supplementing data for training requests. Note that you will need to provide an S3 bucket name in the create model statement. It will take approximately 10 minutes to create the model.

CREATE MODEL frauddetection_xgboost
FROM (select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v where transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

When you run the create model command, it will complete quickly in Amazon Redshift while the model training is happening in the background using SageMaker. You can check the status of the model by running a show model command:

show model frauddetection_xgboost;

The output of the show model command shows that the model state is TRAINING. It also shows other information such as the model type and the training job name that SageMaker assigned.
After a few minutes, we run the show model command again:

show model frauddetection_xgboost;

Now the output shows the model state is READY. We can also see the train:error score here, which at 0 tells us we have a good model. Now that the model is trained, we can use it for running inference queries.

Use the offline feature store and local store for inference

We can use the SQL function to apply the ML model to data in queries, reports, and dashboards. Let’s use the function ml_fn_frauddetection_xgboost created by our model against our test dataset by filtering where transactiondt >=12517618, to predict whether a transaction is fraudulent or not. SageMaker Feature Store can be useful in supplementing data for inference requests.

Run the following query to predict whether transactions are fraudulent or not:

select  "isfraud" as "Actual",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v where transactiondt >= 12517618;

For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following:

accuracy = (sum (actual == predicted)/total) *100

Let’s apply the preceding code to our use case to find the accuracy of the model. We use the test data (transactiondt >= 12517618) to test the accuracy, and use the newly created function ml_fn_frauddetection_xgboost to predict and take the columns other than the target and label as the input:

-- check accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS actual,
CASE 
   WHEN actual = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS correct
FROM credit_fraud_detection_v where transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data) 

SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

Clean up

As a final step, clean up the resources:

  1. Delete the Redshift cluster.
  2. Run the Cleanup Resources section of your notebook.

Conclusion

Redshift ML enables you to bring machine learning to your data, powering fast and informed decision-making. SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across business units and data science teams.

In this post, we showed how you can train an XGBoost model using Redshift ML with data spread across SageMaker Feature Store and a Redshift table. Additionally, we showed how you can make inferences on a trained model to detect fraud using Amazon Redshift SQL commands.


About the authors

Anirban Sinha is a Senior Technical Account Manager at AWS. He is passionate about building scalable data warehouses and big data solutions working closely with customers. He works with large ISVs customers, in helping them build and operate secure, resilient, scalable, and high-performance SaaS applications in the cloud.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and using the power of ML within their data warehouse.

Gaurav Singh is a Senior Solutions Architect at AWS, specializing in AI/ML and Generative AI. Based in Pune, India, he focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. In his spare time, Gaurav loves to explore nature, read, and run.

ОПГ на ГЕРБ безчинства в Царево и след като причини наводненията Смърт при злополука и незаконни заведения. Прокуратурата и кметът перат олигарха на ГЕРБ в Царево

Post Syndicated from Николай Марченко original https://bivol.bg/zlopoluka-sech-postroiki.html

четвъртък 26 октомври 2023


“Кметът в сянка” в гр. Царево Ангел Цигуларов е за пореден път “измъкнат”, вместо да бъде подведен под отговорност, с помощта на местната власт и на Прокуратурата на Република България…

Intel Shows Granite Rapids and Sierra Forest Motherboards at OCP Summit 2023

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/intel-shows-granite-rapids-and-sierra-forest-motherboards-at-ocp-summit-2023-qct-wistron/

Intel showed off upcoming Granite Rapids and Sierra Forest Xeon motherboards at OCP Summit 2023, we show upcoming server trends

The post Intel Shows Granite Rapids and Sierra Forest Motherboards at OCP Summit 2023 appeared first on ServeTheHome.

[$] Better string handling for the kernel

Post Syndicated from corbet original https://lwn.net/Articles/948408/

The C programming language is replete with features that seemed like a good
idea at the time (and perhaps even were good ideas then) that have not aged
well. Most would likely agree that string handling, and the use of
NUL-terminated strings, is one of those. Kernel developers have, for
years, tried to improve the handling of strings in an attempt to slow the
flow of bugs and vulnerabilities that result from mistakes in that area.
Now there is an early discussion on the idea of moving away from
NUL-terminated strings in much of the kernel.

Security updates for Thursday

Post Syndicated from corbet original https://lwn.net/Articles/948930/

Security updates have been issued by Debian (firefox-esr and xorg-server), Fedora (firefox, mbedtls, nodejs18, nodejs20, and xen), Gentoo (libinput, unifi, and USBView), Mageia (python-nltk), Oracle (linux-firmware), Red Hat (nginx:1.22), SUSE (chromium, firefox, java-11-openjdk, jetty-minimal, nghttp2, nodejs18, webkit2gtk3, and zlib), and Ubuntu (linux, linux-lowlatency, linux-oracle-5.15, vim, and xorg-server, xwayland).

Introducing HAR Sanitizer: secure HAR sharing

Post Syndicated from Kenny Johnson original http://blog.cloudflare.com/introducing-har-sanitizer-secure-har-sharing/


Introducing HAR Sanitizer: secure HAR sharing

On Wednesday 18th, 2023, Cloudflare’s Security Incident Response Team (SIRT) discovered an attack on our systems that originated from an authentication token stolen from one of Okta’s support systems. No Cloudflare customer information or systems were impacted by the incident, thanks to the real-time detection and rapid action of our Security Incident Response Team (SIRT) in tandem with our Zero Trust security posture and use of hardware keys. With that said, we’d rather not repeat the experience — and so we have built a new security tool that can help organizations render this type of attack obsolete for good.

The bad actor in the Okta breach compromised user sessions by capturing session tokens from administrators at Cloudflare and other impacted organizations. They did this by infiltrating Okta’s customer support system and stealing one of the most common mechanisms for troubleshooting — an HTTP Response Archive (HAR) file.

HAR files contain a record of a user’s browser session, a kind of step-by-step audit, that a user can share with someone like a help desk agent to diagnose an issue. However, the file can also contain sensitive information that can be used to launch an attack.

As a follow-up to the Okta breach, we are making a HAR file sanitizer available to everyone, not just Cloudflare customers, at no cost. We are publishing this tool under an open source license and are making it available to any support, engineering or security team. At Cloudflare, we are committed to making the Internet a better place and using HAR files without the threat of stolen sessions should be part of the future of the Internet.

HAR Files – a look back in time

Imagine being able to rewind time and revisit every single step a user took during a web session, scrutinizing each request and the responses the browser received.

HAR (HTTP Archive) files are a JSON formatted archive file of a web browser’s interaction with a web application. HAR files provide a detailed snapshot of every request, including headers, cookies, and other types of data sent to a web server by the browser. This makes them an invaluable resource to troubleshoot web application issues especially for complex, layered web applications.

The snapshot that a HAR file captures can contain the following information:

Complete Request and Response Headers: Every piece of data sent and received, including method types (GET, POST, etc.), status codes, URLs, cookies, and more.

Payload Content: Details of what was actually exchanged between the client and server, which can be essential for diagnosing issues related to data submission or retrieval.

Timing Information: Precise timing breakdowns of each phase – from DNS lookup, connection time, SSL handshake, to content download – giving insight into performance bottlenecks.

This information can be difficult to gather from an application’s logs due to the diverse nature of devices, browsers and networks used to access an application. A user would need to take dozens of manual steps. A HAR file gives them a one-click option to share diagnostic information with another party. The file is also standard, providing the developers, support teams, and administrators on the other side of the exchange with a consistent input to their own tooling. This minimizes the frustrating back-and-forth where teams try to recreate a user-reported problem, ensuring that everyone is, quite literally, on the same page.

HAR files as an attack vector

HAR files, while powerful, come with a cautionary note. Within the set of information they contain, session cookies make them a target for malicious actors.

The Role of Session Cookies

Before diving into the risks, it’s crucial to understand the role of session cookies. A session cookie is sent from a server and stored on a user’s browser to maintain stateful information across web sessions for that user. In simpler terms, it’s how the browser keeps you logged into an application for a period of time even if you close the page. Generally, these cookies live in local memory on a user’s browser and are not often shared. However, a HAR file is one of the most common ways that a session cookie could be inadvertently shared.

If a HAR file with a valid session cookie is shared, then there are a number of potential security threats that user, and company, may be exposed to:

Unauthorized Access: The biggest risk is unauthorized access. If a HAR file with a session cookie lands in the wrong hands, it grants entry to the user’s account for that application. For platforms that store personal data or financial details, the consequences of such a breach can be catastrophic. Especially if the session cookie of a user with administrative or elevated permissions is stolen.

Session Hijacking: Armed with a session cookie, attackers can impersonate legitimate users, a tactic known as session hijacking. This can lead to a range of malicious activities, from spreading misinformation to siphoning off funds.

Persistent Exposure: Unlike other forms of data, a session cookie’s exposure risk doesn’t necessarily end when a user session does. Depending on the cookie’s lifespan, malicious actors could gain prolonged access, repeatedly compromising a user’s digital interactions.

Gateway to Further Attacks: With access to a user’s session, especially an administrator’s, attackers can probe for other vulnerabilities, exploit platform weaknesses, or jump to other applications.

Mitigating the impact of a stolen HAR file

Thankfully, there are ways to render a HAR file inert even if stolen by an attacker. One of the most effective methods is to “sanitize” a HAR file of any session related information before sharing it for debugging purposes.

The HAR sanitizer we are introducing today allows a user to upload any HAR file, and the tool will strip out any session related cookies or JSON Web Tokens (JWT). The tool is built entirely on Cloudflare Workers, and all sanitization is done client-side which means Cloudflare never sees the full contents of the session token.

Just enough sanitization

By default, the sanitizer will remove all session-related cookies and tokens — but there are some cases where these are essential for troubleshooting. For these scenarios, we are implementing a way to conditionally strip “just enough” data from the HAR file to render them safe, while still giving support teams the information they need.

The first product we’ve optimized the HAR sanitizer for is Cloudflare Access. Access relies on a user’s JWT — a compact token often used for secure authentication — to verify that a user should have access to the requested resource. This means a JWT plays a crucial role in troubleshooting issues with Cloudflare Access. We have tuned the HAR sanitizer to strip the cryptographic signature out of the Access JWT, rendering it inert, while still providing useful information for internal admins and Cloudflare support to debug issues.

Because HAR files can include a diverse array of data types, selectively sanitizing them is not a case of ‘one size fits all’. We will continue to expand support for other popular authentication tools to ensure we strip out “just enough” information.

What’s next

Over the coming months, we will launch additional security controls in Cloudflare Zero Trust to further mitigate attacks stemming from session tokens stolen from HAR files. This will include:

  • Enhanced Data Loss Prevention (DLP) file type scanning to include HAR file and session token detections, to ensure users in your organization can not share unsanitized files.
  • Expanded API CASB scanning to detect HAR files with session tokens in collaboration tools like Zendesk, Jira, Drive and O365.
  • Automated HAR sanitization of data in popular collaboration tools.

As always, we continue to expand our Cloudflare One Zero Trust suite to protect organizations of all sizes against an ever-evolving array of threats. Ready to get started? Sign up here to begin using Cloudflare One at no cost for teams of up to 50 users.

Email Routing subdomain support, new APIs and security protocols

Post Syndicated from Celso Martinho original http://blog.cloudflare.com/email-routing-subdomains/


Email Routing subdomain support, new APIs and security protocols

It’s been two years since we announced Email Routing, our solution to create custom email addresses for your domains and route incoming emails to your preferred mailbox. Since then, the team has worked hard to evolve the product and add more powerful features to meet our users’ expectations. Examples include Route to Workers, which allows you to process your Emails programmatically using Workers scripts, Public APIs, Audit Logs, or DMARC Management.

We also made significant progress in supporting more email security extensions and protocols, protecting our customers from unwanted traffic, and keeping our IP space reputation for email egress impeccable to maximize our deliverability rates to whatever inbox upstream provider you chose.

Since leaving beta, Email Routing has grown into one of our most popular products; it’s used by more than one million different customer zones globally, and we forward around 20 million messages daily to every major email platform out there. Our product is mature, robust enough for general usage, and suitable for any production environment. And it keeps evolving: today, we announce three new features that will help make Email Routing more secure, flexible, and powerful than ever.

New security protocols

The SMTP email protocol has been around since the early 80s. Naturally, it wasn’t designed with the best security practices and requirements in mind, at least not the ones that the Internet expects today. For that reason, several protocol revisions and extensions have been standardized and adopted by the community over the years. Cloudflare is known for being an early adopter of promising emerging technologies; Email Routing already supports things like SPF, DKIM signatures, DMARC policy enforcement, TLS transport, STARTTLS, and IPv6 egress, to name a few. Today, we are introducing support for two new standards to help increase email security and improve deliverability to third-party upstream email providers.

ARC

Authenticated Received Chain (ARC) is an email authentication system designed to allow an intermediate email server (such as Email Routing) to preserve email authentication results. In other words, with ARC, we can securely preserve the results of validating sender authentication mechanisms like SPF and DKIM, which we support when the email is received, and transport that information to the upstream provider when we forward the message. ARC establishes a chain of trust with all the hops the message has passed through. So, if it was tampered with or changed in one of the hops, it is possible to see where by following that chain.

We began rolling out ARC support to Email Routing a few weeks ago. Here’s how it works:

As you can see, [email protected] sends an Email to [email protected], an Email Routing address, which in turn is forwarded to the final address, [email protected].

Email Routing will use @example.com’s DMARC policy to check the SPF and DKIM alignments (SPF, DKIM, and DMARC help authenticate email senders by verifying that the emails came from the domain that they claim to be from.) It then stores this authentication result by adding a Arc-Authentication-Results header in the message:

ARC-Authentication-Results: i=1; mx.cloudflare.net; dkim=pass header.d=cloudflare.com header.s=example09082023 header.b=IRdayjbb; dmarc=pass header.from=example.com policy.dmarc=reject; spf=none (mx.cloudflare.net: no SPF records found for [email protected]) smtp.helo=smtp.example.com; spf=pass (mx.cloudflare.net: domain of [email protected] designates 2a00:1440:4824:20::32e as permitted sender) [email protected]; arc=none smtp.remote-ip=2a00:1440:4824:20::32e

Then we take a snapshot of all the headers and the body of the original message, and we generate an Arc-Message-Signature header with a DKIM-like cryptographic signature (in fact ARC uses the same DKIM keys):

ARC-Message-Signature: i=1; a=rsa-sha256; s=2022; d=email.cloudflare.net; c=relaxed/relaxed; h=To:Date:Subject:From:reply-to:cc:resent-date:resent-from:resent-to :resent-cc:in-reply-to:references:list-id:list-help:list-unsubscribe :list-subscribe:list-post:list-owner:list-archive; t=1697709687; bh=sN/+...aNbf==;

Finally, before forwarding the message to [email protected], Email Routing generates the Arc-Seal header, another DKIM-like signature, composed out of the Arc-Authentication-Results and Arc-Message-Signature, and cryptographically “seals” the message:

ARC-Seal: i=1; a=rsa-sha256; s=2022; d=email.cloudflare.net; cv=none; b=Lx35lY6..t4g==;

When Gmail receives the message from Email Routing, it not only normally authenticates the last hop domain.example domain (Email Routing uses SRS), but it also checks the ARC seal header, which provides the authentication results of the original sender.

ARC increases the traceability of the message path through email intermediaries, allowing for more informed delivery decisions by those who receive emails as well as higher deliverability rates for those who transport them, like Email Routing. It has been adopted by all the major email providers like Gmail and Microsoft. You can read more about the ARC protocol in the RFC8617.

MTA-STS

As we said earlier, SMTP is an old protocol. Initially Email communications were done in the clear, in plain-text and unencrypted. At some point in time in the late 90s, the email providers community standardized STARTTLS, also known as Opportunistic TLS. The STARTTLS extension allowed a client in a SMTP session to upgrade to TLS encrypted communications.

While at the time this seemed like a step forward in the right direction, we later found out that because STARTTLS can start with an unencrypted plain-text connection, and that can be hijacked, the protocol is susceptible to man-in-the-middle attacks.

A few years ago MTA Strict Transport Security (MTA-STS) was introduced by email service providers including Microsoft, Google and Yahoo as a solution to protect against downgrade and man-in-the-middle attacks in SMTP sessions, as well as solving the lack of security-first communication standards in email.

Suppose that example.com uses Email Routing. Here’s how you can enable MTA-STS for it.

First, log in to the Cloudflare dashboard and select your account and zone. Then go to DNS > Records and create a new CNAME record with the name “_mta-sts” that points to Cloudflare’s record “_mta-sts.mx.cloudflare.net”. Make sure to disable the proxy mode.

Confirm that the record was created:

$ dig txt _mta-sts.example.com
_mta-sts.example.com.	300	IN	CNAME	_mta-sts.mx.cloudflare.net.
_mta-sts.mx.cloudflare.net. 300	IN	TXT	"v=STSv1; id=20230615T153000;"

This tells the other end client that is trying to connect to us that we support MTA-STS.

Next you need an HTTPS endpoint at mta-sts.example.com to serve your policy file. This file defines the mail servers in the domain that use MTA-STS. The reason why HTTPS is used here instead of DNS is because not everyone uses DNSSEC yet, so we want to avoid another MITM attack vector.

To do this you need to deploy a very simple Worker that allows Email clients to pull Cloudflare’s Email Routing policy file using the “well-known” URI convention. Go to your Account > Workers & Pages and press Create Application. Pick the “MTA-STS” template from the list.

This Worker simply proxies https://mta-sts.mx.cloudflare.net/.well-known/mta-sts.txt to your own domain. After deploying it, go to the Worker configuration, then Triggers > Custom Domains and Add Custom Domain.

You can then confirm that your policy file is working:

$ curl https://mta-sts.example.com/.well-known/mta-sts.txt
version: STSv1
mode: enforce
mx: *.mx.cloudflare.net
max_age: 86400

This says that we enforce MTA-STS. Capable email clients will only deliver email to this domain over a secure connection to the specified MX servers. If no secure connection can be established the email will not be delivered.

Email Routing also supports MTA-STS upstream, which greatly improves security when forwarding your Emails to service providers like Gmail or Microsoft, and others.

While enabling MTA-STS involves a few steps today, we plan to simplify things for you and automatically configure MTA-STS for your domains from the Email Routing dashboard as a future improvement.

Sending emails and replies from Workers

Last year we announced Email Workers, allowing anyone using Email Routing to associate a Worker script to an Email address rule, and programmatically process their incoming emails in any way they want. Workers is our serverless compute platform, it provides hundreds of features and APIs, like databases and storage. Email Workers opened doors to a flood of use-cases and applications that weren’t possible before like implementing allow/block lists, advanced rules, notifications to messaging applications, honeypot aggregators and more.

Still, you could only act on the incoming email event. You could read and process the email message, you could even manipulate and create some headers, but you couldn’t rewrite the body of the message or create new emails from scratch.

Today we’re announcing two new powerful Email Workers APIs that will further enhance what you can do with Email Routing and Workers.

Send emails from Workers

Now you can send an email from any Worker, from scratch, whenever you want, not just when you receive incoming messages, to any email address verified on Email Routing under your account. Here are a few practical examples where sending email from Workers to your verified addresses can be helpful:

  • Daily digests with the news from your favorite publications.
  • Alert messages whenever the weather conditions are adverse.
  • Automatic notifications when systems complete tasks.
  • Receive a message composed of the inputs of a form online on a contact page.

Let’s see a simple example of a Worker sending an email. First you need to create “send_email” bindings in your wrangler.toml configuration:

send_email = [
    {type = "send_email", name = "EMAIL_OUT"}
 ]

And then creating a new message and sending it in a Workers is as simple as:

import { EmailMessage } from "cloudflare:email";
import { createMimeMessage } from "mimetext";

export default {
 async fetch(request, env) {
   const msg = createMimeMessage();
   msg.setSender({ name: "Workers AI story", addr: "[email protected]" });
   msg.setRecipient("[email protected]");
   msg.setSubject("An email generated in a worker");
   msg.addMessage({
       contentType: 'text/plain',
       data: `Congratulations, you just sent an email from a worker.`
   });

   var message = new EmailMessage(
     "[email protected]",
     "[email protected]",
     msg.asRaw()
   );
   try {
     await env.EMAIL_OUT.send(message);
   } catch (e) {
     return new Response(e.message);
   }

   return new Response("email sent!");
 },
};

This example makes use of mimetext, an open-source raw email message generator.

Again, for security reasons, you can only send emails to the addresses for which you confirmed ownership in Email Routing under your Cloudflare account. If you’re looking for sending email campaigns or newsletters to destination addresses that you do not control or larger subscription groups, you should consider other options like our MailChannels integration.

Since sending Emails from Workers is not tied to the EmailEvent, you can send them from any type of Worker, including Cron Triggers and Durable Objects, whenever you want, you control all the logic.

Reply to emails

One of our most-requested features has been to provide a way to programmatically respond to incoming emails. It has been possible to do this with Email Workers in a very limited capacity by returning a permanent SMTP error message — but this may or may not be visible to the end user depending on the client implementation.

export default {
  async email(message, env, ctx) {
      message.setReject("Address not allowed");
  }
}

As of today, you can now truly reply to incoming emails with another new message and implement smart auto-responders programmatically, adding any content and context in the main body of the message. Think of a customer support email automatically generating a ticket and returning the link to the sender, an out-of-office reply with instructions when you’re on vacation, or a detailed explanation of why you rejected an email. Here’s a code example:

To mitigate security risks and abuse, replying to incoming emails has a few requirements:

  • The incoming email has to have valid DMARC.
  • The email can only be replied to once.
  • The In-Reply-To header of the reply message must match the Message-ID of the incoming message.
  • The recipient of the reply must match the incoming sender.
  • The outgoing sender domain must match the same domain that received the email.

If these and other internal conditions are not met, then reply() will fail with an exception, otherwise you can freely compose your reply message and send it back to the original sender.

For more information the documentation to these APIs is available in our Developer Docs.

Subdomains support

This is a big one.

Email Routing is a zone-level feature. A zone has a top-level domain (the same as the zone name) and it can have subdomains (managed under the DNS feature.) As an example, I can have the example.com  zone, and then the mail.example.com and corp.example.com subdomains under it. However, we can only use Email Routing with the top-level domain of the zone, example.com in this example. While this is fine for the vast majority of use cases, some customers — particularly bigger organizations with complex email requirements — have asked for more flexibility.

This changes today. Now you can use Email Routing with any subdomain of any zone in your account. To make this possible we redesigned the dashboard UI experience to make it easier to get you started and manage all your Email Routing domains and subdomains, rules and destination addresses in one single place. Let’s see how it works.

To add Email Routing features to a new subdomain, log in to the Cloudflare dashboard and select your account and zone. Then go to Email > Email Routing > Settings and click “Add subdomain”.

Once the subdomain is added and the DNS records are configured, you can see it in the Settings list under the Subdomains section:

Now you can go to Email > Email Routing > Routing rules and create new custom addresses that will show you the option of using either the top domain of the zone or any other configured subdomain.

After the new custom address for the subdomain is created you can see it in the list with all the other addresses, and manage it from there.

It’s this easy.

Final words

We hope you enjoy the new features that we are announcing today. Still, we want to be clear: there are no changes in pricing, and Email Routing is still free for Cloudflare customers.

Ever since Email Routing was launched, we’ve been listening to customers’ feedback and trying to adjust our roadmap to both our requirements and their own ideas and requests. Email shouldn’t be difficult; our goal is to listen, learn and keep improving the service with better, more powerful features.

You can find detailed information about the new features and more in our Email Routing Developer Docs.

If you have any questions or feedback about Email Routing, please come see us in the Cloudflare Community and the Cloudflare Discord.

DDoS threat report for 2023 Q3

Post Syndicated from Omer Yoachimik original http://blog.cloudflare.com/ddos-threat-report-2023-q3/


DDoS threat report for 2023 Q3

Welcome to the third DDoS threat report of 2023. DDoS attacks, or distributed denial-of-service attacks, are a type of cyber attack that aims to disrupt websites (and other types of Internet properties) to make them unavailable for legitimate users by overwhelming them with more traffic than they can handle — similar to a driver stuck in a traffic jam on the way to the grocery store.

We see a lot of DDoS attacks of all types and sizes, and our network is one of the largest in the world spanning more than 300 cities in over 100 countries. Through this network we serve over 64 million HTTP requests per second at peak and about 2.3 billion DNS queries every day. On average, we mitigate 140 billion cyber threats each day. This colossal amount of data gives us a unique vantage point to understand the threat landscape and provide the community access to insightful and actionable DDoS trends.

In recent weeks, we’ve also observed a surge in DDoS attacks and other cyber attacks against Israeli newspaper and media websites, as well as financial institutions and government websites. Palestinian websites have also seen a significant increase in DDoS attacks. View the full coverage here.

HTTP DDoS attacks against Israeli websites using Cloudflare
HTTP DDoS attacks against Israeli websites using Cloudflare

The global DDoS threat landscape

In the third quarter of 2023, Cloudflare faced one of the most sophisticated and persistent DDoS attack campaigns in recorded history.

  1. Cloudflare mitigated thousands of hyper-volumetric HTTP DDoS attacks, 89 of which exceeded 100 million requests per second (rps) and with the largest peaking at 201 million rps — a figure three times higher than the previous largest attack on record (71M rps).
  2. The campaign contributed to an overall increase of 65% in HTTP DDoS attack traffic in Q3 compared to the previous quarter. Similarly, L3/4 DDoS attacks also increased by 14%.
  3. Gaming and Gambling companies were bombarded with the largest volume of HTTP DDoS attack traffic, overtaking the Cryptocurrency industry from last quarter.

Reminder: an interactive version of this report is also available as a Cloudflare Radar Report. On Radar, you can also dive deeper and explore traffic trends, attacks, outages and many more insights for your specific industry, network and country.

HTTP DDoS attacks and hyper-volumetric attacks

An HTTP DDoS attack is a DDoS attack over the Hypertext Transfer Protocol (HTTP). It targets HTTP Internet properties such as mobile application servers, ecommerce websites, and API gateways.

Illustration of an HTTP DDoS attack
Illustration of an HTTP DDoS attack

HTTP/2, which accounts for 62% of HTTP traffic, is a version of the protocol that’s meant to improve application performance. The downside is that HTTP/2 can also help improve a botnet’s performance.

Distribution of HTTP versions by Radar

Campaign of hyper-volumetric DDoS attacks exploiting HTTP/2 Rapid Resets

Starting in late August 2023, Cloudflare and various other vendors were subject to a sophisticated and persistent DDoS attack campaign that exploited the HTTP/2 Rapid Reset vulnerability (CVE-2023-44487).

Illustration of an HTTP/2 Rapid Reset DDoS attack

The DDoS campaign included thousands of hyper-volumetric DDoS attacks over HTTP/2 that peaked in the range of millions of requests per second. The average attack rate was 30M rps. Approximately 89 of the attacks peaked above 100M rps and the largest one we saw hit 201M rps.

HTTP/2 Rapid Reset campaign of hyper-volumetric DDoS attacks

Cloudflare’s systems automatically detected and mitigated the vast majority of attacks. We deployed emergency countermeasures and improved our mitigation systems’ efficacy and efficiency to ensure the availability of our network and of our customers’.

Check out our engineering blog that dives deep into the land of HTTP/2, what we learned and what actions we took to make the Internet safer.

Hyper-volumetric DDoS attacks enabled by VM-based botnets

As we’ve seen in this campaign and previous ones, botnets that leverage cloud computing platforms and exploit HTTP/2 are able to generate up to x5,000 more force per botnet node. This allowed them to launch hyper-volumetric DDoS attacks with a small botnet ranging 5-20 thousand nodes alone. To put that into perspective, in the past, IoT based botnets consisted of fleets of millions of nodes and barely managed to reach a few million requests per second.

Comparison of an Internet of Things (IoT) based botnet and a Virtual Machine (VM) based botnet

When analyzing the two-month-long DDoS campaign, we can see that Cloudflare infrastructure was the main target of the attacks. More specifically, 19% of all attacks targeted Cloudflare websites and infrastructure. Another 18% targeted Gaming companies, and 10% targeted well known VoIP providers.

Top industries targeted by the HTTP/2 Rapid Reset DDoS attacks

HTTP DDoS attack traffic increased by 65%

The attack campaign contributed to an overall increase in the amount of attack traffic. Last quarter, the volume of HTTP DDoS attacks increased by 15% QoQ. This quarter, it grew even more. Attacks volume increased by 65% QoQ to a total staggering figure of 8.9 trillion HTTP DDoS requests that Cloudflare systems automatically detected and mitigated.

Aggregated volume of HTTP DDoS attack requests by quarter

Alongside the 65% increase in HTTP DDoS attacks, we also saw a minor increase of 14% in L3/4 DDoS attacks — similar to the figures we saw in the first quarter of this year.

L3/4 DDoS attack by quarter

Top sources of HTTP DDoS attacks

When comparing the global and country-specific HTTP DDoS attack request volume, we see that the US remains the largest source of HTTP DDoS attacks. One out of every 25 HTTP DDoS requests originated from the US. China remains in second place. Brazil replaced Germany as the third-largest source of HTTP DDoS attacks, as Germany fell to fourth place.

HTTP DDoS attacks: Top sources compared to all attack traffic

Some countries naturally receive more traffic due to various factors such as the population and Internet usage, and therefore also receive/generate more attacks. So while it’s interesting to understand the total amount of attack traffic originating from or targeting a given country, it is also helpful to remove that bias by normalizing the attack traffic by all traffic to a given country.

When doing so, we see a different pattern. The US doesn’t even make it into the top ten. Instead, Mozambique is in first place (again). One out of every five HTTP requests that originated from Mozambique was part of an HTTP DDoS attack traffic.

Egypt remains in second place — approximately 13% of requests originating from Egypt were part of an HTTP DDoS attack. Libya and China follow as the third and fourth-largest source of HTTP DDoS attacks.

HTTP DDoS attacks: Top sources compared to their own traffic

Top sources of L3/4 DDoS attacks

When we look at the origins of L3/4 DDoS attacks, we ignore the source IP address because it can be spoofed. Instead, we rely on the location of Cloudflare’s data center where the traffic was ingested. Thanks to our large network and global coverage, we’re able to achieve geographical accuracy to understand where attacks come from.

In Q3, approximately 36% of all L3/4 DDoS attack traffic that we saw in Q3 originated from the US. Far behind, Germany came in second place with 8% and the UK followed in third place with almost 5%.

L3/4 DDoS attacks: Top sources compared to all attack traffic

When normalizing the data, we see that Vietnam dropped to the second-largest source of L3/4 DDoS attacks after being first for two consecutive quarters. New Caledonia, a French territory comprising dozens of islands in the South Pacific, grabbed the first place. Two out of every four bytes ingested in Cloudflare’s data centers in New Caledonia were attacks.

L3/4 DDoS attacks: Top sources compared to their own traffic

Top attacked industries by HTTP DDoS attacks

In terms of absolute volume of HTTP DDoS attack traffic, the Gaming and Gambling industry jumps to first place overtaking the Cryptocurrency industry. Over 5% of all HTTP DDoS attack traffic that Cloudflare saw targeted the Gaming and Gambling industry.

HTTP DDoS attacks: Top attacked industries compared to all attack traffic

The Gaming and Gambling industry has long been one of the most attacked industries compared to others. But when we look at the HTTP DDoS attack traffic relative to each specific industry, we see a different picture. The Gaming and Gambling industry has so much user traffic that, despite being the most attacked industry by volume, it doesn’t even make it into the top ten when we put it into the per-industry context.

Instead, what we see is that the Mining and Metals industry was targeted by the most attacks compared to its total traffic — 17.46% of all traffic to Mining and Metals companies were DDoS attack traffic.

Following closely in second place, 17.41% of all traffic to Non-profits were HTTP DDoS attacks. Many of these attacks are directed at more than 2,400 Non-profit and independent media organizations in 111 countries that Cloudflare protects for free as part of Project Galileo, which celebrated its ninth anniversary this year. Over the past quarter alone, Cloudflare mitigated an average of 180.5 million cyber threats against Galileo-protected websites every day.

HTTP DDoS attacks: Top attacked industries compared to their own traffic

Pharmaceuticals, Biotechnology and Health companies came in third, and US Federal Government websites in fourth place. Almost one out of every 10 HTTP requests to US Federal Government Internet properties were part of an attack. In fifth place, Cryptocurrency and then Farming and Fishery not far behind.

Top attacked industries by region

Now let’s dive deeper to understand which industries were targeted the most in each region.

HTTP DDoS attacks: Top industries targeted by HTTP DDoS attacks by region

Regional deepdives

Africa

After two consecutive quarters as the most attacked industry, the Telecommunications industry dropped from first place to fourth. Media Production companies were the most attacked industry in Africa. The Banking, Financial Services and Insurance (BFSI) industry follows as the second most attacked. Gaming and Gambling companies in third.

Asia

The Cryptocurrency industry remains the most attacked in APAC for the second consecutive quarter. Gaming and Gambling came in second place. Information Technology and Services companies in third.

Europe

For the fourth consecutive quarter, the Gaming and Gambling industry remains the most attacked industry in Europe. Retail companies came in second, and Computer Software companies in third.

Latin America

Farming was the most targeted industry in Latin America in Q3. It accounted for a whopping 53% of all attacks towards Latin America. Far behind, Gaming and Gambling companies were the second most targeted. Civic and Social Organizations were in third.

Middle East

Retail companies were the most targeted in the Middle East in Q3. Computer Software companies came in second and the Gaming and Gambling industry in third.

North America

After two consecutive quarters, the Marketing and Advertising industry dropped from the first place to the second. Computer Software took the lead. In third place, Telecommunications companies.

Oceania

The Telecommunications industry was, by far, the most targeted in Oceania in Q3 — over 45% of all attacks to Oceania. Cryptocurrency and Computer Software companies came in second and third places respectively.

Top attacked industries by L3/4 DDoS attacks

When descending the layers of the OSI model, the Internet networks and services that were most targeted belonged to the Information Technology and Services industry. Almost 35% of all L3/4 DDoS attack traffic (in bytes) targeted the Information Technology and Internet industry.

Far behind, Telecommunication companies came in second with a mere share of 3%. Gaming and Gambling came in third, Banking, Financial Services and Insurance companies (BFSI) in fourth.

L3/4 DDoS attacks: Top attacked industries compared to all attack traffic

When comparing the attacks on industries to all traffic for that specific industry, we see that the Music industry jumps to the first place, followed by Computer and Network Security companies, Information Technology and Internet companies and Aviation and Aerospace.

L3/4 DDoS attacks: Top attacked industries compared to their own traffic

Top attacked countries by HTTP DDoS attacks

When examining the total volume of attack traffic, the US remains the main target of HTTP DDoS attacks. Almost 5% of all HTTP DDoS attack traffic targeted the US. Singapore came in second and China in third.

HTTP DDoS attacks: Top attacked countries compared to all traffic

If we normalize the data per country and region and divide the attack traffic by the total traffic, we get a different picture. The top three most attacked countries are Island nations.

Anguilla, a small set of islands east of Puerto Rico, jumps to the first place as the most attacked country. Over 75% of all traffic to Anguilla websites were HTTP DDoS attacks. In second place, American Samoa, a group of islands east of Fiji. In third, the British Virgin Islands.

In fourth place, Algeria, and then Kenya, Russia, Vietnam, Singapore, Belize, and Japan.

HTTP DDoS attacks: Top attacked countries compared to their own traffic

Top attacked countries by L3/4 DDoS attacks

For the second consecutive quarter, Chinese Internet networks and services remain the most targeted by L3/4 DDoS attacks. These China-bound attacks account for 29% of all attacks we saw in Q3.

Far, far behind, the US came in second place (3.5%) and Taiwan in third place (3%).

L3/4 DDoS attacks: Top attacked countries compared to all traffic

When normalizing the amount of attack traffic compared to all traffic to a country, China remains in first place and the US disappears from the top ten. Cloudflare saw that 73% of traffic to China Internet networks were attacks. However, the normalized ranking changes from second place on, with the Netherlands receiving the second-highest proportion of attack traffic (representing 35% of the country’s overall traffic), closely followed by Thailand, Taiwan and Brazil.

L3/4 DDoS attacks: Top attacked countries compared to their own traffic

Top attack vectors

The Domain Name System, or DNS, serves as the phone book of the Internet. DNS helps translate the human-friendly website address (e.g., www.cloudflare.com) to a machine-friendly IP address (e.g., 104.16.124.96). By disrupting DNS servers, attackers impact the machines’ ability to connect to a website, and by doing so making websites unavailable to users.

For the second consecutive quarter, DNS-based DDoS attacks were the most common. Almost 47% of all attacks were DNS-based. This represents a 44% increase compared to the previous quarter. SYN floods remain in second place, followed by RST floods, UDP floods, and Mirai attacks.

Top attack vectors

Emerging threats – reduced, reused and recycled

Aside from the most common attack vectors, we also saw significant increases in lesser known attack vectors. These tend to be very volatile as threat actors try to “reduce, reuse and recycle” older attack vectors. These tend to be UDP-based protocols that can be exploited to launch amplification and reflection DDoS attacks.

One well-known tactic that we continue to see is the use of amplification/reflection attacks. In this attack method, the attacker bounces traffic off of servers, and aims the responses towards their victim. Attackers are able to aim the bounced traffic to their victim by various methods such as IP spoofing.

Another form of reflection can be achieved differently in an attack named ‘DNS Laundering attack’. In a DNS Laundering attack, the attacker will query subdomains of a domain that is managed by the victim’s DNS server. The prefix that defines the subdomain is randomized and is never used more than once or twice in such an attack. Due to the randomization element, recursive DNS servers will never have a cached response and will need to forward the query to the victim’s authoritative DNS server. The authoritative DNS server is then bombarded by so many queries until it cannot serve legitimate queries or even crashes all together.

Illustration of a reflection and amplification attack

Overall in Q3, Multicast DNS (mDNS) based DDoS attacks was the attack method that increased the most. In second place were attacks that exploit the Constrained Application Protocol (CoAP), and in third, the Encapsulating Security Payload (ESP). Let’s get to know those attack vectors a little better.

Main emerging threats

mDNS DDoS attacks increased by 456%

Multicast DNS (mDNS) is a UDP-based protocol that is used in local networks for service/device discovery. Vulnerable mDNS servers respond to unicast queries originating outside the local network, which are ‘spoofed’ (altered) with the victim’s source address. This results in amplification attacks. In Q3, we noticed a large increase of mDNS attacks; a 456% increase compared to the previous quarter.

CoAP DDoS attacks increased by 387%

The Constrained Application Protocol (CoAP) is designed for use in simple electronics and enables communication between devices in a low-power and lightweight manner. However, it can be abused for DDoS attacks via IP spoofing or amplification, as malicious actors exploit its multicast support or leverage poorly configured CoAP devices to generate large amounts of unwanted network traffic. This can lead to service disruption or overloading of the targeted systems, making them unavailable to legitimate users.

ESP DDoS attacks increased by 303%

The Encapsulating Security Payload (ESP) protocol is part of IPsec and provides confidentiality, authentication, and integrity to network communications. However, it could potentially be abused in DDoS attacks if malicious actors exploit misconfigured or vulnerable systems to reflect or amplify traffic towards a target, leading to service disruption. Like with other protocols, securing and properly configuring the systems using ESP is crucial to mitigate the risks of DDoS attacks.

Ransom DDoS attacks

Occasionally, DDoS attacks are carried out to extort ransom payments. We’ve been surveying Cloudflare customers over three years now, and have been tracking the occurrence of Ransom DDoS attack events.

Comparison of Ransomware and Ransom DDoS attacks

Unlike Ransomware attacks, where victims typically fall prey to downloading a malicious file or clicking on a compromised email link which locks, deletes, or leaks their files until a ransom is paid, Ransom DDoS attacks can be much simpler for threat actors to execute. Ransom DDoS attacks bypass the need for deceptive tactics such as luring victims into opening dubious emails or clicking on fraudulent links, and they don’t necessitate a breach into the network or access to corporate resources.

Over the past quarter, reports of Ransom DDoS attacks continue to decrease. Approximately 8% of respondents reported being threatened or subject to Random DDoS attacks, which continues a decline we’ve been tracking throughout the year. This is a continued decline that we’ve been tracking throughout the year. Hopefully it is because threat actors have realized that organizations will not pay them (which is our recommendation).

Ransom DDoS attacks by quarter

However, keep in mind that this is also very seasonal, and we can expect an increase in ransom DDoS attacks during the months of November and December. If we look at Q4 numbers from the past three years, we can see that Ransom DDoS attacks have been significantly increasing YoY in November. In previous Q4s, it reached a point where one out of every four respondents reported being subject to Ransom DDoS attacks.

Improving your defenses in the era of hyper-volumetric DDoS attacks

In the past quarter, we saw an unprecedented surge in DDoS attack traffic. This surge was largely driven by the hyper-volumetric HTTP/2 DDoS attack campaign.

Cloudflare customers using our HTTP reverse proxy, i.e. our CDN/WAF services, are already protected from these and other HTTP DDoS attacks. Cloudflare customers that are using non-HTTP services and organizations that are not using Cloudflare at all are strongly encouraged to use an automated, always-on HTTP DDoS Protection service for their HTTP applications.

It’s important to remember that security is a process, not a single product or flip of a switch. Atop of our automated DDoS protection systems, we offer comprehensive bundled features such as firewall, bot detection, API protection, and caching to bolster your defenses. Our multi-layered approach optimizes your security posture and minimizes potential impact. We’ve also put together a list of recommendations to help you optimize your defenses against DDoS attacks, and you can follow our step-by-step wizards to secure your applications and prevent DDoS attacks.


Report methodologies
Learn more about our methodologies and how we generate these insights: https://developers.cloudflare.com/radar/reference/quarterly-ddos-reports

New NSA Information from (and About) Snowden

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/10/new-nsa-information-from-and-about-snowden.html

Interesting article about the Snowden documents, including comments from former Guardian editor Ewen MacAskill

MacAskill, who shared the Pulitzer Prize for Public Service with Glenn Greenwald and Laura Poitras for their journalistic work on the Snowden files, retired from The Guardian in 2018. He told Computer Weekly that:

  • As far as he knows, a copy of the documents is still locked in the New York Times office. Although the files are in the New York Times office, The Guardian retains responsibility for them.
  • As to why the New York Times has not published them in a decade, MacAskill maintains “this is a complicated issue.” “There is, at the very least, a case to be made for keeping them for future generations of historians,” he said.
  • Why was only 1% of the Snowden archive published by the journalists who had full access to it? Ewen MacAskill replied: “The main reason for only a small percentage—though, given the mass of documents, 1% is still a lot—was diminishing interest.”

[…]

The Guardian’s journalist did not recall seeing the three revelations published by Computer Weekly, summarized below:

  • The NSA listed Cavium, an American semiconductor company marketing Central Processing Units (CPUs)—the main processor in a computer which runs the operating system and applications—as a successful example of a “SIGINT-enabled” CPU supplier. Cavium, now owned by Marvell, said it does not implement back doors for any government.
  • The NSA compromised lawful Russian interception infrastructure, SORM. The NSA archive contains slides showing two Russian officers wearing jackets with a slogan written in Cyrillic: “You talk, we listen.” The NSA and/or GCHQ has also compromised key lawful interception systems.
  • Among example targets of its mass-surveillance programme, PRISM, the NSA listed the Tibetan government in exile.

Those three pieces of info come from Jake Appelbaum’s Ph.D. thesis.

Човешката роля на учителя ще е незаменима в бъдещето. Разговор с Женя Лазарова

Post Syndicated from Надежда Цекулова original https://www.toest.bg/choveshkata-rolya-na-uchitelya-shte-e-nezamenima-v-budesheteto-razgovor-s-zhenya-lazarova/

Човешката роля на учителя ще е незаменима в бъдещето. Разговор с Женя Лазарова

Женя Лазарова е невроучен и психолог. Завършила е магистратура и е защитила докторат в Оксфорд. Дисертацията ѝ разглежда процесите, чрез които мозъкът формира ново знание. Изводите, до които достига, и опитът ѝ с различни образователни системи превръщат в нейна кауза осъвременяването на образованието по начин, съобразен с естествените процеси на учене. 

Работата ѝ като консултант в отдела по обществени политики и стратегии на „Делойт“ – Лондон я прави и защитник на убеждението, че всяка промяна в големи публични системи като образованието трябва да се опре на достиженията на науката, но и да вземе под внимание интересите на всички участници в системата, както и различните експертности, необходими за създаването на ефективни иновации и добри практики.

Може ли системи като образованието да имат развитие, основано на данни?

Аз съм учен и бизнес консултант. Това са две професии, които са 100% базирани на данни – емпирични данни, информация за добри практики, данни къде се намираш, какъв ти е локалният контекст, каква е обратната връзка от потребителите, тоест данни, които непрекъснато трябва да се събират. 

Като невроучен бих започнала с данните, със знанието как изобщо протича процесът на научаване. Мозъкът се учи най-добре чрез преживяване, особено в детска възраст. Колкото повече сетива са ангажирани при натрупването на опит, токова по-устойчиво знание може да се създаде. Затова абстрактните дефиниции или сухата фактология без предистория, контекст и свързаност са много трудни за запомняне – защото мозъкът не може да ги свърже с никакъв реален опит, оценява ги като ненужни и много лесно ги изхвърля. 

Вие говорите в много свои участия за зазубрянето (или запаметяването на набор от факти) като основа, на която е построено традиционното образование още в Средновековието и която в ХХI век губи своята ефективност. Това обаче продължава да е фундамент на образованието. Не е ли сериозна революция да го отхвърлим?

В моя TED Talk показвам снимки на различни пространства отпреди сто години – как изглежда библиотека преди сто години, как изглежда производствена среда и една класна стая. Между класната стая отпреди сто години и днес няма почти никаква разлика. Докато всички останали аспекти на живота са се променили много – детският труд, достъпът до знание, науката. Образованието е единственото, което се затруднява да преодолее това наследство. 

Образованието днес все още носи белезите на своя произход, които вече отдавна не са ефективни, нито дори полезни. Част от тях са от времето на религиозните и килийните училища – например нуждата от запаметяване на случайна поредица от факти, които понякога дори не разбираш. Който е можел да зазубри Библията на латински или на старогръцки, е бил оценяван като най-умен и нямало да работи тежък ръчен труд.

По-късно се наслагва и стандартизацията като подход. По време на индустриалната революция е трябвало да се обучават работници с идентични знания и умения. И последният аспект, който исторически е повлиял върху образованието, е дисциплината или сляпото подчинение. Те са били необходими на военното обучение, чиято идея не е да формира хора, които мислят индивидуално. 

Тези подходи видимо не съответстват на нуждите на децата в съвременния свят. Стандартизацията унищожава способността за творчество и творческото мислене, както и възможността за развиване на уникалните таланти на всяко дете. Зазубрянето не развива когнитивните функции, дори паметта не развива. А подчинението елиминира възможността човек да развие лична отговорност за действията си.

Казвате, че ученето наизуст не развива паметта? 

Да, категорично. Семантичната памет се базира на взаимовръзки между знанията и тяхната релевантност, които не се формират адекватно при назубрена информация и тя твърде лесно се забравя. Но по-важното е, че този тип учене не развива интелекта. Аз съм работила няколко години в МЕНСА – организация за хора с високо IQ. Много време изучавах различни тестове за интелигентност и с абсолютна сигурност мога да кажа, че няма тест за интелигентност, който да измерва способността за зазубряне. Има тестове за краткотрайна памет, които обичайно са включени в пакет за измерване на различни видове интелигентност – пространствена, математическа, вербална и др., които в по-голямата си част се игнорират от този тип учене. 

Какво е необходимо тогава, за да се развива интелектът, личността?

Необходимо е образование, което подкрепя цялостното развитие на едно дете. Всичките му потребности – на психиката, на развитието на мозъка, на емоционалното и на физическото му развитие. Всички тези аспекти са еднакво важни, за да може да се развие детето като пълноценна личност, да има пълноценна професионална и лична реализация, тоест да бъде щастлив човек в личния си живот. 

И е ключово ученето да е съобразено с особеностите на възрастта.

Различно ли е ученето в различните възрасти?

Много. Различните способности на мозъка се развиват с различна скорост. Сетивата, когнитивните способности, емоционалното и социалното съзряване се развиват в различни етапи, във всяка възраст мозъкът има нужда да развива различни умения и следователно учи по различен начин. Това са много интуитивни концепции и когато ги обяснявам, родителите и учениците ги припознават с лекота.

В ранна възраст децата градят своя база данни от знания чрез сетивни преживявания. Тоест детето иска да вижда, да чува, да пипа. Затова децата са толкова любопитни, имат нужда от стимулация, задават много въпроси, искат да разберат как се свързват всички тези неща, които виждат – как се казва това, дай да го пипна, издава ли звук, как се ползва и т.н. Така се създават връзки, които в един момент помагат на детето да започне да класифицира, да мисли за важните различия, на базата на които определяме и кои са маловажните различия.

Например едно куче не става не-куче в зависимост от това дали лае по-силно, или по-тихо; но ако мяука – това определено го прави не-куче. Много елементарно звучи този принцип, но тук говорим за изграждане на умения, които са много важни по-нататък в живота и които всъщност се измерват с тестовете за интелигентност. В момента обаче образованието не ги развива. 

До 12–13-годишна възраст е много активно имплицитното учене – способността да се извличат общи принципи от конкретни примери. Тоест ако първото куче, което виждаш в живота си, е черно, вероятно ще предположиш, че всички кучета са черни, докато не започнеш да виждаш други цветове и да адаптираш знанието си. Това е и причината децата да учат по-лесно езици, защото, когато детето чуе едно изречение, може да извлече граматическото правило от него неусетно и по подразбиране. В по-късна възраст това учене става доста по-трудно, защото тогава се затварят много критични периоди в развитието на мозъка, включително способността да учим чужди езици като майчиния.

Способността за абстракция идва още по-късно, когато имаш вече голяма и взаимосвързана база данни. 

Може ли да дадете съществуващи примери за съобразяване с възрастта, за да онагледим по-добре за какво говорите?

Например когато учат за Бай Ганьо в седми клас, децата не осмислят напълно иронизирането на този персонаж. Те виждат един герой, който е българин и не се държи добре в чужбина – всички му се подиграват. В тази възраст децата все още учат имплицитно и резултатът е, че приписват характеристиките на Бай Ганьо на всички българи, включително ги приемат в своята собствена идентичност. Един такъв герой не би трябвало да се изучава преди гимназиалния етап, а да се въведе като надграждане, след като детето е достигнало необходимата емоционална и социална зрялост, за да може да го разтълкува и осмисли като критика на определени типажи в обществото. 

Какво означава надграждане? В момента се учи една глава от „Бай Ганьо“ в седми клас, а цялото произведение – в гимназията.

В никакъв случай нямам предвид да се извади една глава от дадено произведение. Конкретно при примера с Бай Ганьо имам предвид децата да са достатъчно социално и емоционално съзрели, за да разбират идеологическите концепции, иронията и да могат да осмислят героя добре, без това да навреди на психическото им развитие и на себеуважението. А за да се осмисли пълноценно този персонаж, разбира се, че е необходимо да се прочете цялото произведение.

Говоря за надграждане на умения по начин, който буквално трябва да е физиологично адекватен.

Знанието не е линейно, нито има универсален път към него. Универсално и линейно е физиологичното и психическото развитие. На тях трябва да бъде базирано надграждането в образователната система. А пътят към знанието трябва да е гъвкав и да оставя възможности за персонализиране според интересите и способностите на детето. Това е изпитан подход, който работи много добре. Има образователни системи, които са силно персонализирани и съобразени с индивидуалния интерес на децата и предоставят много възможности за избор на детето и родителите му. Линейният подход към знанието не отговаря на естествения начин, по който човекът формира знания. 

Каква е връзката между остарелите принципи на масовото образование и академичните резултати на децата?

Ученето има нужда от постоянна и качествена обратна връзка. Академичните резултати не са достатъчна обратна връзка – ние дори не сме напълно наясно какво точно измерваме с тях. 

Вземете например външното оценяване. То би трябвало да измерва качеството на работа на училищата, но на практика измерва качеството на частните уроци или финансовата възможност на родителите да си ги позволят, както и други външни за системата фактори. Така училищата непрекъснато събират някакви данни, но на практика има много малко адекватни измерители. Ние не измерваме до каква степен образованието помага на децата да се развият така, че да станат успешни хора.

В съвременния свят успехът се определя от социално-емоционалната зрялост – както професионалният успех, така и личното благополучие. Има много примери за хора, които на 16 години напускат училище с дислексия, дефицит на вниманието или други особености на невроразвитието. Те не се чувстват успешни в академична среда, но извън системата успяват именно със социално-емоционалните си умения. Всякакви фактологични знания могат да се усвоят по-късно в живота, но социално-емоционалните умения много трудно могат да се наваксат. А относно тези умения не просто не събираме никакви данни, ние изобщо не ги смятаме за принадлежащи към образованието.

Другата обратна връзка – от ученика към учителя и МОН – също липсва. 

Как трябва да учат децата, за да изградят съвременни умения?

За мен има изкуствена разделителна линия между педагогическата наука, от една страна, и невронауките, психологията, от друга. Когнитивната психология, невронауките от десетилетия изследват и развиват познанието ни за това как човек учи – базови науки, на които би следвало да е основана педагогиката. Това е една от големите бариери, които трябва да прескочим, за да приложим научните достижения от последните 20 години. България в това отношение не бива да е на опашката, защото в много системи този преход вече е в ход. Част от моята концепция е, че редица сериозни и дълбоки социални проблеми са вкоренени в образованието ни и може да бъдат разрешени само ако ги адресираме правилно. Но за целта трябва да придобием повече интелектуално самочувствие, че ние тук можем да разработваме добри решения – няма нужда все да чакаме някой да ни ги даде наготово от чужбина.

Чувала съм обаче от учители репликата, че тяхната работа е да обучават, а не да възпитават. 

И аз съм я чувала, да. Но социално-емоционалните умения имат много по-дълбоко съдържание от възпитанието. Възпитанието се е превърнало в нещо като социален лубрикант, имитация на личностна зрялост. Научили сме се да имитираме тази социална зрялост, без наистина да я развиваме вътрешно. Учим децата да казват „благодаря“, но не ги учим как да изпитват истинска благодарност, учим ги да казват „извинете“, но не ги учим да уважават пространството и границите на другите. Развиването на тези умения не може да остане само извън образованието. 

Ако погледнем какво е положението в момента, у нас едни експерти казват, че изграждането или възпитаването на социално-емоционални умения не е работа на училището, а на семейството, други – че то вече се прави в училище, а трети – че е заложено в закона, но не се прави. За да разберем кой е прав, трябва да погледнем данните. В случая това са данните за психичното здраве на децата ни. А те показват, че българските деца и младежи са на водещи места по негативните показатели за ментално здраве, като агресия и тормоз в училище, кибертормоз, консумация на алкохол и цигари, консумация на марихуана и наркотични вещества, ранна сексуализация, тревожност и т.н.

И така, какво ни казват данните за мнението на експертите, които смятат, че социално-емоционалното учене не е работа на училището или обратното – че то вече се прави? Казват ни, че каквото и да се прави в момента, няма резултат. Казват ни, че децата имат някаква огромна потребност, която не е адресирана. Краткият отговор на въпроса чия е отговорността – на училището или на семейството, е: и на двете. Трябва да се случва в партньорство. Според мен това е най-голямото предизвикателство и най-големият приоритет на образованието в момента. 

Това не натоварва ли учителите с твърде много очаквания?

По-скоро променя очакванията към учителите. Вярвам, че в съвременното образование ролята на учителя е да бъде ментор, да насочва децата и да бъде човешкият пример за подражание. Родителят и учителят са най-авторитетните възрастни в живота на едно дете, от тях се възприемат много от тези принципи, които ние се опитваме да изградим у децата като умения за живот. Човешката роля на учителя ще бъде все по-незаменима в бъдещето, защото тя няма как да бъде изпълнена от изкуствен интелект. Голямата част от знанията вече може да се добият от различни източници и учителят няма да е толкова необходим като приносител на фактологични знания. Но ще е ключов за предаване на тези човешки умения, които ни правят хора и които ние трябва да съхраним в децата.

Например един учител трябва да има умения за емоционална регулация, за да научи децата на емоционална регулация. Ако той крещи в час, но може да даде на децата дефиниция какво е емоционална регулация, защото работата му е само да обучава, това ще създаде недоверие и няма да доведе до резултата, който бихме искали да имаме, а именно – децата да придобият умението за емоционална регулация и да могат да го прилагат в правилните ситуации, а не само да назубрят дефиницията за него. Разбира се, учителите също трябва да бъдат адекватно подкрепени в развиването на уменията, които целим да предадем на следващото поколение.

Казахте, че има системи, в които преходът от зазубряне към социално-емоционални умения вече е започнал. Можем ли да черпим опит как да адаптираме системата към новото? Или по-конкретно, как променяме възрастните, които трябва да променят системата?

Това е въпрос за един милион долара. Извън шегата, съществуват много системи, както и неформални образователни форми, в които се наблюдават различни добри практики. Не бих казала, че има една напълно обновена образователна система, съобразена с всичко, което знае науката. Но например в скандинавските държави, в Уелс, в Япония са много напред в прилагането на социално-емоционалното учене. 

Можем ли да извлечем посланието, че не трябва да се страхуваме да променяме образователната система?

Да, по принцип не трябва да се страхуваме от промяната и развитието, но същевременно на нас ни е еволюционно заложено да се страхуваме. Всяка промяна предизвиква естествена реакция на резистентност, защото необходимостта да се адаптираме към новостите изисква повече ресурси. 

Пътят е непрекъснато да търсим начини да преодоляваме тази естествена резистентност и това е задължително във всяка обществена политика, включваща промяна в голям мащаб.


Живеем във време, в което да научиш това, което поискаш, в момент, в който го искаш, е по-лесно от когато и да било преди в човешката история. Въпреки това, или може би имено заради това, формалното образование преживява криза на идентичността. В рубриката „Разговори за образованието" Надежда Цекулова и нейните събеседници търсят философията, смисъла и формите на онова, което наричаме „образование“, в третата декада на 21 в.

Какво посяхме в дланите си, цвете да поникне

Post Syndicated from Тоест original https://www.toest.bg/kakvo-posyahme-v-dlanite-si-tsvete-da-ponikne/

Какво посяхме в дланите си, цвете да поникне

Цвете да ни учи на вода и суша. Себе си посяхме,
себе си поливахме
с най-милите си мили намерения.

Какво посяхме в дланите си, облак да поникне.
В облака да ходим с безразсъдните си
пикници, надолу да залитаме
с озон в гърдите.

Въздуха посяхме, вдишан в бързината
между две умирания. Ти тогава
беше спряла в сънищата си, убодена
на подлото вретено; помниш ли,
че идвах теб ли да избавя,
ти ли мен.

Какво посяхме в дланите си, плод да ни огрее.
Плод да ни огрее горе в синевата над притихналите
болници; и колко светли да се носим
цяла есен. Слънцето посяхме,
сушата на цветето.

Какво посяхме в обичливите си длани,
челюст да поникне. Челюст от стомана
да прониже гърлото ни.
Себе си посяхме, себе си поливахме
с най-милите си мили намерения.


Никола Петров (р. 1987) e автор на книгите „Въжеиграч“ (ИК „Жанет 45“, 2012), „Бяс/бяло“ (Фондация „Литературен вестник“, 2017) и „Не са чудовища“ („Издателство за поезия ДА“ (2021), награда „Перото“ и номинация за наградата „Иван Николов“). От 2015 г. насам участва в редица издания на представлението „Актьори срещу поети“ в Театрална работилница „Сфумато“. Негова поезия е превеждана на английски, испански, италиански, руски и беларуски. Работи като медиен анализатор.


Според Екатерина Йосифова „четящият стихотворение сутрин… добре понася другите часове“ от деня. Убедени, че поезията държи умовете ни будни, а сърцата – отворени, в края на всеки месец ви предлагаме по едно стихотворение. Защото и в най-смутни времена доброто стихотворение е добра новина.

Enable cost-efficient operational analytics with Amazon OpenSearch Ingestion

Post Syndicated from Mikhail Vaynshteyn original https://aws.amazon.com/blogs/big-data/enable-cost-efficient-operational-analytics-with-amazon-opensearch-ingestion/

As the scale and complexity of microservices and distributed applications continues to expand, customers are seeking guidance for building cost-efficient infrastructure supporting operational analytics use cases. Operational analytics is a popular use case with Amazon OpenSearch Service. A few of the defining characteristics of these use cases are ingesting a high volume of time series data and a relatively low volume of querying, alerting, and running analytics on ingested data for real-time insights. Although OpenSearch Service is capable of ingesting petabytes of data across storage tiers, you still have to provision capacity to migrate between hot and warm tiers. This adds to the cost of provisioned OpenSearch Service domains.

The time series data often contains logs or telemetry data from various sources with different values and needs. That is, logs from some sources need to be available in a hot storage tier longer, whereas logs from other sources can tolerate a delay in querying and other requirements. Until now, customers were building external ingestion systems with the Amazon Kinesis family of services, Amazon Simple Queue Service (Amazon SQS), AWS Lambda, custom code, and other similar solutions. Although these solutions enable ingestion of operational data with various requirements, they add to the cost of ingestion.

In general, operational analytics workloads use anomaly detection to aid domain operations. This assumes that the data is already present in OpenSearch Service and the cost of ingestion is already borne.

With the addition of a few recent features of Amazon OpenSearch Ingestion, a fully managed serverless pipeline for OpenSearch Service, you can effectively address each of these cost points and build a cost-effective solution. In this post, we outline a solution that does the following:

  • Uses conditional routing of Amazon OpenSearch Ingestion to separate logs with specific attributes and store those, for example, in Amazon OpenSearch Service and archive all events in Amazon S3 to query with Amazon Athena
  • Uses in-stream anomaly detection with OpenSearch Ingestion, thereby removing the cost associated with compute needed for anomaly detection after ingestion

In this post, we use a VPC flow logs use case to demonstrate the solution. The solution and pattern presented in this post is equally applicable to larger operational analytics and observability use cases.

Solution overview

We use VPC flow logs to capture IP traffic and trigger processing notifications to the OpenSearch Ingestion pipeline. The pipeline filters the data, routes the data, and detects anomalies. The raw data will be stored in Amazon S3 for archival purposes, then the pipeline will detect anomalies in the data in near-real time using the Random Cut Forest (RCF) algorithm and send those data records to OpenSearch Service. The raw data stored in Amazon S3 can be inexpensively retained for an extended period of time using tiered storage and queried using the Athena query engine, and also visualized using Amazon QuickSight or other data visualization services. Although this walkthrough uses VPC flow log data, the same pattern applies for use with AWS CloudTrail, Amazon CloudWatch, any log files as well as any OpenTelemetry events, and custom producers.

The following is a diagram of the solution that we configure in this post.

In the following sections, we provide a walkthrough for configuring this solution.

The patterns and procedures presented in this post have been validated with the current version of OpenSearch Ingestion and the Data Prepper open-source project version 2.4.

Prerequisites

Complete the following prerequisite steps:

  1. We will be using a VPC for demonstration purposes for generating data. Set up the VPC flow logs to publish logs to an S3 bucket in text format. To optimize S3 storage costs, create a lifecycle configuration on the S3 bucket to transition the VPC flow logs to different tiers or expire processed logs. Make a note of the S3 bucket name you configured to use in later steps.
  2. Set up an OpenSearch Service domain. Make a note of the domain URL. The domain can be either public or VPC based, which is the preferred configuration.
  3. Create an S3 bucket for storing archived events, and make a note of S3 bucket name. Configure a resource-based policy allowing OpenSearch Ingestion to archive logs and Athena to read the logs.
  4. Configure an AWS Identity and Access Management (IAM) role or separate IAM roles allowing OpenSearch Ingestion to interact with Amazon SQS and Amazon S3. For instructions, refer to Configure the pipeline role.
  5. Configure Athena or validate that Athena is configured on your account. For instructions, refer to Getting started.

Configure an SQS notification

VPC flow logs will write data in Amazon S3. After each file is written, Amazon S3 will send an SQS notification to notify the OpenSearch Ingestion pipeline that the file is ready for processing.

If the data is already stored in Amazon S3, you can use the S3 scan capability for a one-time or scheduled loading of data through the OpenSearch Ingestion pipeline.

Use AWS CloudShell to issue the following commands to create the SQS queues VpcFlowLogsNotifications and VpcFlowLogsNotifications-DLQ that we use for this walkthrough.

Create a dead-letter queue with the following code

export SQS_DLQ_URL=$(aws sqs create-queue --queue-name VpcFlowLogsNotifications-DLQ | jq -r '.QueueUrl')

echo $SQS_DLQ_URL 

export SQS_DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $SQS_DLQ_URL --attribute-names QueueArn | jq -r '.Attributes.QueueArn') 

echo $SQS_DLQ_ARN

Create an SQS queue to receive events from Amazon S3 with the following code:

export SQS_URL=$(aws sqs create-queue --queue-name VpcFlowLogsNotification --attributes '{
"RedrivePolicy": 
"{\"deadLetterTargetArn\":\"'$SQS_DLQ_ARN'\",\"maxReceiveCount\":\"2\"}", 
"Policy": 
  "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"s3.amazonaws.com\"}, \"Action\":\"SQS:SendMessage\",\"Resource\":\"*\"}]}" 
}' | jq -r '.QueueUrl')

echo $SQS_URL

To configure the S3 bucket to send events to the SQS queue, use the following code (provide the name of your S3 bucket used for storing VPC flow logs):

aws s3api put-bucket-notification-configuration --bucket __BUCKET_NAME__ --notification-configuration '{
     "QueueConfigurations": [
         {
             "QueueArn": "'$SQS_URL'",
             "Events": [
                 "s3:ObjectCreated:*"
             ]
         }
     ]
}'

Create the OpenSearch Ingestion pipeline

Now that you have configured Amazon SQS and the S3 bucket notifications, you can configure the OpenSearch Ingestion pipeline.

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create pipeline.

  1. For Pipeline name, enter a name (for this post, we use stream-analytics-pipeline).
  2. For Pipeline configuration, enter the following code:
version: "2"
entry-pipeline:
  source:
     s3:
       notification_type: sqs
       compression: gzip
       codec:
         newline:
       sqs:
         queue_url: "<strong>__SQS_QUEUE_URL__</strong>"
         visibility_timeout: 180s
       aws:
        region: "<strong>__REGION__</strong>"
        sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
  
  processor:
  sink:
    - pipeline:
        name: "archive-pipeline"
    - pipeline:
        name: "data-processing-pipeline"

data-processing-pipeline:
    source: 
        pipeline:
            name: "entry-pipeline"
    processor:
    - grok:
        tags_on_match_failure: [ "grok_match_failure" ]
        match:
          message: [ "%{VPC_FLOW_LOG}" ]
    route:
        - icmp_traffic: '/protocol == 1'
    sink:
        - pipeline:
            name : "icmp-pipeline"
            routes:
                - "icmp_traffic"
        - pipeline:
            name: "analytics-pipeline"
    

archive-pipeline:
  source:
    pipeline:
      name: entry-pipeline
  processor:
  sink:
    - s3:
        aws:
          region: "<strong>__REGION__</strong>"
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
        max_retries: 16
        bucket: "<strong>__AWS_S3_BUCKET_ARCHIVE__</strong>"
        object_key:
          path_prefix: "vpc-flow-logs-archive/year=%{yyyy}/month=%{MM}/day=%{dd}/"
        threshold:
          maximum_size: 50mb
          event_collect_timeout: 300s
        codec:
          parquet:
            auto_schema: true
      
analytics-pipeline:
  source:
    pipeline:
      name: "data-processing-pipeline"
  processor:
    - drop_events:
        drop_when: "hasTags(\"grok_match_failure\") or \"/log-status\" == \"NODATA\""
    - date:
        from_time_received: true
        destination: "@timestamp"
    - aggregate:
        identification_keys: ["srcaddr", "dstaddr"]
        action:
          tail_sampler:
            percent: 20.0
            wait_period: "60s"
            condition: '/action != "ACCEPT"'
    - anomaly_detector:
        identification_keys: ["srcaddr","dstaddr"]
        keys: ["bytes"]
        verbose: true
        mode:
          random_cut_forest:
  sink:
    - opensearch:
        hosts: [ "<strong>__AMAZON_OPENSEARCH_DOMAIN_URL__</strong>" ]
        index: "flow-logs-anomalies"
        aws:
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
          region: "<strong>__REGION__</strong>"
          
icmp-pipeline:
  source:
    pipeline:
      name: "data-processing-pipeline"
  processor:
  sink:
    - opensearch:
        hosts: [ "<strong>__AMAZON_OPENSEARCH_DOMAIN_URL__</strong>" ]
        index: "sensitive-icmp-traffic"
        aws:
          sts_role_arn: "<strong>__STS_ROLE_ARN__</strong>"
          region: "<strong>__REGION__</strong>"</code>

Replace the variables in the preceding code with resources in your account:

    • __SQS_QUEUE_URL__ – URL of Amazon SQS for Amazon S3 events
    • __STS_ROLE_ARN__AWS Security Token Service (AWS STS) roles for resources to assume
    • __AWS_S3_BUCKET_ARCHIVE__ – S3 bucket for archiving processed events
    • __AMAZON_OPENSEARCH_DOMAIN_URL__ – URL of OpenSearch Service domain
    • __REGION__ – Region (for example, us-east-1)
  1. In the Network settings section, specify your network access. For this walkthrough, we are using VPC access. We provided the VPC and private subnet locations that have connectivity with the OpenSearch Service domain and security groups.
  2. Leave the other settings with default values, and choose Next.
  3. Review the configuration changes and choose Create pipeline.

It will take a few minutes for OpenSearch Service to provision the environment. While the environment is being provisioned, we’ll walk you through the pipeline configuration. Entry-pipeline listens for SQS notifications about newly arrived files and triggers the reading of VPC flow log compressed files:

…
entry-pipeline:
  source:
     s3:
…

The pipeline branches into two sub-pipelines. The first stores original messages for archival purposes in Amazon S3 in read-optimized Parquet format; the other applies analytics routes events to the OpenSearch Service domain for fast querying and alerting:

…
  sink:
    - pipeline:
        name: "archive-pipeline"
    - pipeline:
        name: "data-processing-pipeline"
… 

The pipeline archive-pipeline aggregates messages in 50 MB chunks or every 60 seconds and writes a Parquet file to Amazon S3 with the schema inferred from the message. Also, a prefix is added to help with partitioning and query optimization when reading a collection of files using Athena.

…
sink:
    - s3:
…
        object_key:
          path_prefix: " vpc-flow-logs-archive/year=%{yyyy}/month=%{MM}/day=%{dd}/"
        threshold:
          maximum_size: 50mb
          event_collect_timeout: 300s
        codec:
          parquet:
            auto_schema: true
…

Now that we have reviewed the basics, we focus on the pipeline that detects anomalies and sends only high-value messages that deviate from the norm to OpenSearch Service. It also stores Internet Control Message Protocols (ICMP) messages in OpenSearch Service.

We applied a grok processor to parse the message using a predefined regex for parsing VPC flow logs, and also tagged all unparsable messages with the grok_match_failure tag, which we use to remove headers and other records that can’t be parsed:

…
    processor:
    - grok:
        tags_on_match_failure: [ "grok_match_failure" ]
        match:
          message: [ "%{VPC_FLOW_LOG}" ]
…

We then routed all messages with the protocol identifier 1 (ICMP) to icmp-pipeline and all messages to analytics-pipeline for anomaly detection:

…
   route:
        - icmp_traffic: '/protocol == 1'
    sink:
        - pipeline:
            name : "icmp-pipeline"
            routes:
                - "icmp_traffic"
        - pipeline:
            name: "analytics-pipeline"
…

In the analytics pipeline, we dropped all records that can’t be parsed using the hasTags method based on the tag that we assigned at the time of parsing. We also removed all records that don’t contains useful data for anomaly detection.

…
  - drop_events:
        drop_when: "hasTags(\"grok_match_failure\") or \"/log-status\" == \"NODATA\""		
…

Then we applied probabilistic sampling using the tail_sampler processor for all accepted messages grouped by source and destination addresses and sent those to the sink with all messages that were not accepted. This helps reduce the volume of messages within the selected cardinality keys, with a focus on all messages that weren’t accepted, and keeps a sample representation of messages that were accepted.

…
aggregate:
        identification_keys: ["srcaddr", "dstaddr"]
        action:
          tail_sampler:
            percent: 20.0
            wait_period: "60s"
            condition: '/action != "ACCEPT"'
…

Then we used the anomaly detector processor to identify anomalies within the cardinality key pairs or source and destination addresses in our example. The anomaly detector processor creates and trains RCF models for a hashed value of keys, then uses those models to determine whether newly arriving messages have an anomaly based on the trained data. In our demonstration, we use bytes data to detect anomalies:

…
anomaly_detector:
        identification_keys: ["srcaddr","dstaddr"]
        keys: ["bytes"]
        verbose: true
        mode:
          random_cut_forest:
…

We set verbose:true to instruct the detector to emit the message every time an anomaly is detected. Also, for this walkthrough, we used a non-default sample_size for training the model.

When anomalies are detected, the anomaly detector returns a complete record and adds

"deviation_from_expected":value,"grade":value attributes that signify the deviation value and severity of the anomaly. These values can be used to determine routing of such messages to OpenSearch Service, and use per-document monitoring capabilities in OpenSearch Service to alert on specific conditions.

Currently, OpenSearch Ingestion creates up to 5,000 distinct models based on cardinality key values per compute unit. This limit is observed using the anomaly_detector.RCFInstances.value CloudWatch metric. It’s important to select a cardinality key-value pair to avoid exceeding this constraint. As development of the Data Prepper open-source project and OpenSearch Ingestion continues, more configuration options will be added to offer greater flexibility around model training and memory management.

The OpenSearch Ingestion pipeline exposes the anomaly_detector.cardinalityOverflow.count metric through CloudWatch. This metric indicates a number of key value pairs that weren’t run by the anomaly detection processor during a period of time as the maximum number of RCFInstances per compute unit was reached. To avoid this constraint, a number of compute units can be scaled out to provide additional capacity for hosting additional instances of RCFInstances.

In the last sink, the pipeline writes records with detected anomalies along with deviation_from_expected and grade attributes to the OpenSearch Service domain:

…
sink:
    - opensearch:
        hosts: [ "__AMAZON_OPENSEARCH_DOMAIN_URL__" ]
        index: "anomalies"
…

Because only anomaly records are being routed and written to the OpenSearch Service domain, we are able to significantly reduce the size of our domain and optimize the cost of our sample observability infrastructure.

Another sink was used for storing all ICMP records in a separate index in the OpenSearch Service domain:

…
sink:
    - opensearch:
        hosts: [ "__AMAZON_OPENSEARCH_DOMAIN_URL__" ]
        index: " sensitive-icmp-traffic"
…

Query archived data from Amazon S3 using Athena

In this section, we review the configuration of Athena for querying archived events data stored in Amazon S3. Complete the following steps:

  1. Navigate to the Athena query editor and create a new database called vpc-flow-logs-archive-database using the following command:
CREATE DATABASE `vpc-flow-logs-archive`
  1. 2. On the Database menu, choose vpc-flow-logs-archive.
  2. In the query editor, enter the following command to create a table (provide the S3 bucket used for archiving processed events). For simplicity, for this walkthrough, we create a table without partitions.
CREATE EXTERNAL TABLE `vpc-flow-logs-data`(
  `message` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://__AWS_S3_BUCKET_ARCHIVE__'
TBLPROPERTIES (
  'classification'='parquet', 
  'compressionType'='none'
)
  1. Run the following query to validate that you can query the archived VPC flow log data:
SELECT * FROM "vpc-flow-logs-archive"."vpc-flow-logs-data" LIMIT 10;

Because archived data is stored in its original format, it helps avoid issues related to format conversion. Athena will query and display records in the original format. However, it’s ideal to interact only with a subset of columns or parts of the messages. You can use the regexp_split function in Athena to split the message in the columns and retrieve certain columns. Run the following query to see the source and destination address groupings from the VPC flow log data:

SELECT srcaddr, dstaddr FROM (
   SELECT regexp_split(message, ' ')[4] AS srcaddr, 
          regexp_split(message, ' ')[5] AS dstaddr, 
          regexp_split(message, ' ')[14] AS status  FROM "vpc-flow-logs-archive"."vpc-flow-logs-data" 
) WHERE status = 'OK' 
GROUP BY srcaddr, dstaddr 
ORDER BY srcaddr, dstaddr LIMIT 10;

This demonstrated that you can query all events using Athena, where archived data in its original raw format is used for the analysis. Athena is priced per data scanned. Because the data is stored in a read-optimized format and partitioned, it enables further cost-optimization around on-demand querying of archived streaming and observability data.

Clean up

To avoid incurring future charges, delete the following resources created as part of this post:

  • OpenSearch Service domain
  • OpenSearch Ingestion pipeline
  • SQS queues
  • VPC flow logs configuration
  • All data stored in Amazon S3

Conclusion

In this post, we demonstrated how to use OpenSearch Ingestion pipelines to build a cost-optimized infrastructure for log analytics and observability events. We used routing, filtering, aggregation, and anomaly detection in an OpenSearch Ingestion pipeline, enabling you to downsize your OpenSearch Service domain and create a cost-optimized observability infrastructure. For our example, we used a data sample with 1.5 million events with a pipeline distilling to 1,300 events with predicted anomalies based on source and destination IP pairs. This metric demonstrates that the pipeline identified that less than 0.1% of events were of high importance, and routed those to OpenSearch Service for visualization and alerting needs. This translates to lower resource utilization in OpenSearch Service domains and can lead to provisioning of smaller OpenSearch Service environments.

We encourage you to use OpenSearch Ingestion pipelines to create your purpose-built and cost-optimized observability infrastructure that uses OpenSearch Service for storing and alerting on high-value events. If you have comments or feedback, please leave them in the comments section.


About the Authors

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Mask and redact sensitive data published to Amazon SNS using managed and custom data identifiers

Post Syndicated from Otavio Ferreira original https://aws.amazon.com/blogs/security/mask-and-redact-sensitive-data-published-to-amazon-sns-using-managed-and-custom-data-identifiers/

Today, we’re announcing a new capability for Amazon Simple Notification Service (Amazon SNS) message data protection. In this post, we show you how you can use this new capability to create custom data identifiers to detect and protect domain-specific sensitive data, such as your company’s employee IDs. Previously, you could only use managed data identifiers to detect and protect common sensitive data, such as names, addresses, and credit card numbers.

Overview

Amazon SNS is a serverless messaging service that provides topics for push-based, many-to-many messaging for decoupling distributed systems, microservices, and event-driven serverless applications. As applications become more complex, it can become challenging for topic owners to manage the data flowing through their topics. These applications might inadvertently start sending sensitive data to topics, increasing regulatory risk. To mitigate the risk, you can use message data protection to protect sensitive application data using built-in, no-code, scalable capabilities.

To discover and protect data flowing through SNS topics with message data protection, you can associate data protection policies to your topics. Within these policies, you can write statements that define which types of sensitive data you want to discover and protect. Within each policy statement, you can then define whether you want to act on data flowing inbound to an SNS topic or outbound to an SNS subscription, the AWS accounts or specific AWS Identity and Access Management (IAM) principals the statement applies to, and the actions you want to take on the sensitive data found.

Now, message data protection provides three actions to help you protect your data. First, the audit operation reports on the amount of sensitive data found. Second, the deny operation helps prevent the publishing or delivery of payloads that contain sensitive data. Third, the de-identify operation can mask or redact the sensitive data detected. These no-code operations can help you adhere to a variety of compliance regulations, such as Health Insurance Portability and Accountability Act (HIPAA), Federal Risk and Authorization Management Program (FedRAMP), General Data Protection Regulation (GDPR), and Payment Card Industry Data Security Standard (PCI DSS).

This message data protection feature coexists with the message data encryption feature in SNS, both contributing to an enhanced security posture of your messaging workloads.

Managed and custom data identifiers

After you add a data protection policy to your SNS topic, message data protection uses pattern matching and machine learning models to scan your messages for sensitive data, then enforces the data protection policy in real time. The types of sensitive data are referred to as data identifiers. These data identifiers can be either managed by Amazon Web Services (AWS) or custom to your domain.

Managed data identifiers (MDI) are organized into five categories:

In a data protection policy statement, you refer to a managed data identifier using its Amazon Resource Name (ARN), as follows:

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Statement": [{
        "DataIdentifier": [
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "..."
    }]
}

Custom data identifiers (CDI), on the other hand, enable you to define custom regular expressions in the data protection policy itself, then refer to them from policy statements. Using custom data identifiers, you can scan for business-specific sensitive data, which managed data identifiers can’t. For example, you can use a custom data identifier to look for company-specific employee IDs in SNS message payloads. Internally, SNS has guardrails to make sure custom data identifiers are safe and that they add only low single-digit millisecond latency to message processing.

In a data protection policy statement, you refer to a custom data identifier using only the name that you have given it, as follows:

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "DataIdentifier": [
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
            "MyCompanyEmployeeId"
        ],
        "..."
    }]
}

Note that custom data identifiers can be used in conjunction with managed data identifiers, as part of the same data protection policy statement. In the preceding example, both MyCompanyEmployeeId and CreditCardNumber are in scope.

For more information, see Data Identifiers, in the SNS Developer Guide.

Inbound and outbound data directions

In addition to the DataIdentifier property, each policy statement also sets the DataDirection property (whose value can be either Inbound or Outbound) as well as the Principal property (whose value can be any combination of AWS accounts, IAM users, and IAM roles).

When you use message data protection for data de-identification and set DataDirection to Inbound, instances of DataIdentifier published by the Principal are masked or redacted before the payload is ingested into the SNS topic. This means that every endpoint subscribed to the topic receives the same modified payload.

When you set DataDirection to Outbound, on the other hand, the payload is ingested into the SNS topic as-is. Then, instances of DataIdentifier are either masked, redacted, or kept as-is for each subscribing Principal in isolation. This means that each endpoint subscribed to the SNS topic might receive a different payload from the topic, with different sensitive data de-identified, according to the data access permissions of its Principal.

The following snippet expands the example data protection policy to include the DataDirection and Principal properties.

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "DataIdentifier": [
            "MyCompanyEmployeeId",
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "DataDirection": "Outbound",
        "Principal": [ "arn:aws:iam::123456789012:role/ReportingApplicationRole" ],
        "..."
    }]
}

In this example, ReportingApplicationRole is the authenticated IAM principal that called the SNS Subscribe API at subscription creation time. For more information, see How do I determine the IAM principals for my data protection policy? in the SNS Developer Guide.

Operations for data de-identification

To complete the policy statement, you need to set the Operation property, which informs the SNS topic of the action that it should take when it finds instances of DataIdentifer in the outbound payload.

The following snippet expands the data protection policy to include the Operation property, in this case using the Deidentify object, which in turn supports masking and redaction.

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "Principal": [
            "arn:aws:iam::123456789012:role/ReportingApplicationRole"
        ],
        "DataDirection": "Outbound",
        "DataIdentifier": [
            "MyCompanyEmployeeId",
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "Operation": { "Deidentify": { "MaskConfig": { "MaskWithCharacter": "#" } } }
    }]
}

In this example, the MaskConfig object instructs the SNS topic to mask instances of CreditCardNumber in Outbound messages to subscriptions created by ReportingApplicationRole, using the MaskWithCharacter value, which in this case is the hash symbol (#). Alternatively, you could have used the RedactConfig object instead, which would have instructed the SNS topic to simply cut the sensitive data off the payload.

The following snippet shows how the outbound payload is masked, in real time, by the SNS topic.

// original message published to the topic:
My credit card number is 4539894458086459

// masked message delivered to subscriptions created by ReportingApplicationRole:
My credit card number is ################

For more information, see Data Protection Policy Operations, in the SNS Developer Guide.

Applying data de-identification in a use case

Consider a company where managers use an internal expense report management application where expense reports from employees can be reviewed and approved. Initially, this application depended only on an internal payment application, which in turn connected to an external payment gateway. However, this workload eventually became more complex, because the company started also paying expense reports filed by external contractors. At that point, the company built a mobile application that external contractors could use to view their approved expense reports. An important business requirement for this mobile application was that specific financial and PII data needed to be de-identified in the externally displayed expense reports. Specifically, both the credit card number used for the payment and the internal employee ID that approved the payment had to be masked.

Figure 1: Expense report processing application

Figure 1: Expense report processing application

To distribute the approved expense reports to both the payment application and the reporting application that backed the mobile application, the company used an SNS topic with a data protection policy. The policy has only one statement, which masks credit card numbers and employee IDs found in the payload. This statement applies only to the IAM role that the company used for subscribing the AWS Lambda function of the reporting application to the SNS topic. This access permission configuration enabled the Lambda function from the payment application to continue receiving the raw data from the SNS topic.

The data protection policy from the previous section addresses this use case. Thus, when a message representing an expense report is published to the SNS topic, the Lambda function in the payment application receives the message as-is, whereas the Lambda function in the reporting application receives the message with the financial and PII data masked.

Deploying the resources

You can apply a data protection policy to an SNS topic using the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS SDK, or AWS CloudFormation.

To automate the provisioning of the resources and the data protection policy of the example expense management use case, we’re going to use CloudFormation templates. You have two options for deploying the resources:

Deploy using the individual CloudFormation templates in sequence

  1. Prerequisites template: This first template provisions two IAM roles with a managed policy that enables them to create SNS subscriptions and configure the subscriber Lambda functions. You will use these provisioned IAM roles in steps 3 and 4 that follow.
  2. Topic owner template: The second template provisions the SNS topic along with its access policy and data protection policy.
  3. Payment subscriber template: The third template provisions the Lambda function and the corresponding SNS subscription that comprise of the Payment application stack. When prompted, select the PaymentApplicationRole in the Permissions panel before running the template. Moreover, the CloudFormation console will require you to acknowledge that a CloudFormation transform might require access capabilities.
  4. Reporting subscriber template: The final template provisions the Lambda function and the SNS subscription that comprise of the Reporting application stack. When prompted, select the ReportingApplicationRole in the Permissions panel, before running the template. Moreover, the CloudFormation console will require, once again, that you acknowledge that a CloudFormation transform might require access capabilities.
Figure 2: Select IAM role

Figure 2: Select IAM role

Now that the application stacks have been deployed, you’re ready to start testing.

Testing the data de-identification operation

Use the following steps to test the example expense management use case.

  1. In the Amazon SNS console, select the ApprovalTopic, then choose to publish a message to it.
  2. In the SNS message body field, enter the following message payload, representing an external contractor expense report, then choose to publish this message:
    {
        "expense": {
            "currency": "USD",
            "amount": 175.99,
            "category": "Office Supplies",
            "status": "Approved",
            "created_at": "2023-10-17T20:03:44+0000",
            "updated_at": "2023-10-19T14:21:51+0000"
        },
        "payment": {
            "credit_card_network": "Visa",
            "credit_card_number": "4539894458086459"
        },
        "reviewer": {
            "employee_id": "EID-123456789-US",
            "employee_location": "Seattle, USA"
        },
        "contractor": {
            "employee_id": "CID-000012348-CA",
            "employee_location": "Vancouver, CAN"
        }
    }
    

  3. In the CloudWatch console, select the log group for the PaymentLambdaFunction, then choose to view its latest log stream. Now look for the log stream entry that shows the message payload received by the Lambda function. You will see that no data has been masked in this payload, as the payment application requires raw financial data to process the credit card transaction.
  4. Still in the CloudWatch console, select the log group for the ReportingLambdaFunction, then choose to view its latest log stream. Now look for the log stream entry that shows the message payload received by this Lambda function. You will see that the values for properties credit_card_number and employee_id have been masked, protecting the financial data from leaking into the external reporting application.
    {
        "expense": {
            "currency": "USD",
            "amount": 175.99,
            "category": "Office Supplies",
            "status": "Approved",
            "created_at": "2023-10-17T20:03:44+0000",
            "updated_at": "2023-10-19T14:21:51+0000"
        },
        "payment": {
            "credit_card_network": "Visa",
            "credit_card_number": "################"
        },
        "reviewer": {
            "employee_id": "################",
            "employee_location": "Seattle, USA"
        },
        "contractor": {
            "employee_id": "CID-000012348-CA",
            "employee_location": "Vancouver, CAN"
        }
    }
    

As shown, different subscribers received different versions of the message payload, according to their sensitive data access permissions.

Cleaning up the resources

After testing, avoid incurring usage charges by deleting the resources that you created. Open the CloudFormation console and delete the four CloudFormation stacks that you created during the walkthrough.

Conclusion

This post showed how you can use Amazon SNS message data protection to discover and protect sensitive data published to or delivered from your SNS topics. The example use case shows how to create a data protection policy that masks messages delivered to specific subscribers if the payloads contain financial or personally identifiable information.

For more details, see message data protection in the SNS Developer Guide. For information on costs, see SNS pricing.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Otavio-Ferreira-author

Otavio Ferreira

Otavio is the GM for Amazon SNS, and has been leading the service since 2016, responsible for software engineering, product management, technical program management, and technical operations. Otavio has spoken at AWS conferences—AWS re:Invent and AWS Summit—and written a number of articles for the AWS Compute and AWS Security blogs.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close