Preparing for Unknown Risks: How to Better Prepare for Risks You Can’t See Yet

Post Syndicated from Rapid7 original https://blog.rapid7.com/2024/08/22/preparing-for-unknown-risks-how-to-better-prepare-for-risks-you-cant-see-yet/

Preparing for Unknown Risks:
How to Better Prepare for Risks You Can't See Yet

As security professionals we’re used to dealing with unknowns and unpredictability. We understand that it’s impossible to always know what’s around the corner. It’s not just about external threats and the big breaches splashed across the news headlines. On one hand, we’re combating threat actors attempting to steal information, money or simply trying to cause havoc. On the other, we’re trying to better understand employee behaviour amidst the myriad of applications they use on a daily basis; always vigilant for any suspicious activity. And while it certainly makes our jobs interesting, unpredictability runs contrary to how the organisations we protect prefer to operate.

Predicting what’s going to happen in our cyber world is nearly impossible.  A greater challenge is explaining this to stakeholders and conveying how difficult it is to get (and stay) one step ahead of threat actors. We’re paid to understand this, yet  it can often feel like shooting in the dark when anticipating the next strike.

Senior leadership teams thrive on certainty and predictability. So how do you plan and manage this?

Focus on what you can control

Ultimately, you can only control what’s in front of you.:he tools, applications and services the business uses to operate. While this might seem obvious, many people spend a considerable amount of time and energy on things that can’t influence.

Your time is best spent focusing on what’s visible and within reach. Begin by identifying the crown jewels of your organisation — understanding the scope of your environment and what exactly you’re protecting. Then, implement controls and monitor for abnormalities.

Regularly conduct comprehensive risk assessments and vulnerability scans to identify potential weaknesses in your organisation’s IT infrastructure. This helps uncover existing vulnerabilities and potential entry points for cyber threats, particularly in areas where the ‘crown jewels’ are held!

Leverage threat modelling

Threat modelling provides very useful analysis, unique to your organisation. Various factors determine your threat model including industry, compliance and regulations and finally, customers. Using your threat model as a guide, you can get a clear picture of the unique risks your business faces and design controls around those. These insights can also inform your approach to Table Top Exercises, preparing you for potential incidents.

While predicting a threat actor’s next steps is challenging, gathering and understanding this information through these exercises can enhance your ability to anticipate future threats. Afterall, identifying unknowns is crucial.

With a clear focus on what you’re protecting, you’re now able to analyse and draw learnings from past events, which is often a good predictor of future occurrences.  While threat actors are often portrayed as volatile and unpredictable (and this is true in some cases), they’re only human – and humans are creatures of habit. Recognizing patterns in their behaviour can provide valuable insights.

This is where threat intelligence gathering is extremely useful. Make sure you stay informed about the latest cyber threats and attack trends by monitoring reputable sources of threat intelligence. Placing yourself in a position to better understand what trends and patterns have occurred in the past, may help you better predict the types of threats or vulnerabilities your organisation could be subject to in the future.

How Rapid7 can help – Threat Command

Threats can come from any direction. Rapid7’s Threat Command scans the clear, deep, and dark webs for potential dangers before they affect your organisation. It provides contextualised alerts on threats affecting your business, proactively researching malware, tactics, techniques, and procedures (TTPs), phishing scams, and other threat actors. Threat Command replaces point solutions with an all-in-one external threat intelligence, digital risk protection, indicators of compromise (IOCs) management, and remediation solution.

Find out more.

Proactive profiling

Conducting risk assessments, vulnerability scans and gathering threat intelligence helps you to understand the ‘cyber profile’ of your organisation. This preparation helps you anticipate the types of threats typically used against similar-sized organisations or those in your industry. There are trends and patterns that emerge., for example, our Ransomware Data Disclosure Report found that internal financial data was leaked 71% of the time in the healthcare and pharmaceutical sectors — more than in any other industry, including financial services.

Tailored strategies for different organisations

Threat actors focus on ‘big fish’ because they’re often  newsworthy and recognizable – threat actors have egos too! Large organisations should consider strong encryption and network segmentation to contain potential threats. Prioritise data types for additional protection.

For smaller organisations, where an online presence is critical but public profile is lower, backup and recovery are essential. This is in case  systems are locked or shut down. Ensure software and systems are up-to-date with the latest security patches to prevent threats exploiting known vulnerabilities. Automate this process to keep it off the to-do list.

Building a detailed picture of your data and crown jewels allows you to reduce risks and build cyber resilience, identifying potential unknowns along the way.

How Rapid7 can help – Managed Detection and Response

Managed Detection and Response (MDR) services accelerate your team’s incident-response capabilities with end-to-end service. Acting as a seamless extension of your team, our experts monitor your business 24/7/365.. They leverage proprietary technology and analytics to keep your business safe against advanced threats. You can also gain access to our award winning VRM technology to perform unlimited scans to your in-scope environment to spot vulnerabilities before they’re exploited by threat actors.

Find out more.

Communication is key

But don’t forget — communication is key. Organisations crave  predictability and cybersecurity can often appear to be a ‘black box’ to those unfamiliar with  it. Transparent lines of communication and regular updates means you can paint a clear picture of potential risks that could impact your business (not to mention the business benefits of investing in security).

Proactivity is essentia. With so much happening in our field, it can be tempting to simply react and respond to what’s going on around us. However, demanding weekly updates with your stakeholders and keeping them informed of your work will make managing a crisis more bearable. This way, if something unpredictable happens, it won’t be a complete surprise, and you’ll be better prepared to manage it and your senior leaders.

Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors

Post Syndicated from Brent Nowak original https://www.backblaze.com/blog/backblaze-network-stats-ingress-trends-and-what-they-tell-us-about-backup-behaviors/

An image with a background pattern of trend lines and the words "Network Stats Ingress rends and what they tell us"

Every day, thousands of Backblaze customers create and update files. These changes make their way into our system to be securely stored. Sometimes they are sent to us immediately, while other times the differentials are batched up into a job that runs at a scheduled time. 

In this post, I’m sampling three points in our network where we take in a lot of ingress traffic off of the internet, and we’re going to explore some of the trends that we see. 

Reading the ingress tea leaves

So, why do we care about ingress trends? In short, it helps us with capacity planning, and it also tells us a lot about how people use cloud storage. We often think of planning in longer terms—weeks, months, or years. Here I wanted to focus on some of the patterns that we see during a shorter period; for example, a single day or a significant date, like the end of the calendar month. There are some interesting patterns we see in our client behavior that keep us on our toes when we are performing capacity planning.

We currently have two product offerings that have different usage and traffic patterns:

  • Backblaze B2 Cloud Storage: Ingress and egress, high variance in traffic levels throughout the day, hour, and at the start of month. 
  • Backblaze Computer Backup: Heavy ingress, with a small variance in traffic levels during the business day or weekday vs. weekend.

Since humans are using our system, we see very human quirks in our traffic profiles. For example, we humans like round numbers! We notice that a lot of backup jobs kick off at midnight local or UTC, or fire off at the top of the hour, or trigger on the first of the month. This means we see spikes of network traffic during these periods. Additionally, a lot of new content gets created during the day and then queued up to be uploaded to us in an overnight backup job.

Scope and terms

Today we’re going to look at ingress traffic, which means we’re monitoring uploads from both Backblaze Computer Backup and Backblaze B2 into our environment. We’ll save downloads, traffic coming out of Backblaze, for analysis in future posts.

One common term that you’ll see on our graphs in the 95th percentile. The 95th percentile number is a point where 95% of all measurements are under and only 5% are over. This is a very typical method to use for monitoring, billing, and trend analysis in the telecom industry. It maps to a standard bell curve, and tells you that you’re capturing the vast majority of usage for planning purposes.

A chart displaying a bell curve and percentiles
A standard bell curve. Source.

In one of our monitoring systems, we are sampling and recording the utilization on our network links and computing a 95th percentile over a five minute period.

With these items defined, let’s get into the data with some charts!

Sample 1: One-month trend

In this first sample, we see that the majority of our daily traffic falls within a nice range. What stands out here is the clock tick over from February to March, where we see a spike of ingress traffic that is outside the expected daily range.

A chart displaying a sample of ingress trends over one month.

Taking that same dataset, let’s take a closer look at the end of the month and zoom in on the calendar change into March.

Adding a vertical red line on 00:00 UTC where the month changes over, we see that there must be a lot of automated jobs that kick in right at the clock changeover into the new month.

A chart showing ingress trends over 7 days.

Sample 2: Top of the hour

Taking a look at another traffic sample from another point in our network, we see very distinct traffic patterns on the top of almost every hour.

A chart showing ingress trends over 24 hours

Sample 3: Pacific Time Zone working hours

Here’s a sample of traffic in our US-West region. During the business day on the West Coast, we see a lull in traffic, with a pickup after the business day is done. This makes sense to us as there are jobs that backup daily content that start to send traffic to us overnight.

A chart showing ingress trends over three days.

What does this mean for you?

It’s very interesting to see the impact of humans in our network traffic and the patterns that emerge. Generally we humans create and modify things during the day, and we like to back them up over night for safekeeping. And we also like round numbers—people tend to send data at the top of the hour, midnight, or end of the month. 

All of these elements are very important in how we, at Backblaze, capacity plan and balance traffic over transit links. We do a lot of work to make sure that no matter what time of day or day of the month, you can reliably get your data into Backblaze.

But, you might also look at this data and take away a meaningful conclusion: Much like choosing to go to the grocery store at 10:30 a.m. on a Tuesday versus fighting the after-work rush at 6:00 p.m., scheduling jobs on the 15, 30, or 45 minute mark or mid-month instead of at the end of the month would mean you’re up against less traffic, which is never a bad thing (and it also smooths out our ingress, which we wouldn’t be mad about either).

At the end of the day, however you choose to schedule your jobs works for us. We’re just glad we’re able to store and protect our customers’ data reliably and affordably, and we’re happy to pass along any tips and trips for a better, less congested, backup experience as well.   

Thanks for reading, and stay tuned for more graphs and commentary on how we strive to build a reliable, scalable, and forward looking network to serve our customer’s needs.

The post Backblaze Network Stats: Ingress Trends and What They Tell Us About Backup Behaviors appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

[$] A review of file descriptor memory safety in the kernel

Post Syndicated from daroc original https://lwn.net/Articles/985853/

On July 30, Al Viro sent

a patch set
to the linux-fsdevel mailing list with a
comprehensive cover letter explaining his
recent work on ensuring that the kernel’s internal representation of
file descriptors are used correctly in the kernel.
File descriptors are ubiquitous; many system calls
need to handle them. Viro’s review
identified a few existing bugs, and may prevent more in the future. He also had
suggestions for ways to keep uses consistent throughout the kernel.

Garrett: What is an SBAT and why does everyone suddenly care

Post Syndicated from corbet original https://lwn.net/Articles/986844/

Matthew Garrett describes
the role of the Secure Boot Advanced Targeting mechanism
and how it
played into the recent Windows upgrade problems.

So why is this suddenly relevant? SBAT was developed
collaboratively between the Linux community and Microsoft, and
Microsoft chose to push a Windows update that told systems not to
trust versions of grub with a security generation below a certain
level. This was because those versions of grub had genuine security
vulnerabilities that would allow an attacker to compromise the
Windows secure boot chain, and we’ve seen real world examples of
malware wanting to do that.

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/986841/

Security updates have been issued by AlmaLinux (.NET 8.0, bind, bind9.16, curl, edk2, firefox, gnome-shell, grafana, jose, krb5, libreoffice, mod_auth_openidc:2.3, orc, pcs, poppler, python-setuptools, python-urllib3, python3.11-setuptools, python3.12-setuptools, thunderbird, tomcat, and wget), Fedora (webkitgtk), SUSE (apache2, glib2, and roundcubemail), and Ubuntu (kernel, linux, linux-aws, linux-aws-5.15, linux-azure, linux-azure-5.15,
linux-azure-fde, linux-azure-fde-5.15, linux-gcp, linux-gcp-5.15,
linux-gke, linux-gkeop, linux-gkeop-5.15, linux-hwe-5.15, linux-ibm,
linux-ibm-5.15, linux-intel-iotg, linux-intel-iotg-5.15, linux-kvm,
linux-lowlatency, linux-lowlatency-hwe-5.15, linux-nvidia, linux-oracle,
linux-raspi, linux, linux-aws, linux-azure, linux-bluefield, linux-gcp, linux-gcp-5.4,
linux-gkeop, linux-hwe-5.4, linux-ibm, linux-ibm-5.4, linux-kvm,
linux-oracle, linux-oracle-5.4, linux-raspi, linux-xilinx-zynqmp, linux, linux-aws, linux-azure, linux-gcp, linux-gke, linux-ibm,
linux-lowlatency, linux-nvidia, linux-nvidia-6.8, linux-nvidia-lowlatency,
linux-oem-6.8, linux-oracle, linux-raspi, linux, linux-aws, linux-kvm, linux-lts-xenial, linux, linux-gcp, linux-gcp-4.15, linux-hwe, linux-kvm, linux-aws, linux-aws-hwe, linux-bluefield, linux-hwe-5.15, linux-raspi-5.4, and qemu).

Go wild: Wildcard support in Rules and a new open-source wildcard crate

Post Syndicated from Nikita Cano original https://blog.cloudflare.com/wildcard-rules


Back in 2012, we introduced Page Rules, a pioneering feature that gave Cloudflare users unprecedented control over how their web traffic was managed. At the time, this was a significant leap forward, enabling users to define patterns for specific URLs and adjust Cloudflare features on a page-by-page basis. The ability to apply such precise configurations through a simple, user-friendly interface was a major advancement, establishing Page Rules as a cornerstone of our platform.

Page Rules allowed users to implement a variety of actions, including redirects, which automatically send visitors from one URL to another. Redirects are crucial for maintaining a seamless user experience on the Internet, whether it’s guiding users from outdated links to new content or managing traffic during site migrations.

As the Internet has evolved, so too have the needs of our users. The demand for greater flexibility, higher performance, and more advanced capabilities led to the development of the Ruleset Engine, a powerful framework designed to handle complex rule evaluations with unmatched speed and precision.

In September 2022, we announced and released Single Redirects as a modern replacement for the URL Forwarding feature of Page Rules. Built on top of the Ruleset Engine, this new product offered a powerful syntax and enhanced performance.

Despite the enhancements, one of the most consistent pieces of feedback from our users was the need for wildcard matching and expansion, also known as globbing. This feature is essential for creating dynamic and flexible URL patterns, allowing users to manage a broader range of scenarios with ease.

Today we are excited to announce that wildcard support is now available across our Ruleset Engine-based products, including Cache Rules, Compression Rules, Configuration Rules, Custom Errors, Origin Rules, Redirect Rules, Snippets, Transform Rules, Web Application Firewall (WAF), Waiting Room, and more.

Understanding wildcards

Wildcard pattern matching allows users to employ an asterisk `(*)` in a string to match certain patterns. For example, a single pattern like `https://example.com/*/t*st` can cover multiple URLs such as `https://example.com/en/test`, `https://example.com/images/toast`, and `https://example.com/blog/trust`.

Once a segment is captured, it can be used in another expression by referencing the matched wildcard with the `${<X>}` syntax, where `<X>` indicates the index of a matched pattern. This is particularly useful in URL forwarding. For instance, the URL pattern `https://example.com/*/t*st` can redirect to `https://${1}.example.com/t${2}st`, allowing dynamic and flexible URL redirection. This setup ensures that `https://example.com/uk/test` is forwarded to `https://uk.example.com/test`, `https://example.com/images/toast` to `https://images.example.com/toast`, and so on.

Challenges with Single Redirects

In Page Rules, redirecting from an old URI path to a new one looked like this:

  • Source URL: `https://example.com/old-path/*`

  • Target URL: `https://example.com/new-path/$1`

In comparison, replicating this behaviour in Single Redirects without wildcards required a more complex approach:

  • Filter: `(http.host eq “example.com” and starts_with(http.request.uri.path, “/old-path/”))`

  • Expression: `concat(“/new-path/”, substring(http.request.uri.path, 10)) (where 10 is the length of /old-path/)`

This complexity created unnecessary overhead and difficulty, especially for users without access to regular expressions (regex) or the technical expertise to come up with expressions that use nested functions.

Wildcard support in Ruleset Engine

With the introduction of wildcard support across our Ruleset Engine-based products, users can now take advantage of the power and flexibility of the Ruleset Engine through simpler and more intuitive configurations. This enhancement ensures high performance while making it easier to create dynamic and flexible URL patterns and beyond.

What’s new?

1) Operators “wildcard” and “strict wildcard” in Ruleset Engine:

  • wildcard” (case insensitive): Matches patterns regardless of case (e.g., “test” and “TesT” are treated the same, similar to Page Rules).

  • strict wildcard” (case sensitive): Matches patterns exactly, respecting case differences (e.g., “test” won’t match “TesT”).

Both operators can be applied to any string field available in the Ruleset Engine, including full URI, host, headers, cookies, user-agent, country, and more.

This example demonstrates the use of the “wildcard” operator in a Web Application Firewall (WAF) rule applied to the User Agent field. This rule matches any incoming request where the User Agent string contains patterns starting with “Mozilla/” and includes specific elements like “Macintosh; Intel Mac OS “, “Gecko/”, and “Firefox/”. Importantly, the wildcard operator is case insensitive, so it captures variations like “mozilla” and “Mozilla” without requiring exact matches.

2) Function `wildcard_replace()` in Single Redirects:

In Single Redirects, the `wildcard_replace()` function allows you to use matched segments in redirect URL targets.

Consider the URL pattern `https://example.com/*/t*st` mentioned earlier. Using `wildcard_replace()`, you can now set the target URL to `https://${1}.example.com/t${2}st` and dynamically redirect URLs like `https://example.com/uk/test` to `https://uk.example.com/test` and `https://example.com/images/toast` to `https://images.example.com/toast`.

3) Simplified UI in Single Redirects:

We understand that not everyone wants to use advanced Ruleset Engine functions, especially for simple URL patterns. That’s why we’ve introduced an easy and intuitive UI for Single Redirects called “wildcard pattern”. This new interface, available under the Rules > Redirect Rules tab of the zone dashboard, lets you specify request and target URL wildcard patterns in seconds without needing to delve into complex functions, much like Page Rules.

How we built it

The Ruleset Engine powering Cloudflare Rules products is written in Rust. When adding wildcard support, we first explored existing Rust crates for wildcard matching.

We considered using the popular `regex` crate, known for its robustness. However, it requires converting wildcard patterns into regular expressions (e.g., `*` to `.*,` and `?` to `.`) and escaping other characters that are special in regex patterns, which adds complexity.

We also looked at the `wildmatch` crate, which is designed specifically for wildcard matching and has a couple of advantages over `regex`. The most obvious one is that there is no need to convert wildcard patterns to regular expressions. More importantly, wildmatch can handle complex patterns efficiently: wildcard matching takes quadratic time – in the worst case the time is proportional to the length of the pattern multiplied by the length of the input string. To be more specific, the time complexity is O(p + ℓ + s ⋅ ℓ), where p is the length of the wildcard pattern, the length of the input string, and s the number of asterisk metacharacters in the pattern. (If you are not familiar with big O notation, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) In the Ruleset Engine, we limit the number of asterisk metacharacters in the pattern to a maximum of 8. This ensures we will have good performance and limits the impact of a bad actor trying to consume too much CPU time by targeting extremely complicated patterns and input strings.

Unfortunately, `wildmatch` did not meet all our requirements. Ruleset Engine uses byte-oriented matching, and `wildmatch` works only on UTF-8 strings. We also have to support escape sequences –  for example, you should be able to represent a literal * in the pattern with `\*`.

Last but not least, to implement the `wildcard_replace() function` we needed not only to be able to match, but also to be able to replace parts of strings with captured segments. This is necessary to dynamically create HTTP redirects based on the source URL. For example, to redirect a request from `https://example.com/*/page/*` to `https://example.com/products/${1}?page=${2}`, you should be able to define the target URL using an expression like this:

wildcard_replace(
http.request.full_uri, 
"https://example.com/*/page/*", 
"https://example.com/products/${1}?page=${2}"
)

This means that in order to implement this function in the Ruleset Engine, we also need our wildcard matching implementation to capture the input substrings that match the wildcard’s metacharacters.

Given these requirements, we decided to build our own wildcard matching crate. The implementation is based on Kurt’s 2016 iterative algorithm, with optimizations from Krauss’ 2014 algorithm. (You can find more information about the algorithm here). Our implementation supports byte-oriented matching, escape sequences, and capturing matched segments for further processing.

Cloudflare’s `wildcard crate` is now available and is open-source. You can find the source repository here. Contributions are welcome!

FAQs and resources

For more details on using wildcards in Rules products, please refer to our updated Ruleset Engine documentation:

We value your feedback and invite you to share your thoughts in our community forums. Your input directly influences our product and design decisions, helping us make Rules products even better.

Additionally, check out our `wildcard crate` implementation and contribute to its development.

Conclusion

The new wildcard functionality in Rules is available to all plans and is completely free. This feature is rolling out immediately, and no beta access registration required. 

We are thrilled to offer this much-requested feature and look forward to seeing how you leverage wildcards in your Rules configurations. Try it now and experience the enhanced flexibility and performance. Your feedback is invaluable to us, so please let us know in community how this new feature works for you!

Вода в чешмите на селата от Родопи няма, но проблемът не е в липсата на дъжд

Post Syndicated from VassilKendov original https://kendov.com/%D0%B2%D0%BE%D0%B4%D0%B0-%D0%B2-%D1%87%D0%B5%D1%88%D0%BC%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D1%81%D0%B5%D0%BB%D0%B0%D1%82%D0%B0-%D0%BE%D1%82-%D1%80%D0%BE%D0%B4%D0%BE%D0%BF%D0%B8-%D0%BD%D1%8F%D0%BC%D0%B0/

Вода в чешмите на селата от Родопи няма, но проблемът не е в липсата на дъжд.

Много се е изписало за Маите. Тяхната цивилизация все още крие тайните си, макар някои обичаи да са доста добре изследвани.
При тях решението на всеки природен проблем се е свеждал до жертвоприношения. При продължителна суша например, владетелите са имали задачата да измолят дъжд от боговете. Те на свой ред са прехвърляли топката върху поданиците и са организирали публични жертвоприношения. В една от свещените пещери в Чечен Ица, където хвърляли телата от жертвоприношенията са намерени 127 тела, като 80% от тях са били момчета на възраст от 3 до 11 години. Лично мое предположение е, е това са били деца на политически опоненти.

Какво обаче се е случвало, когато владетелите не са успявали да измолят дъжд от боговете?
Ами много просто – пренасяли са в жетртва владетелите и техните семейства, и са избирали нов владетел.
Нищо не искам да кажа, но мисля че доста мъдър народ са били Маите. Затова и до ден днешен се възхищаваме на достиженията на цивилизацията им.

Да видим сега как ние с нашите достижения 1000 години след тях би следвало да се справим с липсата на вода в 21 век.

Тръбите не се виждат и не стават за PR

Може би едина от най-важните предпоставки за да не се реши проблемът с водата е, че тръбите са под земята и не се виждат. От друга страна копането и затваряне на улици създава доста неприятности на населението, а това не е добре. Всички знаем колко е важен PR-а за кмета на Община Родопи. Друго си е да е лапма, или улица, или някоя обновенна сграда… Вървиш си и се блъскаш в нея ежедневно. Няма как да не я видиш. Пък си речеш «Добър си ни е Кмета. Я виж колко неща прави неща за селото.» А водата? Те дъждовете като дойдат, всичко ще се оправи.
Хубаво ама така е от 5 години.

Администрацията няма проблем с водата

Служителите в Община Родопи до един имат жилища в Пловдив. А в Пловдив проблем с водата няма.
Злите езици говорят, че председателят на Общинския съвет г-н Владо Маринов, който живее в с. Бойково, нямал проблем с водата в къщата си за гости. Да но останалите в село Бойково имат такъв проблем всяко лято.
Аз самият съм жител на Бойково. Миналата неделя се разходих по планината над селото и стигнах до съседното село – Ситово. И то няма вода. Чешмите в гората обаче всичките бликат от вода. Явно вода има и дъждът не е фактор. Позагледах и старите каптажи (защото направиха нови преди години). Ами имат си вода, даже някои преливат. Откъдето и да го погледна, вода в гората има предостатъчно. В селата обаче няма, което ме навежда на друга мисъл.


Специално в Бойково загледах тръбите на каптажите, които водят към селото. Ами теснички са и са стари. Прди 2 седмици, докато пооправяха горския път за селския събор, спукаха тръбата и сега се вижда над земята. Според мен е 1.5 цола, но не мога да се закълна. Не носех шублер да я измеря. А ВиК мрежата в Бойково беше сменена преди години с изцяло нови тръби, които сад доста по-широки от тези, които доставят водата от каптажа. Колко са затлачени това е отделен въпрос. Знам обаче, че ВиК искат да им се отчужди право на преминаване или собственост върху земята, през които минават тръбите от каптажите, за да ги сменят. И са си прави. Много от тръбите минават през частни имоти, защото са правени по комунизма. Как си представяте ВиК да копае в частен имот?

И тук зачитаме закона за водите

чл.10 ал.4 (2) Политиката по експлоатация и реконструкция на В и К инфраструктурата се осъществява от кмета на общината.

ал. 1 Общинският съвет приема Програма за развитието на В и К сектора

ал. 2 Кметът на общината разработва Програмата и я предлага за обществено обсъждане.

И щом законът е такъв, продължаваме да търсим да видим какво пак не е свършила общината



Заделените пари в бюджета реално не се използват

В заседанията по обсъждането на бюджета, почти винаги съм сам. С малки изключения. Помня как жители на с. Марково, организирани от кмета г-жа Терзиева, дойдоха на такова заседание и съответно получиха най-голям дял от парите по изтегления кредит, за ремонт на няколко улици в селото.
Нищо, че всички ще плащат кредита, парите отиват в Марково.

ТАКА И ТРЯБВА ДА БЪДЕ! – Парите трябва да следват инициативните!

Да обаче и в Марково вече има режим на водата. Не ми се рови отново в бюджета (ако някои си плати ще му го изровя разбира се), но в последните бюджети имаше заложени капиталови разходи за водни колектори, които да решат проблема с водата в Марково. И така няколко години. Земята била отредена, проект по думите на кмета Михайлов имало, но колекторите нещо ги няма.
Всички знаем колко е сложна работата на кмета, но пък според закона по-горе, политиката по водата се осъществява от него. Това му е работата. Би следвало да е и приоритет. Много хубаво, че имаме LED лампи или тържества с хора във всяко село, ама водата…

Може би малко хора знаят, че акведуктът на Коматевския възел в Пловдив е захранвал Тримонциум с вода точно от землището на Марково. Тогава в Тримонциум по оценки на историци са живели между 60-80 000 души.
В днешното село Марково живет около 5500 души. Разбира се това са хора с много по-голямо потребление на вода от средния жител в древен Тримонциум. В селото има множество къщи с басейни и морави, изискващи напояване, така че потребленито не вода на тези 5500 може и да е по-голямо от това на древните жители, ако и да са били над 60 000 души.
Но и другото е вярно. Днес не строим акведукти, а полагаме тръби в дупки, изкопани с фадроми и багери. Загубите на вода би следвало да са много по-малко в сравнение с акведукта, а водита може да се пренася от много километри, благодарение на електричество и помпени станции.

Може ама някой трябва да има визията да го направи, а 5 години явно не са достаъчни за тази визия в Община Родопи. Не са римляни все пак и са тук само до следващите избори.

Ръстът в населението

Ето какво казва г-жа Терзиева, кмет на с. Марково

„”Признанието е голямо, това е много важен приз за всеки общественик, който е решил да извършва дейност за благото на хората. Марково е една кауза. Доста се разраства селото, в него вече има около 5500 жители. Тенденцията в последните 4-5 години е такава – младите семейства да отиват в населени места близо до големия град. Марково се намира на само 5 км от Пловдив, което го прави притегателно място”, заяви Терзиева пред Bulgaria ON AIR.“

Мисля, че ако синът ми в 6-ти клас прочете тези редове, веднага ще разбере, че инфраструктурата няма да издържи това развитие и трябва да се направят инвестиции за подобрението. За общинската администрация не мога да гаранирам какви изводи си правят от тази информация.

И понеже инвестициите изискват средства, отиваме на следващия очевидно нерешим проблем

Колко пари трябват, откъде да дойдат и какво пропуснахме

На този въпрос не мога да отговоря на прима виста. Мога да предполагам обаче.
От опит знам, че смяната на канализацията на едно село като Марково, ще струва около 30 млн. лева. Отделно пречиствателна станция. Последният бюджет на Община Родопи за 2024 е 52.5 млн.

С две думи без кредит или заем проблемът с доставката на вода няма да се реши. Просто не е по силите на Община Родопи. Нито финансово, нито административно, както е видно.

Селата от Община Родопи се водоснабдяват от ВиК Пловдив. Самия Пловдив е около 500 000 хиляди, но няма проблем с водата. Бюджетът им обач е 684 милиона. Възможностите им за получаване на кредит съответно също са 10 пъти по-добри от тези на Община Родопи.
Друг е въпросът, че Община Родопи похарчи кредита не за водна инфраструктура, а за лампи и пътища. Няма лошо разбира се, но така унищожава възможността за получаване на кредит за ВиК инфраструктура.

Сега вече би трябвало хората да научат думата „ПРИОРИТЕТ“ в харчовете. Ясно е, че улиците и лампите се виждат, ама не се пият. По лош път мога да се движа, със стара лампа мога да се осветявам, ама без вода в чешмата… Направо да хващаме коритата и на ходим да перем на реката. Администрацията на общината няма да ги мислим, те си отиват в Пловдив, а там всичко си има.

И тук един риторичен въпрос към жителите на с. Белащица „-Как очаквате пари за инфраструктура с бюджета на Община Родопи? Нали не искахте в пределите на Пловдив?“

Политическата страна на нещата

Ако трябва да падна на нивото на Общинската администрация, бих попитал съвсем по герберски „-И как очаквате да Ви дадем държавни пари за ВиК, като сте си избрали кмет комунист?“
Парите знаете, че не достигат никога. В положението на селата от Община Родопи са стотици други села с кметове близки до властта. На кое село да оправим водата по-напред? На Брестовица (с кмет комунист) или някое друго с кмет лоялен към друга политическа сила?

За финал ще съм доволен, ако в главите на хората остане схващането, че решението на проблема с водата не е еднократен акт. Трябват много предпоставки за да се случи. Този порядък за мен е най-добрия

Гласувате за хора с визия – Ходите на заседанията на Общински съвет и си задавате въпросите, колкото и да са неудобни – Обединявате се около хора с визия, а не с PR мания – Организирате протести пред Общна Родопи – Не си мълчите, когато в социалните мрежи бушуват тролове и се опитват да ви замазват очите.

Иначе и обичаят “Бяла пеперуда” върши работа до следващото лято, когато проблемът ще е същият.

Васил Кендов – жител на безводно Бойково

Моля използвайте приложената форма за записване на час за среща
[contact-form-7]

The post Вода в чешмите на селата от Родопи няма, но проблемът не е в липсата на дъжд appeared first on Kendov.com.

Migrating from Datadog to Zabbix with Custom Metric Submission

Post Syndicated from Chris Board original https://blog.zabbix.com/migrating-from-datadog-to-zabbix-with-custom-metric-submission/28620/

For a few years, I’ve been monitoring 3 Digital Ocean servers with Datadog and using Datadog DogStatsD to submit custom metrics to Datadog. I am a big fan of Datadog and will continue recommending them. However, it became a bit too expensive for my needs, so I started looking for alternative options.

I decided to go down the self-hosted route as that was the least expensive option. I decided to go with Zabbix.

If you don’t know, Zabbix is a completely free, open source, and enterprise-ready monitoring service with a vast range of integrations for all of your monitoring needs. You can choose to install it on-premises or in the cloud.

I went with a $24 Basic Droplet in Digital Ocean with Regular SSD, which is actually below the minimum requirements that Zabbix specifies. It has been working fine and resource usage is minimal (around 40% RAM and 4% CPU use).

When you create a host to monitor, you assign templates. The templates are integrations you want to monitor, such as Apache, MySQL, and general Zabbix agent metrics like server performance (CPU, RAM, IO, etc.).

There were some things I had to create manually (including process monitoring) as I couldn’t find a built-in way of doing it. Datadog, had live process monitoring, so you could create a monitor which looks for a particular process, and then alert if that process wasn’t running.

Zabbix didn’t seem to have anything like this (that I could find) so I created custom templates and a custom shell script to look for the process name using the ps command (on Linux).

Another important function I needed was custom metric submission. This was originally done via the Datadog DogStatsD libraries available in pretty much any language, either as official libraries or via community versions. This would submit UDP data to the agent running locally on the server, and the agent would submit it to your Datadog account.

I didn’t want to rewrite all my apps to be able to send data to Zabbix, so I built a conversion tool. Its a small app I built in C# that listens on the same UDP socket as the Datadog agent (obviously, you’ll need to have the Datadog agent turned off). It receives the data from the Datadog DogStatsD libraries as normal, and the C# app converts the Datadog UDP data and submits an HTTP request to the Zabbix server via its API.

After everything was installed, I then re-created the various dashboards that I had from Datadog in Zabbix. A couple of examples are below:

In terms of access and configuration, all of the metrics are sent over the private interfaces of each droplet. Nothing is available via the public interface.

Logging into the Zabbix web portal is done via a Cloudflare Tunnel that allows me to connect to the web portal over the private interface via the Cloudflare tunnels running on each of the servers for fault tolerance. This provides multiple levels of authentication, as you have to authenticate to Cloudflare and authenticate with Zabbix.

This post was designed as an overview to show that it is possible to migrate from Datadog to Zabbix fairly easily, with a small amount of development involved to convert Datadog custom metrics to Zabbix via the C# app.

The C# app isn’t publicly available, but if there is some demand for it I can look at open sourcing it. If you want a full rundown of how I migrated and set up the Zabbix server and the servers being monitored, please let me know and I can do a more in-depth blog post!

 

 

The post Migrating from Datadog to Zabbix with Custom Metric Submission appeared first on Zabbix Blog.

What the fuck is an SBAT and why does everyone suddenly care

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70348.html

Short version: Secure Boot Advanced Targeting and if that’s enough for you you can skip the rest you’re welcome.

Long version: When UEFI Secure Boot was specified, everyone involved was, well, a touch naive. The basic security model of Secure Boot is that all the code that ends up running in a kernel-level privileged environment should be validated before execution – the firmware verifies the bootloader, the bootloader verifies the kernel, the kernel verifies any additional runtime loaded kernel code, and now we have a trusted environment to impose any other security policy we want. Obviously people might screw up, but the spec included a way to revoke any signed components that turned out not to be trustworthy: simply add the hash of the untrustworthy code to a variable, and then refuse to load anything with that hash even if it’s signed with a trusted key.

Unfortunately, as it turns out, scale. Every Linux distribution that works in the Secure Boot ecosystem generates their own bootloader binaries, and each of them has a different hash. If there’s a vulnerability identified in the source code for said bootloader, there’s a large number of different binaries that need to be revoked. And, well, the storage available to store the variable containing all these hashes is limited. There’s simply not enough space to add a new set of hashes every time it turns out that grub (a bootloader initially written for a simpler time when there was no boot security and which has several separate image parsers and also a font parser and look you know where this is going) has another mechanism for a hostile actor to cause it to execute arbitrary code, so another solution was needed.

And that solution is SBAT. The general concept behind SBAT is pretty straightforward. Every important component in the boot chain declares a security generation that’s incorporated into the signed binary. When a vulnerability is identified and fixed, that generation is incremented. An update can then be pushed that defines a minimum generation – boot components will look at the next item in the chain, compare its name and generation number to the ones stored in a firmware variable, and decide whether or not to execute it based on that. Instead of having to revoke a large number of individual hashes, it becomes possible to push one update that simply says “Any version of grub with a security generation below this number is considered untrustworthy”.

So why is this suddenly relevant? SBAT was developed collaboratively between the Linux community and Microsoft, and Microsoft chose to push a Windows update that told systems not to trust versions of grub with a security generation below a certain level. This was because those versions of grub had genuine security vulnerabilities that would allow an attacker to compromise the Windows secure boot chain, and we’ve seen real world examples of malware wanting to do that (Black Lotus did so using a vulnerability in the Windows bootloader, but a vulnerability in grub would be just as viable for this). Viewed purely from a security perspective, this was a legitimate thing to want to do.

(An aside: the “Something has gone seriously wrong” message that’s associated with people having a bad time as a result of this update? That’s a message from shim, not any Microsoft code. Shim pays attention to SBAT updates in order to avoid violating the security assumptions made by other bootloaders on the system, so even though it was Microsoft that pushed the SBAT update, it’s the Linux bootloader that refuses to run old versions of grub as a result. This is absolutely working as intended)

The problem we’ve ended up in is that several Linux distributions had not shipped versions of grub with a newer security generation, and so those versions of grub are assumed to be insecure (it’s worth noting that grub is signed by individual distributions, not Microsoft, so there’s no externally introduced lag here). Microsoft’s stated intention was that Windows Update would only apply the SBAT update to systems that were Windows-only, and any dual-boot setups would instead be left vulnerable to attack until the installed distro updated its grub and shipped an SBAT update itself. Unfortunately, as is now obvious, that didn’t work as intended and at least some dual-boot setups applied the update and that distribution’s Shim refused to boot that distribution’s grub.

What’s the summary? Microsoft (understandably) didn’t want it to be possible to attack Windows by using a vulnerable version of grub that could be tricked into executing arbitrary code and then introduce a bootkit into the Windows kernel during boot. Microsoft did this by pushing a Windows Update that updated the SBAT variable to indicate that known-vulnerable versions of grub shouldn’t be allowed to boot on those systems. The distribution-provided Shim first-stage bootloader read this variable, read the SBAT section from the installed copy of grub, realised these conflicted, and refused to boot grub with the “Something has gone seriously wrong” message. This update was not supposed to apply to dual-boot systems, but did anyway. Basically:

1) Microsoft applied an update to systems where that update shouldn’t have been applied
2) Some Linux distros failed to update their grub code and SBAT security generation when exploitable security vulnerabilities were identified in grub

The outcome is that some people can’t boot their systems. I think there’s plenty of blame here. Microsoft should have done more testing to ensure that dual-boot setups could be identified accurately. But also distributions shipping signed bootloaders should make sure that they’re updating those and updating the security generation to match, because otherwise they’re shipping a vector that can be used to attack other operating systems and that’s kind of a violation of the social contract around all of this.

It’s unfortunate that the victims here are largely end users faced with a system that suddenly refuses to boot the OS they want to boot. That should never happen. I don’t think asking arbitrary end users whether they want secure boot updates is likely to result in good outcomes, and while I vaguely tend towards UEFI Secure Boot not being something that benefits most end users it’s also a thing you really don’t want to discover you want after the fact so I have sympathy for it being default on, so I do sympathise with Microsoft’s choices here, other than the failed attempt to avoid the update on dual boot systems.

Anyway. I was extremely involved in the implementation of this for Linux back in 2012 and wrote the first prototype of Shim (which is now a massively better bootloader maintained by a wider set of people and that I haven’t touched in years), so if you want to blame an individual please do feel free to blame me. This is something that shouldn’t have happened, and unless you’re either Microsoft or a Linux distribution it’s not your fault. I’m sorry.

comment count unavailable comments

Now open — AWS Asia Pacific (Malaysia) Region

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/now-open-aws-asia-pacific-malaysia-region/

In March of last year, Jeff Barr announced the plan for an AWS Region in Malaysia. Today, I’m pleased to share the general availability of the AWS Asia Pacific (Malaysia) Region with three Availability Zones and API name ap-southeast-5.

The AWS Asia Pacific (Malaysia) Region is the first infrastructure Region in Malaysia and the thirteenth Region in Asia Pacific, joining the existing Asia Pacific Regions in Hong Kong, Hyderabad, Jakarta, Melbourne, Mumbai, Osaka, Seoul, Singapore, Sydney, and Tokyo and the Mainland China Beijing and Ningxia Regions.

The Petronas Twin Towers in the heart of Kuala Lumpur’s central business district.

The new AWS Region in Malaysia will play a pivotal role in supporting the Malaysian government’s strategic Madani Economy Framework. This initiative aims to improve the living standards of all Malaysians by 2030 while supporting innovation in Malaysia and across ASEAN. The construction and operation of the new AWS Region is estimated to add approximately $12.1 billion (MYR 57.3 billion) to Malaysia’s gross domestic product (GDP) and will support an average of more than 3,500 full-time equivalent jobs at external businesses annually through 2038.

The AWS Region in Malaysia will help to meet the high demand for cloud services while supporting innovation in Malaysia and across Southeast Asia.

AWS in Malaysia
In 2016, Amazon Web Services (AWS) established a presence with its first AWS office in Malaysia. Since then, AWS has provided continuous investments in infrastructure and technology to help drive digital transformations in Malaysia in support of hundreds of thousands of active customers each month.

Amazon CloudFront – In 2017, AWS announced the launch of the first edge location in Malaysia, which helps improve performance and availability for end users. Today, there are four Amazon CloudFront locations in Malaysia.

AWS Direct Connect – To continue helping our customers in Malaysia improve application performance, secure data, and reduce networking costs, in 2017, AWS announced the opening of additional Direct Connect locations in Malaysia. Today, there are two AWS Direct Connect locations in Malaysia.

AWS Outposts – As a fully managed service that extends AWS infrastructure and AWS services, AWS Outposts is ideal for applications that need to run on-premises to meet low latency requirements. Since 2020, customers in Malaysia have been able to order AWS Outposts to be installed at their datacenters and on-premises locations.

AWS customers in Malaysia
Cloud adoption in Malaysia has been steadily gaining momentum in recent years. Here are some examples of AWS customers in Malaysia and how they are using AWS for various workloads:

PayNet – PayNet is Malaysia’s national payments network and shared central infrastructure for the financial market in Malaysia. PayNet uses AWS to run critical national payment workloads, including the MyDebit online cashless payments system and e-payment reporting.

Pos Malaysia Berhad (Pos Malaysia) – Pos Malaysia is the national post and parcel service provider, holding the sole mandate to deliver services under the universal postal service obligation for Malaysia. They migrated critical applications to AWS, which increased their business agility and ability to deliver enhanced customer experiences. Also, they scaled their compute capacity to handle deliveries to more than 11 million addresses and a network of more than 3,500 retail touchpoints using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS), ensuring disruption-free services.

DerivDeriv, one of the world’s largest online brokers, is using Amazon Q Business to increase productivity, efficiency, and innovation in its operations across customer support, marketing, and recruiting departments. With Amazon Q Business, Deriv has been able to boost productivity and reduce onboarding time by 45 percent.

Asia Pacific University – As one of the leading tech universities in Malaysia, Asia Pacific University (APU) uses AWS serverless technology such as Lambda to reduce operational costs. The automated scalability of AWS services has led to high availability and faster deployment that ensure APU’s applications and services are accessible to the students and staff at all times, enhancing the overall user experience. 

Aerodyne – Aerodyne Group is a DT3 (Drone Tech, Data Tech, and Digital Transformation) solutions provider of drone-based enterprise solutions. They’re running their DRONOS software as a service (SaaS) platform on AWS to help drone operators worldwide grow their businesses.

Building cloud skills together
AWS and various organizations in Malaysia have been working closely to build necessary cloud skills for builders in Malaysia. Here are some of the initiatives:

Program AKAR powered by AWS re/Start – Program AKAR is the first financial services-aligned cloud skills program initiated by AWS and PayNet. This new program aims to bridge the growing skills gap in Malaysia’s digital economy by equipping university students with transferrable skills for careers in the sector. As part of this initial collaboration, PayNet, AWS re/Start, and WEPS have committed to starting the program with 100 students in 2024, with the first 50 from Asia Pacific University serving as a pilot. 

AWS Academy — AWS Academy aims to bridge the gap between industry and academia by preparing students for industry-recognized certifications and careers in the cloud with a free and ready-to-teach cloud computing curriculum. AWS Academy currently runs courses in 48 Malaysian universities, covering various domains. Since 2018, 23,000 students have been trained through this program.

AWS Skills Guild at PETRONAS – PETRONAS, a global energy and solutions provider with a presence in over 50 countries, has been an AWS customer since 2014. AWS is also collaborating with PETRONAS to train their employees using the AWS Skills Guild program.

AWS’s contribution to sustainability in Malaysia
With The Climate Pledge, Amazon is committed to reaching net-zero carbon across its business by 2040 and is on a path to powering its operations with 100 percent renewable energy by 2025.

In September 2023, AWS announced its collaboration with Petronas and Gentari, a global clean energy company, to accelerate sustainability and decarbonization efforts in the global energy transition. Shortly after, in December 2023, AWS customer PKT Logistics Group became the first Malaysian company to join over 300 global companies in The Climate Pledge to accelerate the world’s path to net-zero carbon.

In July 2024, AWS and Zero Waste Management collaborated on the first-ever AWS InCommunities Malaysia initiative, Green Wira Programme, to train educators to build sustainability initiatives in schools to advance Malaysia’s sustainable future.

Amazon is committed to investing and innovating across its businesses to help create a more sustainable future.

Things to know
AWS Community in Malaysia – Malaysia is also home to one AWS Hero, nine AWS Community Builders and about 9,000 community members of three AWS User Groups in various cities in Malaysia. If you’re interested in joining AWS User Groups Malaysia, visit their Meetup and Facebook pages.

AWS Global footprint – With this launch, AWS now spans 108 Availability Zones within 34 geographic Regions around the world. We have also announced plans for 18 more Availability Zones and six more AWS Regions in Mexico, New Zealand, the Kingdom of Saudi Arabia, Taiwan, Thailand, and the AWS European Sovereign Cloud.

Available now – The new Asia Pacific (Malaysia) Region is ready to support your business, and you can find a detailed list of the services available in this Region on the AWS Services by Region page.

To learn more, please visit the AWS Global Infrastructure page, and start building on ap-southeast-5!

Happy building!
— Donnie

“Something has gone seriously wrong,” dual-boot systems warn after Microsoft update (Ars Technica)

Post Syndicated from jzb original https://lwn.net/Articles/986659/

Ars Technica covers
a recent update
that is causing problems for users with systems that dual-boot Windows
and Linux.

“Note that Windows says this update won’t apply to systems that
dual-boot Windows and Linux,” one frustrated person wrote. “This
obviously isn’t true, and likely depends on your system configuration
and the distribution being run. It appears to have made some linux efi
shim bootloaders incompatible with microcrap efi bootloaders (that’s
why shifting from MS efi to ‘other OS’ in efi setup works). It appears
that Mint has a shim version that MS SBAT doesn’t recognize.”

The reports indicate that multiple distributions, including Debian,
Ubuntu, Linux Mint, Zorin OS, and Puppy Linux, are all
affected. Microsoft has yet to acknowledge the error publicly, explain
how it wasn’t detected during testing, or provide technical guidance
to those affected. Company representatives didn’t respond to an email
seeking answers.

Górny: Gentoo: profiles and keywords rather than releases

Post Syndicated from jzb original https://lwn.net/Articles/986655/

Gentoo developer Michał Górny has written a lengthy blog
post
that explains how Gentoo approaches releases:

Gentoo is something of a hybrid, as it combines the best of both
worlds. It is a rolling release distribution with a single shared
repository that is available to all users. However, within this
repository we use a keywording system to provide a choice between
stable and testing packages, to facilitate both production and
development systems (with some extra flexibility), and versioned
profiles to tackle major lock-step upgrades.

Optimize cost and performance for Amazon MWAA

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/optimize-cost-and-performance-for-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With Amazon MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance on how you can optimize performance and save cost by following best practices.

Amazon MWAA environments include four Airflow components hosted on groups of AWS compute resources: the scheduler that schedules the work, the workers that implement the work, the web server that provides the UI, and the metadata database that keeps track of state. For intermittent or varying workloads, optimizing costs while maintaining price and performance is crucial. This post outlines best practices to achieve cost optimization and efficient performance in Amazon MWAA environments, with detailed explanations and examples. It may not be necessary to apply all of these best practices for a given Amazon MWAA workload; you can selectively choose and implement relevant and applicable principles for your specific workloads.

Right-sizing your Amazon MWAA environment

Right-sizing your Amazon MWAA environment makes sure you have an environment that is able to concurrently scale across your different workloads to provide the best price-performance. The environment class you choose for your Amazon MWAA environment determines the size and the number of concurrent tasks supported by the worker nodes. In Amazon MWAA, you can choose from five different environment classes. In this section, we discuss the steps you can follow to right-size your Amazon MWAA environment.

Monitor resource utilization

The first step in right-sizing your Amazon MWAA environment is to monitor the resource utilization of your existing setup. You can monitor the underlying components of your environments using Amazon CloudWatch, which collects raw data and processes data into readable, near real-time metrics. With these environment metrics, you have greater visibility into key performance indicators to help you appropriately size your environments and debug issues with your workflows. Based on the concurrent tasks needed for your workload, you can adjust the environment size as well as the maximum and minimum workers needed. CloudWatch will provide CPU and memory utilization for all the underlying AWS services utilize by Amazon MWAA. Refer to Container, queue, and database metrics for Amazon MWAA for additional details on available metrics for Amazon MWAA. These metrics also include the number of base workers, additional workers, schedulers, and web servers.

Analyze your workload patterns

Next, take a deep dive into your workflow patterns. Examine DAG schedules, task concurrency, and task runtimes. Monitor CPU/memory usage during peak periods. Query CloudWatch metrics and Airflow logs. Identify long-running tasks, bottlenecks, and resource-intensive operations for optimal environment sizing. Understanding the resource demands of your workload will help you make informed decisions about the appropriate Amazon MWAA environment class to use.

Choose the right environment class

Match requirements to Amazon MWAA environment class specifications (mw1.small to mw1.2xlarge) that can handle your workload efficiently. You can vertically scale up or scale down an existing environment through an API, the AWS Command Line Interface (AWS CLI), or the AWS Management Console. Be aware that a change in the environment class requires a scheduled downtime.

Fine tune configuration parameters

Fine-tuning configuration parameters in Apache Airflow is crucial for optimizing workflow performance and cost reductions. It allows you to tune settings such as Auto scaling, parallelism, logging, and DAG code optimizations.

Auto scaling

Amazon MWAA supports worker auto scaling, which automatically adjusts the number of running worker and web server nodes based on your workload demands. You can specify the minimum and maximum number of Airflow workers that run in your environment. For worker node auto scaling, Amazon MWAA uses RunningTasks and QueuedTasks metrics, where (tasks running + tasks queued) / (tasks per worker) = (required workers). If the required number of workers is greater than the current number of running workers, Amazon MWAA will add additional worker instances using AWS Fargate, up to the maximum value specified by the maximum worker configuration.

Auto scaling in Amazon MWAA will gracefully downscale when there are more additional workers than required. For example, let’s assume a large Amazon MWAA environment with a minimum of 1 worker and a maximum of 10, where each large Amazon MWAA worker can support up to 20 tasks. Let’s say, each day at 8:00 AM, DAGs start up that use 190 concurrent tasks. Amazon MWAA will automatically scale to 10 workers, because the required workers = 190 requested tasks (some running, some queued) / 20 (tasks per worker) = 9.5 workers, rounded up to 10. At 10:00 AM, half of the tasks complete, leaving 85 running. Amazon MWAA will then downscale to 6 workers (95 tasks/20 tasks per worker = 5.25 workers, rounded up to 6). Any workers that are still running tasks remain protected during downscaling until they’re complete, and no tasks will be interrupted. As the queued and running tasks decrease, Amazon MWAA will remove workers without affecting running tasks, down to the minimum specified worker count.

Web server auto scaling in Amazon MWAA allows you to automatically scale the number of web servers based on CPU utilization and active connection count. Amazon MWAA makes sure your Airflow environment can seamlessly accommodate increased demand, whether from REST API requests, AWS CLI usage, or more concurrent Airflow UI users. You can specify the maximum and minimum web server count while configuring your Amazon MWAA environment.

Logging and metrics

In this section, we discuss the steps to select and set the appropriate log configurations and CloudWatch metrics.

Choose the right log levels

If enabled, Amazon MWAA will send Airflow logs to CloudWatch. You can view the logs to determine Airflow task delays or workflow errors without the need for additional third-party tools. You need to enable logging to view Airflow DAG processing, tasks, scheduler, web server, and worker logs. You can enable Airflow logs at the INFO, WARNING, ERROR, or CRITICAL level. When you choose a log level, Amazon MWAA sends logs for that level and higher levels of severity. Standard CloudWatch logs charges apply, so reducing log levels where possible can reduce overall costs. Use the most appropriate log level based on environment, such as INFO for dev and UAT, and ERROR for production.

Set appropriate log retention policy

By default, logs are kept indefinitely and never expire. To reduce CloudWatch cost, you can adjust the retention policy for each log group.

Choose required CloudWatch metrics

You can choose which Airflow metrics are sent to CloudWatch by using the Amazon MWAA configuration option metrics.statsd_allow_list. Refer to the complete list of available metrics. Some metrics such as schedule_delay and duration_success are published per DAG, whereas others such as ti.finish are published per task per DAG.

Therefore, the cumulative number of DAGs and tasks directly influence your CloudWatch metric ingestion costs. To control CloudWatch costs, choose to publish selective metrics. For example, the following will only publish metrics that start with scheduler and executor:

metrics.statsd_allow_list = scheduler,executor

We recommend using metrics.statsd_allow_list with metrics.metrics_use_pattern_match.

An effective practice is to utilize regular expression (regex) pattern matching against the entire metric name instead of only matching the prefix at the beginning of the name.

Monitor CloudWatch dashboards and set up alarms

Create a custom dashboard in CloudWatch and add alarms for a particular metric to monitor the health status of your Amazon MWAA environment. Configuring alarms allows you to proactively monitor the health of the environment.

Optimize AWS Secrets Manager invocations

Airflow has a mechanism to store secrets such as variables and connection information. By default, these secrets are stored in the Airflow meta database. Airflow users can optionally configure a centrally managed location for secrets, such as AWS Secrets Manager. When specified, Airflow will first check this alternate secrets backend when a connection or variable is requested. If the alternate backend contains the needed value, it is returned; if not, Airflow will check the meta database for the value and return that instead. One of the factors affecting the cost to use Secrets Manager is the number of API calls made to it.

On the Amazon MWAA console, you can configure the backend Secrets Manager path for the connections and variables that will be used by Airflow. By default, Airflow searches for all connections and variables in the configured backend. To reduce the number of API calls Amazon MWAA makes to Secrets Manager on your behalf, configure it to use a lookup pattern. By specifying a pattern, you narrow the possible paths that Airflow will look at. This will help in lowering your costs when using Secrets Manager with Amazon MWAA.

To use a secrets cache, enable AIRFLOW_SECRETS_USE_CACHE with TTL to help to reduce the Secrets Manager API calls.

For example, if you want to only look up a specific subset of connections, variables, or config in Secrets Manager, set the relevant *_lookup_pattern parameter. This parameter takes a regex as a string as value. To lookup connections starting with m in Secrets Manager, your configuration file should look like the following code:

[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs =

{
  "connections_prefix": "airflow/connections",
  "connections_lookup_pattern": "^m",
  "profile_name": "default"
}

DAG code optimization

Schedulers and workers are two components that are involved in parsing the DAG. After the scheduler parses the DAG and places it in a queue, the worker picks up the DAG from the queue. At the point, all the worker knows is the DAG_id and the Python file, along with some other info. The worker has to parse the Python file in order to run the task.

DAG parsing is run twice, once by the scheduler and then by the worker. Because the workers are also parsing the DAG, the amount of time it takes for the code to parse dictates the number of workers needed, which adds cost of running those workers.

For example, for a total of 200 DAGs having 10 tasks each, taking 60 seconds per task to parse, we can calculate the following:

  • Total tasks across all DAGs = 2,000
  • Time per task = 60 seconds + 20 seconds (parse DAG)
  • Total time = 2000 * 80 = 160,000 seconds
  • Total time per worker = 72,000 seconds
  • Number of workers needs = Total time/Total time per worker = 160,000/72,000 = ~3

Now, let’s increase the time taken to parse the DAGs to 100 seconds:

  • Total tasks across all DAGs = 2,000
  • Time per task = 60 seconds + 100 seconds
  • Total time = 2,000 *160 = 320,000 seconds
  • Total time per worker = 72,000 seconds
  • Number of workers needs = Total time/Total time per worker = 320,000/72,000 = ~5

As you can see, when the DAG parsing time increased from 20 seconds to 100 seconds, the number of worker nodes needed increased from 3 to 5, thereby adding compute cost.

To reduce the time it takes for parsing the code, follow the best practices in the subsequent sections.

Remove top-level imports

Code imports will run every time the DAG is parsed. If you don’t need the libraries being imported to create the DAG objects, move the import to the task level instead of defining it at the top. After it’s defined in the task, the import will be called only when the task is run.

Avoid multiple calls to databases like the meta database or external system database. Variables are used within the DAG that are defined in the meta database or a backend system like Secrets Manager. Use templating (Jinja) wherein calls to populate the variables are only made at task runtime and not at task parsing time.

For example, see the following code:

import pendulum
from airflow import DAG
from airflow.decorators import task
import numpy as np  # <-- DON'T DO THAT!

with DAG(
    dag_id="example_python_operator",
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as dag:

    @task()
    def print_array():
        """Print Numpy array."""
        import numpy as np  # <-- INSTEAD DO THIS!
        a = np.arange(15).reshape(3, 5)
        print(a)
        return a
    print_array()

The following code is another example:

# Bad example
from airflow.models import Variable

foo_var = Variable.get("foo")  # DON'T DO THAT

bash_use_variable_bad_1 = BashOperator(
    task_id="bash_use_variable_bad_1", bash_command="echo variable foo=${foo_env}", env={"foo_env": foo_var}
)

bash_use_variable_bad_2 = BashOperator(
    task_id="bash_use_variable_bad_2",
    bash_command=f"echo variable foo=${Variable.get('foo')}",  # DON'T DO THAT
)

bash_use_variable_bad_3 = BashOperator(
    task_id="bash_use_variable_bad_3",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": Variable.get("foo")},  # DON'T DO THAT
)

# Good example
bash_use_variable_good = BashOperator(
    task_id="bash_use_variable_good",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": "{{ var.value.get('foo') }}"},
)

@task
def my_task():
    var = Variable.get("foo")  # this is fine, because func my_task called only run task, not scan DAGs.
print(var)

Writing DAGs

Complex DAGs with a large number of tasks and dependencies between them can impact performance of scheduling. One way to keep your Airflow instance performant and well utilized is to simplify and optimize your DAGs.

For example, a DAG that has simple linear structure A → B → C will experience less delays in task scheduling than a DAG that has a deeply nested tree structure with an exponentially growing number of dependent tasks.

Dynamic DAGs

In the following example, a DAG is defined with hardcoded table names from a database. A developer has to define N number of DAGs for N number of tables in a database.

# Bad example
dag_params = getData()
no_of_dags = int(dag_params["no_of_dags"]['N'])
# build a dag for each number in no_of_dags
for n in range(no_of_dags):
    dag_id = 'dynperf_t1_{}'.format(str(n))
default_args = {'owner': 'airflow','start_date': datetime(2022, 2, 2, 12, n)}

To reduce verbose and error-prone work, use dynamic DAGs. The following definition of the DAG is created after querying a database catalog, and creates as many DAGs dynamically as there are tables in the database. This achieves the same objective with less code.

def getData():
    client = boto3.client('dynamodb’)
    response = client.get_item(
        TableName="mwaa-dag-creation",
        Key={'key': {'S': 'mwaa’}}
    )
    return response["Item"]

Stagger DAG schedules

Running all DAGs simultaneously or within a short interval in your environment can result in a higher number of worker nodes required to process the tasks, thereby increasing compute costs. For business scenarios where the workload is not time-sensitive, consider spreading the schedule of DAG runs in a way that maximizes the utilization of available worker resources.

DAG folder parsing

Simpler DAGs are usually only in a single Python file; more complex DAGs might be spread across multiple files and have dependencies that should be shipped with them. You can either do this all inside of the DAG_FOLDER , with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single .zip file. Airflow will look into all the directories and files in the DAG_FOLDER. Using the .airflowignore file specifies which directories or files Airflow should intentionally ignore. This will increase the efficiency of finding a DAG within a directory, improving parsing times.

Deferrable operators

You can run deferrable operators on Amazon MWAA. Deferrable operators have the ability to suspend themselves and free up the worker slot. No tasks in the worker means fewer required worker resources, which can lower the worker cost.

For example, let’s assume you’re using a large number of sensors that wait for something to occur and occupy worker node slots. By making the sensors deferrable and using worker auto scaling improvements to aggressively downscale workers, you will immediately see an impact where fewer worker nodes are needed, saving on worker node costs.

Dynamic Task Mapping

Dynamic Task Mapping allows a way for a workflow to create a number of tasks at runtime based on current data, rather than the DAG author having to know in advance how many tasks would be needed. This is similar to defining your tasks in a for loop, but instead of having the DAG file fetch the data and do that itself, the scheduler can do this based on the output of a previous task. Right before a mapped task is run, the scheduler will create N copies of the task, one for each input.

Stop and start the environment

You can stop and start your Amazon MWAA environment based on your workload requirements, which will result in cost savings. You can perform the action manually or automate stopping and starting Amazon MWAA environments. Refer to Automating stopping and starting Amazon MWAA environments to reduce cost to learn how to automate the stop and start of your Amazon MWAA environment retaining metadata.

Conclusion

In conclusion, implementing performance optimization best practices for Amazon MWAA can significantly reduce overall costs while maintaining optimal performance and reliability. Key strategies include right-sizing environment classes based on CloudWatch metrics, managing logging and monitoring costs, using lookup patterns with Secrets Manager, optimizing DAG code, and selectively stopping and starting environments based on workload demands. Continuously monitoring and adjusting these settings as workloads evolve can maximize your cost-efficiency.


About the Authors

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Retina Satish is a Solutions Architect at AWS, bringing her expertise in data analytics and generative AI. She collaborates with customers to understand business challenges and architect innovative, data-driven solutions using cutting-edge technologies. She is dedicated to delivering secure, scalable, and cost-effective solutions that drive digital transformation.

Jeetendra Vaidya is a Senior Solutions Architect at AWS, bringing his expertise to the realms of AI/ML, serverless, and data analytics domains. He is passionate about assisting customers in architecting secure, scalable, reliable, and cost-effective solutions.