Security updates for Thursday

Post Syndicated from jzb original https://lwn.net/Articles/1077536/

Security updates have been issued by AlmaLinux (.NET 10.0, .NET 8.0, .NET 9.0, podman, poppler, and postgresql-jdbc), Debian (chromium, jackson-core, libdbi-perl, and libinput), Fedora (httpd, rust, and xmlstarlet), Mageia (openssh, postfix, and roundcubemail), Oracle (frr, kernel, libyang, n, postgresql-jdbc, and unbound), Red Hat (.NET 10.0, .NET 8.0, .NET 9.0, redis, and redis:7), SUSE (agama-web-ui, cockpit, cosign, glibc, google-cloud-sap-agent, google-osconfig-agent, kanidm, kernel, kubernetes, kubernetes1.23, kubernetes1.24, kubernetes1.25, kubernetes1.27, kubernetes1.28, libpodofo-devel, libyang, NetworkManager-libreswan, openCryptoki, python311-pypdf, rclone, steampipe, wicked, and xen), and Ubuntu (exim4, libcrypt-saltedhash-perl, libhttp-daemon-perl, samba, and uriparser).

Criminal AI-as-a-Service in 2026: How the Underground Market Is Operationalizing Cybercrime

Post Syndicated from Jeremy Makowski original https://www.rapid7.com/blog/post/tr-criminal-ai-underground-market-operationalizing-cybercrime-2026

Introduction

The underground market for criminally oriented generative AI has moved beyond the early hype surrounding ‘malicious chatbots.’ The gradual integration of AI as a productivity layer within cybercrime operations has become the dominant story, indicating that while the potential for fully autonomous AI hacking systems is possible, attackers are not embracing them as expected. Instead, threat actors are increasingly using AI to accelerate routine, but operationally significant, tasks to scale their operations. Drafting phishing lures, profiling targets, debugging code, generating forged documents, modifying malware, translating victim communications, and processing stolen data at scale were once time-consuming activities that AI has made significantly easier. AI does not replace cybercriminals; it lowers friction, increases speed, and expands the range of actors able to perform tasks that previously required more time, skill, or external support.

AI is being absorbed into criminal tradecraft, embedding itself in social engineering, fraud enablement, impersonation, identity abuse, and post-breach data exploitation. The market supporting this demand is not a single coherent product category, but a broader ecosystem of jailbreak wrappers, Telegram-based bots, prompt packs, open-weight model deployments, stolen AI accounts, and hijacked API keys. Their importance lies less in technical elegance than in usability. They provide criminals with accessible, repeatable, and commercially packaged ways to apply AI to operational problems.

This ecosystem should not be mistaken for a stable or fully mature criminal market. Compared with more established sectors, criminal AI remains volatile, uneven, and heavily exposed to hype. Some services offer genuine operational utility while others are little more than repackaged public models marketed at inflated prices. Many are short-lived, deceptive, or opportunistic rebrands. 

Even so, the demand is real. The core shift is not the arrival of a single dominant criminal model, but the commercialization of access to AI-enabled criminal capability. The strategic significance of criminal AI lies in compressing time, lowering skill barriers, improving communication quality, and scaling existing criminal workflows.

Criminal AI-as-a-Service

The defining features of this market have little to do with any technical novelty, but rather the packaging and monetization of access. By early 2026, many underground services were marketed through familiar commercial mechanisms like subscriptions, private support channels, Telegram-based delivery, gated communities, and promises of uncensored output, privacy, or reduced logging. These are clear signs of SaaS-style commercialization, albeit far less mature or stable than its legitimate counterparts.

The market should be best understood as “Criminal AI-as-a-Service.” Most offerings do not appear to rely on original foundational models built by threat actors. Instead, they typically depend on jailbreaks, wrappers around commercial services, fine-tuned open-weight models, repackaged interfaces, or modular combinations of existing capabilities. 

Pricing patterns suggest growing commercialization, but not a stable market structure. Entry-level access may be inexpensive, while premium services can be marketed at significantly higher rates with promises of priority support or additional functionality. These prices should be treated as indicative, not definitive (Figures 1 and 2). They are highly volatile and shaped by takedowns, fraud, rebranding, and shifting demand. 

At the lower end, free tools and stolen access to legitimate AI services often remain the default. In the middle of the market, recurring subscriptions are increasingly common. At the upper end, some services claim to use more modular or self-hosted architectures to reduce dependence on mainstream platforms. Together, these patterns point to a market that is becoming more operationalized, even if it remains unstable and hype-driven.

xanthorox-pricing.png
Figure 1: Xanthorox’s pricing

wormGPT-pricing.png
Figure 2: WormGPT’s pricing

Main criminal AI tool families

The criminal AI ecosystem is defined by several distinct tool families that reflect how threat actors adopt, package, and market generative AI for illicit use. Some platforms function as fraud-enabling assistants, others as uncensored Telegram-native chatbots, modular offensive frameworks, or low-barrier tools aimed at novice users. Examining these categories is more useful than focusing solely on individual brand names, as it reveals the market’s underlying operational logic. That logic is based on how these tools are distributed, which users they target, and which stages of the criminal workflow they are designed to support. 

Overall, the market is increasingly splitting into two complementary directions. At one end are low-cost, mass-market tools that help less experienced actors produce phishing content, scam scripts, malware prompts, forged material, and social engineering narratives at scale. At the other end are more specialized platforms that integrate AI into execution workflows, supporting targeting, automation, and operational optimization for fewer but more precise attacks. This volume-versus-precision dynamic shows that criminal AI is no longer only about accelerating malicious content generation; it is also becoming a way to make illicit operations more scalable, quieter, and strategically targeted.

FraudGPT 

This tool family represents the distribution model for criminal AI by fraud shops. Emerging in mid-2023 for a few hundred dollars per month, its longevity on the black market stems from its positioning as an “all-in-one” operational assistant rather than a simple programming tool. Most buyers are not using it to engineer highly complex malware; instead, they treat it as a productivity engine to orchestrate the entire fraud chain. 

Threat actors use it to systematically design lookalike phishing pages, scrape target data, draft convincing spear-phishing lures, and generate scam scripts. Even as the underlying architecture has evolved away from standalone models and toward basic wrappers around legitimate, jailbroken corporate APIs, FraudGPT remains a staple of the underground economy because it effectively democratizes advanced social engineering, allowing entry-level scammers to execute highly localized, grammatically flawless, and high-volume fraud operations (Figure 3).

FraudGPT-website.png
Figure 3: FraudGPT’s website

GhostGPT 

This tool family reflects the Telegram-native distribution model. Its reported selling points — uncensored output, ease of access, and reduced operational friction — illustrate the convenience and perceived safety many criminal buyers claim to value most. However, like many tools in this category, independent verification of its capabilities is limited, and its significance lies more in what it signals about buyer preferences than in any confirmed technical differentiation.

WormGPT

This tool family serves as the ultimate case study in the power and persistence of criminal branding. While the original, headline-grabbing tool was officially shut down by its creator in August 2023 following intense law enforcement and media exposure, the name has essentially become a generic dark-web trademark for unrestricted AI. The market is saturated with opportunistic copycats, such as “WormGPT v4” and various Telegram bots trading on the name. 

Threat intelligence analysis of these modern variants reveals that they share zero code with the original system; instead, they are highly volatile marketing shells, often basic API wrappers around commercial models like Grok or Mixtral that use specialized system prompts to bypass safety guardrails. WormGPT’s relevance in 2026 lies not in its technical uniqueness but in its sociological impact. It is an entry-level gateway tool used by script kiddies and sophisticated actors alike to quickly generate functional exploit scripts, craft persuasive business email compromise (BEC) lures, and scale offensive workflows (Figure 4).

WormGPT_s-website.png
Figure 4: WormGPT‘s website

KawaiiGPT 

This is a freely accessible or low-cost criminally oriented AI chatbot/tool marketed in underground spaces to generate or support illicit content and cybercrime-related tasks. Its use highlights the problem of low-barrier access in the criminal LLM market. Its relevance does not lie in any demonstrated advanced capability and there is little evidence that it provides meaningful technical sophistication beyond basic generative AI functions. Rather, KawaiiGPT is important as an example of how free or near-free tools can normalize AI-assisted offending among less experienced users. Its significance is therefore sociological rather than technical as it lowers the threshold for participation, makes AI-assisted offending appear accessible and low-risk, and introduces novice actors to workflows such as phishing text generation, fraud scripting, impersonation, and other forms of low-level cybercrime support.

BruteForceAI 

This tool family represents a meaningfully different category from the chatbot-style tools that dominate criminal AI branding. BruteForceAI prioritizes precision over content generation. It integrates large language models for intelligent form analysis and sophisticated multi-threaded attack execution. This distinction matters. The broader trend it reflects is one of attackers making fewer, better-targeted attempts rather than relying on brute volume. AI here is not a content tool. It is an execution layer, and the shift from noisy credential stuffing to quiet, optimized targeting is strategically more significant than any individual tool name (Figure 5).

BruteforceAI-program.png
Figure 5: BruteforceAI program

Xanthorox 

This AI represents the modular criminal AI platform. Its significance lies in how it is marketed. Public reporting describes it as more than another “evil chatbot,” with claims around coding support, multiple model components, and broader operational utility. Still, Xanthorox should be framed cautiously. It is better treated as an emerging or ambitiously marketed platform than as a universally verified flagship of the underground market (Figure 6).

Xanthorox-website.png
Figure 6: Xanthorox’s website

The wide variety of smaller adversarial AI tools in 2026, including names like DarkGPT, EscapeGPT, WolfGPT, Evil-GPT, XXXGPT, and BadGPT, should be viewed with caution. These brands do not constitute a coherent or reliable category; instead, they often function as short-lived rebrandings or simple interfaces built on public or open-source models. In many cases, these are “scam-of-the-month” services hosted on Telegram, designed to capitalize on hype, with entry-level memberships starting at a few dozen dollars. However, they should not be dismissed outright, as some do offer genuine un-censorship or serve as testing grounds for malicious exploits. The bottom line in 2026 is that the brand name matters less than the underlying architecture. Most “GPT” labels are disposable marketing shells used to evade takedown measures or rebuild credibility after a service failure.

What truly defines the threat is the infrastructure supporting them. While entry-level tiers cost very little, professional-grade systems can cost thousands of dollars. At this level, the value isn’t in the name, but in the technical setup.: These include the specific model used, how the service is delivered, the reliability of the operator, and how well it connects with other criminal tools like phishing kits, stealers, and ransomware support. Ultimately, the market has shifted toward operationalizing AI, focusing on tools that can automate and maximize the efficiency of entire illicit workflows.

Stolen AI accounts as an overlooked criminal market

One of the most important and still underappreciated developments in this landscape is the resale and abuse of legitimate AI access. This pattern is not new. Every widely adopted and commercially valuable technology eventually generates a secondary criminal market around stolen credentials, compromised accounts, and unauthorized access. AI is now following the same trajectory. Threat actors do not rely only on underground “dark AI” tools. They also misuse mainstream AI platforms directly.

However, the abuse of stolen AI accounts and hijacked API keys may be more consequential than many earlier credential markets. Access to legitimate AI services can provide threat actors with scalable cognitive and operational capabilities, not just access to a single platform or dataset. A compromised AI account may enable faster reconnaissance, multilingual targeting, automated content production, code generation, malware troubleshooting, and the refinement of phishing or fraud workflows. Hijacked API keys may also allow actors to consume compute resources at the victim’s expense, bypass usage restrictions tied to their own identities, and access more capable models or enterprise-grade infrastructure. In this sense, stolen AI access is not merely another credential commodity. It can function as an operational force multiplier across multiple stages of the attack lifecycle, making its abuse both expected and potentially more impactful than many traditional forms of account compromise (Figures 7 and 8).

Stolen-AI-accounts-for-sale-cybercrime-forum.png
Figure 7: Stolen AI accounts for sale on a cybercrime forum

More-stolen-AI-accounts-for-sale-cybercrime-forum.png
Figure 8: More stolen AI accounts for sale on a cybercrime forum

The impact on organizations can be serious as AI accounts may contain proprietary information such as prompts, uploaded files, source code, legal drafts, customer data, internal summaries, product plans, meeting notes, investigative material, or strategic analysis. If compromised, the exposure extends beyond the credential itself. Enterprise AI accounts and AI-related access tokens should therefore be treated like cloud credentials, developer secrets, email accounts, or administrative SaaS access.

Deepfake services: From impersonation to KYC bypass

Deepfake services have become one of the criminal AI market’s most important adjacent segments, particularly in fraud, synthetic identity creation, onboarding abuse, and KYC bypass. These services are marketed not as experimental technologies, but as practical fraud enablers. Common offerings include face swaps, voice cloning, fake selfie generation, synthetic profiles, document manipulation, virtual camera injection, video-call impersonation, and full onboarding bypass packages (Figure 9). Their significance stems from the fact that many digital platforms continue to rely heavily on remote identity verification and visual trust cues.

The purpose of bypassing KYC controls is to create, validate, or access accounts that should not exist or should not be available to the offender. Once established, such accounts can support money laundering, mule activity, romance scams, investment fraud, payment abuse, sanctions evasion, account resale, and marketplace manipulation. The threat is no longer limited to static fake images. Attackers can combine face swaps, synthetic video, animated media, and virtual camera injection to impersonate real individuals during onboarding or verification.

Deepfake services also strengthen broader fraud operations. Romance scams, fake recruitment schemes, executive impersonation, vendor fraud, and investment scams all become more persuasive when synthetic voice or video is added to the deception chain. These services should therefore be understood as part of the same criminal AI capability stack. LLMs generate scripts, refine pretexts, localize language, and support interaction at scale. Stolen data enhances personalization. Deepfake tools add the visual and audio layer that increases trust and makes deception harder to detect. Together, these capabilities form a more complete deception architecture.

Deepfake-KYC-bypass-service-advertisement.png
Figure 9: Cybercrime forum’s advertisement for a Deepfake KYC bypass service website

Organizational impact and defensive priorities

For organizations, the impact of AI-enabled cybercrime is both economic and operational. The main concern is not the sudden arrival of fully autonomous AI hacking, but the steady increase in attacker productivity, deception quality, operational flexibility, and post-compromise efficiency.

This last concern is important to note. Once attackers obtain data, AI can help them review it more quickly and more systematically. Models can summarize large document sets, identify sensitive or monetizable material, extract victim-specific details, and support tailored extortion or fraud. This does not require a purpose-built criminal model. It requires access to a capable model, relevant data, and a clear criminal objective.

At the same time, enterprise AI environments are becoming part of the attack surface. AI accounts, API keys, prompts, uploaded files, connectors, retrieval systems, internal knowledge bases, and agentic workflows can all expose sensitive business information if they are compromised, misused, or poorly governed. These assets should therefore be managed with the same seriousness as other critical systems, including clear ownership, least-privilege access, logging, monitoring, retention rules, and periodic access reviews.

Organizations should respond by treating criminal AI as a challenge of trust, identity, workflow security, and data governance, rather than only as a malware issue. High-risk business processes should be reinforced with stronger approval controls, transaction verification, segregation of duties, and out-of-band confirmation, especially for financial transfers, access changes, sensitive data requests, and executive communications.

Phishing and fraud defenses must also adapt. Poor grammar and obvious language errors are no longer reliable indicators of malicious activity. Organizations should assume that many adversaries can now generate polished, localized, and credible communications at scale. Detection should therefore rely more heavily on behavioral indicators, sender validation, process anomalies, identity verification, and transaction integrity than on superficial language cues.

At the same time, organizations should prepare for AI-assisted post-breach exploitation by improving data minimization, segmentation, access controls, monitoring, logging, and incident response planning. They should also monitor the broader underground capability stack, including jailbreak services, stolen AI accounts, and synthetic media tooling, because these increasingly shape attacker tradecraft in practice.

The market will likely see more bundling of text generation, translation, impersonation, data analysis, and synthetic media into a single criminal offering. It will also likely see continued abuse of legitimate AI platforms alongside wrapper-based underground services. The ecosystem will likely remain uneven, opportunistic, and hype-heavy, while becoming strategically important because it makes cybercrime easier to execute, scale, and detectFor organizations, the main risk is not only higher financial loss, but also the growing operational strain created by AI-assisted attacks that are faster, more scalable, and harder to triage.

Enterprise AI accounts, API keys, prompts, uploaded files, connectors, retrieval systems, internal knowledge bases, and agentic workflows should be managed as critical assets, with clear ownership, least-privilege access, logging, monitoring, retention rules, and periodic access reviews. Sensitive data should be exposed to AI systems only when there is a clear business need, especially when AI tools connect to email, cloud storage, code repositories, customer databases, financial systems, or external services. High-risk AI connectors and workflows should be inventoried, risk-ranked, and monitored for abnormal access, bulk data movement, privilege escalation, or unauthorized agent actions.

 As phishing tactics become better, core controls should include MFA, phishing-resistant authentication, conditional access, DLP, EDR/XDR, API security monitoring, secrets scanning, prompt and output filtering, and model-access controls. Incident response plans should also cover stolen AI accounts, exposed prompts, compromised API keys, leaked embeddings, abused connectors, and sensitive data retained in AI workspaces.

The organizations best positioned for the next phase will be those that integrate AI risk into existing security governance rather than treating it as a separate technical issue. As criminal use of AI becomes part of everyday attacker tradecraft, resilience will depend on the ability to verify identity, control access, protect data flows, monitor AI-enabled workflows, and maintain human oversight over high-impact decisions. The future defensive priority is therefore not to predict every AI-enabled attack, but to build security architectures that remain reliable when attackers become faster, more persuasive, and more efficient.

Enhanced License Plate Tracking

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2026/06/enhanced-license-plate-tracking.html

The surveillance company Leonardo wants more data:

A surveillance company plans to add sensors to automatic license plate readers (ALPRs) that would mean the devices, as well as capture the license plate of passing vehicles, would also sweep up unique identifiers of mobile phones, wearables, and other Bluetooth-enabled devices in those cars, potentially letting law enforcement identify specific drivers or passengers.

The technology, called SignalTrace, would turn ALPR cameras from devices focused on tracking cars to ones that can more readily track the location of particular people. ALPR cameras have become a commonly deployed technology all across the U.S.; SignalTrace would make some of those cameras capable of collecting much more data.

Yes, it’s bad that more companies are collecting this level of surveillance data. But all of this pales in comparison to the type and quantity of data our smartphones already collect about us.

Alternate link.

[$] LWN.net Weekly Edition for June 11, 2026

Post Syndicated from jzb original https://lwn.net/Articles/1076254/

Inside this week’s LWN.net Weekly Edition:

  • Front: Suspicious AI activity in Fedora; fork() + exec(); splice() + vmsplice(); BPF loop verification; fanotify; trusted publishing.
  • Briefs: CA age bill; Bundler cooldowns; insecure code completion; Asahi and macOS 27 beta; Buildroot 2026.05; Ubuntu MATE; rsync 3.4.4; Quotes; …
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Dell Pro Max 16 Plus Review A More Mobile NVIDIA RTX Pro 5000 Blackwell System

Post Syndicated from Ryan Smith original https://www.servethehome.com/dell-pro-max-16-plus-review-intel-nvidia-rtx-pro-5000-blackwell-system/

Sparing no expense, Dell’s flagship workstation laptop, the Pro Max 16 Plus, aims to deliver as much performance as is possible in a 16-inch laptop while still being modestly portable

The post Dell Pro Max 16 Plus Review A More Mobile NVIDIA RTX Pro 5000 Blackwell System appeared first on ServeTheHome.

Невзоров за възможна двойна употреба

Post Syndicated from Светла Енчева original https://www.toest.bg/nevzorov-za-vuzmozhna-dvoyna-upotreba/

Невзоров за възможна двойна употреба

Замисляли ли сте се как и държавата, и обществото проявяват склонност да не забелязват някои неща, които са толкова видими, че могат да ни извадят очите? В някакъв момент сякаш някой ни натиска копчето и дружно забелязваме. И започва масово чудене: къде са били институциите досега, къде е било обществото, къде са били медиите, къде сме гледали ние самите?

Край Варна в продължение на няколко години се строи незаконно селище от над 100 сгради.

Това се е случвало поне от 2023 г. – по времето на двама кметове (Иван Портних и Благомир Коцев) и на няколко правителства. Селището в защитената местност Баба Алино дори се е рекламирало преди три години. Но мащабът на беззаконието стана водеща тема едва в края на май 2026 г.

Междувременно в случая са замесени всевъзможни институции на различни нива, чиито представители твърдят, че не са направили нищо нередно. По-точно така се оправдават представителите на институциите, които си правят труда да кажат нещо по въпроса. Сред тях не е ДАНС, нито посланичката на Украйна, които също имат отношение към случая. Но нека караме поред.

Откъслечни знаци

Всъщност не е съвсем коректно да се твърди, че никой не е алармирал публично за незаконните строежи в Баба Алино. Затова е много интересно да се проследи избирателната пропускливост на чуваемостта.

Още към октомври 2023 г. Регионалната дирекция по горите и Държавното горско стопанство във Варна са разполагали със сигнал за незаконна сеч, във връзка с който извършват проверка и уведомяват прокуратурата. След това се подават още сигнали, образуват се и досъдебни производства.

През март 2025 г. тогавашният зам.-кмет на Варна Илия Коев (уволнен от Благомир Коцев на 9 юни 2026 г.) говори по БНТ за незаконната сеч в района и обещава Общината да предприеме мерки. Взета е и позиция на фирмата, която обещава, че всичко ще е законно. Нито Коев обаче, нито репортерът на БНТ споменават името на фирмата – КУБ, както и това на инвеститора Олег Невзоров. Не става дума също, че в района вече има построени сгради.

Абревиатурата КУБ се вкарва в публичното пространство от „Възраждане“

през септември 2025 г. – първо в заседание на парламентарната Комисия за контрол над службите за сигурност, а след това и месеци наред в заседания на Народното събрание. „Възраждане“ впрочем прикачва към фирмата на Невзоров квалификацията „престъпна украинска групировка“.

През октомври 2025 г. BIRD и журналистът от „Дневник“ Спас Спасов споменават Невзоров във връзка с ареста на Благомир Коцев, изразявайки предположението, че именно той е тайният свидетел срещу варненския кмет. И това е контекстът, в който е споменаван Невзоров в този период. Самият той отрича да е въпросният таен свидетел. Но няколко дни преди ареста на Коцев ДАНС издава заповед за изгонването на Невзоров от страната. Дни по-късно заповедта е оттеглена – поведение, твърде нетипично за тази институция, чиито решения не подлежат на никакъв контрол.

Таймингът

Накратко, какво се случва в Баба Алино и кой го извършва, е публично известно още през 2025 г. То обаче е тема най-вече на проруската партия „Възраждане“, която вижда подходящ повод да уличи „лошите“ украинци, а Невзоров се споменава основно в контекста на ареста на Благомир Коцев. Като изключим Спас Спасов, който още през октомври 2025 г. обръща внимание на незаконните строежи.

Иронично, за незаконното селище край Варна се заговори масово чак когато кметът Благомир Коцев реши най-сетне да даде гласност на случая. Това моментално беше използвано от правителството на „Прогресивна България“ срещу него.

Малко вероятно е Румен Радев чак сега да научава за незаконната дейност на фирмата на Невзоров – още повече че докато е заемал президентския пост, той е имал достъп до мистериозно отменения доклад на ДАНС. Ако беше използвал случая в предизборната кампания, това щеше да е удар срещу основните му политически конкуренти – ГЕРБ и ПП–ДБ, които не са направили нужното, за да спрат беззаконието. И срещу ДПС, което се оказва свързано с комай всяко крупно беззаконие.

Изобщо, в тази история като че няма невинни. Всеки по веригата е отговорен. Било с издаването на документи с невярно съдържание, било с бездействието си, било с недостатъчно решителните си действия.

Преди изборите обаче Радев се позиционира в максимално широк периметър, за да привлече повече избиратели. А освен срещу политическите му конкуренти,

случаят в Баба Алино може да се използва и срещу Украйна.

Защото незаконното строителство се извършва от фирма на украински бизнесмен, който на всичкото отгоре развива дейност и в организация, подкрепяща украинските бежанци. Участвал е в публични събития, на които е присъствала и посланичката на Украйна Олеся Илашчук.

В допълнение, вътрешният министър Иван Дерменджиев твърди, че Илашчук се е намесила във връзка със заповедта на ДАНС за екстрадирането на Невзоров. Как точно се е намесила, не е известно, но думите на Дерменджиев оставят впечатлението, че тя се е застъпила за Невзоров. Как посланик на чужда държава може да повлияе на решение на българските разузнавателни служби – също е неясно.

Незаконното селище в Баба Алино се превръща във водеща тема точно сега, когато правителството на Радев поема курс към промяна на геополитическата ориентация на България.

На първо време този курс е основно за вътрешна употреба и цели постепенна промяна на нагласите на обществото. Сред начините за постигане на тази промяна са нарочването на врагове и оправдаването на антидемократични режими.

Антидемократичен чеклист
Началото на края на демокрацията няма да бъде поставено с идващи танкове. Един вид, ако чакате „танковете да дойдат“, няма да стане. Ще има „по-малко пречки“, „повече ефективност“ и много заглушени теми. За червените лампички, които мигат, преди демокрацията да изгасне – от Светла Енчева.
Невзоров за възможна двойна употреба

В периода, в който публичното говорене в България е заето основно с Баба Алино, а много медии използват въведеното от „Възраждане“ клише „украинска групировка“, се случиха някои на пръв поглед несвързани събития, които обаче, взети заедно, създават обща картина към каква България се стреми новата власт.

На 1 юни беше възстановена Асамблеята „Знаме на мира“ – детски фестивал, създаден по идея на Людмила Живкова, дъщерята на социалистическия диктатор Тодор Живков. Днес фондацията, която организира Асамблеята, се ръководи от дъщерята на Людмила Живкова – Евгения. Както някога, така и днес фестивалът е не на последно място политическо събитие, демонстриращо определена геополитическа ориентация. А „мирът“ е „руският“. Чий да е, ако се изразим в стил „Радев“.

Седмица по-късно държавата се посвети на добрите си отношения с Китай. Президентката Илияна Йотова, премиерът Румен Радев и вицепремиерът Гълъб Донев поотделно проведоха срещи с Шън Ицин – държавната съветничка на Китайската народна република, и обещаха да задълбочат отношенията на България с азиатската социалистическа страна. Освен че произвежда голяма част от нещата, които се продават, и че се опитва да разшири влиянието си по света, Китай е държава, която системно нарушава човешките права, налага цензура и следи всяка крачка на гражданите си.

И ето, на 9 юни министърът на отбраната Димитър Стоянов заяви, че България вече няма да изпраща оръжия на Украйна. Дали действително ще престане, или ще изпраща под сурдинка, както и през 2022 г., когато бившата председателка на БСП Корнелия Нинова, по онова време министърка, уж не даваше, е друг въпрос. Важно е публичното послание – в момент, в който навсякъде се говори за „украински групировки“.

КУБ с двойно дъно

Една от основните характеристики на дискриминацията е вменяването на колективна вина. Макар повечето извършители на престъпления да са мъже, не се говори за „мъжка престъпност“, но когато ром наруши закона, това вече е „ромска престъпност“. И често общественият гняв се насочва срещу всички роми.

По същия начин незаконното селище в Баба Алино става повод да се вменява вина на всички украинци, за което допринасят и устойчивите клишета „украинска групировка“ и „престъпна украинска групировка“. А фактът, че Олег Невзоров е подпомагал украински бежанци, се използва за настройване на общественото мнение и срещу тях.

В тази ситуация е много важно и какво не се казва.

Когато се поставя знак за равенство между Невзоров, държавата му по произход и съгражданите му, обикновено се изпуска от поглед, че украинската държава разследва него и негови роднини за престъпления – заради строителни измами в Одеса, заради невърнати кредити, фалшифициране на документи и придобиване на оръжия с фалшив сертификат.

Публикация в „Капитал“ на Спас Спасов хвърля светлина и върху политическата ориентация на Невзоров. През 2020 г. той се е кандидатирал за кмет на Таировската община в Одеса от проруската партия „Победа Пальчевского“, която също е финансирал. Партията е кръстена на лидера си Андрей Палчевский. Той пък е свързан с друга проруска партия („Опозиционна платформа – За живот“), чиято лидерка Наталия Королевска понастоящем се издирва от украинските власти. Същата Королевска е свързана със сдружението United Women, научаваме пак от статия на Спас Спасов – от юли 2025 г. Сдружението, на което Невзоров е спонсор, хем помага на украински бежанки, хем членовете на управителния му съвет са с проруски възгледи.

От BIRD споменават и за данни за връзки с руските служби на сътрудника на Невзоров – грузинеца Джони Читадзе (депортиран от България заради заповедта на ДАНС, в която е бил включен и Невзоров, преди ДАНС да отмени мярката за него).

Как е възможно хем да си за Русия, хем да помагаш на бежанци от Украйна? Ами очевидно е възможно. Като в „Хлапето“ на Чарли Чаплин – детето чупи прозорци, героят на Чаплин ги поправя. Нападайки родината им, Русия прогонва милиони украинци, а после нейни хора се грижат за прогонените. И ги държат под око. Затворен цикъл. Ако нещо се обърка, то се пише на сметката на Украйна („престъпна украинска групировка“), а Русия остава чиста.

Помните ли българите, осъдени във Великобритания, защото са били руски шпиони? Процесът срещу тях не се превърна в атака срещу България, въпреки че двама от групата (Катрин Иванова и Бисер Джамбазов) са оказвали помощ на свои сънародници в Обединеното кралство. Разкритията не бяха използвани и като кампания срещу БСП, макар че Джамбазов е бил член на партията и че двамата са кръстили организацията си Българска социална платформа – БСП.

Ето как не фактите сами по себе си, а употребата им задава посоката на публичното говорене.

Остава послевкус на активно мероприятие.

Освен че се използва в контекста на геополитическата преориентация на България, казусът с незаконното селище в Баба Алино играе и друга роля. Той успешно отвлича общественото внимание от трагедията край Петрохан и Околчица (също използвана за политически цели), при която загинаха шестима души, между тях и дете, и въпросите около която продължават да са доста повече от отговорите.

Впрочем и в двата случая е намесена ДАНС, но не това е най-важното. По-важното е как общественото внимание може да бъде моделирано и насочвано. Как и институции, и медии, и общество (с незначителни изключения) години наред са слепи за нещо огромно. И изведнъж проглеждат, но точно по определен начин и в точно определена посока.

И така до следващото „откриване“ на нещо, което ще бъде използвано за поредното разчистване на сметки. И за отвличане на вниманието от нещо друго.

Заглавно изображение: Съвсем истински слон в стаята, който никой не вижда, защото всички са заети да пият чай. Сидни, Австралия, март 1939 г.

Пунктуацията на вметнатите части, между другото, никак не е между другото

Post Syndicated from original https://www.toest.bg/punktuatsiiata-na-vmetnatite-chasti-mezhdu-drugoto-nikak-ne-e-mezhdu-drugoto/

Пунктуацията на вметнатите части, между другото, никак не е  между другото

Случвало ли ви се е да започнете да пишете текст най-вече защото даден проблем ви занимава, имате някакво обяснение, но не и в детайли, и искате най-накрая да си ги изясните? На мен ми се случва почти всеки път, когато започвам статия за рубриката „Порция език“. Просто трябва на мен самата да ми е интересно да стигна до отговор, който не знам, да вникна във философията на нещата или пък да ги систематизирам в съзнанието си. Разбира се, надявам се това да е интересно и полезно и за читателите, след като е свързано с езика.

И така, от известно време ме занимава проблемът с вметнатите части в българския език и свързващите думи (linking words, connectors, linkers) в английския. Между тях има припокриване, но и доста разлики. А защо е необходимо да ги сравняваме, ще попитате вероятно. Причината е чисто практическа: пунктуацията им се подчинява на различни правила и наблюденията ми показват, че

често българските съответствия на английските свързващи думи (например освен това – moreover) се отделят погрешно със запетая от останалите думи в изречението.

Какво вмятаме и как свързваме?

В българската граматика като вметнати части се разглеждат думи и изрази, които не са същински части на простото изречение (подлог, сказуемо, допълнение и т.н.). Разделят се на две основни групи по своето значение. И тук започвам да се чудя дали изобщо да ви занимавам с това, защото то няма никакво, ама наистина никакво отношение към пунктуацията. Добре, да го направим за пълнота, а и заради още нещо.

Първата група включва вметнати части, с които изразяваме своето лично отношение към казаното, например: може би, вероятно, за съжаление, очевидно, наистина, всъщност, впрочем, естествено, разбира се, според мен.

Във втората група са думите и изразите, с които установяваме връзка с вече казаното, обобщаваме, изброяваме факти и противопоставяме, например: следователно; значи; общо взето; в крайна сметка; например; първо… второ… трето…; от една страна… от друга страна; напротив; обаче.

Логично е да си помислим, че вметнатите части от втората група са като английските свързващи думи, но не е точно така. Да речем, besides this и as a result са типични свързващи думи, обаче българските им съответствия освен това и в резултат (на това) не се третират като вметнати части. Примери има и за обратното, и то много: английските аналози на вметнатите части от първата група не са свързващи думи в класическия смисъл, макар че, ако сте учили езика и сте стигнали до ниво В2, в съответния урок сте срещнали поне personally и in my opinion за изразяване на мнение¹.

От какво зависи пунктуацията на вметнатите части?

Нашите вметнати части отново се разделят на две групи: едните се отделят със запетая (или с две запетаи, ако са в средата на изречението), а другите – не. Критерият е дали винаги се употребяват като вметнати части, или могат да функционират и като вметнати части, и като така добре познатите ни сказуемо, обстоятелствено пояснение, допълнение, определение. Звучи сложно и абстрактно, затова ще дам примери с надеждата да остане поне само сложно.

Тази работа изисква внимание и не може да се върши между другото.
Тази работа изисква внимание, между другото, и не може да се върши през пръсти.

В първото изречение между другото е същинска негова част – обстоятелствено пояснение (как да се върши?), а във второто съчетанието е употребено като вметната част (с нея сигнализираме, че вмъкваме, добавяме някаква информация). Ето защо тук отделяме между другото със запетаи – защото има изречения, в които може да не е вметната част.

Сред вметнатите части, които се пишат със запетая, има някои глаголи и изрази, съдържащи глаголи. Това е логично, защото глаголите могат да бъдат и сказуеми в изреченията. Когато обаче разбира се, значи, изглежда, моля, така да се каже, да речем, да кажем са вметнати части, ги отделяме със запетая. Ще дам примери с изглежда, защото много често запетаите се пропускат:

Целият свят изглежда полудял. (сказуемо)
Целият свят, изглежда, е полудял. (вметната част)

Към посочените вметнати части, които се отделят със запетая, следва да прибавим и следните по-често срещани: от една страна… от друга страна; първо… второ… трето; обратно; напротив; естествено; за съжаление; честно казано.

При другата група вметнати части не се употребява запетая. Просто не се налага, защото езикът като велик режисьор е отредил само една роля на тези посредствени актьори. Все пак и сред тях има по-известни: всъщност, впрочем, може би, наистина, според мен, обаче, например, действително, вероятно, по всяка вероятност, навярно, следователно, като че ли, сякаш, в крайна сметка. Ето два примера:

Вероятно някои вулкани не са угаснали, а само задрямали.
Постоянно говорим за изкуствения интелект, но какво знаем за него всъщност?

Къде са клопките?

1. Пунктуацията на английските свързващи думи няма нищо общо с пунктуацията на българските вметнати части. Обикновено те са в началото на изречението и след тях се поставя запетая.

In the end, the stronger team won the match.
Im my view, the government is not decisive enough.
Our company’s results are getting worse. Therefore, we need to make some changes.

Ако преведем изреченията, всички запетаи ще паднат, защото съответствията на свързващите думи могат да бъдат само вметнати части и нищо друго:

В крайна сметка по-силният отбор спечели мача.
Според мен правителството не е достатъчно решително.
Резултатите на нашата компания се влошават. Следователно трябва да направим промени.

2. Някои английски свързващи думи не се третират като вметнати части и съответно не се отделят със запетая в българските изречения. Такива са, да речем: furthermore, moreover, besides (this) – освен това; nevertheless – въпреки това; according to (source) – според (източник).

There was a massive traffic jam in the city this morning. Nevertheless, I managed to get to the meeting right on time.
Тази сутрин имаше голямо задръстване в града. Въпреки това успях да стигна навреме за срещата.

3. По-горе посочихме, че вметнати части като всъщност, впрочем, може би, вероятно, по всяка вероятност не се отделят със запетаи. В английския език обаче actually, by the way и in all probability се придружават от запетаи и в началото, и в средата, и в края на изречението².

Впрочем кога затваря фитнес залата?
By the way, when does the gym close?

4. Моля, обърнете специално внимание на думата обаче. Дори и да не сте толкова вещи в английската пунктуация, предполагам, знаете, че however e свързваща дума, която се огражда със запетаи. Българското обаче е вметната част, която не се отделя със запетаи. Усложняващо обстоятелство е, че обаче може да бъде и съюз – тогава пред него се пише запетая. Повече обяснения и примери за пунктуацията на думата може да намерите тук.

Къде са слабите места на правилата?

Естествено, ще си позволя да коментирам само двете основни правила за нашите вметнати части. Според мен е крайно нереалистично всеки път да се питаме дали могат да бъдат и същински части на изречението, или не могат, и да ги пишем съответно със или без запетая. Надали това е бил и замисълът на кодификатора. По-скоро се разчита, след като конкретните вметнати части са разделени на две и изброени, ние да проверяваме всяка дума или израз към коя група се числи (и евентуално да запомним пунктуацията на често употребяваните).

Това върши работа, но в българския език има и други вметнати части освен примерите, придружаващи правилата. Ето някои думи и изрази, над чиято пунктуация ние трябва да си блъскаме главата, ако искаме да ги употребим: несъмнено, безсъмнено, безспорно, явно, моля, с една дума/с две думи, в допълнение, за жалост, за щастие, за радост, по дяволите, в общи линии, като цяло, най-малкото. (Като написах по дяволите, осъзнах, че ругатните и псувните, общо взето, следва да са вметнати части в изречението, значи и там трябва да се замисляте, ако сте склонни да се изразявате нецензурно в писмен вид.)

Да вземем за пример изречението Несъмнено новото откритие ще намери приложение в практиката. Дали има изречение, в което несъмнено не е вметната част? Да, може направо да преобразуваме предишното: Приложението на новото откритие в практиката е несъмнено (несъмнено е част от съставно именно сказуемо). Следователно в първото изречение трябва да поставим запетая: Несъмнено, новото откритие ще намери приложение в практиката. Аз лично бих си я спестила, защото е неуместна. Ако преместим несъмнено в средата, запетаите ще бъдат още по-неуместни според мен, макар това със сигурност да е вметната част – показваме своята увереност в това, което казваме: Новото откритие, несъмнено, ще намери приложение в практиката.

Защо се получава така? Мисля, че вметнатите части, които се отделят със запетая (разбира се, напротив, за съжаление и др.), на практика се открояват от останалите думи в изречението и смислово, и интонационно – привличат логическото ударение, изговарят се с паузи, – а тези, които не изискват запетая (впрочем, може би, навярно и др.) се вписват по-плавно в потока на речта. По-скоро това са причините да употребяваме или не запетаи при различните вметнати части, а не синтактичният критерий, на който се крепят двете основни правила. Сега може би става ясно защо отделянето на несъмнено със запетаи изглежда неуместно, макар че теоретично е правилно – върху тази дума не се акцентира, когато произнасяме изречението.

Няма да премълча и нещо, което отдавна ми е като трън в очите: вметнатите части наистина и действително не се отделят със запетая, защото не могат да бъдат части на изречението. Напротив, могат и академичният тълковен речник посочва такива употреби, които се игнорират неясно защо. Ето ви един пример:

Иване, наистина ли ще се жениш? (В действителност ли имаш такова намерение, или са само слухове?)

Затова, когато думата е употребена като вметната част, запетая следва да се постави. Особено пък ако е придружена от съюза и:

И наистина, вчера Иван се ожени. (Потвърждавам, вярно е това, че Иван се е оженил.)

Допускам, че обяснението за това „недоглеждане“ може да е следното: много често е трудно да се определи кога думи като наистина и действително са употребени като вметнати части и кога – като същински части на изречението.³

Приключвайки тази статия, реших да преброя колко вметнати части съм употребила дотук. Оказаха се 18 (без примерите, разбира се; ето, станаха 19 с разбира се). Ако не знаех правилата за пунктуацията им, щях да допусна доста грешки. Вероятно и вие използвате немалко вметнати думи и изрази, когато пишете, а може и да се водите от пунктуацията на английския, ако той е работният ви език. Искаме или не, моделите в него ни влияят и това влияние няма да отслабва. Остава да наблюдаваме колко устойчиви ще се окажат българските езикови модели, в това число и пунктуационните.

1 В тази статия няма да навлизаме в детайли по отношение на английските свързващи думи (linking words), в които често се включват и дискурсни маркери (discourse markers) или пък термините се употребяват като взаимнозаменяеми. Целта ни е да наблегнем на разликите в пунктуационните модели в българския и английския език, затова даваме примери и с изрази, които не са linking words в тесния смисъл на това понятие, а също и с наречия, съответстващи на нашите вметнати части.

2 Изключение има за actually – в средата на изречението не се огражда със запетаи.

3 Аналогичен е примерът с напротив, но с обратен знак. Тази дума е посочена от кодификатора като вметната част, която се отделят със запетая, следователно има изречения, в които тя може да бъде същинска тяхна част. Колкото и да се опитвам, не мога да измисля такова изречение на съвременен книжовен български език.


Езикът може да е вкусен и извън блюдото – онзи, българският език, на който говорим от малки и на който около 24 май се кълнем в обич. А той в същността си е средство за общуване и за да ни служи добре, непрекъснато се променя. Да го погледнем в неговата динамика и да се опитаме да разберем какво става и защо, кои са движещите механизми и как те са свързани с обществените процеси. И тъй като задачата не е лека, ще го правим постепенно – на порции.

Larson: Are insecure code completions a vulnerability?

Post Syndicated from jzb original https://lwn.net/Articles/1077413/

Seth Larson, the Python Software Foundation’s security
developer-in-residence
, has written
about
the difficulty in classifying insecure code completion in
the PyCharm IDE using
its Full
Line code completion
plugin. Larson discovered that the plugin,
which uses a local “deep learning module” to offer code completions,
suggests code that would lead to severe vulnerabilities. He was unsure
whether it warranted a CVE or not, however:

I reported this behavior to JetBrains for “Full Line Code Completion” v253.29346.142
and clearly their support staff weren’t certain whether this defect
was a security vulnerability or not either. When I asked to
publish a blog post about this behavior after they confirmed
this report wasn’t a “direct security vulnerability” (which
I agree with) but then was asked not to publicize my report and referred to
PyCharm’s Coordinated Disclosure Policy
so… which is it? Security vulnerability or not?

I ended up waiting the 90 days anyway and I didn’t hear back with
any substantive update from the development team. I double-checked
again today using “Full Line Code Completion” v261.24374.152 and the
behavior is identical, suggesting the same insecure code for both
contexts.

This isn’t meant to be a specific dig at PyCharm or JetBrains, I
have no-doubt that examples like this exist in every code generation
model available.

Automated Threat Hunting: Turning Threat Intelligence into Executable Hunt Plans

Post Syndicated from Blake McDermott original https://www.rapid7.com/blog/post/ai-automated-threat-hunting-turns-threat-intelligence-into-executable-hunt-plans

Blake McDermott is Senior Threat Hunter at Rapid7.

Every week, threat hunt teams are faced with a steady flow of blogs, advisories, and DFIR reports containing valuable intelligence about adversary behaviors, tactics, techniques, and procedures. The challenge is turning that intelligence into repeatable, behavior-based hunting logic quickly enough to be useful. Indicators of compromise still have value, but they age quickly. Behavioral detections give defenders a better way to look for how attackers operate, rather than relying only on what they leave behind.

To help solve this, Rapid7’s Internal Security team built an automated threat hunting pipeline that transforms threat intelligence reporting into structured, executable hunt plans. The pipeline uses large language models to extract adversary behaviors, map them to MITRE ATT&CK techniques, generate detection queries across multiple tools, and support analyst-ready briefings in minutes rather than days.

Why manual threat hunting does not scale

A single threat intelligence report can describe dozens of adversary behaviors across multiple ATT&CK techniques. Translating that report into useful hunt logic often requires an analyst to read the full source, identify relevant behaviors, map them to ATT&CK, write queries for each security tool, validate syntax, execute searches, and triage the results.

For a report covering 40 to 50 techniques, that process can consume much of a working week. When multiple high-quality reports land at once, manual hunting quickly becomes unsustainable. The goal of this project was to reduce the mechanical work involved in building hunt plans, while keeping analysts in control of validation, interpretation, and decision-making.

How the automated threat hunting pipeline works

The pipeline runs in four stages, each designed to be inspectable, repeatable, and easy for analysts to refine over time.

Stage 1: Threat intelligence ingestion

The pipeline accepts a threat intelligence blog or report via URL or pasted text. It extracts the core article body, removes navigation and boilerplate content, and validates the material to ensure there is enough substance for analysis. This creates a clean input for the model and reduces the risk of irrelevant page content influencing the output.

Stage 2: ATT&CK technique extraction

The cleaned content is then sent to a large language model with a structured prompt that instructs it to act as a MITRE ATT&CK analyst. The model identifies adversary techniques referenced in the report and returns each one with its technique ID, technique name, tactic category, and a short summary of how the threat actor used it.

The prompt is tuned to focus on offensive behaviors and adversary tradecraft. Defensive recommendations, control guidance, and mitigation strategies are excluded from this specific workflow so the output reflects what the attacker did, rather than what defenders should implement in response. That focus helps preserve the hunting value of the source material while leaving room for separate workflows that generate defensive recommendations or control improvements.

For example, when applied to a Rapid7 threat research report on BPFdoor activity in telecom networks, the pipeline identified 16 techniques across seven ATT&CK tactics, including Initial Access, Persistence, Defense Evasion, Credential Access, Collection, Command and Control, and Execution. That structured extraction became the foundation for a hunt plan with detection coverage across InsightIDR, Velociraptor, and Sigma, giving analysts a faster path from source intelligence to behavior-based hunting logic.

Stage 3: Detection query generation

For each identified technique, the pipeline generates detection content across several tools and formats. This includes LEQL queries for InsightIDR, targeting activity such as process execution, authentication events, network connections, and file modifications. It also includes Velociraptor VQL queries and artifact recommendations for live host interrogation, Sigma rules that can be shared across teams or converted into other SIEM formats, and YARA rules where relevant.

Every generated query is reviewed by an analyst before use. LLMs can accelerate drafting and reduce repetitive work, but analyst validation remains essential for accuracy, syntax, and operational fit.

Stage 4: Hunt plan assembly

The pipeline assembles a structured markdown hunt plan organized by ATT&CK tactic. Each report includes an executive summary, an IOC sweep section when indicators are present, and a behavioral hunting section containing generated queries in fenced code blocks with clear explanations of what each query is designed to detect. This gives analysts a consistent output they can inspect, edit, execute, and reuse.

Building a reusable detection query library

A key design decision was the introduction of a persistent query cache. Each technique’s generated queries are saved as standalone markdown files, creating a growing library of reusable detection content.

This cache reduces cost and execution time because techniques seen in previous reports can be loaded from the library rather than regenerated. It also creates a practical feedback loop: analysts can correct, tune, and improve cached queries over time, and those improvements persist across future hunt plans.

By tracking which reports and campaigns reference each technique, the team can build an organic view of recurring adversary behavior and identify which techniques appear across multiple actors or campaigns. Over time, this helps narrow the focus to behaviors most relevant to the environment, providing useful context.

Executing hunts and analyzing results

Once a hunt plan has been reviewed and validated, a separate process executes approved queries against InsightIDR. Results are then parsed and summarized into a briefing that highlights which queries returned results, why those results may matter, which findings may require immediate investigation, and how the activity relates to the threat actor’s known tradecraft.

Analysts can then ask follow-up questions conversationally, such as which findings should be prioritized, which hosts or users require deeper review, or how results should be interpreted based on risk.

Velociraptor queries are still executed manually because of the level of access involved. Given the potential impact of live host interrogation, the team made the deliberate decision to keep that execution under direct analyst control.

Practical use cases for automated threat hunting

The pipeline has already proven useful across several hunting scenarios: For advanced threat actor reporting, it can process DFIR reports and APT advisories to quickly determine whether known tradecraft appears in the environment. For insider threat hunting, it can be adapted to focus on data movement, anomalous access patterns, staging, and exfiltration behaviors. For security hardening, it can process reports about common persistence mechanisms and misconfigurations to validate whether the environment is exposed to known attack paths.

Across each use case, the value comes from shortening the path between intelligence and action.

Automating the repetitive work, not the expertise

By automating the repetitive work of reading reports, mapping techniques, and drafting queries, analysts can spend more time interpreting results, understanding context, and making decisions. The pipeline turns a daily flood of threat intelligence into structured, queryable, and continuously improving detection content. What previously required hours or days of manual effort can now be completed in minutes, while the underlying library compounds in value with every report processed.

Choosing the right workflow orchestration service for your use case: Amazon MWAA and AWS Step Functions

Post Syndicated from Rajkumar Raghuwanshi original https://aws.amazon.com/blogs/big-data/choosing-the-right-workflow-orchestration-service-for-your-use-case-amazon-mwaa-and-aws-step-functions/

Whether you’re processing financial data, managing e-commerce orders, or training machine learning (ML) models, efficiently coordinating complex processes is essential. Amazon Web Services (AWS) offers two services for workflow orchestration: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and AWS Step Functions.

This post explores how to select the right workflow orchestration service based on your specific use case requirements. We’ll examine key workflow characteristics, present real-world scenarios, and provide practical guidance to help you make an informed decision for your particular needs.

Understanding workflow orchestration requirements

Before exploring specific services, consider the key dimensions that influence workflow orchestration needs:

  • Data statefulness: Does your workflow process independent units of work (stateless) or create dependencies where each step modifies data from previous steps (stateful)?
  • Execution duration: Are your workflows short-lived (seconds to minutes) or long-running (hours to days)?
  • Scheduling requirements: Do you need built-in time-based execution or rely primarily on event triggers?
  • Recovery capabilities: How critical is the ability to restart from specific failure points rather than reprocessing entirely?
  • Integration complexity: What systems, services, and data sources need to be coordinated?
  • Security and access control: Do you need fine-grained permissions for different workflow components?

Let’s explore how these requirements map to real-world use cases and the appropriate orchestration solutions.

Use case: Enterprise data analytics pipeline

This scenario illustrates how Amazon MWAA handles complex, stateful data pipelines with built-in scheduling and granular recovery.

Business challenge

A global financial services company processes massive volumes of transaction data daily, requiring sophisticated data analytics capabilities. Their requirements include:

  • Designed to process 5-10 TB of financial transaction data daily
  • Running complex extract, transform, and load (ETL) jobs with multiple transformation stages
  • Generating regulatory reports for compliance use cases
  • Supporting both scheduled batch processing and event-driven workflows
  • Capable of handling long-running jobs that can take up to 12 hours
  • Ensuring data consistency and integrity throughout the pipeline

Workflow characteristics

  • Data statefulness: Highly stateful workflows where each processing step modifies transaction data, creating dependencies throughout the pipeline
  • Execution duration: Supports long-running processes extending 2-12 hours
  • Scheduling needs: Mixed time-based and event-driven patterns
  • Recovery requirements: Critical ability to resume from specific failure points
  • Integration complexity: Orchestrates multiple AWS services and external systems

Solution: Amazon Managed Workflows for Apache Airflow (Amazon MWAA)

For this enterprise data analytics scenario, Amazon MWAA provides capabilities that align well with these requirements:

Stateful workflow management

MWAA excels at managing complex, stateful data pipelines where data consistency is critical. When processing terabytes of financial data, MWAA’s ability to resume from the last successful checkpoint helps prevent costly reprocessing and maintain data integrity.

The following code example demonstrates how to structure a complex financial ETL pipeline in MWAA:

# Example: Complex ETL pipeline with proper dependency management
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

dag = DAG(
	'financial_etl_pipeline',
	schedule_interval='0 2 * * *',  # Daily at 2 AM
	start_date=datetime(2024, 1, 1),
	catchup=False
)

# Define tasks
extract_transactions = PythonOperator(task_id='extract_transactions', ...)
extract_market_data = PythonOperator(task_id='extract_market_data', ...)
transform_data = PythonOperator(task_id='transform_data', ...)
load_warehouse = PythonOperator(task_id='load_warehouse', ...)
generate_reports = PythonOperator(task_id='generate_reports', ...)

# Express complex dependencies clearly
[extract_transactions, extract_market_data] >> transform_data >> [load_warehouse, generate_reports]

This Directed Acyclic Graph (DAG) shows how to define task dependencies for parallel data extraction followed by sequential transformation and loading operations. The >> operator clearly defines the workflow dependencies. Transformation only begins after both extraction tasks complete successfully.

Built-in scheduling capabilities

MWAA includes native scheduling capabilities, making it straightforward to set up recurring workflows without additional services. The schedule_interval parameter in the DAG definition provides flexible scheduling options using cron syntax.

Granular recovery and resume control

During production incidents, operations teams can use the MWAA web interface to restart or bypass specific steps with a few clicks. This capability is important for stateful applications where restarting the entire workflow could compromise data consistency.

The MWAA web interface provides a visual representation of the workflow execution, allowing operators to:

Identify failed tasks – Examine task logs for troubleshooting – Clear the status of specific tasks – Restart execution from specific points

Figure 1: A Directed Acyclic Graph (DAG) in MWAA showing parallel execution ofAmazon Redshift Data APItasks. If any task fails, you can re-run specific tasks rather than restarting from the beginning.

Comprehensive monitoring and operational control

MWAA’s metadata server maintains comprehensive execution logs, enabling organizations to build operational dashboards for: – Real-time workflow monitoring – Task completion rate tracking – Pipeline execution pattern analysis – Optimization opportunity identification

Implementation considerations

  • Infrastructure planning: While MWAA requires capacity planning, the automatic scaling capabilities effectively handle variable workloads by setting minimum and maximum worker counts.
  • Security model: MWAA uses a shared execution role across DAGs, but you can implement additional security through resource-level policies and separate environments for different teams.
  • Cost predictability: The worker-hour pricing model provides predictable costs for long-running jobs, making budget planning more straightforward.

Use case: Real-time serverless application orchestration

This scenario shows how AWS Step Functions handles event-driven, serverless workflows that need to scale automatically with unpredictable traffic.

Business challenge

An e-commerce platform needs to orchestrate real-time order processing workflows that can handle thousands of concurrent orders during peak shopping periods. Their requirements include:

  • Designed for processing customer orders in real-time (targeting sub-second response times)
  • Coordinating payment validation, inventory checks, and fulfillment
  • Integrating with multiple AWS services (AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), Amazon DynamoDB)
  • Designed to handle traffic spikes during promotional events
  • Implementing approval workflows for high-value orders
  • Maintaining cost efficiency during variable load periods

Workflow characteristics

  • Data statefulness: Primarily stateless processing where each customer order represents an independent transaction
  • Execution duration: Supports rapid, real-time processing with sub-second to few-minute response times.
  • Event-driven nature: Core architectural pattern where workflows are triggered by specific customer actions
  • Integration requirements: Extensive coordination with AWS serverless services
  • Scalability needs: Highly unpredictable traffic patterns requiring automatic scaling

Solution: AWS Step Functions

For this real-time e-commerce scenario, AWS Step Functions provides capabilities that align well with these requirements:

Serverless architecture and automatic scaling

Step Functions automatically scales to handle traffic spikes without infrastructure management. During peak shopping events like Black Friday, the service handles increased load without manual intervention.

Event-driven workflow execution

Step Functions is designed for order-triggered workflows that need immediate execution. The following JSON definition shows how to structure an e-commerce order processing workflow:

{
  "Comment": "E-commerce Order Processing Workflow",
  "StartAt": "ValidatePayment",
  "States": {
    "ValidatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account:function:ValidatePayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "CheckWarehouse1",
          "States": {
            "CheckWarehouse1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:region:account:function:CheckWarehouse",
              "End": true
            }
          }
        },
        {
          "StartAt": "CheckWarehouse2", 
          "States": {
            "CheckWarehouse2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:region:account:function:CheckWarehouse",
              "End": true
            }
          }
        }
      ],
      "Next": "ProcessOrder"
    },
    "ProcessOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account:function:ProcessOrder",
      "End": true
    }
  }
}

This Step Functions definition demonstrates several key capabilities: – The ValidatePayment state includes built-in retry logic with exponential backoff – The CheckInventory state uses parallel execution to simultaneously check multiple warehouses – Each Lambda function is called via its Amazon Resource Name (ARN), providing direct integration with AWS services

Figure 2: A complex workflow in AWS Step Functions, involving multiple stages of data processing. The parallel execution doesn’t allow resuming from a specific mid-execution step, but the branching structure provides automated error handling and recovery.

Native AWS service integration

Step Functions provides direct integration with Lambda functions, SQS queues, SNS topics, and DynamoDB, eliminating the need for custom connectors or additional infrastructure components.

Cost-effective pay-per-use model

The pay-per-execution pricing model aligns with variable order volumes, keeping costs minimal during slow periods while scaling automatically during busy times.

Human approval workflow support

Step Functions supports human approval steps, making it suitable for high-value order workflows that require manual review or approval processes.

Implementation considerations

  • Error handling: Built-in retry mechanisms and error handling patterns help provide reliable order processing with configurable retry policies.
  • Visual monitoring: The Step Functions console provides real-time visibility into order processing status, enabling quick identification of bottlenecks.
  • Security model: Fine-grained AWS Identity and Access Management (IAM) roles per step so that payment processing functions have different permissions than inventory management functions.

Choosing the right workflow orchestration service

When selecting between Amazon MWAA and AWS Step Functions, consider these workflow characteristics:

Consider Amazon MWAA when your use case involves:

  • Complex stateful data processing where workflows modify data state and require recovery mechanisms to maintain consistency
  • Long-running batch jobs executing for hours or days where computational investment is substantial
  • Built-in scheduling requirements where regular batch processing needs time-based orchestration
  • Granular recovery needs where resuming from specific failure points is business-critical
  • Complex task dependencies involving sophisticated relationships between workflow tasks
  • Existing Apache Airflow expertise where teams have substantial investment in Apache Airflow knowledge

Consider AWS Step Functions when your use case involves:

  • Event-driven serverless workflows triggered by external events requiring immediate response
  • Stateless processing where each workflow execution operates independently
  • Short to medium duration tasks completing within minutes to hours
  • Heavy AWS service integration involving extensive coordination with Lambda functions and other AWS services
  • Human approval workflows requiring manual intervention or decision-making
  • Variable load patterns with unpredictable traffic requiring automatic scaling

Decision framework

To help guide your decision process, consider the following questions:

Figure 3: Decision tree guiding through key considerations for choosing between Amazon MWAA and AWS Step Functions based on workflow characteristics.

Figure 4: Comprehensive comparison between Amazon MWAA and AWS Step Functions, highlighting decision factors for choosing the right workflow orchestration service.

Conclusion

Both Amazon Managed Workflows for Apache Airflow and AWS Step Functions are workflow orchestration services, each designed to address specific use case requirements. By understanding your workflow characteristics and aligning them with the strengths of each service, you can make an informed decision that supports your business needs.

For complex, stateful workflows with long execution times and sophisticated recovery requirements, Amazon MWAA provides robust capabilities. For event-driven, serverless workflows with tight AWS integration and variable load patterns, AWS Step Functions is a strong fit.

Remember that these services are not mutually exclusive. Many organizations use both to address different workflow orchestration needs across their application portfolio. By focusing on your specific use case requirements, you can select the right tool for each job and build resilient, efficient workflow orchestration solutions on AWS.

If you have questions or feedback about choosing between these services, leave a comment.


About the authors

Rajkumar Raghuwanshi

Rajkumar Raghuwanshi

Rajkumar is a Delivery Consultant, within AWS Professional Services, specializing in helping customers design and optimize their data and analytics workloads on AWS. With expertise spanning database modernization, data migration, and analytics architecture, he builds scalable, cloud-native solutions that enable customers to unlock the full value of their data.

Shuvajit Ghosh

Shuvajit Ghosh

Shuvajit is a Delivery Consultant – Data & Analytics within AWS Professional Services, with over a decade of experience architecting enterprise-scale data warehouses, lakehouse platforms, and modern data ecosystems. He specializes in data lakehouse architectures, end-to-end ETL/ELT pipeline design, data lineage, and container-based solutions using services like Amazon Redshift, Amazon OpenSearch Service, AWS Glue, Lake Formation, Apache Iceberg, dbt, and Amazon MWAA.

Nishad

Nishad Mankar

Nishad is a Delivery Consultant with AWS Professional Services, passionate about helping customers harness the power of data on the cloud. He brings deep expertise in analytics architecture, data platform modernization, and database migration, enabling organizations to build robust, scalable solutions on AWS. From architecting modern data pipelines to optimizing complex workloads, Nishad partners closely with customers to accelerate their cloud journey and deliver measurable business outcomes.

Real-time CDC from Aurora PostgreSQL to Amazon S3 Tables using Debezium and Firehose

Post Syndicated from Chintan Agrawal original https://aws.amazon.com/blogs/big-data/real-time-cdc-from-aurora-postgresql-to-amazon-s3-tables-using-debezium-and-firehose/

Enterprises running transactional workloads on Amazon Aurora PostgreSQL-Compatible Edition (Aurora PostgreSQL) need their operational data available for analytics. However, analytical queries and cross-database joins compete for resources on OLTP-optimized clusters. Batch exports introduce latency, and when data spans multiple Aurora clusters, there’s no straightforward way to join datasets or run cross-domain analytics. Real-time change data capture (CDC) addresses this by streaming row-level changes into a separate analytics layer. However, most CDC approaches write append-only records that require downstream consumers to reconstruct current state from the change log.

In this post, we show you how to build a CDC pipeline that delivers query-ready Iceberg tables directly. The pipeline captures inserts, updates, and deletes from Aurora PostgreSQL and applies them as row-level operations in Amazon S3 Tables, a capability of Amazon Simple Storage Service (Amazon S3). The destination tables always reflect the current state of the source database. You use Debezium on Amazon MSK Connect for change capture and Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming. You also use AWS Lambda to transform CDC events and resolve operation semantics, and Amazon Data Firehose to deliver records into Iceberg tables. You deploy the infrastructure using the AWS Cloud Development Kit (AWS CDK).

Apache Iceberg supports row-level updates, deletes, ACID transactions, schema evolution, and time travel natively. S3 Tables handles Iceberg snapshot management and compaction automatically. With AWS Lake Formation for access control, multiple teams can query the tables through Amazon Athena, Amazon Redshift, or Amazon SageMaker Unified Studio.

Solution overview

The following diagram shows the architecture of the CDC pipeline.

Figure 1. CDC pipeline architecture from Aurora PostgreSQL to Amazon S3 Tables.

Figure 1. CDC pipeline architecture from Aurora PostgreSQL to Amazon S3 Tables.

The pipeline uses six components:

  1. Aurora PostgreSQL to Debezium. Debezium runs on MSK Connect in your VPC and uses PostgreSQL’s native logical replication to stream row-level changes from the write-ahead log (WAL), with minimal impact on query performance.
  2. Debezium to Amazon MSK. The ByLogicalTableRouter SMT reroutes CDC events from multiple tables into a single topic (aurora.cdc.all-tables), retaining the source table name in each message.
  3. Amazon MSK to Firehose. Firehose connects to the MSK cluster using the IAM access control over AWS PrivateLink and continuously polls the topic for new messages.
  4. Firehose to Lambda. For each batch, Firehose invokes the Lambda function to decode the Kafka message, flatten the Debezium envelope, and set otfMetadata routing with the destination table and operation type.
  5. Firehose to S3 Tables. Firehose reads the otfMetadata, routes each record to the correct Iceberg table, and performs the appropriate row-level operation using configured unique keys (for example, order_id for orders). S3 Tables handles compaction and snapshot management automatically.
  6. Query and access control. After data lands in S3 Tables, you can query the Iceberg tables with Amazon Athena, Amazon Redshift, or Amazon SageMaker Unified Studio, with AWS Lake Formation managing fine-grained access control.

Firehose supports one MSK topic per delivery stream. The single-topic routing pattern uses a Debezium SMT to consolidate multiple tables into one topic, and a Lambda function to route records to the correct destination. With this, you can serve multiple tables through one Firehose stream, reducing cost and operational complexity.

Debezium event transformation

Debezium produces CDC events in an envelope structure containing both the previous and current state of a row, along with metadata about the source database, table, and operation type. However, Firehose expects records in a flattened JSON format with routing metadata that indicates the target table and operation type.

The Lambda function bridges this gap by performing three operations on each record:

  1. Decode. When Firehose uses Amazon MSK as a source, it delivers the Kafka message value as a base64-encoded string in the kafkaRecordValue field. The function base64-decodes this field to obtain the raw Debezium JSON payload.
  2. Flatten and extract. Pulls the row data from the Debezium envelope. For inserts and updates, the function uses the after field (the row after the change). For deletes, it uses the before field, because the after field is null when a row is removed.
  3. Route. Sets the otfMetadata block with destinationTableName (extracted from the Debezium source.table field) and operation (mapped from Debezium’s single-character codes to Firehose’s operation types).

The following table shows how Debezium operation codes map to Firehose Iceberg operations:

Debezium code Meaning Firehose operation
c Row created (insert) insert
u Row updated update
d Row deleted delete
r Snapshot read (initial load) insert

When Debezium starts with snapshot.mode=initial, it reads all existing rows and emits them as r (read) events. These represent rows that existed before CDC began, so they are mapped to insert to establish the baseline state in the destination tables.

For example, the function transforms this Debezium envelope:

{
"op": "c",
"before": null,
"after": {"order_id": 1, "customer_id": 1, "total_amount": 299.99},
"source": {"table": "orders", "db": "cdcdemo"}
}

Into a response record with routing metadata:

{
"recordId": "<original-record-id>",
"result": "Ok",
"kafkaRecordValue": "<base64-encoded flattened row JSON>",
"metadata": {
"otfMetadata": {
"destinationDatabaseName": "aurora_cdc",
"destinationTableName": "orders",
"operation": "insert"
}
}
}

The kafkaRecordValue contains the base64-encoded flattened row data (for example, {"order_id": 1, "customer_id": 1, "total_amount": 299.99}), and the otfMetadata block tells Firehose which table to write to and which operation to perform.

With this routing metadata, a single Firehose stream can write to multiple destination tables. For more information, see Route incoming records to different Iceberg tables.

Walkthrough

The following sections walk you through building the CDC pipeline end to end. Before you begin, complete the prerequisites.

Prerequisites

Before you begin, make sure you have the following:

Step 1: Enable CDC in Aurora PostgreSQL

PostgreSQL supports change data capture through its logical replication framework, which allows database changes to be streamed from the write-ahead log (WAL). Debezium uses this mechanism to continuously read row-level changes and publish them to Kafka topics.

To enable logical replication in Aurora PostgreSQL, configure a custom DB cluster parameter group:

  1. Create a custom parameter group and set the following parameter: rds.logical_replication = 1.
  2. Apply the parameter group to your Aurora cluster and reboot the cluster for the change to take effect.
  3. Connect to your Aurora PostgreSQL cluster and create the source tables:
CREATE TABLE public.orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_date VARCHAR(50),
    total_amount DECIMAL(12,2),
    status VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE public.products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(255),
    category VARCHAR(100),
    price DECIMAL(10,2),
    stock_quantity INTEGER,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);
  1. Create a publication that defines which tables are included in the change stream. Debezium automatically creates the logical replication slot when the connector starts for the first time, so you don’t need to create one manually.
CREATE PUBLICATION dbz_publication FOR TABLE public.orders, public.products;
  1. Verify the publication was created:
SELECT * FROM pg_publication WHERE pubname = 'dbz_publication';

You should see one row returned, confirming the publication is active.

Important: When the Debezium connector starts (Step 6), it creates a replication slot named debezium_slot. This slot retains WAL segments until consumed. If the connector is stopped for an extended period, WAL segments can accumulate and increase storage usage on the Aurora cluster. Monitor the ReplicationSlotDiskUsage Amazon CloudWatch metric for your Aurora cluster.

Step 2: Build and register the Debezium plugin

MSK Connect runs connectors using custom plugins that you upload to Amazon S3. In this step, you download the Debezium PostgreSQL connector, package it as a ZIP file, upload it to S3, and register it with MSK Connect.

First, create an S3 bucket for the plugin, or use an existing metadata management bucket:

aws s3 mb s3://<your-plugin-bucket> --region <your-region>

Download and package the Debezium connector:

DEBEZIUM_VERSION=2.7.3.Final
curl -LO "https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/${DEBEZIUM_VERSION}/debezium-connector-postgres-${DEBEZIUM_VERSION}-plugin.tar.gz"
mkdir -p debezium-plugin
tar -xzf debezium-connector-postgres-${DEBEZIUM_VERSION}-plugin.tar.gz -C debezium-plugin/
cd debezium-plugin && zip -r ../debezium-postgres-connector.zip . && cd ..
aws s3 cp debezium-postgres-connector.zip s3://<your-plugin-bucket>/plugins/

Register the plugin with MSK Connect:

aws kafkaconnect create-custom-plugin \
    --custom-plugin-name debezium-postgres-connector \
    --content-type ZIP \
    --location "s3Location={bucketArn=arn:aws:s3:::<your-plugin-bucket>,fileKey=plugins/debezium-postgres-connector.zip}"

Create a worker configuration that tells MSK Connect to serialize Kafka messages as JSON without schemas:

aws kafkaconnect create-worker-configuration \
    --name debezium-worker-config \
    --properties-file-content "$(echo -n 'key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false' | base64)"

Note the customPluginArn and workerConfigurationArn from the output. You need these for the CDK configuration in the next step.

Note: The custom plugin and worker configuration are created through the AWS CLI because the Debezium connector JARs must be downloaded from the Debezium project and packaged manually. The remaining infrastructure is deployed using the AWS CDK in the following steps.

Step 3: Configure the CDK project

Clone the sample repository and install dependencies:

git clone https://github.com/aws-samples/sample-aurora-cdc-s3tables.git
cd sample-aurora-cdc-s3tables/cdk
npm install

Open cdk/lib/v2/config.ts and update the configuration values to match your environment:

export const CONFIG = {
account: '<your-account-id>',
region: '<your-region>',
// VPC - must match your Aurora cluster's VPC
vpcId: '<your-vpc-id>',
subnetIds: ['<subnet-1>', '<subnet-2>'],
auroraSecurityGroupId: '<aurora-security-group-id>',
// Aurora connection details
auroraEndpoint: '<aurora-cluster-endpoint>',
auroraPort: '5432',
auroraDbName: '<database-name>',
auroraUser: '<db-user>',
auroraSecretArn: '<secrets-manager-arn>',
// Debezium - use the ARNs from Step 2
debeziumPluginArn: '<customPluginArn-from-step-2>',
debeziumWorkerConfigArn: '<workerConfigurationArn-from-step-2>',
debeziumPluginBucket: '<your-plugin-bucket-name>',
debeziumTopicPrefix: 'aurora.cdc',
debeziumTables: 'public.orders,public.products',
// S3 Tables - the table bucket name must be globally unique
s3TablesBucketName: '<your-table-bucket-name>',
s3TablesNamespace: 'aurora_cdc',
tables: ['orders', 'products'],
tableKeys: { orders: 'order_id', products: 'product_id' },
// Firehose - general purpose S3 bucket for failed record backup
firehoseBackupBucket: '<your-backup-bucket-name>',
};

Key configuration notes:

  • auroraSecurityGroupId. The security group attached to your Aurora cluster. The CDK creates an MSK security group with ingress rules allowing traffic from this security group, and a reverse rule allowing MSK Connect workers to reach Aurora on port 5432.
  • tableKeys. The primary key column for each table. Firehose uses these to match incoming records against existing rows for update and delete operations in the Iceberg tables.
  • s3TablesBucketName. The name for your S3 table bucket. Table bucket names must be unique for your account in the chosen Region.

Step 4: Deploy the CDK stacks

Deploy all six stacks with a single command. The CDK resolves the dependency order automatically:

npx cdk --app "npx ts-node bin/app-v2.ts" deploy --all

When prompted, review the AWS Identity and Access Management (IAM) changes and confirm the deployment. The CDK deploys the following stacks:

Stack What it creates
CdcMskCluster Amazon MSK cluster (2x kafka.m5.large brokers) with dual authentication (IAM for Firehose, unauthenticated for Debezium), custom configuration with auto.create.topics.enable=true, security groups with ingress rules for Aurora and MSK Connect workers
CdcMskConnectIam MSK Connect service execution role with permissions for Kafka cluster operations, VPC networking, S3 plugin access, and AWS Secrets Manager; Amazon CloudWatch Logs group for connector logs
CdcS3Tables S3 table bucket, aurora_cdc namespace, two Iceberg tables (orders, products) with column schemas
CdcLambdaTransform Lambda function for CDC event transformation and multi-table routing
CdcFirehoseRole Firehose IAM role with permissions for Amazon MSK, S3 Tables, AWS Glue Data Catalog, AWS Lake Formation, VPC networking, and Lambda invocation
CdcFirehose Firehose delivery stream with MSK as source (private connectivity through AWS PrivateLink), Lambda processing, Apache Iceberg Tables as destination with two table configurations, and S3 backup bucket for failed records

The MSK cluster takes approximately 25 minutes to create. The Debezium connector takes approximately 5 minutes after the cluster is ready. You can monitor the deployment progress in the AWS CloudFormation console.

After the deployment completes, you can verify the resources in the AWS console. The S3 table bucket shows the two Iceberg tables in the aurora_cdc namespace.

Figure 2. S3 table bucket showing the orders and products Iceberg tables in the aurora_cdc namespace.

Figure 2. S3 table bucket showing the orders and products Iceberg tables in the aurora_cdc namespace.

The Firehose delivery stream shows the MSK source, Lambda transformation, and Apache Iceberg Tables destination.

Figure 3. Amazon Data Firehose delivery stream with MSK source, Lambda transformation, and Apache Iceberg Tables destination.

Figure 3. Amazon Data Firehose delivery stream with MSK source, Lambda transformation, and Apache Iceberg Tables destination.

The MSK cluster uses dual authentication (IAM for Firehose, unauthenticated for Debezium through TLS_PLAINTEXT), multi-VPC private connectivity for Firehose PrivateLink access, and auto.create.topics.enable=true so Debezium can create topics on first connect. VPC connectivity and the cluster resource policy are configured as CLI steps in Step 5.

Step 5: Enable MSK VPC connectivity, grant Lake Formation permissions, and apply MSK cluster policy

After the CDK deployment completes, enable multi-VPC private connectivity with IAM on the MSK cluster. Firehose requires this to create an AWS PrivateLink endpoint to the MSK brokers. This setting can’t be configured during cluster creation and must be applied as an update, which triggers a rolling broker restart (approximately 20–30 minutes).

# Get the cluster ARN and current version from the CdcMskCluster stack outputs
MSK_ARN=<msk-cluster-arn>
CLUSTER_VERSION=$(aws kafka describe-cluster-v2 \
    --cluster-arn $MSK_ARN \
    --region <your-region> \
    --query 'ClusterInfo.CurrentVersion' --output text)
# Enable VPC connectivity with IAM
aws kafka update-connectivity \
    --cluster-arn $MSK_ARN \
    --current-version $CLUSTER_VERSION \
    --connectivity-info '{"VpcConnectivity":{"ClientAuthentication":{"Sasl":{"Iam":{"Enabled":true}}}}}' \
    --region <your-region>

Wait for the cluster state to return to ACTIVE before proceeding:

aws kafka describe-cluster-v2 \
    --cluster-arn $MSK_ARN \
    --region <your-region> \
    --query 'ClusterInfo.State'

Next, grant the Firehose IAM role permissions through AWS Lake Formation. S3 Tables uses a sub-catalog format for the CatalogId parameter, which differs from the standard AWS Glue Data Catalog. These permissions require a data lake administrator identity.

Grant database-level and table-level permissions to the Firehose role:

# Grant database-level permissions
aws lakeformation grant-permissions \
    --region <your-region> \
    --principal '{"DataLakePrincipalIdentifier": "<firehose-role-arn>"}' \
    --resource '{"Database": {"CatalogId": "<account-id>:s3tablescatalog/<table-bucket-name>", "Name": "aurora_cdc"}}' \
    --permissions '["ALL"]'
# Grant table-level permissions (wildcard for the tables in the namespace)
aws lakeformation grant-permissions \
    --region <your-region> \
    --principal '{"DataLakePrincipalIdentifier": "<firehose-role-arn>"}' \
    --resource '{"Table": {"CatalogId": "<account-id>:s3tablescatalog/<table-bucket-name>", "DatabaseName": "aurora_cdc", "TableWildcard": {}}}' \
    --permissions '["ALL"]'

Note the CatalogId format: <account-id>:s3tablescatalog/<table-bucket-name>. This is specific to S3 Tables and tells Lake Formation to look up permissions in the S3 Tables catalog rather than the default Glue Data Catalog. For more information, see Integrating Amazon S3 Tables with AWS analytics services.

Next, attach a resource-based policy to the MSK cluster that grants the Firehose service principal permission to create VPC connections:

aws kafka put-cluster-policy \
    --cluster-arn <msk-cluster-arn> \
    --region <your-region> \
    --policy '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "firehose.amazonaws.com"},
"Action": ["kafka:CreateVpcConnection", "kafka:GetBootstrapBrokers", "kafka:DescribeClusterV2"],
"Resource": "<msk-cluster-arn>"
}]
}'

You can find the <msk-cluster-arn> in the CdcMskCluster stack outputs from Step 4, and the <firehose-role-arn> in the CdcFirehoseRole stack outputs.

Step 6: Create the Debezium connector

With the MSK cluster running and Lake Formation permissions in place, create the Debezium connector using the MSK Connect API. The connector reads changes from Aurora PostgreSQL and publishes them to the MSK topic.

Firehose supports only one MSK topic per delivery stream, so each source table would otherwise need its own Firehose stream and VPC connection. To avoid this, the connector uses the Debezium ByLogicalTableRouter Single Message Transform (SMT) to route changes from multiple tables into a single topic (aurora.cdc.all-tables). The Lambda function then uses the source table name in each message to direct records to the correct Iceberg table. This single-topic pattern uses one Firehose stream for multiple tables, reducing cost and operational complexity.

First, retrieve the MSK bootstrap servers from the cluster:

aws kafka get-bootstrap-brokers \
    --cluster-arn <msk-cluster-arn> \
    --region <your-region>

Note the BootstrapBrokerString value (the PLAINTEXT brokers). Then create the connector:

aws kafkaconnect create-connector --cli-input-json '{
"connectorName": "aurora-postgres-debezium-connector",
"kafkaCluster": {
"apacheKafkaCluster": {
"bootstrapServers": "<bootstrap-servers>",
"vpc": {
"subnets": ["<subnet-1>", "<subnet-2>"],
"securityGroups": ["<msk-security-group-id>"]
}
}
},
"kafkaClusterClientAuthentication": {"authenticationType": "NONE"},
"kafkaClusterEncryptionInTransit": {"encryptionType": "PLAINTEXT"},
"kafkaConnectVersion": "2.7.1",
"plugins": [{"customPlugin": {"customPluginArn": "<custom-plugin-arn>", "revision": 1}}],
"serviceExecutionRoleArn": "<msk-connect-service-role-arn>",
"capacity": {"provisionedCapacity": {"mcuCount": 2, "workerCount": 2}},
"workerConfiguration": {"workerConfigurationArn": "<worker-config-arn>", "revision": 1},
"connectorConfiguration": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "<aurora-cluster-endpoint>",
"database.port": "5432",
"database.user": "<db-user>",
"database.password": "<db-password>",
"database.dbname": "<database-name>",
"database.server.name": "aurora_cdc",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "dbz_publication",
"table.include.list": "public.orders,public.products",
"topic.prefix": "aurora.cdc",
"schema.history.internal.kafka.topic": "schema-changes.aurora",
"schema.history.internal.kafka.bootstrap.servers": "<bootstrap-servers>",
"decimal.handling.mode": "string",
"time.precision.mode": "adaptive_time_microseconds",
"tombstones.on.delete": "false",
"snapshot.mode": "initial",
"publication.autocreate.mode": "filtered",
"transforms": "Reroute",
"transforms.Reroute.type": "io.debezium.transforms.ByLogicalTableRouter",
"transforms.Reroute.topic.regex": "aurora\\\\\\\\.cdc\\\\\\\\.public\\\\\\\\.(.*)",
"transforms.Reroute.topic.replacement": "aurora.cdc.all-tables"
},
"logDelivery": {
"workerLogDelivery": {
"cloudWatchLogs": {
"enabled": true,
"logGroup": "/aws/msk-connect/aurora-cdc-debezium"
}
}
}
}'

The <msk-security-group-id> and <msk-connect-service-role-arn> can be found in the CdcMskCluster and CdcMskConnectIam stack outputs respectively. The ByLogicalTableRouter Single Message Transform routes CDC events from the monitored tables into a single topic (aurora.cdc.all-tables).

Step 7: Verify the Debezium connector

After creating the connector, verify that it is running and has completed its initial snapshot.

aws kafkaconnect list-connectors --region <your-region> \
    --query 'connectors[?connectorName==`aurora-postgres-debezium-connector`].{Name:connectorName,State:connectorState}' \
    --output table

The connector state should show RUNNING, as shown in the following figure.

Figure 4. Debezium connector running on Amazon MSK Connect.

Figure 4. Debezium connector running on Amazon MSK Connect.

Check the CloudWatch Logs to confirm the snapshot completed:

aws logs tail /aws/msk-connect/aurora-cdc-debezium --follow --region <your-region>

You should see messages indicating the transition to streaming mode:

Finished exporting 0 records for table 'public.orders' (1 of 2 tables)
Finished exporting 0 records for table 'public.products' (2 of 2 tables)
Snapshot completed
Starting streaming

If the tables were empty when the connector started, the export count is 0. If you had existing data, the snapshot captures the existing rows as r (read) operations, which the Lambda function maps to insert operations in the Iceberg tables.

Verify that the Firehose delivery stream is active:

aws firehose describe-delivery-stream \
    --delivery-stream-name msk-to-s3tables-firehose \
    --region <your-region> \
    --query 'DeliveryStreamDescription.DeliveryStreamStatus'

The status should return ACTIVE.

Step 8: Test the pipeline

Insert test data into the Aurora PostgreSQL source tables. Each insert triggers a CDC event that flows through the pipeline: Aurora WAL to Debezium to MSK topic to Firehose to Lambda transform to S3 Tables.

-- Insert orders
INSERT INTO public.orders (customer_id, order_date, total_amount, status)
VALUES
(1, '2026-01-20', 299.99, 'shipped'),
(2, '2026-01-21', 149.50, 'processing'),
(1, '2026-01-22', 89.99, 'delivered');
-- Insert products
INSERT INTO public.products (product_name, category, price, stock_quantity)
VALUES
('Wireless Headphones', 'Electronics', 79.99, 150),
('Running Shoes', 'Sports', 129.99, 75),
('Coffee Maker', 'Kitchen', 49.99, 200);

This creates six records across two tables. Each record generates a Debezium CDC event with operation type c (create), which the Lambda function maps to an insert operation in the corresponding Iceberg table.

Step 9: Verify data delivery

Check the Firehose IncomingRecords metric to confirm records are flowing through the delivery stream:

aws cloudwatch get-metric-statistics \
    --namespace AWS/Firehose \
    --metric-name IncomingRecords \
    --dimensions Name=DeliveryStreamName,Value=msk-to-s3tables-firehose \
    --start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 --statistics Sum \
    --region <your-region>

You should see a Sum value of 6 or more. If the value is 0, wait another minute and retry. There can be a short delay between MSK topic delivery and Firehose metric reporting.

If records aren’t appearing, check the Firehose error output in the backup S3 bucket and the Lambda function’s CloudWatch Logs for transformation errors.

Step 10: Query data using Amazon Athena

With data delivered to S3 Tables, you can query the Iceberg tables using Amazon Athena. S3 Tables integrates with the AWS Glue Data Catalog as a sub-catalog, so you reference tables using the S3 Tables catalog format.

Tip: If records aren’t appearing in Athena, check the Firehose IncomingRecords CloudWatch metric and the Lambda function’s CloudWatch Logs for transformation errors.

Open the Athena console, select the AwsDataCatalog data source, and run the following queries:

SELECT * FROM "s3tablescatalog/<table-bucket-name>"."aurora_cdc"."products" LIMIT 10;
SELECT * FROM "s3tablescatalog/<table-bucket-name>"."aurora_cdc"."orders" LIMIT 10;

Replace <table-bucket-name> with your S3 table bucket name. You should see the records from the initial snapshot that Debezium captured when the connector started.

The following figures show the initial state of both tables as queried through Athena. At this point, the products table contains seven records and the orders table contains seven records, captured during the Debezium initial snapshot.

Figure 5. Initial state of the products table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.

Figure 5. Initial state of the products table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.

Figure 6. Initial state of the orders table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.

Figure 6. Initial state of the orders table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.

Now test that update and delete operations propagate correctly. Run the following statements in Aurora:

-- Insert new records
INSERT INTO public.products (product_name, category, price, stock_quantity)
VALUES ('Bluetooth Speaker', 'Electronics', 129.99, 90), ('Standing Desk', 'Furniture', 799.99, 20);
INSERT INTO public.orders (customer_id, order_date, total_amount, status)
VALUES (201, '2026-04-03', 149.99, 'NEW'), (202, '2026-04-03', 249.50, 'NEW'), (203, '2026-04-03', 79.90, 'NEW');
-- Update existing records
UPDATE public.products SET stock_quantity = 30, price = 549.99 WHERE product_name = 'Ergonomic Chair';
UPDATE public.orders SET status = 'DELIVERED' WHERE order_id = 201;
-- Delete a record
DELETE FROM public.products WHERE product_name = 'Test Widget';

Wait for the changes to propagate through the pipeline, then query Athena again. The following figures show the results after the insert, update, and delete operations have been applied.

In the products table, the Test Widget record (product_id 100) is no longer present because it was removed by the delete operation. The Ergonomic Chair row now reflects the updated price (549.99) and stock quantity (30). Two new records, Bluetooth Speaker and Standing Desk, appear with a later created_at timestamp, confirming they were inserted after the initial snapshot.

Figure 7. Products table after CDC operations. The Ergonomic Chair, Headphones, and Desk Lamp rows reflect updated values. Bluetooth Speaker and Standing Desk are newly inserted records. The Test Widget record has been removed by the delete operation.

Figure 7. Products table after CDC operations. The Ergonomic Chair, Headphones, and Desk Lamp rows reflect updated values. Bluetooth Speaker and Standing Desk are newly inserted records. The Test Widget record has been removed by the delete operation.

In the orders table, order 100 now shows a status of SHIPPED and order 201 shows DELIVERED, reflecting the update operations. Three new orders (301, 302, 303) appear with status NEW and a later timestamp, confirming they were inserted after the initial load.

Figure 8. Orders table after CDC operations. Orders 100 and 201 reflect updated status values. Orders 301, 302, and 303 are newly inserted records.

Figure 8. Orders table after CDC operations. Orders 100 and 201 reflect updated status values. Orders 301, 302, and 303 are newly inserted records.

This confirms that the pipeline correctly handles the three CDC operation types: inserts, updates, and deletes are captured from the Aurora WAL by Debezium, routed through the single MSK topic, transformed by the Lambda function, and applied as row-level Iceberg operations by Firehose.

S3 Tables handles compaction and snapshot management for Iceberg tables automatically, including compaction of small data files and expiration of old snapshots. You don’t need to run manual maintenance operations.

You can also use Iceberg’s time travel capability to query the table as it existed before the updates:

SELECT * FROM "s3tablescatalog/<table-bucket-name>"."aurora_cdc"."orders"
FOR TIMESTAMP AS OF current_timestamp - interval '5' minute;

This returns the original data before the update, demonstrating the time travel capability that Apache Iceberg provides through S3 Tables.

Cleaning up

To avoid ongoing charges, delete the resources in reverse dependency order.

Delete the CDK stacks:

cd cdk
npx cdk --app "npx ts-node bin/app-v2.ts" destroy --all

Delete the Debezium custom plugin and worker configuration that were created through the AWS CLI in Step 2:

aws kafkaconnect delete-custom-plugin --custom-plugin-arn <plugin-arn>
aws kafkaconnect delete-worker-configuration --worker-configuration-arn <worker-config-arn>

Clean up the Aurora PostgreSQL replication resources:

SELECT pg_drop_replication_slot('debezium_slot');
DROP PUBLICATION dbz_publication;

Important: The replication slot (debezium_slot) was created automatically by Debezium. If you plan to redeploy the pipeline later, you don’t need to drop the slot and publication. However, the replication slot continues to retain WAL segments while the connector isn’t running, which can increase storage usage on the Aurora cluster. The MSK cluster is the largest cost component of this solution and can’t be paused. It can only be deleted and recreated.

Conclusion

In this post, we showed you how to build a near real-time CDC pipeline from Aurora PostgreSQL to Apache Iceberg tables in Amazon S3 Tables. The key architectural decisions include:

  • Single-topic routing with multi-table delivery. The Debezium ByLogicalTableRouter SMT routes CDC events from multiple tables through one MSK topic, and the Lambda otfMetadata routing directs each record to the correct Iceberg table. This reduces VPC connection costs by using a single Firehose stream for inserts, updates, and deletes across multiple destination tables.
  • Fully managed CDC pipeline. MSK Connect runs Debezium, Firehose handles delivery with automatic retries, and S3 Tables manages Iceberg compaction and snapshots. The Lambda transform preserves CDC semantics by mapping Debezium operations to Iceberg row-level operations.
  • Governed lakehouse access. Lake Formation controls fine-grained access to the Iceberg tables, and data from multiple isolated Aurora clusters can be unified in a single S3 Tables namespace for cross-domain analytics.
  • Infrastructure as code. Six AWS CDK stacks deploy the core pipeline, with Lake Formation permissions, MSK cluster policy, and Debezium connector configured through documented CLI steps.

To get started, clone the sample repository and follow the walkthrough steps. For more information about the services used in this solution, see the Amazon MSK Developer Guide, Amazon Data Firehose Developer Guide, and Amazon S3 Tables User Guide.

We encourage you to try this solution and adapt it to your own CDC workloads. If you have questions or feedback, leave a comment on this post.


About the author

Chintan Agrawal

Chintan Agrawal

Chintan is a Solutions Architect with over 7 years of experience, with a specialization in Analytics and Healthcare domain. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices.

The collective thoughts of the interwebz