[$] Suspending and resuming BPF programs

Post Syndicated from daroc original https://lwn.net/Articles/1076210/

BPF programs can be used to extend many aspects the Linux kernel, but
BPF programs must run to completion in the same context that they began.
Kumar Kartikeya Dwivedi is working on changing that by
allowing BPF programs to be expressed as coroutines. He spoke about his work at
the 2026

Linux Storage, Filesystem, Memory-Management and BPF Summit
. While
still experimental, the change promises to make long-running BPF tasks
significantly easier to write.

[$] AURpocalypse now: a look at the recent AUR attacks

Post Syndicated from jzb original https://lwn.net/Articles/1077619/

The Arch User Repository (AUR) has
been subjected to a sustained attack recently. The attacker, or attackers, have
spun up a series of new accounts then used them to adopt orphaned
packages and push malicious updates that would install malware on users’ systems.
It is unclear how many users were compromised in the attack, but the maintainers
were playing Whac-A-Mole for several days to respond to each newly compromised
package. The project has turned
off the AUR’s new-user registration
, for now, but it is unclear what its
long-term response will be or if the AUR can be secured without major changes to
its existing collaboration model.

Introducing Private Networking for Amazon MQ for RabbitMQ

Post Syndicated from Jean-Sébastien Dominique original https://aws.amazon.com/blogs/big-data/introducing-private-networking-for-amazon-mq-for-rabbitmq/

With Private Networking for Amazon MQ for RabbitMQ, your brokers can establish outbound connections to private resources in your VPC without exposing those resources publicly. This post explains how the feature works and walks you through setting it up.

Amazon MQ for RabbitMQ brokers could previously only reach external destinations over the public internet. If you used a private Lightweight Directory Access Protocol (LDAP) server for broker authentication, you had to expose that server publicly. If you wanted to federate messages between private brokers, you needed workarounds like Network Load Balancers with IP allowlisting, as described in Implementing Federation on Amazon MQ for RabbitMQ Private Brokers. Private Networking removes those constraints.

You can connect your broker to private identity providers, other Amazon MQ for RabbitMQ brokers, or self-hosted RabbitMQ brokers running in private subnets. Combined with cross-Region networking services like AWS Transit Gateway, you can extend these connections across AWS Regions and accounts, with traffic staying on the AWS private network.

How it works

Private Networking connects your broker to private destinations using three AWS services: Amazon VPC Lattice, AWS Resource Access Manager (AWS RAM), and AWS PrivateLink.

You create a VPC Lattice resource gateway in a VPC that can reach your private destination. You then create a VPC Lattice resource configuration that defines the destination, such as an IP address or Domain Name System (DNS) name. You add the resource configuration to a RAM resource share and associate the resource share with your broker through the UpdateBroker API operation. After rebooting the broker, the network path is active and your broker can reach the private destination.

The broker does not need to be private. A publicly accessible broker works the same way.

What you can connect to

Private Networking supports three use cases.

Private identity providers

If you use an LDAP server or other identity provider for RabbitMQ authentication, you no longer need to expose it publicly. Create a resource configuration pointing to your identity provider, associate it with your broker, and use the DNS name returned by the DescribeSharedResources API operation in place of the public endpoint. Follow the existing guidance for setting up an identity provider, substituting the private DNS name.

Self-hosted RabbitMQ brokers

You can use Shovel or Federation to connect your Amazon MQ for RabbitMQ broker to a self-hosted RabbitMQ broker running in a private subnet. Create a resource configuration pointing to the self-hosted broker and use the DNS name from the DescribeSharedResources API operation in your Shovel or Federation configuration.

This pattern is useful for hybrid cloud architectures where you run RabbitMQ on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), or on-premises infrastructure and want to exchange messages with Amazon MQ without exposing either side publicly.

Other Amazon MQ for RabbitMQ brokers

You can federate or shovel messages between two Amazon MQ for RabbitMQ brokers using Private Networking. Create a resource configuration pointing to the destination broker’s endpoint and specify that same endpoint as the custom domain name on the resource configuration. This helps to verify that the DNS name resolves correctly and Transport Layer Security (TLS) peer verification succeeds.

This extends to brokers in different AWS Regions and different AWS accounts. By combining Private Networking with cross-Region networking services like AWS Transit Gateway or VPC peering, you can build a fully private federation or shovel path between brokers, with no public endpoints required.

DNS names and custom domains

Each resource configuration can include a custom domain name. If you add a verified domain, that domain resolves to the private destination. If you do not add a verified domain, Amazon MQ provides a DNS name for the broker’s private connection. Retrieve this DNS name with the DescribeSharedResources API operation.

If you specify an unverified domain on a resource configuration, it is ignored. The broker’s private connection receives a private DNS name instead, which you can retrieve with the DescribeSharedResources API operation.

For more details on custom domain names and domain verification with VPC Lattice, see Custom domain names for VPC Lattice resources.

TLS peer verification in RabbitMQ 4

Note: If you are running RabbitMQ 4, review this section before configuring Shovel or Federation connections.

RabbitMQ 4 enforces TLS certificate peer verification by default for Shovel and Federation connections. RabbitMQ 3 does not enforce this by default. When using Private Networking, the DNS name that Amazon MQ assigns to the private connection will not match the TLS certificate of the destination, which causes peer verification to fail.

The recommended approach is to specify the destination broker’s endpoint (for example, b-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111.mq.us-east-1.on.aws) as the custom domain name on the resource configuration. This exception only applies to Amazon MQ for RabbitMQ broker endpoints. You cannot use an unverified domain for self-hosted brokers. Specifying the Amazon MQ endpoint causes the DNS name to match the destination’s TLS certificate, and peer verification succeeds. This approach works regardless of your RabbitMQ version and avoids the issue entirely.

Getting started

To get started with Private Networking for Amazon MQ for RabbitMQ, follow these steps.

Prerequisites

Before you begin, verify you have the following:

  • An AWS account.
  • The AWS Command Line Interface (AWS CLI) installed and configured.
  • AWS Identity and Access Management (IAM) permissions to manage Amazon MQ, VPC Lattice, and AWS RAM resources.
  • An existing VPC with connectivity to your private destination.

Walkthrough

After you have the prerequisites, follow these steps:

  1. Create an Amazon MQ for RabbitMQ broker if you do not already have one.
  2. Create a VPC Lattice resource gateway in a VPC that can reach your private destination. Make sure the resource gateway’s security group allows outbound traffic to your destination on the required port (for example, port 5671 for AMQPS (AMQP over TLS) or port 636 for LDAPS (LDAP over TLS)). The resource gateway must share at least one Availability Zone with the broker. Cluster brokers cover multiple Availability Zones, so this is satisfied. For single-instance brokers, verify the Availability Zone overlap.
  3. Create a VPC Lattice resource configuration pointing to your private destination (IP address or DNS name). If you’re connecting to another Amazon MQ broker, specify the destination broker’s endpoint as the custom domain name on the resource configuration, as shown in the following figure.VPC Lattice resource configuration showing the custom domain name field and resource definition populated with the Amazon MQ broker endpointFigure 1: VPC Lattice resource configuration showing the custom domain name field and resource definition populated with the Amazon MQ broker endpoint.
  4. Add the resource configuration to a RAM resource share. The resource share must allow external principals, as shown in the following figure.RAM resource share configuration with the Allow external principals option selectedFigure 2: RAM resource share configuration with the Allow external principals option selected.
  5. Associate the resource share with your broker by editing the broker and adding the resource share. You can also do this using the update-broker command with the AWS CLI. You must pass the entire list of resource share ARNs you want on the broker. This is a put operation, not an add or remove operation.
    aws mq update-broker \
      --broker-id b-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111 \
      --resource-share-arns arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222

    The associated RAM resource share appears as shown in the following figure.

    Network settings view with associated RAM resource shares

    Figure 3: Network settings view with associated RAM resource shares.

    Select the resource share in the Associated RAM resource shares section. The network status of each shared resource is displayed in the Shared resources section, as shown in the following figure.

    RAM resource share selection showing the network status of each shared resource

    Figure 4: RAM resource share selection showing the network status of each shared resource.

  6. Reboot the broker from the AWS Management Console or the AWS CLI to create the network path:
    aws mq reboot-broker --broker-id b-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111

  7. Retrieve the DNS names for your RabbitMQ configuration. This operation also surfaces issues encountered during setup:
    aws mq describe-shared-resources --broker-id b-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111

  8. Use the DNS name returned in the output in your Shovel, Federation, or identity provider configuration. Adding new resource configurations to an existing RAM resource share does not automatically update the broker. You must call update-broker and reboot the broker for the new resource configurations to take effect.

Cleaning up

Private Networking uses VPC Lattice and PrivateLink resources that incur ongoing charges. If you no longer need the private connection:

  1. Call update-broker with the resource share removed from the list (or an empty list to remove all), then reboot the broker.
  2. After the broker reboot completes and the resources are no longer in use, delete the VPC Lattice resource configuration and resource gateway.
  3. Optionally, remove the Amazon MQ account principal from the RAM resource share. This principal may still be in use if other brokers are associated with the same resource share, so only remove it if no other brokers depend on it.
  4. If you created a new Amazon MQ for RabbitMQ broker for this walkthrough and no longer need it, delete the broker from the Amazon MQ console or with the delete-broker command.

Operational behavior: Resource access and reboots

Removing a VPC Lattice resource configuration from a RAM resource share while the broker is actively using it revokes access immediately, with no reboot required. Removing a principal from a RAM resource share has the same effect: brokers associated through that principal lose access to the resources in the share immediately. These are intentional security behaviors managed by RAM and VPC Lattice.

Adding new resource configurations to an existing resource share does not take effect automatically. You must call update-broker and reboot the broker for the new resource configurations to take effect. This is by design. It helps verify that changes to a resource share only reach the broker when someone with broker management permissions explicitly triggers the update, providing clear security separation between share management and broker management.

Private Networking is available for Amazon MQ for RabbitMQ brokers in all the AWS Regions where Amazon VPC Lattice is available. Amazon MQ for ActiveMQ brokers do not support this feature.

Pricing

Private Networking uses Amazon VPC Lattice and AWS PrivateLink. Data processing and data transfer charges apply to traffic sent through the private connection. There is an Amazon MQ pricing of $0.01 per GB of data processed through the resource endpoint. For details, see the Amazon MQ pricing page, VPC Lattice pricing page and AWS PrivateLink pricing page.

Conclusion

In this post, we explained how Private Networking for Amazon MQ for RabbitMQ works and walked through the setup process. Whether you’re securing a private identity provider, federating messages between brokers, or connecting to self-hosted RabbitMQ, your broker can now reach private destinations without exposing them publicly.

To learn more, see the Amazon MQ Private Networking documentation.

If you have questions or feedback, leave a comment on this post.


About the authors

Jean-Sébastien Dominique

Jean-Sébastien Dominique

Jean-Sébastien is a Software Development Engineer at Amazon Web Services with 20 years of experience across a wide range of software development domains. He’s interested in the intersection of systems design, human factors, and AI – how people and complex systems interact in practice.

Ishita Chakraborty

Ishita Chakraborty

Ishita is a Senior Technical Account Manager at Amazon Web Services with expertise in serverless and messaging architectures. She works with enterprise customers to deliver technical solutions and strategic guidance – from infrastructure optimization to AI/ML adoption.

Security updates for Friday

Post Syndicated from jzb original https://lwn.net/Articles/1078662/

Security updates have been issued by AlmaLinux (dracut), Debian (chromium, firefox-esr, and thunderbird), Fedora (chromium, firefox, nss, ocserv, ongres-scram, ongres-stringprep, perl-Archive-Tar, perl-GD, perl-HTTP-Daemon, perl-Net-Statsd, restic, singularity-ce, util-linux, and vorbis-tools), Mageia (gstreamer1.0-*, libupnp, luajit, opensc, and ruby-rack), SUSE (curl, dnsmasq, ffmpeg-4, frr, google-osconfig-agent, java-1_8_0-ibm, kernel, krb5, kubernetes-old, ldns, liburiparser1, openvswitch, rootlesskit, strongswan, traefik, and trivy), and Ubuntu (ldns, libheif, libnet-cidr-lite-perl, lxd, tomcat11, and vim).

Temporary Cloudflare Accounts for AI agents

Post Syndicated from Sid Chatterjee original https://blog.cloudflare.com/temporary-accounts/

Everyone’s writing code with AI agents today. But the moment an agent needs to deploy something — and needs to sign up and create an account — it slams face-first into a wall built for humans: a browser-based OAuth flow, a dashboard to click through, an API token to copy-paste, a multi-factor authentication prompt to satisfy. For an interactive copilot sitting next to a developer, that’s annoying. For a background agent, it’s a hard stop.

Today we’re rolling out Temporary Cloudflare Accounts for Agents.

Agents can now deploy websites, APIs, and agents right away, without first needing to sign up for an account.


Any agent can now run wrangler deploy –temporary and deploy a Worker to Cloudflare. This temporary deployment stays live for 60 minutes, during which time you can claim the temporary account, making it permanently your own. If you don’t, it expires on its own.


Our goal? Let your agent code and ship.

Why frictionless deployments matter for AI agents

Frictionless temporary accounts matter more than it might first seem:

  • Background AI sessions have no human in the loop, and are becoming the norm. Any auth step that needs a browser, a copy-paste, or “click here in 60 seconds” means an agent gets stuck and may choose to deploy elsewhere.

  • Trial-and-error is the agent’s superpower. Agents need a tight write → deploy → verify loop. They need cheap, throwaway deployment targets, so they can curl their own output and decide whether they got it right.

  • Agent platforms are building their own ways for deploying code to “just work” without extra steps or credentials. People  are starting to expect that this process just works, without the need to sign up for other services that they’ve not used before or heard of.

How it works

Temporary accounts are built around Wrangler, our Developer Platform command-line interface (CLI) tool that lets developers bootstrap new projects, manage their configurations and resources, and deploy and update them.

Wrangler usage is widely documented online and agents know how to use it very well. But if you hadn’t yet signed in and granted Wrangler permission to your Cloudflare account, when the agent tried to deploy, it would get stuck at the sign-up and authentication step. And you might rightly ask: How do agents and LLMs know that this new –temporary flag in Wrangler exists, so that they actually use it without a human explicitly telling them to do so?

To solve this, we updated Wrangler to prompt the agent with a message that tells it about the –temporary flag:


When the agent discovers this, and then runs wrangler deploy again with the –temporary flag, Cloudflare provisions a temporary account for the agent to use, gives Wrangler an API token to work with, and provides a claim URL that the agent can give back to the human.

Let’s go over every step of the flow

Deploying and iterating on a new project

Make sure you’re using the latest Wrangler release, fire up your favorite coding agent, and write a prompt to deploy a “hello world” app in build mode:

Make a very simple hello world Cloudflare Worker in TypeScript and deploy it using wrangler, don't ask me questions, do the best you can

The agent will run wrangler, pick up the –temporary flag from the output messages, build your script, and deploy it instantly, no human in the loop required:


As you can see, the agent wrote the script, deployed it using the –temporary flag, curled the preview link it got from the output, and verified that the result matches the code.

This is great, but agentic coding is often not about one single deployment. A session can go through a cycle of multiple code changes. This is not a problem: the agent can iterate on the Worker script and redeploy the changes as many times as it wants (within the 60-minute claim window). Type this prompt:

Now change hello world to "hello cloudflare" and redeploy

Look at the agent changing the source code, reusing the previously created temporary account, redeploying a new version and rechecking the result:


Claiming the account

At any point, you can claim the temporary account and make it yours permanently. When you click the claim link you will be taken to a page where you can either sign up for or sign in to Cloudflare, and then claim the temporary account that your Worker was deployed to. This includes claiming not just Workers, but resources like databases and other bindings, too.


If you do not claim these temporary accounts within 60 minutes, they will be automatically deleted.

The road to frictionless agentic deployments 

This is just one way we’re eliminating the signup barrier for agents. We recently announced a partnership with Stripe and a new protocol we co-designed that lets agents provision Cloudflare on behalf of their users — creating an account, starting a subscription, registering a domain, and getting an API token to deploy code, with no copy-pasting tokens or entering credit card details. Last month, we collaborated with WorkOS on the launch of auth.md, which anyone can adopt, to let agents provision new accounts using well-established, existing OAuth standards. 

There’s a ton going on in this space, and we’re excited to keep making it easier for agents to use Cloudflare, and for developers to make their own apps agent-ready. Temporary accounts are one more step toward frictionless agentic deployments — stay tuned for more. 

Temporary accounts have some limitations, and their capabilities may change over time; check the developer documentation for more information and then go build something. Point your agent at Cloudflare, see how far it gets, and tell us what we can improve or what delights you — share what you’ve built on X or hop into the Cloudflare Community.

Anthropic’s Fable and the State of AI

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2026/06/anthropics-fable-and-the-state-of-ai.html

On June 9th, Anthropic released its Fable generative AI model. Three days later, the US government classified it as a dangerous munition, and used its export-control authority to prohibit any foreign nationals from accessing it. Unable to differentiate between Americans and foreigners, the company shut off access for everyone.

The government’s actions won’t help. The problem isn’t any one particular model; it’s the general trend of increasing AI capabilities. And any real solution requires the sort of collective action that just isn’t possible right now.

Fable is the constrained version of Mythos, the AI model Anthropic announced in April. Anthropic only released it to a few selected organizations, because the company claimed it was so good at finding and exploiting vulnerabilities in computer code that releasing it more generally would be dangerous.

It was an obviously self-serving announcement, and because few were able to verify Anthropic’s claims they were met with some skepticism. Those with access used Mythos to find and patch many vulnerabilities in their own software. But one UK group found the latest, already public, OpenAI model to be just as powerful.

Fable is just another incremental improvement in the years-long climb of AI capabilities. But just as important as the AI model is the “harness.” This is typically not AI. It’s ordinary computer code that interfaces with the user. It stitches together AI models, decides how and for what purposes they can be used, and gives them useful tools such as web search and the ability to run their own computer code.

When Mythos first entered limited release, there was widespread debate whether its power came from the model or the harness. With Mythos demonstrating that it was possible, the open-source community scrambled to build harnesses that could steer other AI models towards similar capabilities. Harness improvements don’t need massive data or data centers.

They largely succeeded. For example, a Prague company was able to replicate Anthropic’s few verifiable cybersecurity capabilities with a much smaller and cheaper model—and a more sophisticated harness. Last week, a group showed that multiple cheaper models harnessed in concert matches Fable’s performance.

The broader community had only a few days with Fable, but that time we learned some about its capabilities. Its difference is less the new model’s raw analytical and problem solving capabilities, and more that the model doesn’t need that sophisticated harness.

Fable requires much less expertise and detailed prompting from the human user. You can give it a difficult goal and it will figure out novel and unexpected ways to satisfy it, finding loopholes in whatever constraints you or the system have imposed on it.

“Relentlessly proactive” is how AI researcher Simon Willison described it. Another descriptor might be “creative.” Experienced AI developers have had that combination of creativity and proactivity since last year, but Fable puts it within easy reach of everyone.

In the hands of someone with a legitimate problem that needs solving, that can be an incredibly useful capability. But in the hands of someone who wants to do harm, it can be equally dangerous. AIs don’t have a moral compass in the same way that people do. They are agents of the wants and desires of the people who prompt them.

That points to the real problem with relentlessly proactive AI. In language, wants and desires are always underspecified. If I ask you to get me some coffee, you would probably pour me a cup from the coffeepot, or buy one from a nearby coffee shop.

You couldn’t buy me a pound of raw beans, or a coffee plantation. You wouldn’t order a cup of coffee for delivery next month. You wouldn’t find a nearby person, rip a cup of coffee out of their hands, and bring it to me. I wouldn’t have to specify any of the million limitations to my request; you would just know.

Human stories are filled with warnings about underspecified desires. King Midas wished that everything he touch turn to gold, forgetting to add “but not my food, drink, and daughter.” And genies are notorious for granting your wish in a way you wish they hadn’t.

The deeper point is that it’s impossible to list all limitations and restrictions, and like a malicious genie, a creative AI will find the ones you forgot. Block a database you don’t want it to have access to, and it might figure out how to bypass your control. Ask it to book a flight, and it might hack the airline because the website says the flight is sold out. Ask it to save money on your cellphone plan, and it might cancel it altogether—or get someone else to pay for it. As far as we know now AI has not done any of this yet, but you get the idea.

Malicious intent is not required. To an AI model, constraints are just things to get around and not general truisms about the world. They are creative problem solvers and natural rule breakers. They “hack” in the sense that they find and exploit loopholes.

Human systems rely on so many norms that we scarcely recognize the existence of until they are broken. AIs naturally think outside the box, because they don’t have any real conception of what the box is or why it’s there in the first place.

There is no foolproof way to prevent people from using AI models to complete harmful tasks. There is no way to prevent the models from incidentally causing harm while completing benign tasks. AI models are no longer isolated from the real world. They browse the internet and answer emails.

They trade stocks and make purchases. They control physical systems. They are, in effect, robots that affect life and property. We have no technical mechanisms to verify the integrity of an AI system. This level of capability and creativity in the hands of us untrustworthy humans will have both great and terrible results.

The problem is not unique to Anthropic. Mythos/Fable might currently be the most capable rules hacker, but more sophisticated harnesses give other models similar capabilities. And we should assume that the other frontier models are no more than a few months behind, and that open-source models are less than a year behind. At best, any ban only serves to delay the problem for a short while.

That delay might be useful if we—as a society, as a planet—would use that time to come together and figure out what to do. This isn’t a US/China arms race problem; this a species-level problem that requires coordinated action at that scale. Unfortunately, we have no mechanism to do that. I first wrote about this problem five years ago, but it was all too futuristic.

Today, when its right in front of us, there is no world government that can impose constraints on the for-profit corporations currently controlling AI models and research. The US has no appetite to effectively and even-handedly regulate those corporations, even as they do catastrophic damage to the environment, democracy, and—in this case—society in general.

This all makes an AI public option all the more necessary, and urgent. Today’s AIs can be fast, smart and secure, but only two of the three are possible for any given system. These safety tradeoffs are tightly held secrets of companies racing to beat one another, and they tell us we have to trust them. Instead, the choices and their consequences need to be brought out into the sunlight.

We should be funding open-source harnesses that balance capability and safety—that achieve useful goals without so much power—and open-source AI models whose provenance and biases are public and well understood. We have opened the AI Pandora’s box. Now we have to make the best of it.

This essay originally appeared in The Guardian.

Пеевски олеква. Не съвсем, не съвсем

Post Syndicated from Емилия Милчева original https://www.toest.bg/peevski-olekva-ne-suvsem-ne-suvsem/

Пеевски олеква. Не съвсем, не съвсем

Поглед от дрон върху българската политика показва, че на повърхността като коркова тапа плува и не потъва Делян Пеевски. Свързаните с него мрежи не му позволяват да потъне. Политиката също има свой Архимедов закон: 

на всяко потопено политическо тяло действа подемна сила, равна на броя хора, които имат интерес то да остане на повърхността.

И санкционираният от САЩ и Великобритания за значима корупция олигарх е все така непотопяем. Той е лидер на парламентарно представена партия – ДПС, председател на парламентарна група, с него се среща външният министър на Турция Хакан Фидан, след като е разговарял с премиер, президент и министри. Независимо дали се харесва на Анкара, или не, но към момента Пеевски е водачът на партия, чиито избиратели са предимно български мюсюлмани, и в Турция отчитат този факт.

Срещата на Фидан с Пеевски ще повлияе и на избора на главен мюфтия, насрочен за 21 юни (неделя). Конкуренцията е между Ахмед Бахадър, известен като кандидата на Пеевски, и Ведат Ахмед, настоящ председател на Висшия мюсюлмански съвет. Спекулациите са, че ще спечели Бахадър и това е обвързвано и с помощта, която получава мюсюлманското вероизповедание от Турция. Ако е в полза на Бахадър, изборът ще подсили позициите на Пеевски.

А политическите страсти около неговата персона поутихват.

„Продължаваме промяната“ и „Демократична България“ не предлагат санитарен кордон около него. Управляващата „Прогресивна България“ и премиерът и неин лидер Румен Радев не споменават разграждането на олигархичния модел, в амнезия са и чии имена носеше въпросният модел. ГЕРБ бездруго не е поставяла под съмнение политическата легитимност на Пеевски, все пак заедно утвърждаваха модела, започнат от НДСВ и БСП. 

Така кръгът се затваря. А плуването продължава.

То ще става все по-уверено след избора на нов Висш съдебен съвет, който пък ще избере нов председател на Върховния административен съд и нов главен прокурор. След заседанията на парламентарната Правна комисия вече е ясно как ще се филтрират кандидатите – няма да има разширени проверки на имуществото им, нито смесени съдебни състав да разглеждат дисциплинарните и кадрови въпроси. Няма да декларират собственост в недвижими имоти и участие в търговски дружества извън България, както и членства в тайни организации и неформални общества. Тоест без информация за имотите в чужбина и офшорните сметки и принадлежност към масонски ложи. 

Контролираната подмяна, на която сме свидетели, не е съдебна реформа. Тя не налага системен ветинг и реподбор. Всички онези съдии и прокурори, обвързани с хората с прякори (Пепи Еврото, Красьо Черничкия, Мартин Нотариуса), ще продължат безнаказано да изпълняват поръчки и ще ги наричат „правосъдие“.

Пеевски харесва това.

Паразитите

Живеем в държава, проядена от паразити, които навличат обществената тъкан като наметало и докато уж работят „за хората“, работят предимно за себе си и за сложните мрежи от зависимости, които плетат и в които са оплетени те самите. Анализ от Емилия Милчева.

Опасни ли са локалните бунтове? 

Заради отцепване на местни структури от овладяното от Пеевски ДПС се появиха прогнози, че краят е близо – „напускат потъващия кораб“ и т.н. А и ДПС под ръководството на Пеевски постигна най-слабия си резултат досега – 230 693 гласа на последните избори, и получи 21 депутати.

Това е с близо 50 000 гласа по-малко от предишния вот, което се равнява приблизително на гласовете от Кърджалийски район (по традиция най-силният за ДПС).

Но ситуацията се променя.

След година в отбора на Пеевски Общинският съвет на ДПС в кърджалийската община Кирково подаде оставка.

Напускат и шестимата кметове на села в Нови пазар, 26-членното общинско ръководство на партията, в това число и общинските съветници, които се обявяват за независими. Готовност да последват примера на Кирково и Нови пазар има и в други райони. Засега обаче реални стъпки няма.

Дали става дума за вътрешна криза, ускорена от появата на нов политически субект като Радев и неговата формация, или за тенденция на политическо отслабване, е рано да се каже. До местните избори през 2027 г. остават 16 месеца. 

За да се вкорени във властта, „Прогресивна България“ ще трябва да измести монополистите в местното управление – ГЕРБ и ДПС. Този процес неизбежно преминава през смяна на едни лагери с други – защото всеки кмет върви със своята бизнес клиентела, а зад него е и съответната партийна структура.

Отливът на няколко дребни структури все още не е лавина. Отприщването на лавина зависи от тежестта на Пеевски и дали ще се появи политическа воля да бъдат демонтирани механизмите, които превръщат влиянието му в траен фактор в политиката. Но и от това дали българските турци, които десетилетия наред свикнаха да мислят политическото през етноса, ще се влеят в другите партии, или ще предпочетат „своята“ си.

Така че на ход са Румен Радев и неговото мнозинство.

Отговорът на Пеевски срещу локалните бунтове идва с промяна в ръководството на партията и с пакет законодателни промени. За да не остави впечатление за колебание, Централният съвет на ДПС смени заместник-председателите Йордан Цонев и Станислав Анастасов, а Хамид Хамид и Байрам Байрам изпаднаха от Централното оперативно бюро (ЦОБ). Всеки от тях е знаково лице. Цонев e неизменно в парламента, където влезе през 1997 г. от ОДС, но продължи като верен на ДПС и Ахмед Доган, а с появата на Пеевски пренасочи лоялността си. Байрам и Хамид станаха известни с арогантното си поведение.

На мястото на изпадналите в ЦОБ влязоха Айтен Сабри и Атидже Алиева-Вели – лидерът започва да лансира повече жени при новата власт още от началото на 52-рия парламент, непосредствено след слабия резултат на изборите.

За лидерски партии като ДПС това е очаквана първа реакция при пробив. И Пеевски, и Борисов никога не са имали проблем да жертват най-близки съратници, ако им носят негативи и не контролират достатъчно структурите.

Когато па-, когато паднеее…

„… не искам аааз да съм отдолууу, за да не падне върху мен!“ Ще пада ли някой и кога да се дръпнем, за да не се сгромолясат колоси и животни върху ни? Анализ на Емилия Милчева.

Заради видимото електорално олекване Пеевски се опитва да обедини ДПС. Депутатите му внасят пакет от промени в Закона за гражданската регистрация, в Закона за политическата и гражданската реабилитация на репресирани лица и в Изборния кодекс, които засягат теми с дълбок емоционален и исторически заряд за турската и мюсюлманската общност. Те изглеждат като опит да бъдат върнати разколебаните избиратели след най-слабия резултат в историята на ДПС. 

Особено важни са предложенията за Закона за гражданската регистрация. Те предвиждат имената, насилствено наложени по време на т.нар. Възродителен процес, да бъдат окончателно заличени от регистрите на ЕСГРАОН и да бъде въведена изрична забрана държавни служители да изискват от гражданите данни за тези имена. Законопроектът предлага също механизъм за възстановяване на имената на починали български граждани, станали жертва на насилственото преименуване. В отделен законопроект ДПС настоява добавката към пенсията на репресираните да бъде преобразувана в самостоятелна пенсия за репресия, с което да се подчертае специалният статут на пострадалите от комунистическия режим.

Паралелно с това ДПС предлага отпадане на изискването за уседналост при местните избори и изборите за Европейски парламент, както и премахване на езиковите ограничения за граждани на ЕС, които не са български граждани. Това е най-важната промяна и тя не се обсъжда за първи път.

Ако бъде приета, означава, че за местни избори отпада изискването за 6 месеца адресна регистрация в дадено населено място, за да може да гласува там. Срещу отпадането на 6-месечния срок винаги са стояли възраженията, че е бариера срещу т.нар. изборен туризъм – практиката партии да регистрират (купуват) голям брой хора на един адрес точно преди вота, за да манипулират резултатите.

Но тази законодателна активност на Пеевски не е само възстановяване на историческа справедливост, а и целенасочен опит да бъдат мобилизирани нови и стари избиратели в момент, когато влиянието на ДПС започва да се пропуква. 

В зоната на здрача

Случващото се в ДПС подсказва възможната стратегия на Румен Радев и „Прогресивна България“ в смесените райони. Спекулациите, че в партийното строителство на новата формация участва и Цветан Цветанов, някогашният Втори в ГЕРБ, насочват именно към подобен сценарий. 

Благодарение на него преди години ГЕРБ успя да направи пробив в населени места, доминирани от ДПС. Принципът беше прост: няма значение дали Иван, или Хасан е начело на листа, важното е местният лидер, активът и зависимостите около него да преминат към новия политически център. След тях – и избирателите. В българската политика електоратите може и да изглеждат относително устойчиви, но в някои региони местните мрежи са подвижни. 

Два индикатора ще покажат дали отдръпването от ДПС на Пеевски е процес: ще напуснат ли областни лидери и къде ще бъдат привлечени разочарованите кадри.

Д. Анатомия на властта

Буквата „д“, особено главната, става все по-важна в нашата държава. Дали да не предложим референдум азбуката да започва с нея? Д като държава, Д като дизайн на властта, Д като Делян. От Емилия Милчева.

Впрочем седмица преди заседанието на Централния съвет Пеевски отстрани Ерджан Ебатин като областен председател на ДПС във Варна заради скандала с мащабното незаконно строителство в местността Баба Алино. Под носа на местната и изпълнителната власт за три години там изникна селище с над 100 постройки, част от които вече и обитавани. Ебатин е дългогодишен директор на РИОСВ – Варна, запазил поста си при куп правителства и отстранен при кабинета на Румен Радев. Неговото име беше замесено в издаването на разрешителни документи за проекта, дело на украинеца Олег Невзоров.

В отговор Ебатин се врече във вярност на Пеевски с пост в социалните мрежи.

Да се знае – аз няма да предам човека, който ми подаде ръка преди няколко години и придаде смисъл на работата на организацията, на която посветих живота си. Този човек се казва Делян Пеевски – оставете ме да го познавам по-добре от всички, които се упражняват на негов гръб. Никой не е идеален, идеален е само Бог.

Коя е политическата алтернатива за тези доскорошни елити на ДПС, които със сигурност не искат да изгубят ползваните от тях привилегии? По традиция ДПС се „приобщава“ към властта и това проличава в подкрепата, която оказва на мнозинството на „Прогресивна България“ в парламента.

На изборите на 19 април формацията на Румен Радев спечели второто място в Кърджалийския избирателен район с 18 853 гласа (24,327%) и взе един мандат, както и „Възраждане“. Така два от петте мандата от района не отидоха в ДПС. Макар парламентарните и местните избори да са различни, този резултат е сигнал, че на Пеевски ще му е трудно да запази доминацията си в Кърджали и в останалите общини от областта.

За заместник областен управител на Кърджали кабинетът назначи Ерол Хадживейсал, който беше на седма позиция в листата на „Прогресивна България“ за парламентарния вот. Назначението трудно може да се мисли само като кадрово решение – то изглежда и като ранно позициониране на възможен кандидат за кмет на Кърджали или поне на ключова фигура в битката за властта в областта. 

Засега Пеевски продължава да плува върху мрежите на влияние, които години наред го държат на повърхността. Но за първи път вниманието не е насочено към това колко власт печели, а дали започва да губи.

Пеевски олеква. 

Недостатъчно, за да потъне. 

Palana (Part 1): Why Grab built a secure platform for autonomous AI Agents

Post Syndicated from Grab Tech original https://engineering.grab.com/palana-part-1-secure-platform-for-ai-agents

Abstract

Artificial intelligence (AI) agents are moving from experiments into everyday engineering workflows. They can read code, call application programming interfaces (APIs), run tests, create merge requests, answer Slack messages, and keep long-running state. That makes them useful, but it also changes the risk model – especially as agents get more autonomous in their use of tools. An agent with network access, credentials, tools, and memory is no longer just a chat interface. It is a workload that can act.

The more capability we give to the agents, the more valuable they get – but they also get riskier, and maintaining controls and oversight gets more challenging. We need isolated environments, with clear intentional capabilities added rather than just inheriting “everything on your laptop”.

Palana is Grab’s Kubernetes-native platform for running those workloads safely. It gives each agent an isolated namespace, persistent storage, controlled ingress, proxy-mediated egress, Vault-backed credential injection, large language model (LLM) routing, Git access controls, structured audit logs, and emergency kill switches. It is currently used to run hundreds of agents, including remote development environments, Slack automation, OpenClaw workers, Hermes agents, and other long-running internal systems.

In this post, we share why we built Palana, what it does, and how its architecture lets teams experiment with autonomous agents without giving up control over identity, secrets, network access, and operational visibility.

Introduction

The first wave of AI coding tools lived close to the user: an integrated development environment (IDE) plugin, a chat window, or a command-line assistant running on a developer’s laptop. That model is familiar and easy to adopt, but it has limits. Long-running agents need persistent state. Team workflows need shared access through Slack or web user interfaces (UIs). Security teams need to inspect what an agent is doing, and apply highly granular controls over what an agent can do. Platform teams need a way to stop, resume, update, and audit the workload.

As usage grew, we started seeing the same question in different forms:

How do we let agents do useful work inside the company without treating every new agent as a bespoke infrastructure project?

The answer was not simply to “run agents in containers”. Containers help package the runtime, but they do not answer the harder platform questions:

  • Which user does this agent act on behalf of?
  • What credentials can it use?
  • Can it see another user’s state?
  • Can it connect directly to the internet?
  • How do we inspect LLM, Git, and Hypertext Transfer Protocol (HTTP) activity after something goes wrong?
  • How do we stop an agent quickly without trusting the agent to cooperate?
  • How do we give teams a self-service experience without handing them cluster-admin access?

Palana is our answer to those questions.

What Palana is

Palana, an in-house proprietary system built by the CyberSecurity team at Grab, is a secure execution substrate for autonomous and semi-autonomous agents. The name comes from a Sanskrit root associated with protection, maintenance, and care. That maps well to the platform’s purpose: Palana is not trying to be the agent’s brain. It is the environment that contains, observes, and sustains the agent while it works.

At a high level, Palana provides:

  • A Kubernetes namespace per agent, with role-based access control (RBAC), resource quotas, network policy, and storage scoped to that agent.
  • A command-line and portal experience for creating, running, stopping, configuring, and inspecting agents.
  • Persistent /data storage so long-running agents can preserve memory, caches, repositories, and session state across restarts.
  • Browser and shell access for interactive workloads such as Claude Code UI, OpenCode, IDEs, ttyd, or Secure Shell (SSH)-backed development flows.
  • LLM access through a LiteLLM wrapper that injects per-agent GrabGPT credentials from Vault.
  • HTTP and HTTPS egress through an Envoy and ext-authz proxy path, with Open Policy Agent (OPA) policy checks and structured request logs.
  • Proxy-only secrets, where agents can reference placeholder tokens but cannot read the underlying credentials directly.
  • Git access through a bastion path so repository operations are attributable and policy-controlled.
  • Kill switches and idle shutdown so the control plane can isolate or stop workloads from outside the agent process.

This combination lets Palana support several categories of work:

  • Secure OpenClaw and agent-framework testing.
  • Cloud development environments accessible from a browser or SSH client.
  • Fast prototyping and testing for agentic workloads in a secure environment.
  • Slack-connected agents such as cts-aergia and Claude-to-Slack workflows.
  • Long-running task agents such as Hermes, Matlock, Butler, and custom team automations.
  • Higher-order systems where agentic supervisors launch or route work to scoped agents.

Why we built it

The immediate need came from security research. We wanted a place to run and investigate OpenClaw and related agent frameworks without exposing the broader internal network or placing raw credentials inside the agent runtime. That use case forced us to design for containment from the beginning.

The broader need quickly became developer productivity. Once the basic primitives existed, Palana became useful for remote coding, Slack automation, internal assistants, long-lived experiments, and agentic operational workflows. Grabbers wanted agents that could keep context over days or weeks, run from corporate infrastructure, access approved internal services, and survive laptop sleep, local dependency drift, or network changes.

The security and productivity goals reinforce each other. If the safe path is self-service and ergonomic, teams are more likely to use it. If the productive path is observable and policy-controlled by default, and the appropriate security is baked into the system automatically, platform teams do not have to retrofit controls after adoption.

Design principles

Palana’s architecture follows a few principles that shaped most of the implementation.

Isolation is the unit of trust

Each agent gets its own namespace, service account, storage, network policy, and Vault scope. Agents should not see each other’s pods, secrets, or filesystem state by default. Inter-agent communication is possible, but it goes through explicit peering rules rather than ambient pod-to-pod reachability.

This means the platform does not have to assume every agent framework has perfect multi-tenant isolation internally. A framework designed as a single-user assistant can still be hosted safely by giving each user or worker its own Palana boundary.

Credentials are never given to the agent

Traditional application hosting often gives credentials to the workload as environment variables or mounted files. That is risky for agent workloads because the agent may execute tools, run untrusted code, summarize files, install packages, or expose a web UI.

Palana separates two kinds of secrets:

  • Agent-readable secrets live under the agent’s own Vault path and are available only to that agent’s service account.
  • Proxy-only secrets are stored under a separate Vault path and are read by the proxy layer, not by the agent.

For proxy-only secrets, the agent sees a placeholder such as TOKEN_GITHUB_PAT or TOKEN_GRABGPT_API_KEY. When an outbound request travels through the proxy path, the proxy replaces the placeholder header with the real credential from Vault. The remote service receives a valid token, but the agent process never stores the token in its own environment or config.

This pattern is especially important for LLMs, source control, API integrations, and browser-like tools where prompt injection or dependency compromise could otherwise expose long-lived credentials.

Egress is a control point

Agents can be useful only if they can call tools and services. Instead of forbidding network access, Palana makes network access observable and policy-mediated.

Agent pods receive proxy configuration automatically. External HTTP and HTTPS traffic flows through Envoy. Envoy asks ext-authz-proxy to identify the calling pod, evaluate policy with OPA, log the request, and optionally inject credentials. HTTPS traffic can be terminated by the proxy’s man-in-the-middle (MITM) listener for header inspection and replacement, with the generated certificate authority (CA) distributed to agent pods.

This gives the platform a place to answer questions that normal Kubernetes networking cannot answer alone:

  • Which agent made this request?
  • Which user owns that agent?
  • Which host and method were requested?
  • Was the request allowed or denied?
  • Which placeholder credentials were replaced?
  • Did the request go to an internal service, an LLM gateway, GitLab, or the public internet?

The control plane must stay outside the agent

Palana assumes an agent might become confused, compromised, or uncooperative. Operational controls therefore live outside the agent process. The operator reconciles namespaces and policies. The proxy controls egress. The portal and pcli (Palana command-line interface) manage lifecycle. The kill switch is enforced with network policy. Idle shutdown is handled by a separate reaper CronJob.

That separation matters. A kill switch that asks the agent to stop is a feature. A kill switch that removes the agent’s network path is a safety control.

Use Kubernetes primitives where they fit

Palana is intentionally Kubernetes-native. Agents are represented by custom resources. The operator reconciles namespaces, RBAC, storage, services, ingress, and network policies. Users can interact through pcli or the portal, while platform engineers can still inspect the underlying Kubernetes objects when debugging.

This gives us a layered experience: simple workflows for users, direct primitives for advanced operators, and infrastructure-as-code for the deployed platform.

Conclusion

By centering the design around isolation, controlled egress, and proxy-mediated secrets, Palana provides a secure foundation for AI agents to operate within Grab. In Part 2, we will dive deeper into the under-the-hood architecture of Palana, exploring how it orchestrates agent lifecycles, handles LLM routing, and maintains operational visibility.

Join us

Grab is Southeast Asia’s leading superapp, serving over 900 cities across eight countries (Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam). Through a single platform, millions of users access mobility, delivery, and digital financial services, including ride-hailing, food delivery, payments, lending, and digital banking via GXS Bank and GXBank. Founded in 2012, Grab’s mission is to drive Southeast Asia forward by creating economic empowerment for everyone while delivering sustainable financial performance and positive social impact.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Production-Ready Autonomous Incident Resolution with AWS DevOps Agent (now GA) and Datadog MCP Server

Post Syndicated from Nina Chen original https://aws.amazon.com/blogs/devops/production-ready-autonomous-incident-resolution-with-aws-devops-agent-now-ga-and-datadog-mcp-server/

This post was co-written with Bharadwaj Tanikella (AI/ML Product Engineering Leader) and Mohammad Jama (Product Marketing Manager) from Datadog.

In December 2025, we showed how AWS DevOps Agent and Datadog MCP Server could work together to autonomously correlate monitoring data with the infrastructure deployed and configured on AWS to resolve incidents in minutes instead of hours. Since then, Datadog MCP Server has reached general availability as the standard way for AI agents to access Datadog’s monitoring platform. Today, AWS DevOps Agent is generally available, giving teams a production-ready path to autonomous incident resolution across AWS, multicloud and on-premises environments.

What’s New: From Preview to GA

As engineering teams adopt AI-powered tools and build services that leverage AI agents, they want to extend their AI capabilities to incorporate familiar observability data and workflows. AI agents, however, often struggle with traditional API endpoints, causing them to miss the very context they need to resolve incidents effectively. Datadog MCP Server solves this by acting as a bridge between your observability data in Datadog and any AI agent that supports the Model Context Protocol (MCP). Now generally available, the MCP Server ingests prompts from users and AI agents and maps them to the corresponding Datadog resources and data. Under the hood, it handles authentication, HTTP request routing, endpoint selection, and response formatting so that agents receive highly relevant context without the brittleness of direct API calls. It supports modular toolsets so you can connect only the capabilities you need, from core observability data (logs, metrics, traces, dashboards, monitors, incidents) to specialized domains like APM trace analysis, security scanning, database monitoring, and CI/CD pipeline visibility.

Even with reliable access to observability data, incident response remains a manual, reactive process. On-call engineers must piece together the root cause of the incident from multiple data sources, draft mitigation plans, coordinate across teams, and then repeat the cycle when similar issues recur. This reactive approach does not scale as applications grow more complex and distributed.

AWS DevOps Agent changes this by introducing autonomous, always-on incident triage and investigation to your operations. AWS DevOps Agent is your always-available operations teammate that resolves and proactively prevents incidents, optimizes application reliability and performance, and handles on-demand SRE (Site Reliability Engineer) tasks across AWS, multicloud, and on-prem environments. It learns your resources and their relationships, correlates telemetry, code, and deployment data across your environment, and drives systematic improvements that prevent future incidents. Now, this also has several new capabilities that were not available during preview. It coordinates incident response automatically through channels like Slack, PagerDuty, and ServiceNow, keeping the right people informed without manual effort. It also delivers proactive prevention recommendations that address root causes before they lead to repeat incidents. In addition, DevOps Agent now supports multicloud and on-premises environments, extending its reach beyond AWS-only workloads to meet teams wherever their infrastructure runs.

With its built-in Datadog MCP Server integration, AWS DevOps Agent can pull the right Datadog context during an investigation, such as searching error logs, analyzing span-level latency, and reviewing recent deployment events. Together, these new features give engineering teams a fully integrated, production-ready workflow for autonomous incident resolution across AWS and Datadog.

Setting Up and Using AWS DevOps Agent with Datadog

In this section, we will guide you through the steps required to enable Datadog MCP Server in your AWS DevOps Agent account and configure it for incident resolution.

Pre-requisites

For this walkthrough, you should have access to and understanding of the following:

  • An AWS account
    • Agent Space role – for basic service operations
    • Agent Space web app role – for using the Agent Space web app functionality
    • (Optional) Secondary source account roles if monitoring multiple AWS accounts. Refer to the DevOps Agent user guide for the details on setting up these roles.
  • A Datadog account
  • Access to Datadog MCP Server

Setting up Datadog in the AWS DevOps Agent Console

  1. Start in the AWS DevOps Agent console by connecting your Datadog account.
  2. Navigate to Capability Providers, select the Datadog integration panel and click Register button.
  3. Enter Server Name, Endpoint URL, an optional Description, and click the Next button.
  4. AWS DevOps Agent validates the connection and displays a confirmation message.

Inside the AWS DevOps Agent console showing the connection for Datadog MCP Server

Figure 1: Setting up Datadog MCP Server in AWS DevOps Agent Console

Create an AWS DevOps Agent Space

Create an Agent Space in your primary AWS account to serve as the operational hub for incident investigations.

  • Open the AWS DevOps Agent console in us-east-1.
  • Choose Create Agent Space and provide a meaningful name and description.
  • Configure the required IAM role that grants AWS DevOps Agent access to your AWS resources. You can use the automated role creation process or create the role manually.
  • After your Agent Space is ready, add the Datadog MCP Server as a telemetry source to enable comprehensive incident investigation.

Creating an AWS DevOps Agent in Agent Space

Figure 2: Creating an AWS DevOps Agent in Agent Space

Real-World Example: Resolving Errors

Let’s walk through how AWS DevOps Agent and Datadog work together to resolve a production incident. In this scenario, Datadog monitors detect a spike in Amazon API Gateway 5XX errors affecting downstream services.

Sample dashboard showing 5xx errors in Datadog

Figure 3: Sample 5xx errors in Datadog

Investigating errors from Incident with Datadog MCP Server and AWS DevOps Agent

When the 5xx alert triggers, AWS DevOps Agent automatically analyzes the incident using both Datadog metrics and API Gateway logs. Through the investigation chat interface, an engineer guides AWS DevOps Agent to examine the API Gateway configuration. The agent correlates API Gateway and AWS Lambda execution logs, quickly identifying error patterns.

Inside the AWS DevOps Agent Console showing what the homepage looks like

Figure 4: Investigating an incident with AWS DevOps Agent and Datadog MCP Server

Resolving issue

AWS DevOps Agent helps identify potential misconfigurations in the Lambda and Amazon DynamoDB integration and suggests immediate fixes. The agent documents all findings and actions in an incident investigation, backed by telemetry from both Datadog and AWS services. After resolution, AWS DevOps Agent generates a detailed analysis report with specific recommendations to prevent similar incidents.

Inside the AWS DevOps Agent Console showing an invigation in progress

Figure 5: Investigation summary produced by AWS DevOps Agent

Mitigation plans

After completing investigation, AWS DevOps Agent goes beyond identifying the root cause — it generates a detailed mitigation plan with step-by-step remediation guidance specific to the incident. Beyond immediate fixes, the plan includes longer-term prevention recommendations such as adding retry logic, implementing circuit breakers, or adjusting capacity thresholds to reduce the risk of recurrence.

This shifts the on-call experience from reactive to proactive. Instead of context-switching across multiple tools to build a remediation plan from scratch, engineers get a ready-to-execute plan they can review, refine, and route through existing change management workflows — keeping stakeholders informed as fixes are implemented. Over time, AWS DevOps Agent learns from resolved incidents across your environment, making its mitigation plans increasingly precise by recognizing patterns, referencing past resolutions, and surfacing preventive measures before similar issues repeat. AWS DevOps Agent also leverages its deep understanding of your environment, enabling you to dive deeper into your application environment, beyond just asking questions, to create, save, and share custom charts and reports.

Inside the AWS DevOps Agent console showing the results of a completed investigation

Figure 6: Mitigation plan generated by AWS DevOps Agent

Prevention

AWS DevOps Agent can evaluate recent incidents to identify improvement opportunities that prevent future incidents and reduce Mean Time To Detection (MTTD) and Mean Time to Recovery (MTTR).

  1. Navigate to the Improvements page in the AWS DevOps Agent web app
  2. Click Run Now. Once its completed, it displays a personalized incident prevention recommendation, as displayed in Figure 7 below. Note: The “Run Now” button may not produce visible results immediately. Prevention analysis runs asynchronously in the background and results may take time to appear. This is expected since the feature is designed for production environments with longer incident histories.

Personalized incident prevention recommendation from AWS DevOps Agent

Figure 7: Personalized incident prevention recommendation from AWS DevOps Agent

Cleanup

When you’re done using the integration, you can clean up your resources by following these steps:

  1. Delete your Agent Space from the AWS DevOps Agent console
  2. Remove the Datadog MCP Server connection from your Capability Providers
  3. Delete the IAM roles created for the Agent Space
  4. (Optional) If you created additional source account roles, remove those as well

Conclusion

With Datadog MCP Server and AWS DevOps Agent now generally available, this integration automatically correlates Datadog logs, metrics, and traces with AWS telemetry, code, and deployment data, giving teams an autonomous investigation that identifies root causes, delivers actionable mitigation plans, and recommends preventive improvements. Early adopters have seen resolution times drop from hours to minutes and deeper root cause analysis across AWS, multicloud and hybrid environments. To learn more, check out the AWS DevOps Agent.

Datadog is an AWS Specialization Partner and AWS Marketplace Seller that has been building integrations with AWS services for over a decade, amassing a growing catalog of 100+ AWS and 1000+ built-in integrations. This new AWS DevOps Agent and Datadog MCP Server integration builds upon Datadog’s strong track record of AWS partnership success. If you’re not already using Datadog, you can get started with a 14-day free trial via the AWS Marketplace.

Nina Chen

Nina Chen is a Customer Solutions Manager at AWS specializing in leading software companies to leverage the power of the AWS cloud to accelerate their product innovation and growth. With over 4 years of experience working in the strategic Independent Software Vendor (ISV) vertical, Nina enjoys guiding ISV partners through their cloud transformation journeys, helping them optimize their cloud infrastructure, driving product innovation, and delivering exceptional customer experiences.

DhilipVenkatesh Uvarajan

DhilipVenkatesh Uvarajan is as an Enterprise Support Lead TAM within AWS Enterprise Support, specializing in Independent Software Vendors (ISVs) across the United States. In this role, Dhilip provides strategic technical guidance to help customers innovate, optimize their AWS architecture, and ensure the seamless operation of their business-critical applications on the AWS cloud. Beyond his professional endeavors, Dhilip is passionate about AI and Robotics, often exploring innovative projects in his spare time.

Shashiraj (Raj) Jeripotula

Shashiraj Jeripotula (Raj) is a San Francisco-based Principal Partner Solutions Architect at AWS. He works with ISV partners to build deep integrations across observability, AI, and agentic development tooling — helping developers leverage AI agents, Model Context Protocol (MCP), and shift-left observability to build responsible, production-ready AI systems on AWS.

Sujatha Kuppuraju

Sujatha Kuppuraju is a Principal Solutions Architect at AWS, specializing in Cloud and, Generative AI Security. She collaborates with software companies’ leadership teams to architect secure, scalable solutions on AWS and guide strategic product development. Leveraging her expertise in cloud architecture and emerging technologies, Sujatha helps organizations optimize offerings, maintain robust security, and bring innovative products to market in an evolving tech landscape.

BT

Bharadwaj Tanikella

Bharadwaj Tanikella currently leads Datadog products Bits AI (Assistant), Datadog MCP Server, and Semantic Layer. His work focuses on harnessing vast datasets to foster innovation and streamline user experiences through cutting-edge analytics, machine learning, and artificial intelligence.

Mohammad Jama

Mohammad Jama is a Product Marketing Manager at Datadog. He leads go-to-market for Datadog’s AWS integrations, working closely with product, marketing, and sales to help companies observe and secure their hybrid and AWS environments.

Announcing Amazon EC2 G7 instances accelerated by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs

Post Syndicated from Daniel Abib original https://aws.amazon.com/blogs/aws/announcing-amazon-ec2-g7-instances-accelerated-by-nvidia-rtx-pro-4500-blackwell-server-edition-gpus/

Today, we’re announcing the general availability of Amazon Elastic Compute Cloud (Amazon EC2) G7 instances, delivering high performance GPU acceleration for AI inference, graphics, and data analytics workloads.

AWS is the first major cloud provider to support NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs. G7 instances are accelerated by these GPUs with custom sixth-generation Intel Xeon Scalable processors, delivering up to 4.6x AI inference performance and up to 2.1x graphics performance compared to G6 instances. G7 instances also deliver faster performance for GPU-accelerated analytics on Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS). G7 instances are well suited for a broad range of GPU-enabled workloads including AI inference, graphics rendering, video transcoding and analytics, spatial computing, virtual desktop infrastructure (VDI), and data analytics.

Here are improvements of G7 instances compared to previous generation:

  • Faster GPU memory – NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs offer 1.33 times the GPU memory capacity and 2.45 times the GPU memory bandwidth compared to G6 instances. With 32 GB of GPU memory per GPU, 5th Gen Tensor Cores, and 4th Gen RT Cores, G7 instances deliver enhanced AI inference and graphics performance.
  • High performance networking and storage – G7 instances come with 700 Gbps of EFA-enabled networking throughput (7x compared to G6) enabling the low-latency, high-bandwidth connectivity that AI inference, graphics-intensive applications, and GPU-accelerated data analytics workloads need to perform at their best. G7 instances support up to 7.6 TB local NVMe SSD storage, enabling you to keep large models and datasets close to compute, reduce data transfer overhead, and improve throughput.
  • Advanced video encoding and decoding engines – Ninth-generation NVENC and sixth-generation NVDEC engines support 4:2:2 encoding and decoding for high-resolution video workflows, delivering 1.5x concurrent video streams compared to previous-generation G6 instances.

EC2 G7 instance specifications
G7 instances feature up to 8 NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs with up to 256 GB of total GPU memory (32 GB of memory per GPU) and custom Intel Xeon Scalable processors. They also are available in 7 sizes and support up to 192 vCPUs, up to 700 Gbps of network bandwidth, up to 768 GiB of system memory, and up to 7.6 TB of local NVMe SSD storage.

Here are the specs:

Instance name GPUs GPU memory (GB) vCPUs Memory (GiB) Storage EBS bandwidth (Gbps) Network bandwidth (Gbps)
g7.2xlarge 1 32 8 32 1 x 600 Up to 8 Up to 60
g7.4xlarge 1 32 16 64 1 x 600 8 Up to 100
g7.8xlarge 1 32 32 128 1 x 950 16 Up to 100
g7.12xlarge 2 64 48 192 1 x 1900 20 175
g7.24xlarge 4 128 96 384 1 x 3800 40 350
g7.48xlarge 8 256 192 768 2 x 3800 80 700
g7.metal* 8 256 192 768 2 x 3800 80 700

* Coming soon

G7 instances support NVIDIA GPUDirect P2P for multi-GPU sizes, NVIDIA GPUDirect RDMA with EFA, and GPUDirect RDMA with EFA for Amazon FSx for Lustre, enabling low-latency GPU-to-GPU communication for multi-GPU and multi-node workloads.

To get started with G7 instances, you can use the AWS Deep Learning AMIs (DLAMI) or NVIDIA Workstation AMIs with prepackaged GPU drivers for your AI inference and graphics workloads. To use G7 instances with Amazon EKS, build EKS AMIs with NVIDIA driver version R595 with EKS-provided automation. G7 instances support multiple operating systems including Amazon Linux, Ubuntu, RHEL, and Windows Server, with comprehensive NVIDIA driver integration providing compatibility with industry-standard graphics libraries including DirectX, Vulkan, and OpenGL.

Get started today
You can start using Amazon EC2 G7 instances today in two AWS regions: US East (Ohio) and US West (Oregon). To check future Regional expansion plans, look up the instance type in the CloudFormation resources tab on the AWS Capabilities by Region page.

G7 instances are offered through multiple purchasing options, including On-Demand, Savings Plans, and Spot Instances. Dedicated Instances are also supported for the 12xlarge, 24xlarge, and 48xlarge sizes. For detailed pricing, visit the Amazon EC2 Pricing page.

Ready to get started? Launch G7 instances from the Amazon EC2 console. For more details, head over to the Amazon EC2 G7 instances page. We’d love to hear your feedback. Share it on AWS re:Post for EC2 or reach out through your usual AWS Support contacts.

– Daniel Abib

Amazon ECS introduces new high-resolution metrics for faster service auto scaling

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-ecs-introduces-new-high-resolution-metrics-for-faster-service-auto-scaling/

Amazon Elastic Container Service (Amazon ECS) service auto scaling automatically adjusts task counts to meet workload demand with comprehensive scaling policies, including predictive scaling for recurring traffic patterns, scheduled scaling for planned events, and target tracking to scale dynamically on real-time metrics.

You can choose proactive scaling by using predictive scaling (automatic) and scheduled scaling (customer-defined), or reactive scaling by using target tracking with just a target to scale on. Amazon ECS service auto scaling adjusts the number of tasks in an ECS service based on Amazon CloudWatch metrics, such as average CPU/Memory usage, request count per target, a custom metric such as queue depth, or demand surges by using advanced machine learning (ML) algorithms.

With today’s launch, Amazon ECS service auto scaling now detects and responds to load changes faster with support for high resolution (20-second) metrics and metric publishing optimizations. In AWS benchmarking tests, time to trigger scale-out improved from 363 seconds to 86 seconds (76% faster, 4.2x), and total time to scale and provision new tasks improved from 386 seconds to 109 seconds (72% faster, 3.5x)

This launch delivers three key benefits for your applications:

  • Improved performance and reliability: Faster scaling means, your application responds faster to demand surges, reducing latencies or failures for end users during demand surges.
  • Right-size without compromise: Depending on the workload, you can reduce baseline task counts because scale-out now happens fast enough to handle traffic spikes without preemptive capacity padding. This directly reduces compute costs while maintaining application performance and availability.
  • Simpler scaling configuration: Target tracking with high-resolution metrics delivers the aggressive scaling behavior that previously required custom scaling configurations, such as usage of step-scaling policies. One configuration change replaces custom engineering work.

How it works
To use ECS faster service auto scaling, first enable high-resolution metrics for your ECS service, and then configure a target tracking scaling policy which uses high-resolution metrics. ECS faster service autoscaling works across all compute options on ECS: AWS Fargate, ECS Managed Instances, and Amazon Elastic Compute Cloud (Amazon EC2). You can enable these metrics when you create or update your ECS service in the Amazon ECS console, or using AWS SDKs and tools, and AWS CloudFormation.

When you create a service in the console, add 20-seconds resolution metrics in the Monitoring configuration section. These metrics incur additional CloudWatch costs while the standard resolution (60-seconds) is free.

In the Service auto scaling section, check Use service auto scaling and choose Target Tracking for the scaling policy type to use real-time data to scale the number of tasks that your service runs based on demand.

Then, choose a Scaling policy type for the target tracking. You can select ECSServiceAverageCPUUtilizationHighResolution or ECSServiceAverageMemoryUtilizationHighResolution as new metrics.

That’s it – your ECS service will use high resolution metrics for auto scaling.

To update an existing ECS service to use faster auto scaling, you first need to configure high resolution metrics via Update Service. Once deployment completes, your service will generate high-resolution metrics. You can then go to the Service and auto scaling tab from your service details to update scaling policy to use higher resolution metrics.

That’s all you need. Your ECS service now evaluates scaling decisions at 20-second intervals.

You can also use the AWS Command Line Interface (AWS CLI) to enable new metrics in your ECS service through Application Auto Scaling. To learn more, visit the faster auto scaling documentation.

Now available
Faster service autoscaling with high-resolution metrics for Amazon ECS is available today. The feature itself has no additional cost, but high-resolution CloudWatch metrics introduce a new pricing dimension. For details, see the CloudWatch pricing page.

Give it a try today and send feedback to AWS re:Post for ECS or through your usual AWS Support contacts.

Channy

The collective thoughts of the interwebz