AI and Civil Service Purges

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2025/02/ai-and-civil-service-purges.html

Donald Trump and Elon Musk’s chaotic approach to reform is upending government operations. Critical functions have been halted, tens of thousands of federal staffers are being encouraged to resign, and congressional mandates are being disregarded. The next phase: The Department of Government Efficiency reportedly wants to use AI to cut costs. According to The Washington Post, Musk’s group has started to run sensitive data from government systems through AI programs to analyze spending and determine what could be pruned. This may lead to the elimination of human jobs in favor of automation. As one government official who has been tracking Musk’s DOGE team told the Post, the ultimate aim is to use AI to replace “the human workforce with machines.” (Spokespeople for the White House and DOGE did not respond to requests for comment.)

Using AI to make government more efficient is a worthy pursuit, and this is not a new idea. The Biden administration disclosed more than 2,000 AI applications in development across the federal government. For example, FEMA has started using AI to help perform damage assessment in disaster areas. The Centers for Medicare and Medicaid Services has started using AI to look for fraudulent billing. The idea of replacing dedicated and principled civil servants with AI agents, however, is new—and complicated.

The civil service—the massive cadre of employees who operate government agencies—plays a vital role in translating laws and policy into the operation of society. New presidents can issue sweeping executive orders, but they often have no real effect until they actually change the behavior of public servants. Whether you think of these people as essential and inspiring do-gooders, boring bureaucratic functionaries, or as agents of a “deep state,” their sheer number and continuity act as ballast that resists institutional change.

This is why Trump and Musk’s actions are so significant. The more AI decision making is integrated into government, the easier change will be. If human workers are widely replaced with AI, executives will have unilateral authority to instantaneously alter the behavior of the government, profoundly raising the stakes for transitions of power in democracy. Trump’s unprecedented purge of the civil service might be the last time a president needs to replace the human beings in government in order to dictate its new functions. Future leaders may do so at the press of a button.

To be clear, the use of AI by the executive branch doesn’t have to be disastrous. In theory, it could allow new leadership to swiftly implement the wishes of its electorate. But this could go very badly in the hands of an authoritarian leader. AI systems concentrate power at the top, so they could allow an executive to effectuate change over sprawling bureaucracies instantaneously. Firing and replacing tens of thousands of human bureaucrats is a huge undertaking. Swapping one AI out for another, or modifying the rules that those AIs operate by, would be much simpler.

Social-welfare programs, if automated with AI, could be redirected to systematically benefit one group and disadvantage another with a single prompt change. Immigration-enforcement agencies could prioritize people for investigation and detainment with one instruction. Regulatory-enforcement agencies that monitor corporate behavior for malfeasance could turn their attention to, or away from, any given company on a whim.

Even if Congress were motivated to fight back against Trump and Musk, or against a future president seeking to bulldoze the will of the legislature, the absolute power to command AI agents would make it easier to subvert legislative intent. AI has the power to diminish representative politics. Written law is never fully determinative of the actions of government—there is always wiggle room for presidents, appointed leaders, and civil servants to exercise their own judgment. Whether intentional or not, whether charitably or not, each of these actors uses discretion. In human systems, that discretion is widely distributed across many individuals—people who, in the case of career civil servants, usually outlast presidencies.

Today, the AI ecosystem is dominated by a small number of corporations that decide how the most widely used AI models are designed, which data they are trained on, and which instructions they follow. Because their work is largely secretive and unaccountable to public interest, these tech companies are capable of making changes to the bias of AI systems—either generally or with aim at specific governmental use cases—that are invisible to the rest of us. And these private actors are both vulnerable to coercion by political leaders and self-interested in appealing to their favor. Musk himself created and funded xAI, now one of the world’s largest AI labs, with an explicitly ideological mandate to generate anti-“woke” AI and steer the wider AI industry in a similar direction.

But there’s a second way that AI’s transformation of government could go. AI development could happen inside of transparent and accountable public institutions, alongside its continued development by Big Tech. Applications of AI in democratic governments could be focused on benefitting public servants and the communities they serve by, for example, making it easier for non-English speakers to access government services, making ministerial tasks such as processing routine applications more efficient and reducing backlogs, or helping constituents weigh in on the policies deliberated by their representatives. Such AI integrations should be done gradually and carefully, with public oversight for their design and implementation and monitoring and guardrails to avoid unacceptable bias and harm.

Governments around the world are demonstrating how this could be done, though it’s early days. Taiwan has pioneered the use of AI models to facilitate deliberative democracy at an unprecedented scale. Singapore has been a leader in the development of public AI models, built transparently and with public-service use cases in mind. Canada has illustrated the role of disclosure and public input on the consideration of AI use cases in government. Even if you do not trust the current White House to follow any of these examples, U.S. states—which have much greater contact and influence over the daily lives of Americans than the federal government—could lead the way on this kind of responsible development and deployment of AI.

As the political theorist David Runciman has written, AI is just another in a long line of artificial “machines” used to govern how people live and act, not unlike corporations and states before it. AI doesn’t replace those older institutions, but it changes how they function. As the Trump administration forges stronger ties to Big Tech and AI developers, we need to recognize the potential of that partnership to steer the future of democratic governance—and act to make sure that it does not enable future authoritarians.

This essay was written with Nathan E. Sanders, and originally appeared in The Atlantic.

Кога ще избухне Румен Радев

Post Syndicated from Емилия Милчева original https://www.toest.bg/koga-shte-izbuhne-rumen-radev/

Кога ще избухне Румен Радев

Има въпроси шлагери, които се задържат за по-дълго на сцената на политическата естрада. „Кога ще избухне (с партия) Румен Радев?“ е един от тях. Този въпрос се върти с по-голяма или по-малка сила от началото на втория мандат на Радев, съвпаднал с политическата криза и позволил му да изпъкне в доспехите на народен закрилник-протагонист-стожер на нацията.

Президентът – къде умело, къде по-грубо, когато налага интересите си – балансира между ролите на арбитър и на активен играч, използвайки нестабилността като фон за своето политическо соло. Въпросът за негова партийна авантюра е рефрен, който никога не изчезва напълно, а само затихва, за да се върне с нов аранжимент при поредния политически трус. А трусовете зачестиха. 

Обичайно на въпросите дали ще дебютира със свой политически проект, Радев отговаря, че сам ще го каже, когато реши, но не казва „не“. Отговорите му са витиевати като на древна пророчица. Харесва му да поддържа напрежението, а в телевизионните студиа винаги се намира по някой и друг говорител, който разпалва аудиторията колко некадърни и невнятни са политиците, как „ключът е в президента и без негова партия ще гледаме едно и също“.

Неизбежна ли е Румен-Радевата алтернатива

Сигналите на държавния глава лесно могат да бъдат категоризирани:

„Аз съм президент, а не партиен лидер.“ Румен Радев обича да подчертава своята НАДпартийност, както и че ролята му е да бъде държавен глава, а не да се включва в партийните ала-бала. (Не че не го прави – въпреки че бяха орязани конституционните му правомощия да назначава служебни кабинети, без неговото съгласие такива не може да има, както се видя в случая с Горица Кожарева.)

„Хората очакват нова алтернатива, неизбежна е появата ѝ.“ Макар да не заявява ясно лични амбиции, той многократно е намеквал, че има нужда от „различен тип политика“, което кара анализаторите да виждат в това индиректно подгряване на почвата за бъдещ политически проект.

„България има нужда от промяна.“ Този лайтмотив върви в публичните му изяви още от първия мандат, когато Радев се позиционира като критик на статуквото, все едно какво – той е срещу него. Към настоящото правителство не е бил критичен, но още не е изминал и месец от избора му. Дали защото в изявления преди конституирането на кабинета той настояваше, че редовно правителство е необходимо в името на стабилността. Или защото поне две от четирите формации в управляващото мнозинство – БСП (след отстраняването на Корнелия Нинова) и „Има такъв народ“ – симпатизират на президента.

С неговите пет служебни правителства за осем години може да се каже, че Румен Радев е бил и петкратен премиер – без партия, но със стабилен рейтинг, свой апарат от бивши министри и дори лобистки договори като този с турската държавна компания „Боташ“, обслужващ приоритетно енергийните интереси на Анкара.

„Ще дойде време за отговори…“ Това е многоточието, което не закрива темата, напротив – оставя отворени всички опции. Някои партийни проекти бяха приписвани на президентска намеса, например „Български възход“ на един от служебните му премиери – Стефан Янев, но също и „Продължаваме промяната“, тъй като нейните лидери Кирил Петков и Асен Василев също бяха поканени за служебни министри от президента, който предостави терен за изява на политическите им амбиции.

Мъдрец в сянка? Не, благодаря

Какви варианти има пред Румен Радев? Може да избере ролята на „мъдреца в сянка“, както направи Жельо Желев, който създаде едноименната си фондация и основа Балканския политически клуб. Или да създаде свой политически проект и активно да влезе в играта с всички произтичащи от това рискове, които изпита също двумандатният президент Георги Първанов, регистрирал АБВ.

Изглежда малко вероятно Радев да си направи фондация и да изнася лекции тук и там. През тези десет години на „Дондуков“ 2, които ще се навършат догодина, властта му прилепна като мундир и няма индикации, че иска да напусне сцената. Политиката го направи известен, но той изгуби онази естествена и симпатична недодяланост, присъща на новаците, която обаче хората харесаха през 2017 г. – наред с професията и произхода му, с който се идентифицираха мнозина българи.

В момента около Радев има немалко фигури – съветници, други на високи позиции в различни институции, но и в изпълнителната и местната власт, придобили влияние и тежест благодарение на присъствието си в президентската орбита. Те не биха подкрепили негово доброволно оттегляне, тъй като ще бъде застрашено тяхното благополучие.

В крак с новото статукво

Моментът за „избухване“ на Радев преди президентския вот през есента на 2026 г. е благоприятен – ако се отвори възможност с предсрочни парламентарни избори. След избора на Доналд Тръмп за президент на Съединените щати започва да се формира ново статукво на сигурността в света, което не подминава и Европа. 

В българската политическа действителност от известно време изчезна употребата на термина „евроатлантически“. Тръмп говори лично с руския президент Владимир Путин, за да се договарят за мир в Украйна, но засега Европа и дори самата Украйна изглеждат изключени от този диалог. Радев, чиито проруски симпатии не са тайна, се оказа в новия тренд, налаган от американския президент, и ще си позволява безнаказано да се отклонява от традиционните евроатлантически рамки. Той е и единственият държавен глава в най-новата история на България, чиито външнополитически позиции са били в разрез с официалните позиции на държава членка на ЕС и НАТО, и той ги е демонстрирал на високи форуми.

Възможността за предсрочни парламентарни избори би могла да му предостави шанс да събере повече подкрепа, особено сред избирателите, които се чувстват отдалечени от традиционните западни алианси и търсят алтернатива в политика на неутралитет или дори на сближаване с Русия. Това би означавало стратегическо преосмисляне на външнополитическата ориентация на България.

Левицата изглежда като естествен терен за него, но е съмнително дали би успял да изгради модерна лява партия, различна от БСП, от каквато българската политика действително се нуждае.

Ако Радев изгрее със свой партиен проект, ще застраши първенството на ГЕРБ. Но също така биха могли да се спогодят, а останалите политически сили съвсем ще се маргинализират. Техният шанс е да използват ефективно времето до следващите президентски избори, за да се позиционират по най-добрия начин, включително и с кандидатите си за президент и вицепрезидент. Къде е Бойко Борисов в този гмеж от хипотези – на „Дондуков“ 2 или в Банкя?

Радев е на прага на решението: да излезе от властта или да открие нов път към нея. 

Времето му за избор наближава.

The ATS Group and a Regional Telecom Provider

Post Syndicated from Michael Kammer original https://blog.zabbix.com/the-ats-group-and-a-regional-telecom-provider/29671/

Our Premium Partners at the ATS Group have a regional telecom provider on the West Coast of the United States as one of their key clients. The provider covers a massive geographical area on a limited budget and serves thousands of (primarily rural) customers.

The Challenge

After recent price hikes by the “big-box” monitoring solutions, the provider needed an alternative with a more stable pricing model. Simply put, their budget was shrinking, but their software monitoring costs were expanding.

The provider had a large stock of non-traditional IT equipment that all needed to be monitored effectively, and they also had only one month to get all monitored devices and endpoints over to a new solution.

On top of that, many of the provider’s legacy systems were directly related to regulatory compliance and therefore needed to be operational from day one.

The Solution

The provider set about migrating to a complete and robust Zabbix 7.0 solution that would eliminate any foreseeable issues – even the loss of an entire data center.

There were a few initial hiccups in the implementation when it came to getting PostgreSQL set up with database proxies, but the ATS Group team quickly arrived at an architecture that the provider was happy with. The clear and easy-to-follow Zabbix documentation was of particular help.

The Results

The new Zabbix solution, as implemented, was able to monitor a number of things that had previously been challenging, including:

• Doors. The provider badly needed a solution for monitoring doors, including entrance and exit doors as well as cabinet doors in data centers. They had long-term compliance issues with doors sticking open, employees forgetting to close doors, etc. Zabbix made it easy to develop custom SNMP traps that send alerts in case of open doors, solving the issue.

• Weather. The provider’s services are available over a large and varied geographical area that encompasses multiple states. The ability of Zabbix to predict weather changes across this area has been an important added bonus, with the provider now being able to get future weather alerts that can be used to compare against equipment tolerance levels. Personnel can then be sent to affected areas in anticipation of weather events, instead of being purely reactive.

• SLAs. The provider functions as an ISP that provides internet access to customers in rural areas, many of whom may not have other means of accessing the world around them. As such, they not only feel a strong sense of duty to provide consistent uptime, but they are bound by a strict set of service level agreements (SLAs). With Zabbix, it’s possible to provide SLAs for some of the remote edge equipment involved by building an integration with ServiceNow.

In conclusion

The telecom provider in question trusts Zabbix to guarantee rural broadband access for thousands of customers over an enormous geographic area. Zabbix not only gets the job done more effectively than other monitoring solutions, it does so at a fraction of the cost.

The post The ATS Group and a Regional Telecom Provider appeared first on Zabbix Blog.

Encryption for Everybody

Post Syndicated from Let's Encrypt original https://letsencrypt.org/2025/02/14/encryption-for-everybody.html

Let's Encrypt 10th Anniversary logo

2025 marks ten years of Let’s Encrypt. Already this year we’ve taken steps to continue to deliver on our values of user privacy, efficiency, and innovation, all with the intent of continuing to deliver free TLS certificates to as many people as possible; to deliver encryption for everybody.

And while we’re excited about the technical progress we’ll make this year, we’re also going to celebrate this tenth anniversary by highlighting the people around the world who make our impact possible. It’s no small village.

From a community forum that has provided free technical support, to our roster of sponsors who provide vital funding, to the thousands of individual supporters who contribute financially to Let’s Encrypt each year, free TLS at Internet scale works because people have supported it year in, year out, for ten years.

Each month we’ll highlight a different set of people behind our “everybody.” Who do you want to see us highlight? What use cases of Let’s Encrypt have you seen that amazed you? What about our work do you hope we’ll continue or improve as we go forward? Let us know on LinkedIn, or drop a note to [email protected].

Encryption for Everybody is our unofficial tagline for this tenth anniversary year. What we love about it is that, yes, it captures our commitment to ensuring anyone around the world can easily get a cert for free. But more importantly, it captures the reality that technical innovation won’t work without people believing in it and supporting it. We’re grateful that, for ten years (and counting!), our community of supporters has made an impact on the lives of billions of Internet users—an impact that’s made theWeb more secure and privacy respecting for everybody, everywhere.

Internet Security Research Group (ISRG) is the parent organization of Let’s Encrypt, Prossimo, and Divvi Up. ISRG is a 501(c)(3) nonprofit. If you’d like to support our work, please consider getting involved, donating, or encouraging your company to become a sponsor.

Encryption for Everybody

Post Syndicated from Let's Encrypt original https://letsencrypt.org/2025/02/14/encryption-for-everybody/

Let's Encrypt 10th Anniversary logo

2025 marks ten years of Let’s Encrypt. Already this year we’ve taken steps to continue to deliver on our values of user privacy, efficiency, and innovation, all with the intent of continuing to deliver free TLS certificates to as many people as possible; to deliver encryption for everybody.

And while we’re excited about the technical progress we’ll make this year, we’re also going to celebrate this tenth anniversary by highlighting the people around the world who make our impact possible. It’s no small village.

From a community forum that has provided free technical support, to our roster of sponsors who provide vital funding, to the thousands of individual supporters who contribute financially to Let’s Encrypt each year, free TLS at Internet scale works because people have supported it year in, year out, for ten years.

Each month we’ll highlight a different set of people behind our “everybody.” Who do you want to see us highlight? What use cases of Let’s Encrypt have you seen that amazed you? What about our work do you hope we’ll continue or improve as we go forward? Let us know on LinkedIn, or drop a note to [email protected].

Encryption for Everybody is our unofficial tagline for this tenth anniversary year. What we love about it is that, yes, it captures our commitment to ensuring anyone around the world can easily get a cert for free. But more importantly, it captures the reality that technical innovation won’t work without people believing in it and supporting it. We’re grateful that, for ten years (and counting!), our community of supporters has made an impact on the lives of billions of Internet users—an impact that’s made theWeb more secure and privacy respecting for everybody, everywhere.

Internet Security Research Group (ISRG) is the parent organization of Let’s Encrypt, Prossimo, and Divvi Up. ISRG is a 501(c)(3) nonprofit. If you’d like to support our work, please consider getting involved, donating, or encouraging your company to become a sponsor.

AWS CloudTrail network activity events for VPC endpoints now generally available

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/aws-cloudtrail-network-activity-events-for-vpc-endpoints-now-generally-available/

Today, I’m happy to announce the general availability of network activity events for Amazon Virtual Private Cloud (Amazon VPC) endpoints in AWS CloudTrail. This feature helps you to record and monitor AWS API activity traversing your VPC endpoints, helping you strengthen your data perimeter and implement better detective controls.

Previously, it was hard to detect potential data exfiltration attempts and unauthorized access to the resources within your network through VPC endpoints. While VPC endpoint policies could be configured to prevent access from external accounts, there was no built-in mechanism to log denied actions or detect when external credentials were used at a VPC endpoint. This often required you to build custom solutions to inspect and analyze TLS traffic, which could be operationally costly and negate the benefits of encrypted communications.

With this new capability, you can now opt in to log all AWS API activity passing through your VPC endpoints. CloudTrail records these events as a new event type called network activity events, which capture both control plane and data plane actions passing through a VPC endpoint.

Network activity events in CloudTrail provide several key benefits:

  • Comprehensive visibility – Log all API activity traversing VPC endpoints, regardless of the AWS account initiating the action.
  • External credential detection – Identify when credentials from outside your organization are accessing your VPC endpoint.
  • Data exfiltration prevention – Detect and investigate potential unauthorized data movement attempts.
  • Enhanced security monitoring – Gain insights into all AWS API activity at your VPC endpoints without the need to decrypt TLS traffic.
  • Visibility for regulatory compliance – Improve your ability to meet regulatory requirements by tracking all API activity passing through.

Getting started with network activity events for VPC endpoint logging
To enable network activity events, I go to the AWS CloudTrail console and choose Trails in the navigation pane. I choose Create trail to create a new one. I enter a name in the Trail name field and choose an Amazon Simple Storage Service (Amazon S3) bucket to store the event logs. When I create a trail in CloudTrail, I can specify an existing Amazon S3 bucket or create a new bucket to store my trail’s event logs.

If you set Log file SSE-KMS encryption to Enabled, you have two options: Choose New to create a new AWS Key Management Service (AWS KMS) key or choose Existing to choose an existing KMS key. If you chose New, you need to type an alias in the AWS KMS alias field. CloudTrail encrypts your log files with this KMS key and adds the policy for you. The KMS key and Amazon S3 must be in the same AWS Region. For this example, I use an existing KMS key. I enter the alias in the AWS KMS alias field and leave the rest as default for this demo. I choose Next for the next step.

In the Choose log events step, I choose Network activity events under Events. I choose the event source from the list of AWS services, such as cloudtrail.amazonaws.com, ec2.amazonaws.com, kms.amazonaws.com, s3.amazonaws.com, and secretsmanager.amazonaws.com. I add two network activity event sources for this demo. For the first source, I select ec2.amazonaws.com option. For Log selector template, I can use templates for common use cases or create fine-grained filters for specific scenarios. For example, to log all API activities traversing the VPC endpoint, I can choose the Log all events template. I choose Log network activity access denied events template to log only access denied events. Optionally, I can enter a name in the Selector name field to identify the log selector template, such as Include network activity events for Amazon EC2.

As a second example, I choose Custom to create custom filters on multiple fields, such as eventName and vpcEndpointId. I can specify specific VPC endpoint IDs or filter the results to include only the VPC endpoints that match specific criteria. For Advanced event selectors, I choose vpcEndpointId from the Field dropdown, choose equals as Operator, and enter the VPC endpoint ID. When I expand the JSON view, I can see my event selectors as a JSON block. I choose Next and after reviewing the selections, I choose Create trail.

After it’s configured, CloudTrail will begin logging network activity events for my VPC endpoints, helping me analyze and act on this data. To analyze AWS CloudTrail network activity events, you can use the CloudTrail console, AWS Command Line Interface (AWS CLI), and AWS SDK to retrieve relevant logs. You can also use CloudTrail Lake to capture, store and analyze your network activity events. If you are using Trails, you can use Amazon Athena to query and filter these events based on specific criteria. Regular analysis of these events can help you maintain security, comply with regulations, and optimize your network infrastructure in AWS.

Now available
CloudTrail network activity events for VPC endpoint logging provide you with a powerful tool to enhance your security posture, detect potential threats, and gain deeper insights into your VPC network traffic. This feature addresses your critical needs for comprehensive visibility and control over your AWS environments.

Network activity events for VPC endpoints are available in all commercial AWS Regions.

For pricing information, visit AWS CloudTrail pricing.

To get started with CloudTrail network activity events, visit AWS CloudTrail. For more information on CloudTrail and its features, refer to the AWS CloudTrail documentation.

— Esra

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Post Syndicated from Subham Rakshit original https://aws.amazon.com/blogs/big-data/migrate-from-standard-brokers-to-express-brokers-in-amazon-msk-using-amazon-msk-replicator/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) now offers a new broker type called Express brokers. It’s designed to deliver up to 3 times more throughput per broker, scale up to 20 times faster, and reduce recovery time by 90% compared to Standard brokers running Apache Kafka. Express brokers come preconfigured with Kafka best practices by default, support Kafka APIs, and provide the same low latency performance that Amazon MSK customers expect, so you can continue using existing client applications without any changes. Express brokers provide straightforward operations with hands-free storage management by offering unlimited storage without pre-provisioning, eliminating disk-related bottlenecks. To learn more about Express brokers, refer to Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters.

Creating a new cluster with Express brokers is straightforward, as described in Amazon MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

MSK Replicator offers a built-in replication capability to seamlessly replicate data from one cluster to another. It automatically scales the underlying resources, so you can replicate data on demand without having to monitor or scale capacity. MSK Replicator also replicates Kafka metadata, including topic configurations, access control lists (ACLs), and consumer group offsets.

In the following sections, we discuss how to use MSK Replicator to replicate the data from a Standard broker MSK cluster to an Express broker MSK cluster and the steps involved in migrating the client applications from the old cluster to the new cluster.

Planning your migration

Migrating from Standard brokers to Express brokers requires thorough planning and careful consideration of various factors. In this section, we discuss key aspects to address during the planning phase.

Assessing the source cluster’s infrastructure and needs

It’s crucial to evaluate the capacity and health of the current (source) cluster to make sure it can handle additional consumption during migration, because MSK Replicator will retrieve data from the source cluster. Key checks include:

    • CPU utilization – The combined CPU User and CPU System utilization per broker should remain below 60%.
    • Network throughput – The cluster-to-cluster replication process adds extra egress traffic, because it might need to replicate the existing data based on business requirements along with the incoming data. For instance, if the ingress volume is X GB/day and data is retained in the cluster for 2 days, replicating the data from the earliest offset would cause the total egress volume for replication to be 2X GB. The cluster must accommodate this increased egress volume.

Let’s take an example where in your existing source cluster you have an average data ingress of 100 MBps and peak data ingress of 400 MBps with retention of 48 hours. Let’s assume you have one consumer of the data you produce to your Kafka cluster, which means that your egress traffic will be same compared to your ingress traffic. Based on this requirement, you can use the Amazon MSK sizing guide to calculate the broker capacity you need to safely handle this workload. In the spreadsheet, you will need to provide your average and maximum ingress/egress traffic in the cells, as shown in the following screenshot.

Because you need to replicate all the data produced in your Kafka cluster, the consumption will be higher than the regular workload. Taking this into account, your overall egress traffic will be at least twice the size of your ingress traffic.
However, when you run a replication tool, the resulting egress traffic will be higher than twice the ingress because you also need to replicate the existing data along with the new incoming data in the cluster. In the preceding example, you have an average ingress of 100 MBps and you retain data for 48 hours, which means that you have a total of approximately 18 TB of existing data in your source cluster that needs to be copied over on top of the new data that’s coming through. Let’s further assume that your goal for the replicator is to catch up in 30 hours. In this case, your replicator needs to copy data at 260 MBps (100 MBps for ingress traffic + 160 MBps (18 TB/30 hours) for existing data) to catch up in 30 hours. The following figure illustrates this process.

Therefore, in the sizing guide’s egress cells, you need to add an additional 260 MBps to your average data out and peak data out to estimate the size of the cluster you should provision to complete the replication safely and on time.

Replication tools act as a consumer to the source cluster, so there is a chance that this replication consumer can consume higher bandwidth, which can negatively impact the existing application client’s produce and consume requests. To control the replication consumer throughput, you can use a consumer-side Kafka quota in the source cluster to limit the replicator throughput. This makes sure that the replicator consumer will throttle when it goes beyond the limit, thereby safeguarding the other consumers. However, if the quota is set too low, the replication throughput will suffer and the replication might never end. Based on the preceding example, you can set a quota for the replicator to be at least 260 MBps, otherwise the replication will not finish in 30 hours.

  • Volume throughput – Data replication might involve reading from the earliest offset (based on business requirement), impacting your primary storage volume, which in this case is Amazon Elastic Block Store (Amazon EBS). The VolumeReadBytes and VolumeWriteBytes metrics should be checked to make sure the source cluster volume throughput has additional bandwidth to handle any additional read from the disk. Depending on the cluster size and replication data volume, you should provision storage throughput in the cluster. With provisioned storage throughput, you can increase the Amazon EBS throughput up to 1000 MBps depending on the broker size. The maximum volume throughput can be specified depending on broker size and type, as mentioned in Manage storage throughput for Standard brokers in a Amazon MSK cluster. Based on the preceding example, the replicator will start reading from the disk and the volume throughput of 260 MBps will be shared across all the brokers. However, existing consumers can lag, which will cause reading from the disk, thereby increasing the storage read throughput. Also, there is storage write throughput due to incoming data from the producer. In this scenario, enabling provisioned storage throughput will increase the overall EBS volume throughput (read + write) so that existing producer and consumer performance doesn’t get impacted due to the replicator reading data from EBS volumes.
  • Balanced partitions – Make sure partitions are well-distributed across brokers, with no skewed leader partitions.

Depending on the assessment, you might need to vertically scale up or horizontally scale out the source cluster before migration.

Assessing the target cluster’s infrastructure and needs

Use the same sizing tool to estimate the size of your Express broker cluster. Typically, fewer Express brokers might be needed compared to Standard brokers for the same workload because depending on the instance size, Express brokers allow up to three times more ingress throughput.

Configuring Express Brokers

Express brokers employ opinionated and optimized Kafka configurations, so it’s important to differentiate between configurations that are read-only and those that are read/write during planning. Read/write broker-level configurations should be configured separately as a pre-migration step in the target cluster. Although MSK Replicator will replicate most topic-level configurations, certain topic-level configurations are always set to default values in an Express cluster: replication-factor, min.insync.replicas, and unclean.leader.election.enable. If the default values differ from the source cluster, these configurations will be overridden.

As part of the metadata, MSK Replicator also copies certain ACL types, as mentioned in Metadata replication. It doesn’t explicitly copy the write ACLs except the deny ones. Therefore, if you’re using SASL/SCRAM or mTLS authentication with ACLs rather than AWS Identity and Access Management (IAM) authentication, write ACLs need to be explicitly created in the target cluster.

Client connectivity to the target cluster

Deployment of the target cluster can occur within the same virtual private cloud (VPC) or a different one. Consider any changes to client connectivity, including updates to security groups and IAM policies, during the planning phase.

Migration strategy: All at once vs. wave

Two migration strategies can be adopted:

  • All at once – All topics are replicated to the target cluster simultaneously, and all clients are migrated at once. Although this approach simplifies the process, it generates significant egress traffic and involves risks to multiple clients if issues arise. However, if there is any failure, you can roll back by redirecting the clients to use the source cluster. It’s recommended to perform the cutover during non-business hours and communicate with stakeholders beforehand.
  • Wave – Migration is broken into phases, moving a subset of clients (based on business requirements) in each wave. After each phase, the target cluster’s performance can be evaluated before proceeding. This reduces risks and builds confidence in the migration but requires meticulous planning, especially for large clusters with many microservices.

Each strategy has its pros and cons. Choose the one that aligns best with your business needs. For insights, refer to Goldman Sachs’ migration strategy to move from on-premises Kafka to Amazon MSK.

Cutover plan

Although MSK Replicator facilitates seamless data replication with minimal downtime, it’s essential to devise a clear cutover plan. This includes coordinating with stakeholders, stopping producers and consumers in the source cluster, and restarting them in the target cluster. If a failure occurs, you can roll back by redirecting the clients to use the source cluster.

Schema registry

When migrating from a Standard broker to an Express broker cluster, schema registry considerations remain unaffected. Clients can continue using existing schemas for both producing and consuming data with Amazon MSK.

Solution overview

In this setup, two Amazon MSK provisioned clusters are deployed: one with Standard brokers (source) and the other with Express brokers (target). Both clusters are located in the same AWS Region and VPC, with IAM authentication enabled. MSK Replicator is used to replicate topics, data, and configurations from the source cluster to the target cluster. The replicator is configured to maintain identical topic names across both clusters, providing seamless replication without requiring client-side changes.

During the first phase, the source MSK cluster handles client requests. Producers write to the clickstream topic in the source cluster, and a consumer group with the group ID clickstream-consumer reads from the same topic. The following diagram illustrates this architecture.

When data replication to the target MSK cluster is complete, we need to evaluate the health of the target cluster. After confirming the cluster is healthy, we need to migrate the clients in a controlled manner. First, we need to stop the producers, reconfigure them to write to the target cluster, and then restart them. Then, we need to stop the consumers after they have processed all remaining records in the source cluster, reconfigure them to read from the target cluster, and restart them. The following diagram illustrates the new architecture.

After verifying that all clients are functioning correctly with the target cluster using Express brokers, we can safely decommission the source MSK cluster with Standard brokers and the MSK Replicator.

Deployment Steps

In this section, we discuss the step-by-step process to replicate data from an MSK Standard broker cluster to an Express broker cluster using MSK Replicator and also the client migration strategy. For the purpose of the blog, “all at once” migration strategy is used.

Provision the MSK cluster

Download the AWS CloudFormation template to provision the MSK cluster. Deploy the following in us-east-1 with stack name as migration.

This will create the VPC, subnets, and two Amazon MSK provisioned clusters: one with Standard brokers (source) and another with Express brokers (target) within the VPC configured with IAM authentication. It will also create a Kafka client Amazon Elastic Compute Cloud (Amazon EC2) instance where from we can use the Kafka command line to create and view Kafka topics and produce and consume messages to and from the topic.

Configure the MSK client

On the Amazon EC2 console, connect to the EC2 instance named migration-KafkaClientInstance1 using Session Manager, a capability of AWS Systems Manager.

After you log in, you need to configure the source MSK cluster bootstrap address to create a topic and publish data to the cluster. You can get the bootstrap address for IAM authentication from the details page for the MSK cluster (migration-standard-broker-src-cluster) on the Amazon MSK console, under View Client Information. You also need to update the producer.properties and consumer.properties files to reflect the bootstrap address of the standard broker cluster.

sudo su - ec2-user

export BS_SRC=<<SOURCE_MSK_BOOTSTRAP_ADDRESS>>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=/BOOTSTRAP_SERVERS_CONFIG=${BS_SRC}/g" producer.properties 
sed -i "s/bootstrap.servers=/bootstrap.servers=${BS_SRC}/g" consumer.properties

Create a topic

Create a clickstream topic using the following commands:

/home/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_SRC \
--create --replication-factor 3 --partitions 3 \
--topic clickstream \
--command-config=/home/ec2-user/kafka/config/client_iam.properties

Produce and consume messages to and from the topic

Run the clickstream producer to generate events in the clickstream topic:

cd /home/ec2-user/clickstream-producer-for-apache-kafka/

java -jar target/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/producer.properties -nt 8 -rf 3600 -iam \
-gsr -gsrr <<REGION>> -grn default-registry -gar

Open another Session Manager instance and from that shell, run the clickstream consumer to consume from the topic:

cd /home/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar target/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/consumer.properties -nt 3 -rf 3600 -iam \
-gsr -gsrr <<REGION>> -grn default-registry

Keep the producer and consumer running. If not interrupted, the producer and consumer will run for 60 minutes before it exits. The -rf parameter controls how long the producer and consumer will run.

Create an MSK replicator

To create an MSK replicator, complete the following steps:

  1. On the Amazon MSK console, choose Replicators in the navigation pane.
  2. Choose Create replicator.
  3. In the Replicator details section, enter a name and optional description.

  1. In the Source cluster section, provide the following information:
    1. For Cluster region, choose us-east-1.
    2. For MSK cluster, enter the MSK cluster Amazon Resource Name (ARN) for the Standard broker.

After the source cluster is selected, it automatically selects the subnets associated with the primary cluster and the security group associated with the source cluster. You can also select additional security groups.

Make sure that the security groups have outbound rules to allow traffic to your cluster’s security groups. Also make sure that your cluster’s security groups have inbound rules that accept traffic from the replicator security groups provided here.

  1. In the Target cluster section, for MSK cluster¸ enter the MSK cluster ARN for the Express broker.

After the target cluster is selected, it automatically selects the subnets associated with the primary cluster and the security group associated with the source cluster. You can also select additional security groups.

Now let’s provide the replicator settings.

  1. In the Replicator settings section, provide the following information:
    1. For the purpose of the example, we have kept the topics to replicate as a default value that would replicate all topics from primary to secondary cluster.
    2. For Replicator starting position, we configure it to replicate from the earliest offset, so that we can get all the events from the start of the source topics.
    3. To configure the topic name in the secondary cluster as identical to the primary cluster, we select Keep the same topic names for Copy settings. This makes sure that the MSK clients don’t need to add a prefix to the topic names.

    1. For this example, we keep the Consumer Group Replication setting as default (make sure it’s enabled to allow redirected clients resume processing data from the last processed offset).
    2. We set Target Compression type as None.

The Amazon MSK console will automatically create the required IAM policies. If you’re deploying using the AWS Command Line Interface (AWS CLI), SDK, or AWS CloudFormation, you have to create the IAM policy and use it as per your deployment process.

  1. Choose Create to create the replicator.

The process will take around 15–20 minutes to deploy the replicator. When the MSK replicator is running, this will be reflected in the status.

Monitor replication

When the MSK replicator is up and running, monitor the MessageLag metric. This metric indicates how many messages are yet to be replicated from the source MSK cluster to the target MSK cluster. The MessageLag metric should come down to 0.

Migrate clients from source to target cluster

When the MessageLag metric reaches 0, it indicates that all messages have been replicated from the source MSK cluster to the target MSK cluster. At this stage, you can cut over client applications from the source to the target cluster. Before initiating this step, confirm the health of the target cluster by reviewing the Amazon MSK metrics in Amazon CloudWatch and making sure that the client applications are functioning properly. Then complete the following steps:

  1. Stop the producers writing data to the source (old) cluster with Standard brokers and reconfigure them to write to the target (new) cluster with Express brokers.
  2. Before migrating the consumers, make sure that the MaxOffsetLag metric for the consumers has dropped to 0, confirming that they have processed all existing data in the source cluster.
  3. When this condition is met, stop the consumers and reconfigure them to read from the target cluster.

The offset lag happens if the consumer is consuming slower than the rate the producer is producing data. The flat line in the following metric visualization shows that the producer has stopped producing to the source cluster while the consumer attached to it continues to consume the existing data and eventually consumes all the data, therefore the metric goes to 0.

  1. Now you can update the bootstrap address in properties and consumer.properties to point to the target Express based MSK cluster. You can get the bootstrap address for IAM authentication from the MSK cluster (migration-express-broker-dest-cluster) on the Amazon MSK console under View Client Information.
export BS_TGT=<<TARGET_MSK_BOOTSTRAP_ADDRESS>>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=.*/BOOTSTRAP_SERVERS_CONFIG=${BS_TGT}/g" producer.properties
sed -i "s/bootstrap.servers=.*/bootstrap.servers=${BS_TGT}/g" consumer.properties

  1. Run the clickstream producer to generate events in the clickstream topic:
cd /home/ec2-user/clickstream-producer-for-apache-kafka/

java -jar target/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/producer.properties -nt 8 -rf 60 -iam \
-gsr -gsrr <<REGION>> -grn default-registry -gar

  1. In another Session Manager instance and from that shell, run the clickstream consumer to consume from the topic:
cd /home/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar target/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream \
-pfp /home/ec2-user/consumer.properties -nt 3 -rf 60 -iam \
-gsr -gsrr <<REGION>> -grn default-registry

We can see that the producers and consumers are now producing and consuming to the target Express based MSK cluster. The producers and consumers will run for 60 seconds before they exit.

The following screenshot shows producer-produced messages to the new Express based MSK cluster for 60 seconds.

Migrate stateful applications

Stateful applications such as Kafka Streams, KSQL, Apache Spark, and Apache Flink use their own checkpointing mechanisms to store consumer offsets instead of relying on Kafka’s consumer group offset mechanism. When migrating topics from a source cluster to a target cluster, the Kafka offsets in the source will differ from those in the target. As a result, migrating a stateful application along with its state requires careful consideration, because the existing offsets are incompatible with the target cluster’s offsets. Before migrating stateful applications, it is crucial to stop producers and make sure that consumer applications have processed all data from the source MSK cluster.

Migrate Kafka Streams and KSQL applications

Kafka Streams and KSQL store consumer offsets in internal changelog topics. It is advisable not to replicate these internal changelog topics to the target MSK cluster. Instead, the Kafka Streams application should be configured to start from the earliest offset of the source topics in the target cluster. This allows the state to be rebuilt. However, this method results in duplicate processing, because all the data in the topic is reprocessed. Therefore, the target destination (such as a database) must be idempotent to handle these duplicates effectively.

Express brokers don’t allow configuring segment.bytes to optimize performance. Therefore, the internal topics need to be manually created before the Kafka Streams application is migrated to the new Express based cluster. For more information, refer to Using Kafka Streams with MSK Express brokers and MSK Serverless.

Migrate Spark applications

Spark stores offsets in its checkpoint location, which should be a file system compatible with HDFS, such as Amazon Simple Storage Service (Amazon S3). After migrating the Spark application to the target MSK cluster, you should remove the checkpoint location, causing the Spark application to lose its state. To rebuild the state, configure the Spark application to start processing from the earliest offset of the source topics in the target cluster. This will lead to re-processing all the data from the start of the topic and therefore will generate duplicate data. Consequently, the target destination (such as a database) must be idempotent to effectively handle these duplicates.

Migrate Flink applications

Flink stores consumer offsets within the state of its Kafka source operator. When checkpoints are completed, the Kafka source commits the current consuming offset to provide consistency between Flink’s checkpoint state and the offsets committed on Kafka brokers. Unlike other systems, Flink applications don’t rely on the __consumer_offsets topic to track offsets; instead, they use the offsets stored in Flink’s state.

During Flink application migration, one approach is to start the application without a Savepoint. This approach discards the entire state and reverts to reading from the last committed offset of the consumer group. However, this prevents the application from accurately rebuilding the state of downstream Flink operators, leading to discrepancies in computation results. To address this, you can either avoid replicating the consumer group of the Flink application or assign a new consumer group to the application when restarting it in the target cluster. Additionally, configure the application to start reading from the earliest offset of the source topics. This enables re-processing all data from the source topics and rebuilding the state. However, this method will result in duplicate data, so the target system (such as a database) must be idempotent to handle these duplicates effectively.

Alternatively, you can reset the state of the Kafka source operator. Flink uses operator IDs (UIDs) to map the state to specific operators. When restarting the application from a Savepoint, Flink matches the state to operators based on their assigned IDs. It is recommended to assign a unique ID to each operator to enable seamless state restoration from Savepoints. To reset the state of the Kafka source operator, change its operator ID. Passing the operator ID as a parameter in a configuration file can simplify this process. Restart the Flink application with parameter --allowNonRestoredState (if you are running self-managed Flink). This will reset only the state of the Kafka source operator, leaving other operator states unaffected. As a result, the Kafka source operator resumes from the last committed offset of the consumer group, avoiding full reprocessing and state rebuilding. Although this might still produce some duplicates in the output, it results in no data loss. This approach is applicable only when using the DataStream API to build Flink applications.

Conclusion

Migrating from a Standard broker MSK cluster to an Express broker MSK cluster using MSK Replicator provides a seamless, efficient transition with minimal downtime. By following the steps and strategies discussed in this post, you can take advantage of the high-performance, cost-effective benefits of Express brokers while maintaining data consistency and application uptime.

Ready to optimize your Kafka infrastructure? Start planning your migration to Amazon MSK Express brokers today and experience improved scalability, speed, and reliability. For more details, refer to the Amazon MSK Developer Guide.


About the Author

Subham Rakshit is a Senior Streaming Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build streaming architectures so they can get value from analyzing their streaming data. His two little daughters keep him occupied most of the time outside work, and he loves solving jigsaw puzzles with them. Connect with him on LinkedIn.

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/foundational-blocks-of-amazon-sagemaker-unified-studio-an-admins-guide-to-implement-unified-access-to-all-your-data-analytics-and-ai/

Amazon SageMaker Unified Studio (preview) provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. Users can now build, deploy, and execute end-to-end workflows from a single interface. SageMaker Unified Studio is built on the foundations of Amazon DataZone, where it uses domains to categorize and structure the data assets, while offering project-based collaboration features that allow teams to securely share artifacts and work together across various compute services. This experience allows multiple personas to seamlessly collaborate, while operating under appropriate access controls and governance policies.

In this post, we focus on the admin persona and deep dive into the foundational building blocks while implementing the self-service access to all your data.

Conceptual framework

SageMaker Unified Studio offers an integrated development experience organized into three distinct planes, each serving different personas and purposes within the development lifecycle. This architecture enables seamless collaboration while maintaining clear boundaries of responsibility.

As shown in the following figure, each plane represents a distinct layer of functionality that works in harmony with the others to create a complete data and machine learning (ML) solution.

foundational planes

The planes are as follows:

  • Infrastructure plane – The infrastructure plane forms the foundation of SageMaker Unified Studio. Here administrators and domain owners of the organization provision the underlying infrastructure and define rules for users of the data factory plane to deploy the compute resources for data and ML operations in self-service mode. They can also decide to onboard existing resources or pre-create them. They can set up access controls and permissions to enforce and allocate resources to different teams and projects. This layer makes sure that all necessary computational resources are available and properly governed for downstream computation.
  • Data factory plane – The data factory plane functions like a sophisticated vending machine for compute resources, where data scientists and ML engineers can select and utilize preconfigured compute resources or deploy new ones. The data product developers, data engineers, and data scientists can create collaboration spaces and build data products by consuming infrastructure resources, with all the underlying complexity abstracted away.
  • Product experience plane – At the outermost layer, the product experience plane serves as a discovery and collaboration hub where business units (data producers and data consumers) can explore available data products from the asset catalog. This plane drives users to engage in data-driven conversations with knowledge and insights shared across the organization. Through the product experience plane, data product owners can use automated workflows to capture data lineage and data quality metrics and oversee access controls. They can track how their data products are being used and continuously improve the value proposition of their data assets.

In this post, we focus on the infrastructure plane deployment steps from an administrator’s perspective, outlining key responsibilities and actions required and how to configure and organize your assets under specific business units and teams and authorize policies during the initial setup phase.

Roles and responsibilities of the domain owner (admin) for the infrastructure plane

As shown in the following figure, the infrastructure plane revolves around three pivotal operational paradigms: onboard, organize, and authorize.

The details of the three essential functions in the foundational layer are as follows:

  • Onboard – The domain owner establishes a foundational environment by creating a domain, which represents an organization entity for you to connect together your assets, users, resources, and code repository configs. They can onboard the users who have authorization to access the self-serve unified studio. The self-serve unified studio is a browser-based web application where you can analyze, discover, catalog, govern, and share data in self-serve manner. The admin can enable the necessary blueprints and create project profiles to set up the underlying data infrastructure. In a multi-account (Mesh) scenario, the admin can also onboard the business units by associating the AWS accounts.
  • Organize – Here the domain owner creates hierarchies to organize and isolate projects within individual business units. The method of creating hierarchical representation of business units or team-level organization is through domain units. This makes sure that each business unit takes ownership of their assets. The admin can also delegate ownership within these business units.
  • Authorize – The admin or owners of individual business units or line of business (domain unit owners) can manage user policies—project-specific policies that dictate certain actions these principals can perform under a domain unit.

Now that we have discussed the core functions, let’s delve into the workflow that brings these concepts together.

Process workflow (infrastructure plane)

In the following figure, we break down the roles and responsibilities of domain owners to unit administrators through a sequence of operations, providing infrastructure deployment and management.

process workflow

The workflow consists of the following steps:

  1. The root domain owner (admin) creates a SageMaker Unified Studio domain from the console. After the domain is created, you get a SageMaker Unified Studio URL—a browser-based web application that can authenticate you with your AWS Identity and Access Management (IAM) user credentials or with credentials from your identity provider (IdP) through AWS IAM Identity Center or with your SAML credentials.
  2. As part of the onboarding process, the admin onboards single sign-on (SSO) users, SSO groups, and IAM users who are authorized to log in to SageMaker Unified Studio. IAM roles can be onboarded on the domain as well, but can be used for programmatic access only. During the quick setup deployment of the domain, default project profile templates are created. A project profile is a collection of blueprints that holds configurations of AWS tools and services. You can create following project profiles:
    1. Generative AI application development – Provides you with the tooling capabilities to build generative AI applications using Amazon Bedrock foundation models (FMs) and tools.
    2. SQL analytics – Provides you with a SQL editor to query the data in Amazon SageMaker Lakehouse, Amazon Redshift, and Amazon Athena.
    3. Data analytics and AI-ML model development – Provides you tools to build and orchestrate ML and generative AI models powered by AWS Glue, Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon SageMaker AI, and SageMaker Lakehouse.
    4. Custom project profile – Provides capabilities to build custom templates that can bundle multiple blueprints with varied tooling capabilities to suit your business needs.

Admins can also authorize project profile templates to specific users and groups, enforcing the capability to control resource deployment based on user personas. By default, all users are authorized to use default project profiles. However, this can be changed by the admin to limit the access of certain project profiles to certain users and groups.

The quick setup also establishes a default Git connection to AWS CodeCommit for users to manage their code repository. However, you also have the option to create and enable new Git connections to GitHub, GitHub Enterprise Server, GitLab, and GitLab self-managed. The Free Tier release of Amazon Q is enabled by default to all users of SageMaker Unified Studio domain. Amazon Q Developer Pro can be configured if IAM Identity Center is configured for users of the domain.

Finally, as part of the initial setup, the admin provides access to Amazon Bedrock serverless models.

In a multi-account scenario, the central admin associates AWS accounts, and the associated account admins accept the association and enable the blueprints for the project profiles that the central admin would create. Refer to the appendix at the end of this post for more details.

  1. To organize the data assets within the organization, the admin logs in to the SageMaker Unified Studio URL and creates domain units aligned with the business divisions.
  2. Each domain unit receives delegated ownership, enabling autonomous management of assets within their designated scope. This domain-based isolation provides clear boundaries while allowing unit owners to independently govern their assets and enforce relevant policies.

Steps 3 and 4 are optional as part of the quick deployment setup. Users can directly log in to SageMaker Unified Studio to build data products for their business use case if domain units are not part of immediate requirement. If no domain units are created, all users and groups fall back under the root domain level and authorization policies are applied on the root domain.

Behind the scenes

While users interact with a streamlined project creation interface in SageMaker Unified Studio, a sophisticated orchestration of components operates beneath the surface. This abstraction allows the admin to deploy infrastructure through simple selections while the system handles resource provisioning automatically. Let’s examine the underlying process behind the scenes, as illustrated in the following figure.

conceptual diagram of blueprints

This workflow consists of the following steps:

  1. Administrators enable the blueprints containing the AWS CloudFormation templates that have information on how to create and set up the underlying data infrastructure. These blueprints are automatically enabled during the quick setup deployment.
  2. Project profiles bundle these blueprint configurations into templates. These templates determine which infrastructure components deploy when a project is created.
  3. When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant CloudFormation stack and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

In a multi-account scenario, the associated account admin enables the blueprints. However, the project profile creation happens at the root domain account. The project profile template will include the associated account details and the linked blueprints from the associated account. Refer to the appendix at the end of this post for more details.

Now that we have understood the functional building blocks of SageMaker Unified Studio, let’s proceed with the deployment walkthrough. We will create a domain using the quick setup deployment for single account. Refer to the appendix for multi-account deployment steps.

Prerequisites

You will need to complete the following prerequisites before you can follow the instructions in the next section:

  1. Sign up for an AWS account.
  2. Create a user with administrative access.
  3. Enable IAM Identity Center in the same AWS Region you want to create your SageMaker Unified Studio domain. Confirm in which Region SageMaker Unified Studio is currently available. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
  4. To use Amazon Bedrock FMs, grant access to base models.

Set up domain

Complete the following steps to create a new SageMaker Unified Studio domain:

  1. Sign in to the SageMaker console in the Region in which IAM Identity Center is enabled.
  2. Choose Create a Unified Studio domain.

create domain

  1. Select the Quick setup (recommended for exploration).
  2. Choose Create VPC (you can also use your own VPC but to simplify the cleanup, we opted to use a new VPC).

create vpc

This will open a new tab to deploy the CloudFormation stack to create the VPC and the necessary private and public subnets.

  1. For Stack name, enter a unique name to the stack (if the default name already exists).
  2. Keep the parameter for useVpcEndpoints as false.
  3. Choose Create stack.

create stack

  1. After the stack is created, go to the domain creation page and refresh the page, as shown in the following screenshot.

refresh

  1. For Name, enter a unique name for the domain.
  2. Keep the default selections for Domain Execution role, Domain Service role, Provisioning role, and Manage Access role.
  3. The configuration automatically selects the VPC and private subnets.

domain roles

service roles

  1. Keep the default selection for Model provisioning role and Model consumption role.
  2. Choose Continue.

prov roles

  1. Provide the email address of the SSO user that exists in IAM Identity Center.

The SSO user selected here is used as the administrator in SageMaker Unified Studio. If the account doesn’t have IAM Identity Center set up, then it will create an IAM Identity Center account instance, so long as the account is permitted to do so. An SSO or IAM user is required so that a user is able to log in to the studio after the domain is created.

  1. Choose Create domain.

create IdC

  1. After the domain is created, a dialog box pops up. You can close dialog box to set up authorization policies and onboard users.

dialog box

On the domain detail page, the Amazon SageMaker Unified Studio URL is listed. You can authenticate with your IAM user credentials or with credentials from your IdP through IAM Identity Center or with your SAML credentials. To authorize users to log in to the URL, the administrator must onboard the users to the domain. We see this as part of the next steps.

Unified Studio URL

Onboard users and associated accounts

Complete the following steps:

  1. To onboard users, go to the User management tab and choose Add.
  2. On the Add menu, choose either Add SSO users and groups or Add IAM users.

You can also add IAM roles for the purpose of managing the domain programmatically. However, you can’t use IAM roles to log in to the SageMaker Unified Studio URL. After you add the users, they will appear with the status Assigned. The status changes to Activated only when the user logs in to the SageMaker Unified Studio URL.

onboard users

  1. If you want to onboard multiple AWS accounts to your domain account, go to the Account associations tab and choose Request association.

This enables domain users to publish and consume data from these AWS accounts.

associate accounts

For a multi-account setup, by sending an association request to another AWS account, you share the root domain with the other AWS account with AWS Resource Access Manger (AWS RAM). The associated admin domain owner accepts the invitation. To access the compute resources of the associated accounts from SageMaker Unified Studio, the associated domain owner must enable the necessary blueprints. Refer to the appendix to understand the cross-account deployment steps.

Project profiles and authorizing users

For the quick setup deployment, when you navigate to the Blueprints tab, you will notice all the blueprints are automatically enabled. Also, on the Project profiles tab, you will find default project profiles are available to the user.project profiles

Leave the rest of the tabs with the default options.

Create a custom project profile and authorize users (optional)

In the following example, we show the steps to create a custom project profile by bundling selected blueprints. We also show the steps to authorize only restricted users to use this project profile template. This example creates a custom project profile with selective blueprints. This enables the user to create a data lake environment with AWS Glue database and Athena workgroup to query the data. The user can also create an Amazon MWAA environment for orchestration. You can also change or override the configuration parameters of the blueprint by using the Tooling configurations option within the project profile.

Because SageMaker Unified Studio is in preview mode, the naming conventions of some visual elements might appear different in the current version.

When you create a project profile, you can add blueprint deployment settings in two modes: on create and on demand. On create mode allows you to deploy the blueprint deployment settings as soon as the project is created. On demand mode allows you to deploy the blueprint deployment settings when users need it.

Create a project, create domain units, and delegate ownership (optional)

In the following example, the administrator logs in to SageMaker Unified Studio and creates the retail domain unit. The admin also delegates ownership to the retail business user. The retail business user logs in to SageMaker Unified Studio and creates a project with the authorized project profile template.

With these configurations in place, you have successfully completed the initial infrastructure plane deployment from an administrative perspective.

Authorization of blueprints (optional)

By default, all domain users have authorization to create projects with the enabled blueprints across domain units. If you want to restrict the usage of the blueprint within a specific domain unit (in this case, the retail domain unit, as shown in the following screenshot), you need to revoke the existing permissions and authorize the specific domain units. By limiting the use of blueprints to a particular domain unit, users can only create projects using the blueprint within that domain unit. To apply authorization settings to child domain units, enable the Cascade to all child domain units option.

blueprints authorization

Clean up

Make sure you remove the SageMaker Unified Studio resources to mitigate any unexpected costs. This involves a few steps:

  1. If you had multiple projects and subscribed to assets, unsubscribe to all assets.
  2. Note the names of all AWS Glue databases and Athena workgroups created by your projects.
  3. Delete any connections you created in the data explorer that you don’t want to keep.
  4. Note the project IDs.
  5. Delete the projects. If you encounter any errors, check the AWS CloudFormation console and find the failed stack. Fix the error that failed the stack deletion and delete the projects.
  6. Note down the domain ID.
  7. Delete the domain.
  8. Delete the S3 bucket named amazon-datazone-AWSACCOUNTID-AWSREGION-DOMAINID.
  9. Delete the AWS Glue databases and Athena workgroups you noted earlier.
  10. Delete the CloudFormation stack for the VPC (if you followed that step in the setup).

If you have additional resources that haven’t been deleted, you can also use tags to identify and delete specific resources.

Conclusion

In this post, we discussed the foundational building blocks of SageMaker Unified Studio and how, by abstracting complex technical implementations behind user-friendly interfaces, organizations can maintain standardized governance while enabling efficient resource management across business units. This approach provides consistency in infrastructure deployment while providing the flexibility needed for diverse business requirements.

To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide and the following resources:

Appendix: Multi-account administration

This section illustrates the cross-account association. After the account invitation is accepted by the associated account owner, follow the instructions as shown in the following example to understand how to enable the blueprints. After the blueprints are enabled in the associate accounts, the root domain account can create project profile templates with the parameters of the associated account, including its linked blueprints. The example then demonstrates how the retail domain unit user can deploy compute resources and create data using the resources from the associated account.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Exabyte Scale Hard Drive Investments

Post Syndicated from Chris Opat original https://www.backblaze.com/blog/exabyte-scale-hard-drive-investments/

A decorative image showing several servers connected to the same network.

Not many companies run exabyte scale data platforms, and not many companies open source their drive data—at Backblaze, we do both. From that perch, I’m sharing how I think about buying hard drives at exabyte scale, including the intentional design decisions and trade-offs I make as an expert in the field, and what you can apply to your own operations whether you’re running a couple hundred terabytes or petabytes on-premises. 

TL/DR: Bigger drives aren’t always better

You’d think, as a cloud platform managing massive amounts of data, we’d be delighted that drive density continues to grow. But it’s not as simple as that. While we do run cohorts of 20TB+ drives in our environment, there are a few reasons it doesn’t always make sense to fill our servers up with the densest drives we can buy.

Drive size and IOPS starvation

Drives have a finite amount of capacity to perform input/output operations per second (IOPS). The larger the drive, the more those IOPS become a contentious consumable—creating a triangle of tension between storage capacity, reading, and writing. You can store more data on a 20TB drive, but you can only read and write as fast as that one drive allows. Conversely, you can store the same amount of data on five 4TB drives and 5x your IOPS capacity through concurrency. 

For high demand workloads with high concurrency requirements for reading and writing files—like AI inferencing, for example—you’ll want to carefully consider the balance point between the right drive size and the performance you need to get out of the system. The ability to read, write, or delete content has to peacefully coexist with the ability for your storage infrastructure to service any of those three needs. Now, you might be thinking: If that’s a constraint, what about SSDs? I’ll get to that down below.

Drive size and rebuilds

When managing large data at scale we employ Reed-Solomon erasure coding to rebuild drives upon failure to maintain data durability. The larger the drive, the more painful and slow the rebuild when that drive eventually fails. The rebuild process can take hours or even days, depending on the size of the drive and the workload on the system. That can impact performance, especially if the storage system is already under heavy use, and increases the risk of another failure while the rebuild is in progress. While we mitigate that risk in a variety of ways, it may not be feasible for smaller shops to do so.

If you’re in a business that relies on real-time data access—financial institutions, healthcare providers, e-commerce platforms, for example—you need drives that balance capacity and rebuild speed. Higher-capacity drives may offer better storage density but smaller or enterprise-grade drives with faster rebuild times and higher endurance may be a better choice for businesses where continuous uptime and/or durability is critical.

HDD vs. SSD: Unit economics

The moral of the story is that the way you invest in drives, and how much you take things like drive size, drive type, and the failure rates we publish into consideration absolutely depends on your use case. It’s not as simple as looking at our Drive Stats and picking the drive with the lowest annualized failure rate. 

In Backblaze’s early days, when we were focused on consumer backup, drive density and durability were the most important part of the equipment for us. We didn’t care about speed. As our customers increasingly bring us newer and more demanding use cases, our calculus for the kinds of drives we fill our data centers with will change with them. 

The post Exabyte Scale Hard Drive Investments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

New leadership for Asahi Linux

Post Syndicated from corbet original https://lwn.net/Articles/1009528/

The Asahi Linux project, which is working to support Linux on Apple
silicon, has announced the
resignation of Hector “marcan” Martin as its lead, and his replacement by a
seven-person committee. “Today’s news is bittersweet. We are grateful
to marcan for kicking off this project and tirelessly working on it these
past years. Our community will miss him. Still, with your support, the
project has a bright future to come
“. Martin has explained his reasons
for leaving at length in this
blog post
.

[$] Multi-size THP creation, two different ways

Post Syndicated from corbet original https://lwn.net/Articles/1009039/

Huge pages can increase the performance of many programs, but they can also
have unfortunate performance impacts of their own. Over the last few
years, multi-size transparent huge pages (mTHPs) have increasingly been
seen as a happy medium that bring the benefits of huge pages at a lower cost.
The system cannot benefit from mTHPs, though, if it does not create them;
two developers have independently posted patches to enable the creation of
mTHPs in the background.

CVE-2025-1094: PostgreSQL psql SQL injection (FIXED)

Post Syndicated from Stephen Fewer original https://blog.rapid7.com/2025/02/13/cve-2025-1094-postgresql-psql-sql-injection-fixed/

CVE-2025-1094: PostgreSQL psql SQL injection (FIXED)

Rapid7 discovered a high-severity SQL injection vulnerability, CVE-2025-1094, affecting the PostgreSQL interactive tool psql. This discovery was made while Rapid7 was performing research into the recent exploitation of CVE-2024-12356 — an unauthenticated remote code execution (RCE) vulnerability that affects both BeyondTrust Privileged Remote Access (PRA) and BeyondTrust Remote Support (RS). Rapid7 discovered that in every scenario we tested, a successful exploit for CVE-2024-12356 had to include exploitation of CVE-2025-1094 in order to achieve remote code execution. While CVE-2024-12356 was patched by BeyondTrust in December 2024, and this patch successfully blocks exploitation of both CVE-2024-12356 and CVE-2025-1094, the patch did not address the root cause of CVE-2025-1094, which remained a zero-day until Rapid7 discovered and reported it to PostgreSQL.

All supported versions before PostgreSQL 17.3, 16.7, 15.11, 14.16, and 13.19 are affected. CVE-2025-1094 has a CVSS 3.1 base score of 8.1 (High). More information is available in the PostgreSQL advisory.

Impact

CVE-2025-1094 arises from an incorrect assumption that when attacker-controlled untrusted input has been safely escaped via PostgreSQL’s string escaping routines, it cannot be leveraged to generate a successful SQL injection attack. Rapid7 found that SQL injection is, in fact, still possible in a certain scenario when escaped untrusted input is included as part of a SQL statement executed by the interactive psql tool.

Because of how PostgreSQL string escaping routines handle invalid UTF-8 characters, in combination with how invalid byte sequences within the invalid UTF-8 characters are processed by psql, an attacker can leverage CVE-2025-1094 to generate a SQL injection.

An attacker who can generate a SQL injection via CVE-2025-1094 can then achieve arbitrary code execution (ACE) by leveraging the interactive tool’s ability to run meta-commands. Meta-commands extend the interactive tools functionality, by providing a wide variety of additional operations that the interactive tool can perform. The meta-command, identified by the exclamation mark symbol, allows for an operating system shell command to be executed. An attacker can leverage CVE-2025-1094 to perform this meta-command, thus controlling the operating system shell command that is executed.

Alternatively, an attacker who can generate a SQL injection via CVE-2025-1094 can execute arbitrary attacker-controlled SQL statements.

Credit

This vulnerability was discovered by Stephen Fewer, Principal Security Researcher at Rapid7 and is being disclosed in accordance with Rapid7’s vulnerability disclosure policy.

Analysis

A technical analysis of CVE-2025-1094, as it relates to the exploitation of the BeyondTrust vulnerability CVE-2024-12356, is available in AttackerKB.

A Metasploit exploit module that exploits CVE-2025-1094 against a vulnerable BeyondTrust Privileged Remote Access (PRA) and Remote Support (RS) target is available here.

Vendor Statement

The PostgreSQL Global Development Group provides information on security vulnerability reporting, releases processes, and known vulnerability fixes at https://www.postgresql.org/support/security/.

Remediation

To remediate CVE-2025-1094, PostgreSQL users should upgrade to PostgreSQL 17.3, 16.7, 15.11, 14.16, or 13.19. For additional details, please see the PostgreSQL advisory.

Rapid7 customers

InsightVM and Nexpose customers will be able to assess their exposure to CVE-2025-1094 with an authenticated vulnerability check expected to be available in today’s (February 13) content release.

For CVE-2024-12356 affecting BeyondTrust Privileged Remote Access (PRA) and Remote Support (RS) products, InsightVM and Nexpose customers have been able to assess exposure with authenticated checks for Windows systems (Scan Engine only checks) as of the February 10, 2025 content release.

Disclosure timeline

  • January 27, 2025: Rapid7 makes initial contact with the PostgreSQL security team and discloses vulnerability details.
  • January 29, 2025: The PostgreSQL development group confirms the finding; Rapid7 and PostgreSQL developers agree on a coordinated disclosure date.
  • February 11, 2025: The PostgreSQL development group provides a CVE ID and affected versions.
  • February 13, 2025: This disclosure.

The collective thoughts of the interwebz