Automate public TLS certificate issuance with ACME support in AWS Certificate Manager

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/automate-public-tls-certificate-issuance-with-acme-support-in-aws-certificate-manager/

If you manage TLS certificates for your applications, you know the challenge: certificates expire, and when they do, your customers see errors or your service goes down. As certificate validity periods get shorter (the Certification Authority (CA)/Browser Forum mandates reduced maximum validity to 100 days starting March 2027, and to 47 days by 2029), manual renewal processes become untenable. You need automation.

Automatic Certificate Management Environment (ACME) is an open protocol for requesting, renewing, and revoking TLS certificates without human intervention. It’s the same protocol behind Let’s Encrypt, and it’s supported by dozens of clients across every platform.

Today we’re announcing ACME support for public certificates in AWS Certificate Manager (ACM). ACM now provides a fully managed ACME server endpoint that works with any ACMEv2-compatible client, such as Certbot, cert-manager for Kubernetes, acme.sh, or any other client you already use. You can issue public TLS certificates from Amazon Trust Services through the standard ACME protocol.

Before today, if you wanted automated certificate management using the ACME protocol, you relied on external certificate authorities alongside ACM, leading to a fragmented visibility experience. Some certificates lived in ACM, others were managed externally with no central dashboard. PKI administrators had limited ability to control who could request certificates or which domains were allowed.

With ACME support in ACM, you can now set up one or more managed ACME endpoint that allows you to centrally manage and monitor ACME certificate usage across your organization.

As a PKI administrator, you get centralized controls that go beyond basic certificate issuance. You can bind IAM roles to ACME accounts for fine-grained access control over which domains each client can request. You can define domain scopes at the endpoint level to enforce organization-wide policies. And you get centralized monitoring and visibility in the same place: AWS CloudTrail logs every certificate request for auditability, Amazon CloudWatch tracks operational metrics, and ACM sends expiry notifications when certificates are approaching renewal. Using ACM, your PKI team can search all certificates, whether issued through the ACM console, an API call, or ACME.

How it works
To get started, you first set up a dedicated ACME endpoint, configure authorization controls using External Account Binding (EAB), validate which domains the endpoint can issue certificates for, and point your existing ACME clients to the new endpoint.

The domain validation step is important: it separates who can set up certificate issuance from who can request certificates. The PKI administrator validates domains once at the endpoint level, using DNS credentials that stay with the admin. Application owners who need certificates never touch DNS. They register with an EAB credential, and the endpoint enforces which domains and scopes they’re allowed to request. This means you can distribute certificate automation broadly across your organization without distributing DNS keys along with it.

I start this demo from the ACME certificates page in the AWS Certificate Manager console.

ACME Console

I already have a few endpoints and certificates in this account, I walk you through creating a new one from scratch. First, I select Create ACME endpoint.

ACME - Ceeate endpoint 1

I give my endpoint a name. The Endpoint type is Public. ACME clients will connect over the public internet. The Certificate type is Public. The certificate will be issued by Amazon Trust Services and trusted by browsers and operating systems by default. For the certificate key type, I keep the default ECDSA P-256. RSA 2048 and ECDSA P-384 are also available if your clients require them.

ACME - Ceeate endpoint 2

Scrolling down, I configure the domain. I enter my domain name and select the domain scope. The scope controls exactly what certificate patterns your ACME clients are allowed to request for this domain. If I check only Exact domain, clients can only request certificates for that specific domain name. Adding Subdomains allows certificates for any subdomain (for example, api.example.com or dev.example.com). Adding Wildcards allows wildcard certificates (*.example.com). By leaving a scope unchecked, you prevent any client using this endpoint from requesting that type of certificate, even if their ACME request is otherwise valid. For a production endpoint, you might enable only Exact domain and Subdomains while leaving Wildcards unchecked to enforce a stricter security posture.

I also select my Amazon Route 53 hosted zone from the drop down menu. ACM then automatically creates the DNS CNAME records needed for domain validation, so I don’t have to do it manually. When my domain is hosted outside of Route 53, I manually create the provided CNAME record at my DNS provider instead. This is a meaningful difference from typical ACME setups where each client handles its own domain verification independently.

These centralized controls give PKI administrators a single place to authenticate domains, restrict which certificate types (ECDSA or RSA) clients can request, and further limit wildcard issuance. Having these governance capabilities built in means you don’t need to purchase a separate certificate lifecycle management product or invest in building a custom policy layer yourself, both of which come at significant cost and operational overhead.

I select Create ACME endpoint

ACME - DNS configuration

After a few seconds, the endpoint is created. The console shows a Setup progress tracker with the next steps. My domain shows a “Validating” status. The validation method is DNS validation, where ACM verifies that you control the domain by checking for a specific CNAME record. Because I selected my Route 53 hosted zone during creation, I select Create records in Route 53 to let ACM handle the DNS validation automatically.

ACME - DNS successThe validation completes in a few seconds and the status changes to Success.

ACME - External Account Binding 1

Now I need to create External Account Binding (EAB) credentials. EAB credentials are a key identifier and HMAC key pair that lets your ACME client register an account with the ACME server. Once registered, the client generates its own asymmetric key pair, which is then used to authenticate all subsequent certificate requests. On the endpoint details page, I select the External account binding tab, then select Create EAB. I give the credential a name and optionally set an expiration time, ideally no longer than needed to complete client registration.

ACME - External Account Binding 2

ACME - end of configuration - show key

After I select Create EAB credential, the console shows the Key ID and HMAC Key. I note these values because I need them to configure my ACME client. The setup progress now shows four green checkmarks.

ACME - end of configuration - success

I’m ready to request a certificate. On the endpoint details page, I expand the CLI reference section. The console provides ready-to-use command examples for both Certbot and acme.sh. I copy the Certbot command and run it inside a container using the certbot/certbot image.

certbot certonly --standalone --non-interactive --agree-tos \
    --email <EMAIL> \
    --server https://acm-acme-enroll.us-east-1.api.aws/<ENDPOINT_ID>/directory \
    --eab-kid <EAB_KID> \
    --eab-hmac-key <EAB_HMAC_KEY> \
    --issuance-timeout <ISSUANCE_TIMEOUT> \
    -d <DOMAIN>

I replace the placeholders with my endpoint URL, EAB credentials, and domain name. The --eab-kid and --eab-hmac-key arguments are how Certbot registers with your ACME endpoint using the External Account Binding credentials I generated earlier. Each ACME client has its own syntax for this step, so check your client’s documentation for the exact flags.

Certbot contacts the ACME endpoint and returns a valid certificate signed by Amazon Trust Services.

Certbot to obtain a certificate through ACME

I use openssl to view the certificate before installing it.

openssl to view the certificate

The certificate is now visible in the ACM console under the ACME certificates tab, alongside any certificates issued through the console or API.

Certoficate view in the ACME console

Availability and pricing
ACME support in AWS Certificate Manager is available today in all commercial AWS Regions and will be available in AWS GovCloud (US), the China Regions, and the AWS European Sovereign Cloud partitions at a later date.

Pricing is per domain included in each certificate at the time of issuance, with a different price for fully qualified domain names and wildcards. Volume tiers are calculated based on total domain occurrences across all certificates issued per month in your AWS account. For details, see the ACM pricing page.

To get started, visit the ACM section on the AWS console or read the documentation.

— seb

Creative Commons founders’ fireside chat (Creative Commons blog)

Post Syndicated from jzb original https://lwn.net/Articles/1080518/

Dee Harris has published a summary
of the recent “fireside chat” featuring Creative Commons founders Hal
Abelson, Lawrence (Larry) Lessig, Molly Van Houweling, and Glenn Otis
Brown. The chat was to mark the 25th anniversary
of Creative Commons
and included a look back at its history as
well as a look at the landscape today:

Twenty-five years ago, a small group of people made a bet. They
believed that if you gave creators a simple set of tools and licenses
in language that a lawyer, a machine, and a human could all read,
millions of people might choose to share their work with the world
instead of locking it down.

The video
of the chat is available on YouTube.

AMD Pivots From HBM to LPDDR5X For New Versal Premium Gen 2 Memory on Package Chips

Post Syndicated from Ryan Smith original https://www.servethehome.com/amd-pivots-from-hbm-to-lpddr5x-for-new-versal-premium-gen-2-memory-on-package-chips/

With HBM in short supply, AMD’s next-generation of adaptive SoCs will be switching from HBM to LPDDR5X memory. The Versal Premium Gen 2 Memory on Package chips target the same compact form factor, but with a 15+ year projected lifecycle

The post AMD Pivots From HBM to LPDDR5X For New Versal Premium Gen 2 Memory on Package Chips appeared first on ServeTheHome.

[$] Flexible metaprogramming with Rhombus

Post Syndicated from daroc original https://lwn.net/Articles/1079001/

Lisp-like languages have historically led the world in metaprogramming and
flexibility. While many modern languages have adopted the idea of macros,
Lisp-like languages such as

Racket
have continued pushing the envelope,
attempting to make macros as easy as possible to incorporate into everyday
programs. On the other hand, Lisp’s minimal, parenthesis-based syntax can be hard
to adapt to — to the point that Lisp is sometimes said to stand
for “Lots of Irritating Silly Parentheses”.

Rhombus
is a new programming
language that aims to have the best of both worlds, marrying Racket’s
metaprogramming capabilities to a simple Python-like syntax and reasonable
standard-library defaults.

Security updates for Tuesday

Post Syndicated from jzb original https://lwn.net/Articles/1080439/

Security updates have been issued by AlmaLinux (git-lfs, perl-Archive-Tar, perl-IO-Compress, python3.12-urllib3, and runc), Debian (sogo), Fedora (perl-DBI and perl-Socket), Oracle (firefox, freerdp, git-lfs, libsoup, libxml2, mod_md, mysql, perl-Archive-Tar, perl-IO-Compress, python, python3.12-urllib3, rsync, thunderbird, tomcat, xorg-x11-server, and xorg-x11-server-Xwayland), SUSE (389-ds, 7zip, alsa, amazon-ecs-init, amazon-ssm-agent, ansible-core, apache2, atril, avahi, bind, bitcoin, capnproto, chromedriver, chromium, cosign, distribution, dnsdist, docker, dovecot24, dracut, firefox, firewalld, freeipmi, freerdp, giflib, gimp, gleam, glib-networking, glibc, glycin-loaders, golang-github-prometheus-alertmanager, google-cloud-sap-agent, google-guest-agent, graphite2, gsasl, hamlib, helm, himmelblau, ignition, imagemagick, istioctl, jackson-databind, jq, jupyter-jupyterlab-templates, keylime, krb5, ldns, libaom, libcaca, libgcrypt, libheif, libinput, libjxl, libnfs, libslirp-devel, libsolv, libzypp, zypper, libssh2_org, libvncserver, libyang, lldpd, logback, loupe, mbedtls, mbedtls-2, mcphost, mozjs128, mutt, nano, nginx, ocaml, ofono, openCryptoki, opencryptoki, opensc, openssh, openssl-3, papers, perl-compress-raw-zlib, perl-config-inifiles, perl-cpanel-json-xs, perl-crypt-passwdmd5, perl-DBI, perl-dbi, perl-html-parser, perl-http-daemon, perl-libwww-perl, perl-protocol-http2, postfix, postgresql14, postgresql15, postgresql16, python-aiohttp, python-biopython, python-click, python-ecdsa, python-idna, python-markdown, python-joblib,, python-paramiko, python-pdm, python-pip, python-py7zr, python-pydata-sphinx-theme, python-pyjwt, python-python-multipart, python-starlette, python-tornado6, python311-jupyter-ydoc, rpcbind, sed, sg3_utils, sqlite3, strongswan, tar, thunderbird, tomcat, tomcat10, tomcat11, trivy, unbound, util-linux, warewulf4, webkit2gtk3, xar, xwayland, yt-dlp, and zypper, libzypp, libsolv), and Ubuntu (libheif, nss, qemu, roundcube, and sqlite3).

The Realities of AI Video Surveillance

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2026/06/the-realities-of-ai-video-surveillance.html

The Financial Times has a good article on how AI is changing the capabilities of video surveillance, with information from both Israel/Iran and Russia.

I wrote about this sort of thing a few years ago, how AI enables mass spying in the way that computers and networks enabled mass surveillance. The interesting development in the article is that AI allows people to ask natural language questions about video footage to AIs—and AIs can answer them.

In contrast with older tools restricted to a few dozen preset searches, these new tools allow an almost unlimited range of enquiries by enabling language-based searches on video.

That lets intelligence officers hunt through massive streams of videos using simple search terms, such as two men handing a bag to each other; a person who has changed their appearance, or has changed clothes multiple times in a day; or a vehicle that has recently been painted over, or has driven past the same spot several times in a short period.

“This is the holy grail of surveillance,” said a European official whose country uses the technology on its cities. “We are able to look for behaviour, not objects ­ it has created a world of new possibilities.”

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)

Post Syndicated from Георги Тотев original https://www.toest.bg/ostrovut-na-prokudenite-vtora-chast/

<<Към първа част

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)

Нощният въздух е студен и режещ. Махмуд върви в група от афганистанци, иранци и пакистанци през каменист планински терен, осветяван единствено от луната. Някъде пред тях е границата между Иран и Турция, високо в планините над Ван в кюрдския регион на Турция. Трафикантите им дават кратки инструкции: когато стигнат до граничната ограда, да я прережат и да бягат. 

Ако полицията дойде отдясно, бягайте наляво. Ако дойде отляво, бягайте надясно – спомня си думите им Махмуд. – Не спирайте да тичате. Ако ви хванат – бююк шамар – голям шамар, и директно ви връщат в Иран!

Няколко седмици по-рано в Кандахар Махмуд продал почти всичко, което притежавал, включително старата си, но обичана моторетка. Успял да си осигури едномесечна студентска виза за Иран. Стигнал до Техеран, където напразно търсил човек, който да му помогне да премине нелегално в Турция. Обезкуражен, решил да се прибере у дома. Съдбата обаче го застигнала в Машхад – град близо до границите с Афганистан и Туркменистан. Именно там най-накрая срещнал трафикант на хора, който срещу солидна сума обещал да го преведе в Турция. Махмуд си помислил, че най-трудната част от пътуването вече е зад гърба му. В действителност изпитанията му тепърва започвали.

България през 80-те години е внимателно режисирана полицейска държава – страна на дълги опашки за основни стоки и субсидирани летни почивки по Черноморието. Кание и Раиф посрещат десетилетието потопени в работа и семеен живот. „Имахме голяма къща в Добрич, а наблизо беше курортът Албена – спомня си Кание. – Там прекарахме най-хубавите години от младостта си.“ 

Турското малцинство никога не изчезва от полезрението на режима.

Макар властта официално да твърди, че изповядва социалистическите принципи на равенство и братство, тя не се колебае да използва българския национализъм за свои цели. През 80-те години режимът възражда старите страхове от „турската заплаха“ и близо петвековното османско владичество. Книги и филми припомнят колективната травма от живота под османска власт – период, представян в училище като време на потисничество и масово поробване. В условията на затворената комунистическа система официалната пропаганда остава без сериозно оспорване, а турското малцинство все по-често започва да бъде представяно като потенциална „пета колона“.

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)
Дворът в къщата на Кание и Раиф © Георги Тотев

Подозренията се засилват след поредица от атентати в обществения транспорт – изключително нетипично и дълбоко разтърсващо явление за една строго контролирана полицейска държава. През август 1984 г. са извършени атаки на жп гарата в Пловдив и на летището във Варна. Най-кървавият атентат е във влака Бургас–София през 1985 г., познат като атентата от гара Буново. При него загиват седем души, сред тях и две деца. Нападенията, извършени от турски националистически екстремисти, затвърждават представата за турското малцинство като въплъщение на „вътрешния враг“.

Именно в тази атмосфера на страх и подозрение е поставено началото на т.нар. Възродителен процес – кампания за насилствена асимилация, насочена срещу мюсюлманските общности в България. Решението е взето и изпълнено почти светкавично в края на декември 1984 г. Само за няколко седмици близо един милион души – повече от една десета от населението на страната по това време, са принудени да заменят имената си със славянски. Освен турското малцинство, мерките засягат и мюсюлманите от ромската общност, както и помаците.

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)
Раиф © Георги Тотев

Раиф си спомня как в града пристигат полицаи и тежковъоръжени войници, за да наложат изпълнението на указа. 

Всеки, който имаше турско име, беше извикван в кметството и принуден да избере ново от предварително подготвен списък. На публични места трябваше да говорим на български и да използваме новите си имена. На колегите ми беше наредено да ме наричат „Румен“. Един човек обаче отказа. Казваше се Петьо. Той каза на полицаите: „Може да ме биете, може и да ме убиете, но аз го познавам като Раиф цял живот. Не мога да го наричам Румен.“

Насилствената асимилационна кампания е представяна от властите като опит за „възстановяване на българските корени“ на засегнатите общности. Подобни мерки всъщност не са прецедент. Още от 50-те години комунистическият режим провежда кампании, целящи да „излекуват“ помаците от тяхната предполагаема „турска идентичност“. Между 1982 и 1984 г. около 50 000 души от турската, ромската и помашката общност са принудени да сменят имената си – своеобразна генерална репетиция за събитията от декември 1984 г. и за последвалия Възродителен процес.

Мащабът на случилото се обаче е несравним с предишните кампании.

Затворени са джамии, поругани са мюсюлмански гробища, турските книги, списания и музика са забранени. Говоренето на турски език на публично място води до глоба или побой. Стотици представители на турската общност са изпратени в трудовия лагер в Белене на дунавския остров Персин. И до днес Раиф се пита какво ли са мислили властите по онова време. 

Още през 70-те години ни предупреждаваха, че могат да сменят имената ни, както вече бяха направили с помаците. Но ние не вярвахме, че това ще се случи.

На 23 май 1985 г. – датата е запечатана в паметта му – Раиф е уволнен от електроразпределителното предприятие, в което работи. „Не остана нито един турчин на работа. Страхуваха се от саботажи.“

Махмуд е заловен малко след като прекосява границата с Турция. Озовава се в център за задържане на мигранти в източния град Ван – първия от поредица подобни центрове, през които ще премине. Там започва да учи турски, без да изоставя мечтата си някой ден да стигне до Западна Европа. Докато е настанен в център за мигранти край Амасия, в черноморския регион на Турция, започва да работи нелегално по нивите в околността. „Попитах в лагера дали мога да работя законно. Казаха ми: „Не.“ Попитах: „А какво да ям?“ Те ми отвърнаха: „Това си е твой проблем. Защо изобщо си дошъл?“

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)
Махмуд © Георги Тотев

От Амасия пътят му продължава на запад през Истанбул и Бурса, докато накрая стига до Чанаккале – оживен пристанищен град на брега на Дарданелите. Когато за първи път застава край тесния проток, свързващ Егейско и Мраморно море, Махмуд гледа към отсрещния бряг и си представя, че там започва Европа. 

Това наистина е Европа, но не онази, която си е представял.

Махмуд се опитва да стигне до Гърция по море. Веднъж е заловен от бреговата охрана и върнат. Друг път двигателят отказва и лодката се носи безконтролно в продължение на седем часа, преди да пристигне помощ. Опитва и по суша. Една нощ успява да премине границата край град Ипсала, но се изгубва и без да разбере, се озовава отново на турска територия. При друг опит стига чак до Александруполис (Североизточна Гърция), преди да бъде задържан от полицията. За негово нещастие, по това време носи синджир с турското знаме и тениска с лика на Мустафа Кемал Ататюрк. „Полицаите решиха, че съм турски трафикант – разказва той. – Взеха ми всичко – телефона, SIM картата.“

След поредния неуспешен опит Махмуд се връща обезкуражен в Чанаккале. Там получава предложение за работа от възрастен мъж на име Хасан. Единственото условие е да напусне града и да замине за близък остров. „Мислех си, че ще поработя един месец, ще спестя малко пари и пак ще тръгна. Но после се появи Хасан и след това – Гьокчеада… Гьокчеада…“, казва той с усмивка и известно примирение.

Христос Талиядурос работи в кафене с прекрасна гледка, разкриваща се от склоновете на островното село Зейтинли. Занимава се и с татуировки. Подобно на Виолета и Димитрис той се установява на острова сравнително наскоро, когато ограниченията върху културния живот на гръцката общност започват постепенно да отпадат. За него това е своеобразно завръщане у дома. Роден е в същото село, в което работи днес, но в началото на 90-те години родителите му го изпращат в Истанбул, за да учи в гръцко училище. 

Винаги съм мечтал един ден да се върна за постоянно, казва Христос. 

Неговото семейство е сред малкото гръцкоговорещи семейства, които остават на острова въпреки десетилетията на ограничения и натиск през XX век.

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)
Христос © Георги Тотев

След като през 1964 г. последното гръцко училище на острова е принудено да затвори врати, малкото останали гръцки семейства започват да изпращат децата си да учат в Истанбул. Там църковните училища, управлявани от вселенския патриарх на Константинопол, продължават да преподават на гръцки и остават извън по-широките ограничения върху гръцкоезичното образование. „Патриархът ни взе под крилото си“, казва Христос. По съвпадение и той е родом от същото село – Зейтинли.

В началото на 20-те години на миналия век Имброс, както тогава все още се нарича островът, разполага с десет гръцки училища, в които учат близо 1500 деца. След като островът преминава под турски контрол, властите постепенно започват да ограничават обучението на гръцки език. Закон, приет през 1927 г., на практика забранява преподаването му в държавните училища и принуждава децата от гръцки произход да изучават езика си извън учебните занятия в частни училища. 

Ограниченията са временно смекчени през 50-те години, но през 60-те са въведени отново в още по-строг вид. В крайна сметка това води до затварянето на последното гръцко училище на острова – институция, превърнала се в символ на гръцкото му наследство и идентичност.

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (първа част)

Под крилата на кайтсърфовете на един турски остров се пресичат съдбите на бежанци, на прогонени, на завърнали се и на хора, търсещи нов дом. Георги Тотев ни разказва за Гьокчеада през личните истории на неговите обитатели.

От 60-те години нататък останалата гръкоезична общност на острова е подложена на постоянен натиск. Мнозина избират да напуснат и да започнат нов живот в Гърция, Западна Европа или Австралия. Често са принудени да продават домовете си на цени далеч под пазарните. Отношението на турската държава към малцинството допълнително се втвърдява заради продължителната криза в Кипър, където гръцката и турската общност все повече се отдалечават една от друга в стремежа си към две несъвместими цели – присъединяване на острова съответно към Гърция или към Турция.

В средата на 60-те години турските власти започват да прилагат план от 27 точки, известен като програмата „Еритме“ („Претопяване“), чиято цел е постепенно да заличи останките от гръцката идентичност на острова.

На православната общност е забранено да притежава колективна собственост, различна от църквите. Именно през този период са затворени и гръцките училища, както и сградите на местните общностни институции.

Други мерки засилват усещането, че общността е обсадена и под постоянен натиск. На острова са изградени полицейски казарми и затвор от открит тип. Разкази от онова време описват как затворници, много от които осъдени за тежки престъпления, се движат свободно и тормозят местните жители. 

Хората помнят, но избягват да говорят за това. Раната все още не е зараснала,

казва Виолета.

Макар темата рядко да намира място в турското общество, създаването на затвора от открит тип заема „централно място в паметта за принудителното напускане на острова“ сред бившите гръцки жители на Гьокчеада, посочва Юмит Есер от университета „Неджметин Ербакан“. Според него съвкупният ефект от политиките, провеждани през този период, „дълбоко променя ежедневието и чувството за сигурност на острова“.

През април 1989 г. Раиф получава неочаквана призовка да се яви в местното полицейско управление. Информацията е оскъдна – казват му да вземе дрехи, малко пари и храна за няколко дни. Никой не му съобщава обаче къде ще бъде изпратен. Едни говорят за Франция, други – за Африка, трети – за Румъния. Той събира набързо багажа си. Кание го изпраща. Полицаите качват Раиф заедно с група български граждани от турски произход на влак. „Стигнахме до София. Едва тогава ни казаха, че ни депортират в Турция.“ Следва ново пътуване с влак, този път към границата. Групата слиза в Одрин – първия голям град от другата страна на границата.

Нямах представа накъде да тръгна. За първи път стъпвах в Турция. Нашият турски беше едно, техният – съвсем друго,

спомня си Раиф. В хола на своя дом той бавно разбърква чая си. Металната лъжичка звъни в малката чаша с форма на лале, запълвайки паузите в разказа му. Неговото прогонване се оказва предвестник на много по-мащабно изселване. В началото на юни 1989 г., на фона на нарастващото безпокойство за положението на турското малцинство в България, Турция отваря границата си. През следващите три месеца между 320 000 и 360 000 български граждани от турски произход я преминават с влакове, автобуси и автомобили, натоварени с мебели и покъщнина. 

Властите в София наричат това преселение „голямата екскурзия“ – евфемизъм, който представя етническото прочистване като обикновено туристическо пътуване.

Турските власти насочват част от новопристигналите към Чорлу, град източно от Одрин. „Дадоха ни чай и храна. После започна организацията – кой къде ще бъде изпратен.“ Раиф е сам в новата страна. „Не можех да се обадя у дома. Нямаше как да разбера как са жена ми и дъщеря ми.“ Ще мине цяла година, преди двете да успеят да се присъединят към него в Турция.

Островът на прокудените. Травми от миналото изплуват по бреговете на Гьокчеада (втора част)
Кание © Георги Тотев

Докато Раиф говори, Кание отваря малко чекмедже и изважда мартеница. Държи я внимателно в ръцете си. „Това е едно от малкото неща, които донесох със себе си от България. Не я нося, защото ми е единствената и се страхувам да не я изгубя.“ Усмихва се тъжно и добавя: „Тук няма мартеници.“

Раиф гледа чашата си с чай. „Някога говорех български толкова добре, че никой не можеше да разбере, че съм турчин.“ Днес езикът постепенно започва да му убягва, казва той. „Минаха толкова много години.“ Гласът на Кание омеква, когато се връща към спомените. „Най-трудното тепърва предстоеше. Никой не бива да бъде принуждаван да напуска родината си и да започва живота си отначало на друго място.“


Този материал е създаден в рамките на Програмата за журналистически постижения (Fellowship for Journalistic Excellence) с подкрепата на ERSTE Foundation и в сътрудничество с Balkan Investigative Reporting Network (BIRN). 

Редактор на оригиналния текст: Нийл Арън
Превод: Георги Тотев

Scale analytics with Amazon Redshift multi-warehouse enhancements

Post Syndicated from Raza Hafeez original https://aws.amazon.com/blogs/big-data/scale-analytics-with-amazon-redshift-multi-warehouse-enhancements/

Onboard analytics workloads at scale with Amazon Redshift’s improved remote table data definition language (DDL), materialized view improvements, and concurrency scaling enhancements for zero-ETL and auto-copy.

As organizations scale their analytics capabilities, they need the ability to add workloads without disrupting production operation or being constrained by the resources of a single data warehouse. In this post, we introduce new capabilities of Amazon Redshift that enhance our multi-warehouse and scaling capabilities: remote materialized view (MV) operations, remote table DDL support, and concurrency scaling enhancements for zero-ETL and S3 event integration. These features help you build more scalable, performant decentralized analytics architectures on Amazon Redshift.

Let us review how these new features enable you to run analytics at scale.

New remote materialized view operations

New remote table DDL operations

  • ALTER TABLE ALTER DISTSTYLE operations now work on remote warehouses through concurrency scaling and data sharing. You can dynamically optimize data distribution across distributed environments, improving query performance and resource utilization without requiring data migration. This is especially valuable for data engineers fine-tuning performance across multiple warehouses and administrators adapting to changing query patterns.
  • ALTER TABLE APPEND operations now extend to remote warehouses through concurrency scaling and data sharing. This consolidates data across distributed environments, so you can efficiently combine tables without complex data movement or extract, transform, and load (ETL) processes. Organizations managing dynamic table operations across multiple environments can maintain data consistency while reducing operational overhead.

Concurrency scaling improvements

With these new concurrency scaling capabilities, you can maintain consistent data freshness without compromising existing warehouse performance. This eliminates the traditional trade-off between analytics and data loading. Apart from turning on concurrency scaling, no additional changes are required to take advantage of these features.

Customer use cases

This section covers two industry use cases: the first for a financial services customer and the second for a gaming industry customer.

Financial services use case

The following is a sample architecture for a large financial services customer with global operations. This customer uses a multi-warehouse architecture built on Amazon Redshift.

Financial services multi-warehouse architecture using STG, DWH, ETL, and USR Amazon Redshift warehouses

The staging (STG) warehouse serves as a raw zone for data from various sources, like the bronze layer of a medallion architecture. This warehouse also cleanses and standardizes the raw data to the silver layer and makes it available for further processing. The STG warehouse uses MVs to process millions of nested JSON messages and extract attributes into scalar columnar Amazon Redshift tables.

CREATE MATERIALIZED VIEW rawdb.fsi.customer_orders_raw
distkey(c_custkey) sortkey(c_custkey) AS (
    SELECT c_custkey,
        o.o_orderstatus,
        o.o_totalprice,
        o_idx
    FROM customer_orders_lineitem c,
        c.c_orders o AT o_idx
);
REFRESH MATERIALIZED VIEW rawdb.fsi.customer_orders_raw;

The DWH warehouse serves as the primary Amazon Redshift instance and gold layer, providing data to consuming applications like Business Objects and Tableau. The zero-ETL concurrency scaling improvements provide consistent data freshness even when zero-ETL ingestion spikes occur alongside heavy DWH workloads. The DWH MVs provide fast access to aggregated data for Tableau extracts and Business Objects live reports. The DWH warehouse takes advantage of concurrency scaling when multiple MVs need to be refreshed on the DWH instance.

CREATE MATERIALIZED VIEW bodb.final.customer_churn_tbl
AS (
    SELECT state,
        account_length,
        area_code,
        total_charge/account_length AS average_daily_spend,
        cust_serv_calls/account_length AS average_daily_cases,
        churn
    FROM custdb.final.customer_activity_all
);
REFRESH MATERIALIZED VIEW bodb.final.customer_churn_tbl;

The ETL01/02 warehouses serve as dedicated compute environments for running project-specific ETL jobs, while the USR01/02 warehouses handle user workloads such as ad-hoc analysis or model building from dbt. When new objects are required by user workloads, they are created and maintained on the remote producer warehouse (DWH).

ALTER TABLE salesdb.final.sales_report_all
ALTER DISTKEY sales_id;
ALTER TABLE APPEND salesdb.final.sales_report_all
FROM stagingdb.sales.sales_2026_02;

Gaming industry use case

A leading gaming company has built their entire analytics infrastructure on AWS, with their analytics team managing data streaming from games, data warehousing, and business intelligence tools. They standardized Amazon Redshift across the organization, migrating off Vertica running on Amazon Elastic Compute Cloud (Amazon EC2). After overcoming early challenges with cluster resize operations, the team became strong advocates for Amazon Redshift and now runs their primary production cluster on 32 ra3.16xlarge nodes.

As their data ingestion pipeline grew, query workloads began competing with data ingestion processes, creating performance bottlenecks. Rather than scaling up their primary cluster, they implemented a workload isolation strategy using Amazon Redshift data sharing. The customer launched a second 16-node ra3.4xlarge cluster as a data share consumer, with the primary cluster serving as the producer. This architecture allowed them to migrate consumption workloads to the consumer cluster while the producer focused on data ingestion, effectively supporting growth without increasing the primary cluster size.

Gaming company architecture with a producer Amazon Redshift cluster sharing data to a consumer cluster

Recognizing the advantages of this distributed architecture, the gaming company expanded their approach by migrating workloads to Amazon Redshift Serverless, further using the data sharing model for workload isolation. Amazon Redshift’s remote materialized view capability allowed the gaming company to create materialized views directly on the data shared by the producer cluster. Each consumer cluster could now build materialized views optimized for its specific workload patterns. This created pre-aggregated datasets, custom join strategies, and workload-specific data distributions, without impacting the producer cluster’s performance or requiring data duplication. The producer warehouse maintains data distribution and sorting strategies designed for generic enterprise needs, providing consistent data quality across all consumers. Meanwhile, consumer warehouses used remote materialized views to fine-tune query performance for their distinct analytical requirements, whether supporting real-time player analytics, business intelligence dashboards, or ad-hoc data science workloads. This distributed approach to data consumption optimization proved essential for the gaming company. It delivered fast query performance across diverse analytical workloads while maintaining a single source of truth in the producer cluster and avoiding the operational overhead of managing redundant data copies.

Best practices

To get the most out of these new capabilities, consider the following best practices:

  • Enable concurrency scaling on your Amazon Redshift clusters and Serverless workgroups to allow ETLs and user queries to run even faster, providing consistent report and dashboard performance.
  • Set up usage limits for concurrency scaling on both Amazon Redshift provisioned clusters and Serverless workgroups by configuring an appropriate MaxRPU setting. This helps you avoid unexpected additional costs. For more information, see the Amazon Redshift usage limits documentation.
  • Use remote MVs to offload resource-intensive MV creation and refresh operations from your primary warehouse to remote data share clusters.

Conclusion

In this post, we walked through the new MV refresh features, remote table DDL capabilities, and expanded concurrency scaling support for zero-ETL and S3 auto-copy. These features help you move beyond the constraints of a single warehouse. They are particularly valuable for organizations managing distributed data architectures that require dynamic table management across multiple environments while maintaining data consistency and adapting quickly to changing workloads. To get started, make sure you are running the latest Amazon Redshift version. Then visit the Amazon Redshift documentation to learn more about concurrency scaling, data sharing, and materialized views.


About the authors

Raza Hafeez

Raza Hafeez

Raza is a Senior Product Manager, Technical at Amazon Redshift. He has 15+ years of experience building and optimizing enterprise data warehouses and is passionate about making cloud analytics accessible and cost-effective for customers of all sizes.

Ravi Animi

Ravi Animi

Ravi is a senior product leader in the Amazon Redshift team and manages several functional areas of the Amazon Redshift cloud data warehouse service, including spatial analytics, streaming analytics, query performance, Spark integration, and analytics business strategy. He has experience with relational databases, multidimensional databases, IoT technologies, storage and compute infrastructure services, and more recently, as a startup founder in the areas of artificial intelligence (AI) and deep learning, computer vision, and robotics.

Satesh Sonti

Satesh Sonti

Satesh is a Principal Analytics Specialist Solutions Architect based in Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 20 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Milind Oke

Milind Oke

Milind is a senior Redshift specialist solutions architect who has worked at Amazon Web Services for three years. He is an AWS-certified SA Associate, Security Specialty and Analytics Specialty certification holder, based out of Queens, New York.

What the June 2026 Threat Technique Catalog update means for your AWS environment

Post Syndicated from Shannon Brazil original https://aws.amazon.com/blogs/security/what-the-june-2026-threat-technique-catalog-update-means-for-your-aws-environment/

The AWS Customer Incident Response Team (AWS CIRT) encounters patterns that repeat across engagements when helping customers respond to security incidents. We’re passionate about making sure that information is accessible so that everyone can improve their security posture and their organization’s resilience to disruption. The primary method we use to share this information is the Threat Technique Catalog for AWS (TTC). The latest update to the catalog for June 2026 focuses on container security, organization-level trust, and compute hijacking. Each new entry reflects something we’ve encountered in practice, and each provides straightforward mitigation. This post breaks down what changed, why it matters, and what you can do about it today.

What we’re seeing

We’ve added five new entries to the TTC.

EKS workload modification

Amazon Elastic Kubernetes Service (Amazon EKS) gives teams powerful orchestration capabilities. We’re seeing threat actors who have obtained Kubernetes credentials or an AWS Identity and Access Management (IAM) role with EKS permissions modify running workloads—altering container images, injecting sidecar containers, or changing pod specifications to introduce malicious code into a deployment.

Nothing new is created. The workload already exists, it might be running in production, and by modifying it in place the threat actor inherits the network access, service account permissions, and data access the legitimate workload already had. Without admission controllers or image verification, these changes can go unnoticed until the impact shows up downstream. Enforcing image signing through admission controllers, restricting workload changes with Kubernetes role-based access control (RBAC), and enabling Amazon GuardDuty EKS Protection to surface anomalous cluster activity all reduce this risk. For more information, see EKS Modification – Workload Integrity Degradation.

Exploit public-facing application – EKS

Publicly exposed Kubernetes API servers and misconfigured ingress controllers continue to be an entry point we see exploited. This technique captures threat actors targeting the customer-deployed workloads running on Amazon EKS—not EKS itself—and their exposure to the internet.

The pattern starts with an exposed service and an application-level weakness, then pivots from the compromised pod toward broader cluster access. When inside a pod, a threat actor can query the instance metadata service, read mounted service account tokens, or move laterally across the cluster network. Limiting public exposure of the Kubernetes API server, applying network policies to restrict pod-to-pod communication, and running workloads with least-privilege service accounts reduce the risk of this technique succeeding. For more information about this technique, see Exploit Public-Facing Application.

Assume root into organization member account

AWS Organizations centralizes trust across member accounts, and that trust runs in one direction—from the management account downward. We’ve observed threat actors who compromise a management account—or gain sufficient privilege within one—use that position to assume root access into member accounts using sts:AssumeRoot. Because the trust is inherent to the organization structure, this can avoid the access controls a member account administrator has configured.

With root access to a member account, a threat actor can disable security controls, delete resources, change billing configurations, and establish persistence that survives remediation focused on IAM principals. We strongly encourage implementing service control policies (SCPs) that restrict which principals can call sts:AssumeRoot and under what conditions, and monitoring for sts:AssumeRoot calls in AWS CloudTrail. For more information, see Assume Root into Organization Member Account.

Compute hijacking – EKS

Compute hijacking remains one of the most common motivations we see behind unauthorized access, and Amazon EKS clusters are increasingly the target. Threat actors deploy cryptocurrency mining or other compute-intensive workloads inside compromised clusters, consuming customer resources and generating unexpected cost.

What sets EKS-based hijacking apart is scale. In clusters without resource quotas, a single compromised service account can consume all available capacity across nodes. The workloads use legitimate-looking images pulled from public registries, which makes image scanning alone insufficient. Setting resource quotas and limit ranges, restricting which registries workloads can pull from, and enabling Amazon GuardDuty EKS Protection to flag mining behavior provides effective detection. For more information, see Resource Hijacking: Compute Hijacking – EKS.

Invite accounts to unknown organization

A threat actor with access to a standalone account—or one they’ve removed from its legitimate organization—invites it into an organization they control. After the account joins, it falls under the threat actor’s governance. The threat actor’s organization can apply SCPs that restrict the legitimate owner’s actions, gain visibility into the account’s resources through organizational services, and access consolidated billing information. The legitimate owner finds themselves locked out of their own governance controls. Monitoring organizations:InviteAccountToOrganization and organizations:AcceptHandshake, and implementing SCPs that prevent accounts from leaving their legitimate organization are important preventive measures. For more information, see Modify Cloud Resource Hierarchy: Invite Accounts to Unknown Organization.

What’s updated

We’ve refreshed three existing entries. S3 Object Collection now captures additional API calls used for bulk data staging from Amazon Simple Storage Service (Amazon S3), with refined detection guidance and mitigations that use recent Amazon S3 security features. Compute Hijacking – ECS adds methods threat actors use to deploy unauthorized tasks in Amazon Elastic Container Service (Amazon ECS), including abuse of overly permissive task execution roles. Role Assumption and Federated Access has been expanded to cover new cross-account role assumption variations and identity provider manipulation, with sharper guidance for distinguishing legitimate federated access from unauthorized use.

The current trend

This June update reflects a clear trend: threat actors are increasingly targeting container orchestration platforms and using organizational trust relationships to their advantage. The container techniques show that as organizations adopt Kubernetes at scale, the attack surface grows with it. The organization-level techniques show that threat actors understand organizational trust relationships.

The common thread is that every one of these techniques operates within the boundaries of legitimate functionality. Modifying a workload, assuming cross-account trust, and joining an organization are all expected actions in healthy environments.. Detection, then, depends entirely on context: the principal, the timing, and the sequence of events that follows.

The Threat Technique Catalog for AWS is designed to help with this. We encourage teams to review the relevant entries and assess whether their current monitoring would catch these patterns:

  • Unexpected modifications to EKS workload specifications
  • Pod deployments that use unsigned container images
  • sts:AssumeRoot calls into member accounts
  • Unbounded compute consumption in your EKS clusters that could be prevented by resource quotas
  • Unexpected organization invitations to your accounts

Each of the threats leaves traces in AWS CloudTrail and Kubernetes audit logs, and the TTC provides specific guidance on what to watch for and how to respond.

Looking ahead

The Threat Technique Catalog for AWS exists because we believe the patterns we observe during security engagements shouldn’t stay behind closed doors. When we see techniques repeating across customers, the most effective thing we can do is document them and make that knowledge available so you can act on it before you’re in the middle of an incident.

This June update adds five new entries and updates three existing ones, and the catalog will continue to evolve. Our team updates it based on what we’re seeing in the real world when helping customers respond to security events. We encourage security teams to review the catalog, incorporate its techniques into threat modeling exercises, and use it as a shared vocabulary for discussing cloud-specific threats.

Explore the full catalog: Threat Technique Catalog for AWS – Full Matrix

Additional resources

If you have feedback about this post, submit comments in the Comments section below.


Shannon Brazil

Shannon Brazil is a Sr. security engineer, managing a team on the AWS Customer Incident Response Team (CIRT), specializing in digital forensics and cloud security investigations. Known in the community as 4n6lady, she is passionate about security education and mentoring the next generation of defenders.

Cydney Stude

Cydney Stude

Cydney is a security engineer specializing in threat intelligence and incident response at AWS. Cydney works on the ground in incident response and is passionate about turning observables into security outcomes. Cydney is an author and maintainer of the Threat Technique Catalog for AWS.

Javier Teitelbaum

Javier Teitelbaum

Javier is security engineer on the AWS Customer Incident Response Team (CIRT), with a focus in building and threat intelligence.

Lessons learned from scaling to 1 million Lambda functions

Post Syndicated from Ben Freiberg original https://aws.amazon.com/blogs/architecture/lessons-learned-from-scaling-to-1-million-lambda-functions/

In this post, we share our journey and the lessons learned from building and running a fully serverless, multi-account software as a service (SaaS) platform at scale. We’ll explore why true scale-to-zero is critical, how we handle quota management, why engaging AWS service teams early saved us from outages, and which unexpected practices emerged once we scaled from thousands to over a million functions.

At ProGlove, we build smart wearable barcode scanning solutions that connect frontline workers to digital workflows. Our scanners integrate with Insight, our AWS-based SaaS platform, to provide real-time visibility into processes, helping customers in manufacturing, logistics and retail improve productivity, reduce errors and enhance ergonomics on the shop floor.

We chose a one AWS account per tenant architecture to achieve clearer security boundaries, streamlined ownership of services, and more transparent cost. It is important to focus on efficiency with dedicated tenant resources at scale, because resource wastage will also scale. The ability to scale-to-zero removes this concern.

Phase 1: The “simple” origins (0 to 1,000 Lambda functions)

When you first build a serverless system, you think in single digits. A handful of AWS Lambda functions, maybe a few dozen at most. It’s hard to imagine what changes when your platform operates thousands of AWS accounts and deploys over one million Lambda functions into production, each isolated to a single customer’s account.

We followed standard playbooks, where “scale-to-zero” was merely a nice-to-have. We used serverless best practices like Amazon Simple Queue Service (Amazon SQS) for decoupling and long-polling to keep the application responsive and resilient. At this scale, a few idle functions or a handful of accounts were a negligible expense and the benefits of a high-level managed service like AWS Lambda really showed.

Microservice composition

Each microservice in our platform follows a consistent structure: 5 to 15 Lambda functions coordinated by AWS Step Functions, with Amazon EventBridge handling event routing and Amazon DynamoDB as the primary data store.

Architecture diagram showing a microservice composition with Lambda functions, Step Functions, EventBridge, and DynamoDB

These resources are bundled together into a dedicated AWS CloudFormation stack for deployment.

As we onboarded our first handful of tenants, it quickly became clear that deploying and updating AWS CloudFormation stacks individually per account wouldn’t scale. We adopted AWS CloudFormation StackSets, which let us push infrastructure updates to multiple accounts in parallel from a central management account. At this stage, StackSets felt like a superpower. One deployment operation and many accounts are updated simultaneously. We evaluated building a fully custom replacement later, but ultimately concluded that the maintenance overhead wasn’t worth the marginal control gains and stayed with StackSets as our core mechanism.

Phase 2: The first 50 accounts

Growing to 50 tenant accounts forced us to confront problems that weren’t visible at single-digit scale. Three areas in particular required deliberate architectural decisions: observability, account provisioning, and quota isolation.

Automating account creation

We knew manual provisioning would not scale. Instead we built an automated account factory on top of AWS Organizations: an AWS Step Functions workflow in the management account handles the full provisioning lifecycle: Creating the account, applying baseline service control policies (SCPs), bootstrapping cross-account IAM roles, and triggering the initial CloudFormation StackSet deployment. All done using cross-account AWS Lambda invocations. New tenant accounts go from request to ready in under 15 minutes, at near-zero incremental cost per provisioning run.

Account provisioning workflow using AWS Organizations and Step Functions

The quota isolation benefit

One underappreciated advantage of the account-per-tenant model is quota separation. Each account gets its own Lambda concurrent execution limit, its own Amazon API Gateway throttle, and its own service quotas across the board. In a shared-account SaaS model at this scale, a single noisy tenant could exhaust shared concurrency and cause cascading failures across all other tenants. With account isolation, that class of problem simply doesn’t exist as each tenant’s activity is bound to their own account.

Phase 3: Scaling challenges (the self-DDoS)

As our fleet grew beyond a few hundred accounts, we began to experience the “Physics of Scale”. We discovered that when hundreds of backend service instances simultaneously access other services, the resulting request volume can resemble a coordinated attack, impacting not only our own infrastructure but also AWS.

One time, we faced a massive metric spike where our own functions effectively overwhelmed (similar to a DDoS attack) our internal APIs. The root cause was synchronized schedules: every Lambda was using the same rate(5 minutes) expression, which aligned to the top of the minute across thousands of accounts.

The solution was request scattering. We now use a standardized internal library that enforces jitter, randomized batch offsets, and staggered updates across all scheduled functions.

Rule of Thumb: “Never do the same thing at the same time everywhere”.

Multi-account observability as a cost driver

With several dozen accounts, manual log access per account became unworkable. We adopted a third-party observability platform, forwarding Amazon CloudWatch logs and metrics cross-account to a centralized dashboard. At roughly $3 per account per month, the cost felt insignificant.

That assumption was soon replaced by a very real learning: at thousands of accounts, $3 per account per month becomes an impactful expense that demands active management. We learned to treat per-account observability costs with the same scrutiny you apply to compute costs.

What came as a surprise to us were the actual cost drivers: instead of Lambda compute or storage costs, we found that forwarding all observability data almost doubled our cloud bill. As a result, we had to learn how to differentiate between high and low priority observability data and only move around the priority data.

With all mitigations combined we managed to bring observability costs down to around $0.7 per account. Additionally, we were able to switch accounts to almost 0 after some time of inactivity by only monitoring a small set of very basic metrics.

Phase 4: Rethinking architectural patterns for scale-to-zero

One of the most painful lessons was realizing that traditional Amazon SQS “best practices” increased costs in our use-case and scale.

Replacing SQS and the DLQ dilemma

After we scaled to over a thousand AWS accounts, we understood that “idle” doesn’t necessarily mean there are no costs – even when using Serverless. When Lambda functions consume events from EventBridge through an SQS queue to increase resilience, they constantly make requests to the queue even when there are no messages to process.

To eliminate the cost of continuous polling, we removed Amazon SQS from the path between Amazon EventBridge and AWS Lambda.

  • Metric-Driven Safety: Instead of relying on a queue to buffer requests, we monitor AsyncEventsDropped and ConcurrentExecutions to make sure we stay within our quotas without losing events.
  • The Centralized DLQ: Polling individual Dead Letter Queues (DLQs) in every account reintroduced the same polling cost issues. We solved this by routing failures to a centralized DLQ as shown in the following two diagrams.
  • The Isolation Trade-off: This approach requires extreme discipline to make sure we don’t break our data isolation patterns, as events from different tenants converge in a single location for recovery. Because of cost implications at scale, the use of SQS moved from a silo to a bridged model where the AWS account ID can be treated as a tenant ID.

Individual dead letter queue per queue architecture

Individual DLQ per queue

Centralized dead letter queue polling architecture

Centralized DLQ polling

Phase 5: Industrializing the deployment engine

Serverless architectures grow to large numbers of infrastructure components: where a monolith or Amazon Elastic Compute Cloud (Amazon EC2)-based service might be a handful of resources, a single microservice in our stack spans dozens of Lambda functions, EventBridge rules, DynamoDB tables, and Step Functions state machines. Multiplied across thousands of accounts, deployment complexity compounds quickly.

Initially, we used AWS CloudFormation StackSets to roll out updates in parallel. However, at the scale of 1 million Lambda functions, StackSets hit a performance ceiling and occasionally produced errors that added up significantly at our volume.

From custom engines to collaborative roadmaps

The bottlenecks became such a blocker that we began building our own internal serverless deployment system to replace StackSets. This caught the attention of the AWS CloudFormation service team, who committed to supporting our use case at the scale we required and partnered with us closely from that point on.

By engaging early and often, we were able to:

  • Influence the Roadmap: We provided the scale requirements that helped AWS prioritize StackSet stability and performance improvements.
  • Automate Resiliency: We built a deployment tracking service that aggregates StackSet events through Amazon EventBridge. A central AWS Step Functions state machine now acts as our “single-pane-of-glass,” acting on failures and triggering retries for occasional AWS internal errors.

Phase 6: Mature governance and FinOps

Being able to scale a serverless platform with a small team of engineers requires consistent and efficient governance practices. This applies to both cloud governance topics as well as engineering practices. Otherwise it will be next to impossible to keep software delivery and development performance as well as reliability at a high level over time.

Cost optimization also changes at a higher maturity level: once cost control is tightly monitored and automated, the discipline changes from housekeeping tasks to collect easy cost savings towards increasingly complex architectural changes. For example, if a new feature significantly increases the number of Lambda invocations and drives up cost, you will need to re-think the architecture and include the new focus on cost.

The mono-repo strategy

We consolidated 20 microservices into a single mono-repo. This helped us to:

  • Enforce consistent tooling and security scanning across more than a million functions.
  • Coordinate runtime and library upgrades through a single source of truth for configuration.
  • Make sure every change passes through the same CI/CD chain with guaranteed compatibility.

The “Almost-Zero” Reality

Even with a scale-to-zero mandate, we learned that “zero” is often “almost-zero”.

  • The Monitoring Tax: We avoided services like NAT Gateways, but monitoring introduced additional costs such as CloudWatch Alarms. Aggregating metrics in external observability tools added up quickly.
  • The Optimization Payoff: By aggressively optimizing these costs, we reduced our idle cost for inactive accounts to less than $1 per month.

Think beyond the obvious services

One of the most valuable habits we built was resisting the urge to immediately default to a familiar pattern or write custom code. AWS offers a growing catalog of fully managed, event-driven services such as Amazon EventBridge Pipes, AWS AppSync, Amazon SQS FIFO, and others, that can remove entire categories of custom Lambda code. Before writing a function, ask whether a native service integration already solves the problem.

A deliberate research step of exploring native AWS capabilities before opening an editor consistently paid off. It reduces the surface area you own, eliminates maintenance burden, and builds the team’s instinct for choosing the right service over reinventing it. Serverlessland is an excellent starting point for discovering patterns and service combinations you may not have considered.

Conclusion: Scaling efficiency faster than growth

Scaling from 0 to 1M Lambda functions across thousands of AWS accounts is a question of efficiency not of capacity. Every new account, every new customer, adds potential operational load. The only way to stay ahead is to make sure efficiency scales faster than growth. For us, that means true scale-to-zero, proactive and efficient quota management, tight collaboration with AWS service teams, disciplined developer education, and a mono-repo that enforces consistency.

We’ve learned that the difference between success and failure at this scale lies in unexpected aspects like the hard-learned fact that observability becomes an increasingly complex problem the more distributed your platform becomes.

The benefits are substantial. With the right automation and architectural rigor, a lean team can operate a large-scale infrastructure. Using a cloud-native approach based on serverless services is the most important operational advantage in this case.

To apply these lessons to your own workloads, discover event-driven patterns and service combinations on Serverless Land.


About the authors

Preventing data exfiltration in machine learning environments with Amazon SageMaker AI

Post Syndicated from Ajish Abraham original https://aws.amazon.com/blogs/architecture/preventing-data-exfiltration-in-machine-learning-environments-with-amazon-sagemaker-ai/

If you’re building machine learning solutions with sensitive data, you face a persistent challenge: preventing data exfiltration while enabling data scientists to work productively. iBusiness, an AI-driven fintech organization, needed its data scientists to work with sensitive data to fine-tune and improve machine learning models. As the data science team scaled, traditional air-gapped environments and monitored virtual desktops proved unsustainable, leading to high costs and operational complexity.

In this post, we demonstrate how iBusiness implemented a three-layered security architecture using Amazon SageMaker AI, virtual private cloud (VPC) endpoints, and Amazon WorkSpaces Secure Browser to prevent data exfiltration while maintaining data scientist productivity. You can adapt this approach to build secure machine learning environments that balance strict data protection with team scalability.

Historically, when access to sensitive data was required, iBusiness provided an isolated, air-gapped on-premises environment. However, with the shift to a remote workforce, this approach became impractical. The company locked down secure virtual desktops through device management policies and had them monitored by proctors to prevent inappropriate actions.

As the data science team scaled and expanded machine learning (ML) use cases, this approach proved unsustainable. Each user required a dedicated virtual desktop, even for temporary access, leading to increased costs. Additionally, maintaining ML tools, libraries, and patches in these locked-down environments was time-consuming and operationally complex.

To address these challenges, iBusiness adopted Amazon SageMaker Studio, a fully managed, web-based ML development environment. This removed the need to maintain in-house Jupyter environments while giving data scientists access to up-to-date tools. Furthermore, SageMaker AI’s integration with AWS services provided straightforward data sharing via AWS Lake Formation and Amazon Athena, reducing the need for manual data transfers.

Solution architecture

To achieve this, iBusiness implemented a three-layered security strategy that you can adapt for your own secure ML environments.

Three-layered security architecture for data exfiltration prevention
Figure 1: Three-layered security architecture for data exfiltration prevention

Layer 1: Securing access through WorkSpaces Secure Browser

iBusiness used Amazon WorkSpaces Secure Browser, a managed, locked-down browser environment. This managed service provides a controlled Chromium-based browser, offering a more cost-effective solution for the company’s use case.

The company configured the Secure Browser to run within a dedicated VPC and subnet in its IT infrastructure account, routing outbound traffic through a network address translation (NAT) gateway. In the secure data science account, iBusiness enforced AWS Identity and Access Management (IAM) policies that restrict access to requests originating only from AWS services or from the NAT gateway’s Elastic IP address. This configuration helps validate that access to the environment is only possible through the Secure Browser. It gives you confidence that data scientists cannot bypass security controls when you implement a similar approach.

Additionally, the Secure Browser was configured to disable file downloads and uploads, disable clipboard access, and disable printing. These controls help prevent data from being transferred to local machines.

Key Secure Browser controls configured:

  • Disable file downloads and uploads.
  • Disable clipboard access.
  • Disable printing.

Layer 2: Restricting browser activity and cross-account access

Building on this foundation, iBusiness restricted activity within the Secure Browser itself to address potential exfiltration through web-based channels.

Although the browser provides a temporary working directory, iBusiness prevented its misuse by implementing strict URL allowlisting. Users can only access *.aws.amazon.com and specific SageMaker AI domains. Other websites, including email and external storage platforms, are blocked, preventing users from uploading data to external services.

Permitted URL patterns:

  • *.aws.amazon.com.
  • Specific SageMaker AI domains.

Preventing cross-account data exfiltration

To help verify users cannot move data to other AWS accounts, iBusiness implemented VPC endpoints for AWS Management Console and AWS IAM Identity Center services. These endpoints route traffic privately within the VPC with no internet exposure. They also enforce endpoint policies restricting access to iBusiness’s specific AWS account, giving you control over which accounts data scientists can access.

The company also configured a private Amazon Route 53 hosted zone to redirect console.aws.amazon.com, *.console.aws.amazon.com, and signon.aws.amazon.com to the company’s VPC endpoints instead of public endpoints. To further mitigate DNS-based exfiltration risks, iBusiness configured Amazon Route 53 Resolver DNS Firewall in the SageMaker AI VPC to block DNS queries to non-approved domains, ensuring that only resolution of required AWS service endpoints is permitted.

This configuration helps verify that users can only authenticate into iBusiness’s secured data science account and that access to other AWS accounts is blocked. To further enforce this, iBusiness applied an IAM policy that enhances the IAM policy from Layer 1. This policy helps confirm actions are sourced from an IAM principal originating from a VPC endpoint and denies actions when the target resource belongs to another AWS account, with minimal exceptions for privileged users.

Layer 3: Securing the SageMaker AI environment

As a final layer of defense, iBusiness secured the SageMaker AI environment itself to prevent data exfiltration through the development environment’s terminal and integrated development environment (IDE) access.

Because SageMaker AI provides terminal and IDE access, it could potentially be used to move data externally. To mitigate this risk, the company removed direct internet access from the SageMaker AI VPC with no NAT gateway or internet routes and configured VPC endpoints for the required AWS services.

This configuration confirms that SageMaker AI can access AWS services internally and function normally while simultaneously blocking direct outbound internet traffic. iBusiness further restricted VPC endpoint policies to allow access only to resources within the organization, providing an additional safeguard against cross-account data movement. VPC endpoint policies allow for granular access to specific AWS resources. For example, allowing users restricted access for s3:PutObject API calls to specific Amazon Simple Storage Service (Amazon S3) buckets depending on the use case.

SageMaker AI network configuration:

  • No NAT gateway or internet routes in the SageMaker AI VPC.
  • VPC endpoints configured for all required AWS services.
  • Endpoint policies restricted to organization-owned resources only.

Conclusion

By implementing this three-layered security architecture, iBusiness achieved an 80% cost reduction, from $40+ per user monthly for individual VDI environments to $7 per user with Amazon WorkSpaces Secure Browser. The solution also transformed IT operations, reducing provisioning from a 2-day SLA to automatic setup within minutes while eliminating ongoing desktop maintenance overhead.

For data scientists, the approach improved both productivity and security by streamlining data access without compromising protection. This demonstrates how you can strengthen security controls while reducing costs and operational complexity.

Start by assessing your current data access controls, then progressively implement each security layer based on your organization’s specific compliance requirements and risk tolerance.


About the authors

Dual-token authentication for Nakama game servers with Amazon Cognito on AWS

Post Syndicated from Madhusudan Athinarapu original https://aws.amazon.com/blogs/architecture/dual-token-authentication-for-nakama-game-servers-with-amazon-cognito-on-aws/

When your game server needs both a managed identity provider and its own session system, players face a broken experience if authentication forces a redirect or stalls gameplay. Dual-token authentication for Nakama game servers with Amazon Cognito solves this by connecting two independent session systems, each with its own token lifecycle, without interrupting the player. This post shows you how.

Amazon Cognito handles player identity and Nakama manages game sessions. Cognito issues a JWT, a server-side Go hook validates it and exchanges the verified identity for a Nakama session token. Each token is validated independently on every request. The pattern applies to game servers such as Nakama that support runtime authentication hooks.

The infrastructure wraps Nakama in a default-closed routing layer. Amazon CloudFront serves as the single HTTPS entry point, AWS WAF filters traffic at the edge, an Application Load Balancer (ALB) enforces an explicit route allow-list for HTTP, and a Network Load Balancer (NLB) handles WebSocket TCP passthrough. Nakama runs on Amazon Elastic Container Service (Amazon ECS) on AWS Fargate. In this post, we cover the Cognito configuration, the Go hook, the Terraform infrastructure, and the WebSocket lifecycle controls.

In this post, you learn how to:

  1. Configure an Amazon Cognito User Pool for SRP-based game client authentication with no client secret.
  2. Implement a Go runtime hook that validates Cognito JWTs and bridges player identity to Nakama sessions.
  3. Set up a default-closed routing layer using Amazon CloudFront, an ALB, and an NLB.
  4. Manage the WebSocket connection lifecycle under the NLB TCP idle timeout model.

Solution overview

The architecture has four layers for authenticating and routing traffic.

The following diagram shows the architecture. Amazon CloudFront is the single entry point, routing HTTP API traffic through an Application Load Balancer (ALB) to Nakama on Amazon ECS, and WebSocket traffic through a Network Load Balancer (NLB) via TCP passthrough.

Architecture diagram showing dual-token authentication flow from client through Amazon CloudFront, ALB, and NLB to Nakama on Amazon ECS

Figure 1. Dual-token authentication architecture for Nakama on AWS.

Traffic flows through the system in six steps:

  1. Client → Amazon Cognito — The player authenticates using USER_SRP_AUTH. The password never leaves the client. Amazon Cognito returns a JWT access token.
  2. Client → Amazon CloudFront — Requests enter via Amazon CloudFront (HTTPS). AWS WAF inspects traffic at the edge before it reaches the origin.
  3. CloudFront → ALB (port 80) — /* HTTP API traffic. The ALB is security-group locked to the CloudFront managed prefix list only.
  4. CloudFront → NLB (port 7350) — /ws* WebSocket traffic. The NLB performs TCP passthrough with no HTTP inspection.
  5. ALB → Amazon ECS (Nakama) — For auth requests: the BeforeAuthenticateCustom Go hook validates the Cognito JWT and extracts the sub claim as the Nakama user ID. For other API calls: Nakama validates its own session token.
  6. NLB → Amazon ECS (Nakama) — Persistent WebSocket connection. Nakama validates the session token from the token query parameter at connect time.

Why two load balancers

The ALB and NLB serve different purposes and cannot be combined into one.

The ALB operates at the HTTP layer (Layer 7). It reads the path, applies listener rules, and returns 403 for unlisted routes.

The NLB operates at the TCP layer (Layer 4) and passes the raw stream to Nakama unchanged. Nakama receives the WebSocket upgrade directly from the client, validates the session token, and manages the connection lifecycle end-to-end.

Amazon CloudFront routes /ws* to the NLB and everything else to the ALB, so each connection type gets the appropriate handling behind a single HTTPS endpoint.

Prerequisites

Before you deploy this solution, make sure you have:

  1. Terraform >= 1.5.0 (download).
  2. Go >= 1.21 (to build the Nakama plugin locally).
  3. Docker and the AWS Command Line Interface (AWS CLI) configured with appropriate credentials.

The repository includes a browser-based test app (/app) that demonstrates the full sign-up, sign-in, and Nakama token exchange flow.

Authenticate players with Amazon Cognito

Amazon Cognito provides a managed user directory that issues JWTs without requiring you to run your own identity server or store credentials. The game server validates the JWT independently on each request, with no callback to Cognito needed. This decouples identity from game sessions: Cognito owns the player’s identity, Nakama owns the game session, and neither system depends on the other at runtime.

Players self-register by calling the Cognito SignUp API from the game client. The User Pool verifies their email before the account becomes active. After sign-in, Cognito returns a JWT access token containing the player’s sub claim (a UUID), which becomes the Nakama user ID in the next step.

Authentication uses the USER_SRP_AUTH flow. The password never leaves the client device. The User Pool App Client is configured as a public client with no client secret, since your game client runs in the browser or a native app where any embedded secret is extractable. With SRP, no secret is needed; security comes from the protocol itself.

After a successful sign-in, Amazon Cognito returns a JWT access token. This token carries the player’s identity claims and is signed with an RSA key pair unique to your User Pool. The sub claim — a UUID generated by Cognito — uniquely identifies the player and becomes the Nakama user ID in the next step.

The auth Terraform module configures the App Client with generate_secret=false and permits only ALLOW_USER_SRP_AUTH and ALLOW_REFRESH_TOKEN_AUTH flows. The resulting JWT access token is short-lived (1 hour by default) and carries the sub, iss, exp, and client_id claims that the Go hook validates in the next step.

Bridge Cognito identity to Nakama sessions

Nakama’s server-side runtime supports Go plugins exclusively. The hook in this section is written in Go using Nakama’s runtime.Initializer interface. This is a constraint of the Nakama runtime model.

Once the client has a Cognito JWT, it needs a Nakama session token to make game API calls.

Validate the Cognito JWT in the Go hook

The game server cannot trust the identity claim sent by the client directly. Any client can forge a user ID. JWT validation cryptographically proves the identity was issued by Cognito, preventing player impersonation.

The hook performs five checks in order: token format, algorithm (RS256 only), signature against the JWKS, expiry, and issuer/audience matching your specific User Pool.

func validateCognitoJWT(token string, env map[string]string) (string, error) {
    parts := strings.Split(token, ".")
    if len(parts) != 3 {
        return "", runtime.NewError("invalid token format", 3)
    }

    // Parse the header to get the key ID (kid)
    var header struct {
        Kid string `json:"kid"`
        Alg string `json:"alg"`
    }
    headerBytes, _ := base64.RawURLEncoding.DecodeString(parts[0])
    json.Unmarshal(headerBytes, &header)

    if header.Alg != "RS256" {
        return "", runtime.NewError("unsupported algorithm: "+header.Alg, 3)
    }

    // Fetch the public key from the JWKS cache
    pubKey, err := jwksCache.getKey(header.Kid)
    if err != nil {
        return "", runtime.NewError("token validation failed", 16)
    }

    // Verify the RSA signature
    hash := sha256.Sum256([]byte(parts[0] + "." + parts[1]))
    signatureBytes, _ := base64.RawURLEncoding.DecodeString(parts[2])
    if err := rsa.VerifyPKCS1v15(pubKey, crypto.SHA256, hash[:], signatureBytes); err != nil {
        return "", runtime.NewError("invalid token signature", 16)
    }

    // Validate claims: expiry, issuer, audience
    if time.Now().Unix() > claims.Exp { return "", runtime.NewError("token expired", 16) }
    if claims.Iss != expectedIssuer || claims.ClientID != env["COGNITO_CLIENT_ID"] {
        return "", runtime.NewError("invalid issuer or audience", 16)
    }

    return claims.Sub, nil // sub claim becomes the Nakama user ID
}

Security note: The hook never trusts the identity string sent by the client. It discards it and overwrites the Nakama user ID with the sub claim from the validated JWT. A client that sends a forged sub cannot impersonate another player — the hook ignores the body value entirely.

Cache JWKS keys with thundering herd protection

Amazon Cognito rotates its signing keys periodically. The hook caches keys with a 1-hour TTL. A 30-second re-fetch guard prevents multiple goroutines from calling the JWKS endpoint simultaneously when the cache expires.

func (c *JWKSCache) refresh() error {
    c.mu.Lock()
    defer c.mu.Unlock()

    // Thundering herd protection: if another goroutine already
    // refreshed within the last 30s, use the updated cache
    if time.Since(c.fetched) < 30*time.Second {
        return nil
    }

    // ... fetch and parse JWKS from Cognito endpoint
}

Register the hook

The hook registers itself in InitModule, the entry point called by Nakama when the plugin loads:

func InitModule(ctx context.Context, logger runtime.Logger, db *sql.DB,
    nk runtime.NakamaModule, initializer runtime.Initializer) error {

    if err := initializer.RegisterBeforeAuthenticateCustom(beforeAuthenticateCustom); err != nil {
        return fmt.Errorf("failed to register hook: %w", err)
    }
    logger.Info("Cognito JWT validation hook registered")
    return nil
}

When the client calls POST /v2/account/authenticate/custom with the Cognito JWT as the id field, Nakama calls beforeAuthenticateCustom before processing the request. If the JWT is valid, the hook sets in.Account.Id = sub and returns. Nakama creates or links the account and returns a session token to the client.

If your server is not Nakama, for example, Colyseus, Photon, or a custom WebSocket server, implement the same five checks (algorithm, signature, expiry, issuer, audience) in your server’s middleware or plugin language. The JWKS endpoint and JWT structure follow the OIDC standard, so any OIDC-compliant identity provider (not only Amazon Cognito) works with this pattern.

Deploy the infrastructure

The infrastructure is organized into six Terraform modules: network (Amazon Virtual Private Cloud (Amazon VPC), subnets, security groups), compute (Amazon ECS cluster, ALB, NLB, Amazon Elastic Container Registry (Amazon ECR)), auth (Cognito User Pool), cdn (CloudFront distribution), waf-cloudfront (AWS WAF Web ACL), and ops (IAM, AWS Systems Manager access). A bootstrap module creates the S3 state backend and AWS Key Management Service (AWS KMS) key before the main deployment.

Deploy with:

# One-time: provision the Terraform state backend
cd terraform/bootstrap && terraform init && terraform apply

# Deploy everything
cd terraform && terraform init -backend-config=config/backend-dev.hcl
make deploy

make deploy builds and pushes the Nakama container image to Amazon ECR, then runs terraform apply. The image tag auto-increments from the latest tag in ECR.

ALB routing: explicit allow list

The ALB default listener action returns 403. Only the paths in the following table reach Nakama. Requests to unlisted paths are rejected before they reach the game server.

Priority Path Target Purpose
1 /healthcheck Nakama port 7350 Health monitoring
2 /v2/account/authenticate/* Nakama port 7350 Session bridge: Go hook validates JWT
10 /v2/* Nakama port 7350 Nakama REST API v2
11 /v1/* Nakama port 7350 Nakama RPC (v1)
Default * 403 Forbidden Request never reaches Nakama

The default-403 posture means a misconfigured client or a scanner probing arbitrary paths gets a 403 at the ALB, not an error from the game server. This limits the attack surface to the explicitly listed API surface.

Security group chain

The network layer enforces two security group rules:

  1. The ALB security group allows inbound only from the CloudFront managed prefix list. As an additional application-layer check, CloudFront sends a shared secret in the X-CloudFront-Secret header on every request; ALB listener rules reject any request missing the correct value with a 403. The NLB security group applies the same CloudFront managed prefix list restriction at Layer 4.
  2. The NLB security group allows inbound TCP 7350 only from the CloudFront managed prefix list. The ECS task security group allows inbound port 7350 only from the ALB security group (HTTP API) and from the NLB security group (WebSocket).

Together, the routing and security group chain means the only path to Nakama is: Internet → CloudFront → AWS WAF → ALB or NLB → ECS. No hop can be skipped.

Manage the WebSocket connection lifecycle

The NLB TCP passthrough model creates a lifecycle challenge: the NLB drops idle TCP flows after 350 seconds (the AWS default, not configurable). If a player’s connection sits idle, the NLB closes the underlying TCP connection while Nakama still holds an open socket.

The following table describes the four controls that handle this:

Control Value Purpose
NLB TCP idle timeout 350s NLB drops idle TCP flows. Cannot be changed.
Nakama ping interval 10s Nakama sends a WebSocket ping every 10s, keeping the TCP flow active.
Nakama pong wait 20s If the client does not respond to a ping within 20s, Nakama closes the connection.
token_expiry_sec 7200 Nakama rejects session tokens older than 2 hours at connect time.
single_socket true A new connection from the same user kills the previous one, preventing stale sessions.

The ping/pong keepalive

The 10-second ping interval is the key control. Nakama sends a WebSocket ping frame every 10 seconds on each active connection. The client responds with a pong. This keeps the NLB TCP flow alive well within the 350-second idle timeout. If the client goes silent, Nakama detects the missing pong within 20 seconds and closes the socket cleanly.

Session expiry at connect time

The NLB performs TCP passthrough, so there is no opportunity to inspect HTTP headers or validate the session token at the network layer. Nakama validates the session token from the token query parameter when the WebSocket upgrade request arrives. A token older than token_expiry_sec is rejected and the connection is closed before any game messages are processed.

Single socket enforcement

single_socket: true verifies that when a player opens a second connection (after a network drop and reconnect, for example) the server closes the first connection. Without this, a player’s Nakama state can be split across two concurrent connections if the client does not cleanly close the first one.

The four-layer model (keepalive, timeout, session expiry at connect, one-connection-per-user enforcement) applies to any real-time server behind an NLB TCP passthrough: Colyseus, Photon, custom WebSocket backends, or any game server that manages persistent connections. If your server does not have built-in ping/pong, implement application-level heartbeat messages that serve the same role.

Security note: The session token travels as a query parameter (?token=...) in the WebSocket upgrade URL. Query parameters appear in server access logs, load balancer logs, Amazon CloudFront logs, and browser history. Mitigations: all connections use TLS (token encrypted in transit), session tokens are short-lived (2 hours), and single_socket invalidates old connections on reconnect. For production deployments, consider log redaction policies for the token parameter.

Clean up

To avoid ongoing AWS charges, destroy all resources when you no longer need them.

Destroy the main infrastructure first:

cd terraform && terraform destroy

Then destroy the Terraform state backend:

cd terraform/bootstrap && terraform destroy

Confirm resources are removed by running terraform state list (should return empty) or checking the AWS Management Console.

Conclusion

In this post, you implemented a dual-token authentication architecture for a Nakama game server on AWS. Amazon Cognito handles player identity through JWT validation; a Go runtime hook bridges verified identity into Nakama sessions; and the infrastructure enforces a routing layer where HTTP API traffic passes through an Application Load Balancer with an explicit allow list and WebSocket connections reach Nakama directly through a Network Load Balancer TCP passthrough.

The four-layer WebSocket lifecycle model can be applied to real-time game servers behind an NLB TCP passthrough, not Nakama exclusively.

For production deployments, consider these next steps:

  1. Replace the PostgreSQL sidecar with Amazon Aurora PostgreSQL-Compatible Edition for persistent, managed player data storage.
  2. Add a custom domain with TLS re-encryption between Amazon CloudFront and the ALB.
  3. Add Amazon VPC endpoints for Amazon Cognito and AWS Secrets Manager to eliminate the NAT Gateway dependency.

The full Terraform modules and Go plugin are available in the GitHub repository.

For more on Cognito-based game authentication patterns, refer to Using Amazon Cognito to Authenticate Players for a Game Backend Service and Web application access control patterns using AWS services.

Share your questions and feedback in the comments.


About the authors

Amazon Redshift delivers faster performance for BI dashboards and real-time analytics

Post Syndicated from Stefan Gromoll original https://aws.amazon.com/blogs/big-data/amazon-redshift-delivers-faster-performance-for-bi-dashboards-and-real-time-analytics/

Business intelligence (BI) dashboards and real-time analytics have become essential tools for making informed decisions quickly. Modern data warehouses must excel at complex, long-running analytical queries and also deliver sub-second response times for the short, ad hoc queries that power interactive and real-time experiences. This matters even more as agents explore and derive new insights from massive amounts of data. From executives monitoring key performance indicators on their morning dashboards to data analysts using agents to explore datasets interactively, the expectation is clear: queries should return results fast and predictably.

Amazon Redshift has long been optimized for these use cases. Over the years, we’ve introduced numerous features designed to improve query performance for BI and real-time analytics workloads, including result caching, materialized views, and automatic workload management (AutoWLM). These capabilities have helped thousands of customers build responsive dashboards and real-time applications on Amazon Redshift. However, we know that when it comes to interactive analytics, every millisecond matters. That’s why we keep focusing on making dashboards load faster and helping exploratory queries return results more quickly.

Today, we’re excited to announce a new performance optimization in Amazon Redshift that improves the response times of low-latency SQL queries, such as those used in real-time analytics applications or generated by BI dashboards. With this enhancement, you can experience improved query latencies because of a reduction in the time Amazon Redshift spends preparing SQL queries for execution. SQL queries start faster, so they return results quicker.

How the optimization works

To understand this improvement, let’s first examine one of Amazon Redshift’s existing core performance capabilities: code generation. Code generation is an optimization technique that analyzes each SQL query and generates query-specific C++ code internally. This code is then compiled and executed in parallel across the available Amazon Redshift compute nodes to deliver results back to you. Code generation has been fundamental to Amazon Redshift query performance, executing complex analytical queries with high efficiency.

While code generation results in performant query execution, new queries can experience a one-time compilation overhead the first time they run. Amazon Redshift already caches compiled code, and more than 99% of queries in the Amazon Redshift fleet execute using this cached generated code and experience no compilation overhead. For queries that haven’t been cached yet, the one-time compilation overhead is most noticeable for fast-running queries (for example, millisecond or single-digit second queries), where it can represent a significant portion of total execution time.

With the optimization we announced, Amazon Redshift reduces this compilation overhead. Here’s how it works: when Amazon Redshift receives a query, it first checks if optimized compiled C++ code already exists in the cache from previous executions of similar queries in the Amazon Redshift fleet. If so, it uses that code for best performance. If not, Amazon Redshift now applies a new query compilation optimization that processes new queries immediately using composition. Composition is a technique that generates a lightweight arrangement of pre-existing logic. At the same time, it creates query-specific optimized code that is compiled and executed across available compute resources to boost performance further. Composition removes compilation from the critical path of query execution and provides immediate execution while compilation proceeds in the background. With this optimization, new queries processed by Amazon Redshift start faster and deliver performance consistent with subsequent runs.

This approach ensures that first-time queries start much quicker, while repeated queries continue to benefit from the same leading price-performance that Amazon Redshift code generation delivers.

The best part? No action is necessary for your queries to start benefiting from this performance optimization. This enhancement is now the default for all SQL queries in Amazon Redshift for all users on provisioned clusters or serverless workgroups in all AWS Regions where Amazon Redshift is available at no additional cost.

Real-world performance results

We analyzed the impact of this new optimization on Amazon Redshift customer clusters. To do so, we measured the compilation time of the 1% of query segments that didn’t get a cache hit in our compilation cache and therefore required compilation. The following chart shows the results. The P50 compilation time before the optimization was 4.3 seconds. With this optimization, the compilation time dropped 25.7x to 170 ms.

Bar chart comparing P50 compilation time on Amazon Redshift before and after the FastCompile optimization, showing a reduction from 4.3 seconds to 170 milliseconds, a 25.7x improvement

With this optimization, BI dashboards load faster, interactive exploration feels more responsive, and real-time analytics applications can deliver insights with lower latency.

What customers are saying

“Following the significant performance improvements that Amazon Redshift demonstrated for cold query execution on our cluster with the FastCompile query performance feature enabled, achieving 2.4x faster query performance with compilation time reduced from 12 seconds to 5 seconds, we have adopted Amazon Redshift as our analytics solution”

— Vijay Hiremath, Group Manager, Business Platforms, Intuit

“As a data platform leader at a leading Chinese liquor company, we rely heavily on Amazon Redshift as our enterprise data warehouse. With diverse analytical query patterns, we faced performance challenges during initial compilation. After testing Redshift’s new cold query compilation enhancement, cold queries now perform nearly as fast as warm queries, with significantly improved speed on diverse queries”

— Yujie Wang, Data Platform Leader, JNC

“In a mid size customer processing about 85 GB of data daily through complex ETL pipelines — multiple tables, mixed DML operations, all landing into our 1.7 TB Amazon Redshift data warehouse, fast compile enhancements accelerated our post-maintenance ETL pipelines by 25%. Now the customer data loads complete faster, data hits analysts sooner for quick decisions”

— Jagan Mohan, Product Engineering Head, Algonomy

Industry-leading price-performance for all of your workloads

To illustrate the impact of this optimization, we simulated a short-running BI-like low-latency workload using a benchmark derived from the industry-standard TPC-DS benchmark. We ran the workload at a relatively small scale of 100 GB on a 3-node RG xlarge Amazon Redshift cluster. At this cluster size and scale, queries finish in milliseconds or single-digit seconds, representing the expected latencies of a typical BI dashboard. The derived TPC-DS benchmark includes 99 different queries that represent a mix of realistic business intelligence workloads, including reporting queries, ad hoc analysis, and data exploration patterns. For this test, we compared a single cold run of these queries on an Amazon Redshift RG cluster with the same run on comparable alternative cloud data warehouses. We launched the warehouses, loaded the data, executed a single run of 99 queries, and measured the total runtime and geometric mean of the queries. No other cluster warm-up or setup was done. This query performance improvement is hardware agnostic. It works on all supported Amazon Redshift hardware instance types, on RA3 and RG on provisioned clusters, and on the hardware that supports serverless workgroups.

The results are shown in table below and summarized in subsequent chart. With this new optimization, Amazon Redshift delivers the fastest runtime and geomean for these short queries at the lowest cost, with up to 8.3x better price-performance than the leading alternative data warehouses for new queries.

. Cost / hr Runtime (sec) Geomean (sec) Runtime comparison Geomean comparison Geomean price-performance
Redshift 3-node RG.xlarge $2.28 235 1.7 baseline baseline baseline
Alternative Warehouse A $3.00 327 2.3 1.4x slower 1.3x slower 1.7x more expensive
Alternative Warehouse B $4.00 538 3.4 2.3x slower 2x slower 3.4x more expensive
Alternative Warehouse C $6.00 907 5.5 3.9x slower 3.2x slower 8.3x more expensive

Bar chart comparing TPC-DS benchmark price-performance for the Amazon Redshift 3-node RG.xlarge baseline against three alternative cloud data warehouses, showing Amazon Redshift fastest at lowest cost and up to 8.3x better price-performance

Conclusion

The new query startup optimization in Amazon Redshift continues our commitment to fast performance across analytical workloads. By reducing compilation overhead, we’ve made BI dashboards and real-time analytics applications more responsive, while maintaining the query execution performance that Amazon Redshift is known for.

Because this optimization is automatically enabled for all Amazon Redshift customers, you can start experiencing these benefits immediately. No configuration changes or query rewrites are required. Your existing queries will run faster.

To learn more, visit Amazon Redshift. To get started, you can try Amazon Redshift Serverless and start querying data in minutes without setting up or managing data warehouse infrastructure. For more details on performance best practices, see the Amazon Redshift Database Developer Guide.

Find the best price performance for your workloads

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and has the following characteristics:

  • The schema and data come from TPC-DS unmodified.
  • The queries are used unmodified from TPC-DS. TPC-approved query variants are used for a warehouse if the warehouse does not support the SQL dialect of the default TPC-DS query.
  • The test includes only the 99 TPC-DS SELECT queries. It does not include maintenance and throughput steps.
  • A single power run was run with query parameters generated using the default random seed of the TPC-DS kit. The total runtime and geomean of that single cold run were used for the results in this post.
  • Price performance is calculated as the geomean in seconds divided by 3,600 seconds per hour, multiplied by the cost of the warehouse per hour. The result is equivalent to the geomean cost per query. Published on-demand pricing is used for all data warehouses.

We call this benchmark the Cloud Data Warehouse Benchmark, and you can reproduce the preceding benchmark results using the scripts, queries, and data available on GitHub. It is derived from the TPC-DS benchmark and is not comparable to published TPC-DS results, because our test results do not comply with the specification.

Each workload has unique characteristics. If you’re starting out, a proof of concept is the best way to understand how Amazon Redshift performs for your requirements. When running your own proof of concept, focus on proper cluster sizing and the right metrics: query throughput (the number of queries per hour) and price performance. You can make a data-driven decision by requesting assistance with a proof of concept or by working with a system integration and consulting partner.

To stay current with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.


About the authors

Stefan Gromoll

Stefan Gromoll

Stefan is a Principal Engineer with Amazon Redshift where he is responsible for Redshift performance across the stack. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Ravi Animi

Ravi Animi

Ravi is a Senior Product Management leader in the Redshift Team and manages several functional areas of the Amazon Redshift cloud data warehouse service including performance across the stack, query processing, materialized views, spatial analytics, streaming analytics and migration strategies. He has deep experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services and as a startup founder using AI/deep learning, computer vision, and robotics. He has dual bachelors degrees in physics and electrical engineering from Washington Univ. St. Louis, a masters degree in engineering from Stanford and an MBA from Chicago Booth.

Venkat Govindaraju

Venkat Govindaraju

Venkat is a Principal Engineer in the Amazon Redshift engineering team. He has designed and developed several major features in Amazon Redshift including the feature discussed in this blog. He holds a Ph.D in computer science from the University of Wisconsin, Madison.

Kiran Chinta

Kiran Chinta

Kiran is a Senior Development Manager in the Amazon Redshift engineering team. He has led the delivery of several key features in Amazon Redshift. He has extensive experience leading software engineering teams at Amazon Web Services, IBM and other companies.

The collective thoughts of the interwebz