Бездната се взира в политиците. ГЕРБ&co

2023-05-05 Емилия Милчева

Post Syndicated from Емилия Милчева original https://www.toest.bg/bezdnata-se-vzira-v-polititsite-gerb-co/

Нищо не е договорено, докато всичко не е договорено. Лидерът на ГЕРБ Бойко Борисов често употребява тази фраза и наскоро пак повтори европейския принцип. Преговорите за правителство, без да е връчен мандат от президента 33 дни след изборите, продължават. Те все повече заприличват на кампания по надлъгване, за да си прехвърлят Черeн Петър – метафората за политическото себесъхранение, добила популярност при предишната провалена въртележка с мандати.

Но макар ситуацията да изглежда отново безнадеждна, този път политиците ще положат повече усилия да сглобят някакво правителство с кратък срок на годност. Не са забравили съдбата, сполетяла „Има такъв народ“ (ИТН) – партията, която от победител на избори, след провалите с двата си проектокабинета, изпадна от стълбицата на медалистите.

Усещат, че бездната се взира в тях*, но трудно ще намерят желаещи за обречена власт, затова и кабинетът най-вероятно ще е слаб, като за отбиване на номер. След местните избори може да се мисли за друг, с нова конфигурация в зависимост от резултатите и от това доколко е пробит монополът на ГЕРБ в местната власт.

Президентът – от „няма да бави“ към „няма да бърза“

Кога ще връчите първия мандат за съставяне на кабинет (на ГЕРБ–СДС – б.а.), питат журналистите президента Румен Радев при всяка публична изява. За три седмици обаче неговата позиция се превъртя от „няма да бави“ до „няма да бърза“.

49-тото Народно събрание има нужда от време, за да се приемат най-важните закони и решения […] Парламентът не може да си позволи лукса да не приеме критично важни закони, дори и да няма правителство,

каза Радев преди десетина дни.

Има предвид приемането на бюджет за тази година и на законите, необходими за втория транш от 724 млн. евро по Плана за възстановяване и устойчивост (ПВУ). Така хем се проявява като загрижен държавник, хем се стреми да осигури комфорт за още едно служебно правителство и неговите разходи. Настоящото на Гълъб Донев разпределя и първия транш от близо 1,4 млрд. евро по ПВУ, получени в края на декември.

Но преди да обсъдят държавния бюджет и бюджетите на държавното обществено осигуряване, НЗОК и на съдебната власт, вече внесени в парламента, политическите сили трябва да определят състава на комисиите и председателите им. Как обаче да го направят, след като не са споделили и разпределили изпълнителната власт? Затова бюджетът чака – и президентът изчаква, а Борисов му пожелава да е жив и здрав, „защото дълго ще управлява и е последното спасение на българите в момента“.

В крайна сметка се намери поредното нестандартно решение – комисиите започват работа с председателите от предходния парламент. Депутатите от „Български възход“ ще бъдат заместени от „Има такъв народ“.

Вече няма как без бюджет

Всички са наясно, че в година на (местни) избори няма как да не бъде приет бюджет. Държавата не може да кара повече с миналогодишния. Ще се раздават 2 милиарда за проекти на общините, бизнесът си чака парите по договорите за строителство, също и болниците. Служебният кабинет е предвидил увеличение от 910 млн. за бюджета на НЗОК, от което над 500 млн. са за лечебните заведения – те ще разполагат с 47% от парите на Касата – 7,027 млрд. лв. за 2023 г.

Първоначално ГЕРБ–СДС и ПП–ДБ се разбраха да постигнат 3% дефицит в бюджета, но не и как – щом като не искат да вдигат данъци, нито пък да премахват данъчни преференции. Но след като няма да бъдат съюзници, остава неясно какво се случва с общите цели – трите процента и законодателната програма. „Джентълменското споразумение“ за ротация на председателското място на 49-тото НС е денонсирано. А точно то даде фалшивите надежди, че е възможен кабинет – въпреки обета на ПП–ДБ, че няма да участва или да подкрепи правителство на ГЕРБ. До обща управленска програма така и не се стигна, затова пък всяка от коалициите осмя и иронизира тази на конкурента – било заради екселските таблици, било заради плавателния канал Русе–Варна, проект от времето на социализма.

Манежен тръс

Макар да повтарят, че като победители в изборите ГЕРБ носят отговорността да предложат кабинет, ПП–ДБ все пак избързаха и представиха първо своя, след като Борисов го поиска с аргумента да преценят могат ли да го подкрепят. Няма как. От „Промяната“ отхвърлиха възможността съставът му да бъде предоговорен. В проектоправителството блестят знакови лица на „Промяната“ – Асен Василев (другият съпредседател Кирил Петков публично заяви, че се отказва), Николай Денков и Даниел Лорер, както и други от парламентарната група, и недотам познати лица от втория ешелон на „Демократична България“.

Партньорът на „Промяната“ демонстрира резигнация и към преговорите, и към селекцията за правителство, независимо от публичните уверения, че всичко в коалицията е наред. Разлом преди местните избори обаче не е възможен. На тях ПП–ДБ се надяват на пробив в София и в още няколко големи града, дълго управлявани от ГЕРБ, и да са в опозиция при евентуален кабинет на ГЕРБ–СДС ще им е от полза за кампанията.

Борисов им обърна гръб след провала на преговорите и показа вариации на план Б: правителство на малцинството или пък споделено от ГЕРБ с други партии в 49-тия парламент (ДПС, БСП, ИТН). Вариантът е да излъчат министри, но без да има коалиционно споразумение с тях, за да не сбъдне „мокрите сънища на ПП–ДБ“. От преговорите дотук ГЕРБ имат най-големи сходства във възгледите с ДПС. „Възраждане“, разбира се, отказаха, за да не последват печалната безславна участ на всички предишни националисти партньори на Бойко Борисов – „Атака“, РЗС, НФСБ, „Воля“, ВМРО. Обаче бяха констатирани сходства между прокремълската формация и евроатлантиците от ГЕРБ в енергетиката и… правосъдната реформа.

В края на деня, в който Борисов се срещна и разговаря с останалите политически сили, от „Продължаваме промяната“ поискаха разговор с ГЕРБ по управленската програма, което, разбира се, е само повод. Даниел Лорер даде най-дипломатичната формулировка:

Имаше голяма доза емоции при обявяването на състава на кабинета на акад. Денков. Това, че обявихме толкова рано, го отчитам като грешка, и то преди да бъде финализирана програмата… Има много голямо сходство между програмите на ПП–ДБ и ГЕРБ–СДС. Гласовете на ГЕРБ ще са ключови при приемане на състава на правителство с втория мандат.

Но на следващата сутрин никой отново не помръдна от позициите си – партията на Борисов настоява за кабинет с първия мандат, коалицията ПП–ДБ държи на правителство с втория.

В крайна сметка всички искат да се покажат конструктивни. Борисов е в новия си образ на диалогичен и разумен политик с опит, който прави всичко в името на стабилността. Кирил Петков и Асен Василев пък са политици, ненарушаващи принципите си и идентичността си на анти-ГЕРБ борци, честни към избирателите си.

Подготвя се нов кръг преговори.
Но бездната се взира във всички политици.

*Перифраза на ницшеанското „Ако твърде дълго се взираш в бездната, бездната започва да се взира в теб.“

Трансгранично партньорство CNN-Румъния сътрудничи с Биволъ за отразяването на атентата срещу Гешев

2023-05-05 Николай Марченко

Post Syndicated from Николай Марченко original https://bivol.bg/cnn-romania-geshev.html

петък 5 май 2023

Водещата румънска новинарска телевизия Antena 3 – CNN (Букурещ) и специалният й пратеник Кристи Попович се базираха на коментарите и разследванията на “Биволъ” при отразяването си от София на атентата…

[$] The end of the accounting search

2023-05-05

Post Syndicated from original https://lwn.net/Articles/925782/

Some things, it seems, just cannot be hurried. Back in 2007, your editor
first started considering alternatives to
the proprietary accounting system that had been used by LWN since the
beginning. That search became more urgent
in 2012, and returned in 2017 with a
focused effort to find something better. But another five years
passed before some sort of conclusion was reached. It has finally happened,
though; LWN is no longer using proprietary software for its accounting
needs.

Web Summit Rio 2023: Building an app in 18 minutes with GitHub Copilot X

2023-05-05 Thomas Dohmke

Post Syndicated from Thomas Dohmke original https://github.blog/2023-05-05-web-summit-rio-2023-building-an-app-in-18-minutes-with-github-copilot-x/

Missed GitHub CEO Thomas Dohmke’s Web Summit Rio 2023 talk? Read his remarks and get caught up on everything GitHub Copilot X.

Hello, Rio! I’m Thomas, and I’m a developer. I’ve been looking forward to this for some time, about a year. When I first planned this talk, I wanted to talk about the future of artificial intelligence and applications. But then I thought, why not show you the full power live on stage. Build something with code.

I’ve been a developer my entire adult life. I’ve been coding since 1989. But as CEO, I haven’t been able to light up my contribution graph in a while. Like all of you, like every developer, my energy, my creativity every day is finite. And all the distractions of life—from the personal to the professional—from morning to night, gradually zaps my creative energy.

When I wake up, I’m at my most creative and sometimes I have a big, light bulb idea. But as soon as the coffee is poured, the distractions start. Slack messages and emails flood in. I have to review a document before a customer presentation. And then I hear my kids yell as their football flew over the fence. And, of course, they want me to go over to our neighbors’ and get it.

Then…I sit back down, I am in wall-to-wall meetings. And by the time the sun comes down, that idea I had in the morning—it’s gone! It’s midnight and I can’t even keep my eyes open. To be honest, in all this, I usually don’t even want to get started because the perception of having to do all the mundane tasks, all the boilerplate work that comes with coding, stops me in my tracks. And I know this is true for so many developers.

The point is: boilerplate sucks! All busywork in life, but especially repetitive work is a barrier between us and what we want to achieve. Every day, building endless boilerplate prevents developers from creating a new idea that will change the world. But with AI, these barriers are about to be shattered.

@ashtom just created and deployed a snake game using GitHub Copilot in 15 minutes

“The point is: Boilerplate sucks. It is just the barrier between us and what we want to achieve.” pic.twitter.com/FDzAb4fPjA

— Art Yavorsky (@yavorsky_) May 3, 2023

You’ve likely seen the memes about the 10x developer. It’s a common internet joke—people have claimed the 10x developer many times. With GitHub Copilot and now Copilot X, it’s time for us to redefine this! It isn’t that developers need to strive to be 10x, it’s that every developer deserves to be made 10 times more productive. With AI at every step, we will truly create without captivity. With AI at every step, we will realize the 10x developer.

Imagine: 10 days of work, done in one day. 10 hours of work, done in one hour. 10 minutes of work, done with a single prompt command. This will allow us to amplify our truest self-expression. It will help a new generation of developers learn and build as fast as their minds. And because of this, we’ll emerge into a new spring of digital creativity, where every light bulb idea we have when we open our eyes for the day can be fully realized no matter what life throws at us.

During my presentation, I built a snake game in 15 minutes.

This morning I woke up, and I wanted to build something. I wanted to build my own snake game, originally created back in 1976. It’s a classic. So, 15 minutes are left on my timer. And I’m going to build this snake game. Let’s bring up Copilot X and get going!

See how you can 10x your developer superpowers. Discover GitHub Copilot X.

Congratulations to @ashtom CEO of @github , for his fantastic presentation at #WebSummitRio showcasing the amazing capabilities of #GitHubCopilot. It's rare to see a CEO with such technical prowess, building a functional app and even a snake game live in just 18 minutes.

— Kamila Paschoal (@PaschoalKamila) May 3, 2023

Comic for 2023.05.05 – Married

2023-05-05 Explosm.net

Post Syndicated from Explosm.net original https://explosm.net/comics/married

New Cyanide and Happiness Comic

Security updates for Friday

2023-05-05

Post Syndicated from original https://lwn.net/Articles/931050/

Security updates have been issued by Debian (chromium, evolution, and odoo), Fedora (java-11-openjdk), Oracle (samba), Red Hat (libreswan and samba), Slackware (libssh), SUSE (amazon-ssm-agent, apache2-mod_auth_openidc, cmark, containerd, editorconfig-core-c, ffmpeg, go1.20, harfbuzz, helm, java-11-openjdk, java-1_8_0-ibm, liblouis, podman, and vim), and Ubuntu (linux-aws, linux-aws-hwe, linux-intel-iotg, and linux-oem-6.1).

Why I’m Not So Alarmed About AI And Jobs

2023-05-05 Bozho

Post Syndicated from Bozho original https://techblog.bozho.net/why-im-not-so-alarmed-about-ai-and-jobs/

With the advances in large language models (e.g. ChatGPT), referred to as AI, concerns are rising about a sweeping loss of jobs because of the new tools. Some claim jobs will be completely replaced, others claim that jobs will be cut because of a significant increase in efficiency. Labour parties and unions are organizing conferences about the future of jobs, universal basic income, etc.

These concerns are valid and these debates should be held. I’m addressing this post to the more extreme alarmists and not trying to diminish the rapid changes that these technological advances are bringing. We have to think about regulations, ethical AI and safeguards. And the recent advances are pushing us in that direction, which is good.

But in technology we often live the phrase “when you have a hammer, everything looks like a nail”. The blockchain revolution didn’t happen, and so I think we are a bit more eager than warranted about the recent advances in AI. Let me address three aspects:

First – automation. The claim is, AI will swiftly automate a lot of jobs and many people with fall out of the labour market. The reality is that GPT/LLMs should be integrated in existing business processes. If regular automation hasn’t already killed those jobs, AI won’t do it so quickly. If an organization doesn’t use automation already for boilerplate tasks, it won’t overnight automate them with AI. Let me remind you that RPA (Robotic process automation) solutions have been advertised as AI. They really “kill” jobs in the enterprise. They’ve been around for nearly two decades and we haven’t heard a large alarmed choir about RPA. I’m aware there is a significant difference in LLMs and RPA, but the idea that a piece of technology will swiftly lead to staff reduction across industries is not something I agree with.

Second – efficiency. Especially in software development, where products like Copilot are already production-ready, it seems that with the increase of efficiency, there may be staff reduction. But if a piece of software used to be built for 6 months before AI, it will be built for, say, 3 months with AI. Note that code writing speed is not the only aspect of software development – other overhead and blockers will continue to exist – requirement clarifications, customer feedback, architecture decisions, operational and scalability issues, etc., so increase in efficiency is unlikely to be orders of magnitude. AT the same time, there is a shortage of software developers. With the advances of AI, there will be less of a shortage, meaning more software can be built within the same timeframe.

For outsourcing this means that the price per hour or per finished product may increase because of AI (speed is also a factor in pricing). A company will be able to service more customers for a given time. And there’s certainly a lot of demand for digital transformation. For product companies this increase in efficiency will mean faster time-to-market for the product and new features. Which will make product companies more competitive. In both cases, AI is unlikely to kills jobs in the near future.

Sure, ChatGPT can write a website. You can create a free website with site-builders even today. And this hasn’t killed web developers. It just makes the easiest websites cheaper. By the way, building software once and maintaining it are completely different things. Even if ChatGPT can build a website, maintenance is going to be tough through prompts.

At the same time, AI will put more intellectual pressure on junior developers, who are typically given the boilerplate work, which is going to be more automatable. But on the other hand AI will improve the training process of those junior developers. Companies may have to put more effort in training developers, and career paths may have to be adjusted, but it’s unlikely that the demand for software developers will drop.

Third, there is a claim that generative AI will kill jobs in the creative professions. Ever since I wrote an algorthmic music generator, I’ve been saying that it will not. Sure, composers of elevator music will be eventually gone. But poets, for example, won’t. ChatGPT is rather bad at poetry. It can’t actually write proper poetry. It seems to know just the AABB rhyme scheme, it ignores instructions on meter (“use dactylic tetrameter” doesn’t seem to mean anything to it). With image and video generation, the problem with unrealistic hands and fingers (and similar ones) doesn’t seem to be going away with larger models (even though the latest version of Midjourney is neatly going around it). It will certainly require post-editing. Will it make certain industries more efficient? Yes, which will allow them to produce more content for a given time. Will there be enough demand? I can’t say. The market will decide.

LLMs and AI will be change things. It will improve efficiency. It will disrupt some industries. And we have to debate this. But we still have time.

The post Why I’m Not So Alarmed About AI And Jobs appeared first on Bozho's tech blog.

Кадри от непрочетената Норвегия

2023-05-05 Стела Джелепова

Post Syndicated from Стела Джелепова original https://www.toest.bg/kadri-ot-neprochetenata-norvegiya/

Кадри от непрочетената Норвегия

В последното десетилетие често говорим за бум на норвежката и изобщо на скандинавската литература в България. Пропускаме обаче да отбележим, че бумът е в рамките на един-единствен жанр – т.нар. nordic noir. Схемата е ясна: циничен, но брилянтен детектив на средна възраст, обикновено депресиран, алкохолизиран, разведен и с отчуждена дъщеря (или две), разрешава поредица зловещи престъпления, зад които прозира уж истинското лице на скандинавското общество – всичко това на фона на мрачно време, сурови пейзажи и като цяло отчайващо съществуване.

Качеството варира, сюжетите – не особено, но пък заплетената фабула, семейните драми и простичкият език (нека го наречем „пестелив“) се оказват достатъчни да спечелят на тези книги толкова голяма популярност, че да изтикат на задните редове в книжарниците и на дъното на списъците на литературните агенти всяка друга скандинавска продукция.

А такава има. Романи, пиеси, поезия, детски книги –

скандинавските автори се изявяват навсякъде и са високоценени по цял свят.

В България също не липсват изпипани издания на немалко от сериозните произведения, но те не се радват на гласовита реклама и достигат до твърде ограничена публика – всъщност дори по-ограничена, отколкото е обичайно за високата литература от например англоезичните или испаноезичните страни.

За разлика от правените по калъп криминалета, при които трудно можеш да различиш шведските от норвежките, датските, финландските, исландските, че дори и фарьорските автори, по-съдържателната литература е едновременно универсална и силно представителна за собствената си култура. Ето защо вместо да поставя под общ знаменател няколко твърде специфични страни, тук предпочетох да се спра единствено на три творби от Норвегия, които заслужават най-после да стигнат и до българския читател.

Кнут Хамсун, „По обраслите с трева пътеки“

В края на 30-те години на миналия век Кнут Хамсун (ако трябва да сме точни, Кнют Хамсюн) е гордостта на норвежкия народ – баща на модерния роман, пионер в психологическата литература, носител на Нобелова награда, вдъхновение за писатели като Томас Ман, Франц Кафка, Стефан Цвайг, Ърнест Хемингуей.

Само няколко години по-късно, през 1945-а, същият този любимец на цяла Норвегия е арестуван, вкаран в лудница и изправен пред съда. Сънародниците му го заклеймяват и настояват за смъртна присъда, книгите му изчезват от книжарниците. Причината е, че по време на войната, вече много възрастен, усамотен в имението си и донякъде загубил връзка със света, той открито заявява подкрепата си за Германия. Предполага се, че основна роля за това играе съпругата му, яростна защитничка на нацизма, а той самият не е съвсем наясно какво се върши нито в Европа, нито в окупирана Норвегия.

И така, докато е под полицейски надзор, вече 90-годишен, Хамсун пише последната си книга – „По обраслите с трева пътеки“ (1949). Тя е отчасти роман, отчасти памфлет, дневник от тези тежки години, апология на един човек на преклонна възраст и до голяма степен протест срещу заключението на съда, че „умствените му способности са трайно намалени“.

За разлика от немалко други книги на Кнут Хамсун, „По обраслите с трева пътеки“ така и не излиза на български – нито в превод от норвежки, нито през друг език. Тя обаче е централна за неговото творчество: в нея, напълно неочаквано за този късен период, се пробужда духът от ранните му романи, изпълнени с

преклонение пред природата, абсурдистки хумор и смели полети на фантазията, при които границите между реалност и фикция се размиват.

Това може да не е идеалната книга, с която да се срещнеш за първи път с Хамсун, но българският читател има достъп до неговото творчество вече повече от век¹ и липсата ѝ в преводната ни литература е осезаема. Нещо повече, днес тази книга е особено актуална. Разкрива ни мислите на човек с огромен интелект и забележителен принос към световната култура, който по една или друга причина е избрал грешната страна в огромен цивилизационен конфликт и е платил за това.

Туре Ренберг, „Толак на Ингеборг“

Отново със старец, изгубил връзка с времето, в което живее, се срещаме в този кратък нов роман. Туре Ренберг е един от младите гласове в норвежката литература. Известен е с ярките си портрети на обикновени и необикновени хора и преди всичко на норвежки мъже.

Старият мелничар Толак е изпълнен с ярост към света, който вече е неразбираем за него, към децата си, които са избрали града пред семейството, към съседите, които тормозят храненика му – местния идиот Уду, и които отдавна са престанали да възприемат самия Толак като отделна личност, а гледат на него единствено като на някакъв придатък към жена му Ингеборг. Затова и за тях той е единствено „Толак на Ингеборг“. Но Ингеборг е изчезнала безследно преди години, Толак е съвсем сам с Уду, смъртта му наближава и той заставя сина си и дъщеря си да се върнат у дома за последен път, за да сподели с тях голяма тайна.

Романът „Толак на Ингеборг“ е продължител на една дълга норвежка традиция за герои аутсайдери,

живеещи незначителния си живот някъде накрай света, жертва на тесногръдието на околните, но и самите те недотам приятни и будещи съчувствие. На български имаме възможност да четем най-забележителния предшественик на Туре Ренберг, който също като него пише на нюноршк (втория официален писмен език в Норвегия) – Таряй Весос.

С наболялата тема за човека от отиващия си някогашен свят, който не може да намери мястото си в днешния и му остава единствено да се съпротивлява, яростно и до смърт (не задължително своята), романът на Ренберг определено би намерил място на рафтовете в българските книжарници и в домашните ни библиотеки.

Йойвин Римберай, „Соларис, коригиран“

При публикуването си през 2004 г. поемата „Соларис, коригиран“ предизвиква същински фурор и е обявена за

една от най-оригиналните творби на столетието, включена в класацията на 25-те най-велики норвежки произведения на всички времена.

Представлява драматичен монолог от 800 стиха, в който героят е механик в роботизиран град на дъното на морето край някогашните нефтени платформи. Годината е 2480-а и той живее във високотехнологично общество, белязано от класови различия и контролирано от частна корпорация. Земята, съсипана от екологични катастрофи, е станала необитаема. Необичайно за научната фантастика, героят не е нито бунтар, нито забележителен ум, а прост работник с ясни и конкретни мечти за спокоен живот.

Оригиналността на поемата обаче идва не толкова от историята, колкото от езика. Римберай конструира свой собствен, основан на диалекта на родния му град Ставангер („нефтената столица на Норвегия“), комбиниран с елементи от староскандинавски, датски, нидерландски, английски и шотландски германски – тоест езиците на всички страни с участие в нефтената индустрия на Северна Европа.

Всъщност Римберай създава един бъдещ език на постиндустриалното население по бреговете на Северно море. Препратките са както към хибридния говор, използван от работниците на първите норвежки нефтени платформи, така и към изменящия се под влиянието на интернет книжовен норвежки от началото на ХХI век, а и на т.нар. евроанглийски, говорен от работещите в институциите на Европейския съюз по същото време.

Поемата „Соларис, коригиран“ е преведена на няколко езика, сред тях английски и нидерландски. Интересно би било да я видим на български – както за да имаме достъп на роден език до едно от най-оригиналните произведения на този век, така и от любопитство какъв подход ще избере евентуалният преводач.

¹ Преводи през руски, немски и френски се появяват още в края на XIX в., а първият превод директно от норвежки идва през 1992 г., дело на Антония Бучуковска. Най-сериозният пропуск в представянето на Хамсуновото творчество в България, освен „По обраслите с трева пътеки“, остава липсата на превод от оригинала на епичния му роман „Благодатта на земята“, за който получава Нобелова награда.

Two Wrecks: USS Abner Read

2023-05-05 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=tJZNFPcdMNs

Статукво на ръба

2023-05-05 Александър Нуцов

Post Syndicated from Александър Нуцов original https://www.toest.bg/statukvo-na-ruba/

Статукво на ръба

Ескалацията на напрежението в Судан разруши крехкия политически баланс в страната и обезпокои международната общност. Раздираната от конфликти северноафриканска страна рискува да се превърне в арена на широкомащабна гражданска война. По различни данни, боевете между суданските въоръжени сили Sudanese Armed Forces (SAF) и паравоенната милиция Rapid Support Forces (RSF) до този момент са отнели живота на стотици души.

По оценка на ООН обаче реалният брой на жертвите вероятно е много по-висок. Хаосът е съпътстван от хуманитарна криза, като голяма част от населението в столицата Хартум и други градове няма достъп до електричество и вода. Има недостиг на хранителни продукти и пособия за нормалното функциониране на болници и други социални учреждения. Цените на храни и продукти от първа необходимост достигат екстремни стойности, а мародерството и плячкосванията допълват общата картина. Нараства броят на вътрешно разселените хора, както и на онези, които търсят сигурно убежище в съседни страни, като Чад, Централноафриканската република, Южен Судан, Египет и Етиопия.

Генерали, милиции и опозиции

Непосредствените актьори са редовната армия на Судан (SAF) начело с генерал Абдел Фатах ал-Бурхан и паравоенната милиция (RSF) под командването на генерал Мохамед Хамдан Дагало, познат като Хемедти. Макар и доскоро неизвестен за широката публика, Ал-Бурхан има сериозен военен опит и през годините се изкачва в йерархията на суданската армия до позицията на един от най-влиятелните генерали. Участва в гражданската война в Дарфур, преминава през различни военни постове и е повишен в генерал-инспектор на суданската армия. Надзирава и разгръщането на суданските сили в подкрепа на ръководената от Саудитска Арабия коалиция срещу бунтовниците в Йемен.

За разлика от Ал-Бурхан, който произхожда от влиятелните кръгове в Хартум, Хемедти е родом от Дарфур и има много по-голямо влияние в суданската периферия. В миналото ръководи паравоенната организация „Джанджауид“, известна с извършените зверства срещу цивилното население в Дарфур. Отцепници от нея формират ядрото и на създадената през 2013 г. организация RSF начело с Хемедти. Тя също се слави с лошо име след обвиненията за извършени военни престъпления и престъпления срещу човечеството в Дарфур и Йемен. Хемедти затвърждава позицията си на една от най-влиятелните фигури в Судан през 2017 г., когато използва RSF, за да поеме контрола върху търговията със злато – най-големия експортен бизнес в страната.

Друг ключов, макар и косвен актьор е разнородната по състав коалиция „Сили за свобода и промяна“ – Forces for Freedom and Change (FFC), съставена от опозиционни политически партии, граждански организации, активисти, профсъюзи и бивши бунтовнически групи. Тя се превръща в двигател на протестите срещу бившия президент Омар ал-Башир. FFC е широкоспектърна цивилна опозиция – контрапункт на военните и на влиянието им върху властовите структури, както и на опитите за налагане на ислямистка, недемократична управленска идеология. Участниците в коалицията са различни по произход, профил и идеология. Като главен инициатор за въвеждане на цивилно управление FFC влияе значително върху отношенията между Ал-Бурхан и Хемедти и върху динамиката на политическите отношения в Судан.

Крехка споделена власт

През 2019 г. дългогодишният президент на Судан Омар ал-Башир е свален от власт след най-масовите граждански протести в новата история на страната. Суданците излизат на улицата заради тежките условия на живот, окаяното състояние на икономиката и непосилната инфлация. Ал-Башир произлиза от военните кръгове и след успешен преврат завзема властта през 1989 г. Той е лидер на ислямистката, проарабска политическа формация National Congress Party, управлява по авторитарен модел и затвърждава политико-икономическото влияние на арабските елити. Международният наказателен съд го обвинява в престъпления срещу човечеството и във военни престъпления, извършени по време на войната в Дарфур, въпреки че Судан не е страна по Римския статут.

През април 2019 г. FFC засилва протестния натиск върху Ал-Башир и срещу него се обръщат службите за сигурност, генералите от армията и RSF. Президентът е отстранен на 11 април, с което започва дългото противоборство между цивилните и военните кръгове относно бъдещия модел на управление. Освен кой ще управлява, за враждуващите страни от изключителна важност е как ще се извърши преходът от диктатура към демократична, цивилна форма на управление, каква ще е структурата на армията и службите, каква ще е системата на съдебната власт, доколко военните ще бъде ограничени и подчинени на цивилното правителство и т.н. Военната хунта създава Преходен военен съвет с председател Ал-Бурхан и зам.-председател Хемедти и инициира преговори с FFC.

Предвижда се създаване на преходно правителство с тригодишен мандат до провеждане на избори, законодателно тяло и т.нар. Суверенен съвет с контролни функции върху преходния процес. Според анализ на International Crisis Group споразумението пропада заради разрив сред военните. Това води до поредица от граждански протести, брутално потушени от армията и паравоенните. След нови масови демонстрации и силен международен натиск военните и цивилната опозиция възобновяват преговорите и сключват политическо и конституционно споразумение. Ал-Бурхан и Хемедти са назначени за председател и зам.-председател на Суверенния съвет, а за министър-председател на преходното правителство е избран видният судански икономист Абдала Хамдок. Преходният период трябва да приключи с провеждането на свободни избори 39 месеца след споразумението, подписано на 17 август 2019 г.

На 25 октомври 2021 г. крехкият модел на споделена власт между военните и цивилната опозиция става жертва на нов преврат. Военните свалят Хамдок от власт и арестуват част от министрите и видни представители на цивилните кръгове. В отговор следват нови масови протести, които принуждават Ал-Бурхан да подпише споразумение и да реабилитира Хамдок на премиерския пост. Той обаче подава оставка в началото на 2022 г. заради огромното влияние на армията върху управленските структури, с което военните затвърждават надмощието си в управлението на страната. Статуквото трае до декември, когато Ал-Бурхан, Хемедти и FFC приемат рамково споразумение за създаване на преходно правителство, което да проведе избори и да положи основите на бъдещата цивилна власт.

Колониалното наследство и пътят на реформите

Кое доведе до избухването на въоръжени действия между двамата най-влиятелни генерали няколко месеца след подписването на споразумението? Отговорът се крие в съдържанието на споразумението, което засилва обществените разделения. Първо, преговорната рамка изключва от дебата важни обществени групи – такива например са т.нар. Комитети за съпротива (Resistance Committees), които са твърди застъпници за демокрация, играят ключова роля за организацията на протестите и се обявяват против споразумението и преговорите с военната хунта. Това води до разрив в цивилната опозиция и дистанциране на част от водещите ѝ елементи.

Вторият разрив е между цивилната опозиция и военните по отношение на достъпа до съдебната власт и състава и структурата на институциите за сигурност. Докато Комитетите за съпротива и протестиращите граждани настояват за строго правосъдие спрямо висшите генерали, отговорни за зверствата и престъпленията срещу цивилното население, Ал-Бурхан и Хемедти трудно биха предали контрола върху съдебните органи и службите на независимо правителство. След свалянето на Ал-Башир през 2019 г. страховете от подобен развой съпътстват действията им. Затова подчиняването на армията и службите за сигурност на едно демократично избрано правителство би било трудно осъществимо, без Ал-Бурхан и/или Хемедти да получат водещи постове в сектора и да запазят влиянието си.

Третият разрив е между оглавяваната от Ал-Бурхан редовна суданска армия (SAF) и ръководените от Хемедти паравоенни сили (RSF). Основният проблем се крие в заложената в споразумението реформа, според която RSF трябва да се интегрира в редовната суданска армия. Спорните въпроси тук са три: какъв да е преходният период за извършване на интеграцията, как да се осъществи и кой ще води командването на военните сили по време на интеграционния процес и след приключването му.

С други думи, от реформата зависят бъдещите отношения между Ал-Бурхан и Хемедти – тоест достъпът им до властови, политически, икономически и военни ресурси. Рамковото споразумение обаче не постановява конкретен времеви хоризонт и структура на бъдещото командване. Според последния анализ на International Crisis Group Хемедти и FFC настояват за десетгодишен период на интеграция, докато Ал-Бурхан се застъпва за двугодишна рамка, за да възпре нарастващата мощ на RSF в дългосрочен план. Борбата за надмощие между Ал-Бурхан и Хемедти е сред водещите причини за избухването на конфликта.

Друг важен фактор според доклада е вътрешният натиск върху Ал-Бурхан от армейски генерали, които са против всякакви отстъпки пред Хемедти. Това води до четвърти разрив – между хардлайнерите и умерените крила в армията. Тази вътрешна динамика не е маловажна, защото ограничава възможните ходове на Ал-Бурхан и влияе върху вземането на крайни решения.

Раздорът между Ал-Бурхан и Хемедти назрява в продължение на няколко години. Недоверието между двамата генерали и динамичната конюнктура карат Хемедти да се дистанцира в публични изяви, позиционирайки се като радетел за демокрация и цивилно управление. Така той противопоставя своя образ на този на Ал-Бурхан, който се свързва с авторитарното минало, както и с ислямистките кръгове около бившия президент Омар ал-Башир. От това следва да спечели поне частично доверието на цивилната опозиция и международните актьори. И Хемедти, и Ал-Бурхан обаче трудно ще получат индулгенция от гражданите заради извършените престъпления.

Дълбоките структурни предпоставки за конфликта са свързани с колониалното наследство на Судан – властови, икономически и социални неравенства, които маргинализират големи обществени групи от политическия и икономическия живот на страната. Разломът между столицата Хартум (и централната част на Судан) и суданската периферия по отношение на инфраструктурата и икономическото развитие, както и съсредоточаването на властови ресурси в ръцете на малобройните арабски елити създават нужните условия за нестихващите конфликти след обявяването на суданската независимост през 1956 г.

Опростенческите разделения на расов и религиозен принцип – например между арабския ислямски Север и африканския християнски Юг (преди отцепването на Южен Судан като самостоятелна държава) – не обясняват същността на конфликтите. Етническите, религиозни и племенни разделения не причиняват конфликтите сами по себе си, а ги катализират. Те се използват от различни актьори за мобилизация и консолидация на сили в борбата си за власт, ресурси и влияние.

В този смисъл бившият президент Омар ал-Башир утвърди практиката паравоенни проправителствени милиции като RSF и „Джанджауид“ да бъдат използвани за потушаване на многобройните бунтовнически въстания в периферията. Подходът му доведе до сегашните структурни предизвикателства, защото подобни военни образувания придобиват мощ и ресурси, които им позволяват да се конкурират със силите на редовната армия. Това създава пречки пред евентуалната интеграция на паравоенните в редиците на армията и води до сблъсък на интереси.

За да се върви към устойчив мир, рискът от нова пълномащабна и продължителна гражданска война трябва да премине. Необходимо е примирие, което да върне водещите актьори на масата за преговори, за да се подновят дебатите за преходно правителство, което да адресира наболелите политически въпроси. Международната общност не пести усилия в тази насока. Южносуданският президент Салва Киир успешно посредничи за едноседмично примирие между двете враждуващи страни. От ключово значение ще са и усилията на ООН, регионалните организации Африкански съюз и Междуправителствен орган за развитие (IGAD), както и на четворката посредници – Саудитска Арабия, Обединените арабски емирства, САЩ и Великобритания.

Commemorative Plaque

2023-05-05

Post Syndicated from original https://xkcd.com/2772/

[Below] On this site on May 12th, 2023, I finally learned how to use the masonry bit for my drill.

Get details on security finding changes with the new Finding History feature in Security Hub

2023-05-05 Nicholas Jaeger

Post Syndicated from Nicholas Jaeger original https://aws.amazon.com/blogs/security/get-details-on-security-finding-changes-with-the-new-finding-history-feature-in-security-hub/

In today’s evolving security threat landscape, security teams increasingly require tools to detect and track security findings to protect their organizations’ assets. One objective of cloud security posture management is to identify and address security findings in a timely and effective manner. AWS Security Hub aggregates, organizes, and prioritizes security alerts and findings from various AWS services and supported security solutions from the AWS Partner Network.

As the volume of findings increases, tracking the changes and actions that have been taken on each finding becomes more difficult, as well as more important to perform timely and effective investigations. In this post, we will show you how to use the new Finding History feature in Security Hub to track and understand the history of a security finding.

Updates to findings occur when finding providers update certain fields, such as resource details, by using the BatchImportFindings API. You, as a user, can update certain fields, such as workflow status, in the AWS Management Console or through the BatchUpdateFindings API. Ticketing, incident management, security information and event management (SIEM), and automatic remediation solutions can also use the BatchUpdateFindings API to update findings. This new capability highlights these various changes and when they occurred so that you don’t need to investigate this yourself.

Finding History

The new Finding History feature in Security Hub helps you understand the state of a finding by providing an immutable history of changes within the finding details. By using this feature, you can track the history of each finding, including the before and after values of the fields that were changed, who or what made the changes, and when the changes were made. This simplifies how you operate on a finding by giving you visibility into the changes made to a finding over time, alongside the rest of the finding details, which removes the need for separate tooling or additional processes. This feature is available at no additional cost in AWS Regions where Security Hub is available, and appears by default for new or updated findings. Finding History is also available through the Security Hub APIs.

To try out this new feature, open the Security Hub console, select a finding, and choose the History tab. There you will see a chronological list of changes that have been made to the finding. The transparency of the finding history helps you quickly assess the status of the finding, understand actions already taken, and take the necessary actions to mitigate risk. For example, upon resolving a finding, you can add a note to the finding to indicate why you resolved it. Both the resolved status and note will appear in the history.

In the following example, the finding was updated and then resolved with an explanatory note left by the person that reviewed the finding. With Finding History, you can see the previous updates and events in the finding’s History tab.

Figure 1: Finding History shows recent updates to the finding

In addition, you can still view the current state of the finding in its Details tab.

Figure 2: Finding Details shows the record of a security check or security-related detection

Conclusion

With the new Finding History feature in Security Hub, you have greater visibility into the activity and updates on each finding, allowing for more efficient investigation and response to potential security risks. Next time that you start work to investigate and respond to a security finding in Security Hub, begin by checking the finding history.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Twitter’s e2ee DMs are better than nothing

2023-05-05

Post Syndicated from original https://mjg59.dreamwidth.org/66791.html

Elon Musk appeared on an interview with Tucker Carlson last month, with one of the topics being the fact that Twitter could be legally compelled to hand over users’ direct messages to government agencies since they’re held on Twitter’s servers and aren’t encrypted. Elon talked about how they were in the process of implementing proper encryption for DMs that would prevent this – “You could put a gun to my head and I couldn’t tell you. That’s how it should be.”

tl;dr – in the current implementation, while Twitter could subvert the end-to-end nature of the encryption, it could not do so without users being notified. If any user involved in a conversation were to ignore that notification, all messages in that conversation (including ones sent in the past) could then be decrypted. This isn’t ideal, but it still seems like an improvement over having no encryption at all. More technical discussion follows.

For context: all information about Twitter’s implementation here has been derived from reverse engineering version 9.86.0 of the Android client and 9.56.1 of the iOS client (the current versions at time of writing), and the feature hasn’t yet launched. While it’s certainly possible that there could be major changes in the protocol between now launch, Elon has asserted that they plan to launch the feature this week so it’s plausible that this reflects what’ll ship.

For it to be impossible for Twitter to read DMs, they need to not only be encrypted, they need to be encrypted with a key that’s not available to Twitter. This is what’s referred to as “end-to-end encryption”, or e2ee – it means that the only components in the communication chain that have access to the unencrypted data are the endpoints. Even if the message passes through other systems (and even if it’s stored on other systems), those systems do not have access to the keys that would be needed to decrypt the data.

End-to-end encrypted messengers were initially popularised by Signal, but the Signal protocol has since been incorporated into WhatsApp and is probably much more widely used there. Millions of people per day are sending messages to each other that pass through servers controlled by third parties, but those third parties are completely unable to read the contents of those messages. This is the scenario that Elon described, where there’s no degree of compulsion that could cause the people relaying messages to and from people to decrypt those messages afterwards.

But for this to be possible, both ends of the communication need to be able to encrypt messages in a way the other end can decrypt. This is usually performed using AES, a well-studied encryption algorithm with no known significant weaknesses. AES is a form of what’s referred to as a symmetric encryption, one where encryption and decryption are performed with the same key. This means that both ends need access to that key, which presents us with a bootstrapping problem. Until a shared secret is obtained, there’s no way to communicate securely, so how do we generate that shared secret? A common mechanism for this is something called Diffie Hellman key exchange, which makes use of asymmetric encryption. In asymmetric encryption, an encryption key can be split into two components – a public key and a private key. Both devices involved in the communication combine their private key and the other party’s public key to generate a secret that can only be decoded with access to the private key. As long as you know the other party’s public key, you can now securely generate a shared secret with them. Even a third party with access to all the public keys won’t be able to identify this secret. Signal makes use of a variation of Diffie-Hellman called Extended Triple Diffie-Hellman that has some desirable properties, but it’s not strictly necessary for the implementation of something that’s end-to-end encrypted.

Although it was rumoured that Twitter would make use of the Signal protocol, and in fact there are vestiges of code in the Twitter client that still reference Signal, recent versions of the app have shipped with an entirely different approach that appears to have been written from scratch. It seems simple enough. Each device generates an asymmetric keypair using the NIST P-256 elliptic curve, along with a device identifier. The device identifier and the public half of the key are uploaded to Twitter using a new API endpoint called /1.1/keyregistry/register. When you want to send an encrypted DM to someone, the app calls /1.1/keyregistry/extract_public_keys with the IDs of the users you want to communicate with, and gets back a list of their public keys. It then looks up the conversation ID (a numeric identifier that corresponds to a given DM exchange – for a 1:1 conversation between two people it doesn’t appear that this ever changes, so if you DMed an account 5 years ago and then DM them again now from the same account, the conversation ID will be the same) in a local database to retrieve a conversation key. If that key doesn’t exist yet, the sender generates a random one. The message is then encrypted with the conversation key using AES in GCM mode, and the conversation key is then put through Diffie-Hellman with each of the recipients’ public device keys. The encrypted message is then sent to Twitter along with the list of encrypted conversation keys. When each of the recipients’ devices receives the message it checks whether it already has a copy of the conversation key, and if not performs its half of the Diffie-Hellman negotiation to decrypt the encrypted conversation key. One it has the conversation key it decrypts it and shows it to the user.

What would happen if Twitter changed the registered public key associated with a device to one where they held the private key, or added an entirely new device to a user’s account? If the app were to just happily send a message with the conversation key encrypted with that new key, Twitter would be able to decrypt that and obtain the conversation key. Since the conversation key is tied to the conversation, not any given pair of devices, obtaining the conversation key means you can then decrypt every message in that conversation, including ones sent before the key was obtained.

(An aside: Signal and WhatsApp make use of a protocol called Sesame which involves additional secret material that’s shared between every device a user owns, hence why you have to do that QR code dance whenever you add a new device to your account. I’m grossly over-simplifying how clever the Signal approach is here, largely because I don’t understand the details of it myself. The Signal protocol uses something called the Double Ratchet Algorithm to implement the actual message encryption keys in such a way that even if someone were able to successfully impersonate a device they’d only be able to decrypt messages sent after that point even if they had encrypted copies of every previous message in the conversation)

How’s this avoided? Based on the UI that exists in the iOS version of the app, in a fairly straightforward way – each user can only have a single device that supports encrypted messages. If the user (or, in our hypothetical, a malicious Twitter) replaces the device key, the client will generate a notification. If the user pays attention to that notification and verifies with the recipient through some out of band mechanism that the device has actually been replaced, then everything is fine. But, if any participant in the conversation ignores this warning, the holder of the subverted key can obtain the conversation key and decrypt the entire history of the conversation. That’s strictly worse than anything based on Signal, where such impersonation would simply not work, but even in the Twitter case it’s not possible for someone to silently subvert the security.

So when Elon says Twitter wouldn’t be able to decrypt these messages even if someone held a gun to his head, there’s a condition applied to that – it’s true as long as nobody fucks up. This is clearly better than the messages just not being encrypted at all in the first place, but overall it’s a weaker solution than Signal. If you’re currently using Twitter DMs, should you turn on encryption? As long as the limitations aren’t too limiting, definitely! Should you use this in preference to Signal or WhatsApp? Almost certainly not. This seems like a genuine incremental improvement, but it’d be easy to interpret what Elon says as providing stronger guarantees than actually exist.

comments

Migrating Critical Traffic At Scale with No Downtime — Part 1

2023-05-05 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835

Migrating Critical Traffic At Scale with No Downtime — Part 1

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, Devang Shah

Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations.

When undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

The backend for the streaming product utilizes a highly distributed microservices architecture; hence these migrations also happen at different points of the service call graph. It can happen on an edge API system servicing customer devices, between the edge and mid-tier services, or from mid-tiers to data stores. Another relevant factor is that the migration could be happening on APIs that are stateless and idempotent, or it could be happening on stateful APIs.

We have categorized the tools and techniques we have used to facilitate these migrations in two high-level phases. The first phase involves validating functional correctness, scalability, and performance concerns and ensuring the new systems’ resilience before the migration. The second phase involves migrating the traffic over to the new systems in a manner that mitigates the risk of incidents while continually monitoring and confirming that we are meeting crucial metrics tracked at multiple levels. These include Quality-of-Experience(QoE) measurements at the customer device level, Service-Level-Agreements (SLAs), and business-level Key-Performance-Indicators(KPIs).

This blog post will provide a detailed analysis of replay traffic testing, a versatile technique we have applied in the preliminary validation phase for multiple migration initiatives. In a follow-up blog post, we will focus on the second phase and look deeper at some of the tactical steps that we use to migrate the traffic over in a controlled manner.

Replay Traffic Testing

Replay traffic refers to production traffic that is cloned and forked over to a different path in the service call graph, allowing us to exercise new/updated systems in a manner that simulates actual production conditions. In this testing strategy, we execute a copy (replay) of production traffic against a system’s existing and new versions to perform relevant validations. This approach has a handful of benefits.

Replay traffic testing enables sandboxed testing at scale without significantly impacting production traffic or user experience.
Utilizing cloned real traffic, we can exercise the diversity of inputs from a wide range of devices and device application software versions in production. This is particularly important for complex APIs that have many high cardinality inputs. Replay traffic provides the reach and coverage required to test the ability of the system to handle infrequently used input combinations and edge cases.
This technique facilitates validation on multiple fronts. It allows us to assert functional correctness and provides a mechanism to load test the system and tune the system and scaling parameters for optimal functioning.
By simulating a real production environment, we can characterize system performance over an extended period while considering the expected and unexpected traffic pattern shifts. It provides a good read on the availability and latency ranges under different production conditions.
Provides a platform to ensure that essential operational insights, metrics, logging, and alerting are in place before migration.

Replay Solution

The replay traffic testing solution comprises two essential components.

Traffic Duplication and Correlation: The initial step requires the implementation of a mechanism to clone and fork production traffic to the newly established pathway, along with a process to record and correlate responses from the original and alternative routes.
Comparative Analysis and Reporting: Following traffic duplication and correlation, we need a framework to compare and analyze the responses recorded from the two paths and get a comprehensive report for the analysis.

We have tried different approaches for the traffic duplication and recording step through various migrations, making improvements along the way. These include options where replay traffic generation is orchestrated on the device, on the server, and via a dedicated service. We will examine these alternatives in the upcoming sections.

Device Driven

In this option, the device makes a request on the production path and the replay path, then discards the response on the replay path. These requests are executed in parallel to minimize any potential delay on the production path. The selection of the replay path on the backend can be driven by the URL the device uses when making the request or by utilizing specific request parameters in routing logic at the appropriate layer of the service call graph. The device also includes a unique identifier with identical values on both paths, which is used to correlate the production and replay responses. The responses can be recorded at the most optimal location in the service call graph or by the device itself, depending on the particular migration.

The device-driven approach’s obvious downside is that we are wasting device resources. There is also a risk of impact on device QoE, especially on low-resource devices. Adding forking logic and complexity to the device code can create dependencies on device application release cycles that generally run at a slower cadence than service release cycles, leading to bottlenecks in the migration. Moreover, allowing the device to execute untested server-side code paths can inadvertently expose an attack surface area for potential misuse.

Server Driven

To address the concerns of the device-driven approach, the other option we have used is to handle the replay concerns entirely on the backend. The replay traffic is cloned and forked in the appropriate service upstream of the migrated service. The upstream service calls the existing and new replacement services concurrently to minimize any latency increase on the production path. The upstream service records the responses on the two paths along with an identifier with a common value that is used to correlate the responses. This recording operation is also done asynchronously to minimize any impact on the latency on the production path.

The server-driven approach’s benefit is that the entire complexity of replay logic is encapsulated on the backend, and there is no wastage of device resources. Also, since this logic resides on the server side, we can iterate on any required changes faster. However, we are still inserting the replay-related logic alongside the production code that is handling business logic, which can result in unnecessary coupling and complexity. There is also an increased risk that bugs in the replay logic have the potential to impact production code and metrics.

Dedicated Service

The latest approach that we have used is to completely isolate all components of replay traffic into a separate dedicated service. In this approach, we record the requests and responses for the service that needs to be updated or replaced to an offline event stream asynchronously. Quite often, this logging of requests and responses is already happening for operational insights. Subsequently, we use Mantis, a distributed stream processor, to capture these requests and responses and replay the requests against the new service or cluster while making any required adjustments to the requests. After replaying the requests, this dedicated service also records the responses from the production and replay paths for offline analysis.

This approach centralizes the replay logic in an isolated, dedicated code base. Apart from not consuming device resources and not impacting device QoE, this approach also reduces any coupling between production business logic and replay traffic logic on the backend. It also decouples any updates on the replay framework away from the device and service release cycles.

Analyzing Replay Traffic

Once we have run replay traffic and recorded a statistically significant volume of responses, we are ready for the comparative analysis and reporting component of replay traffic testing. Given the scale of the data being generated using replay traffic, we record the responses from the two sides to a cost-effective cold storage facility using technology like Apache Iceberg. We can then create offline distributed batch processing jobs to correlate & compare the responses across the production and replay paths and generate detailed reports on the analysis.

Normalization

Depending on the nature of the system being migrated, the responses might need some preprocessing before being compared. For example, if some fields in the responses are timestamps, those will differ. Similarly, if there are unsorted lists in the responses, it might be best to sort them before comparing. In certain migration scenarios, there may be intentional alterations to the response generated by the updated service or component. For instance, a field that was a list in the original path is represented as key-value pairs in the new path. In such cases, we can apply specific transformations to the response on the replay path to simulate the expected changes. Based on the system and the associated responses, there might be other specific normalizations that we might apply to the response before we compare the responses.

Comparison

After normalizing, we diff the responses on the two sides and check whether we have matching or mismatching responses. The batch job creates a high-level summary that captures some key comparison metrics. These include the total number of responses on both sides, the count of responses joined by the correlation identifier, matches and mismatches. The summary also records the number of passing/ failing responses on each path. This summary provides an excellent high-level view of the analysis and the overall match rate across the production and replay paths. Additionally, for mismatches, we record the normalized and unnormalized responses from both sides to another big data table along with other relevant parameters, such as the diff. We use this additional logging to debug and identify the root cause of issues driving the mismatches. Once we discover and address those issues, we can use the replay testing process iteratively to bring down the mismatch percentage to an acceptable number.

Lineage

When comparing responses, a common source of noise arises from the utilization of non-deterministic or non-idempotent dependency data for generating responses on the production and replay pathways. For instance, envision a response payload that delivers media streams for a playback session. The service responsible for generating this payload consults a metadata service that provides all available streams for the given title. Various factors can lead to the addition or removal of streams, such as identifying issues with a specific stream, incorporating support for a new language, or introducing a new encode. Consequently, there is a potential for discrepancies in the sets of streams used to determine payloads on the production and replay paths, resulting in divergent responses.

A comprehensive summary of data versions or checksums for all dependencies involved in generating a response, referred to as a lineage, is compiled to address this challenge. Discrepancies can be identified and discarded by comparing the lineage of both production and replay responses in the automated jobs analyzing the responses. This approach mitigates the impact of noise and ensures accurate and reliable comparisons between production and replay responses.

Comparing Live Traffic

An alternative method to recording responses and performing the comparison offline is to perform a live comparison. In this approach, we do the forking of the replay traffic on the upstream service as described in the `Server Driven` section. The service that forks and clones the replay traffic directly compares the responses on the production and replay path and records relevant metrics. This option is feasible if the response payload isn’t very complex, such that the comparison doesn’t significantly increase latencies or if the services being migrated are not on the critical path. Logging is selective to cases where the old and new responses do not match.

Load Testing

Besides functional testing, replay traffic allows us to stress test the updated system components. We can regulate the load on the replay path by controlling the amount of traffic being replayed and the new service’s horizontal and vertical scale factors. This approach allows us to evaluate the performance of the new services under different traffic conditions. We can see how the availability, latency, and other system performance metrics, such as CPU consumption, memory consumption, garbage collection rate, etc, change as the load factor changes. Load testing the system using this technique allows us to identify performance hotspots using actual production traffic profiles. It helps expose memory leaks, deadlocks, caching issues, and other system issues. It enables the tuning of thread pools, connection pools, connection timeouts, and other configuration parameters. Further, it helps in the determination of reasonable scaling policies and estimates for the associated cost and the broader cost/risk tradeoff.

Stateful Systems

We have extensively utilized replay testing to build confidence in migrations involving stateless and idempotent systems. Replay testing can also validate migrations involving stateful systems, although additional measures must be taken. The production and replay paths must have distinct and isolated data stores that are in identical states before enabling the replay of traffic. Additionally, all different request types that drive the state machine must be replayed. In the recording step, apart from the responses, we also want to capture the state associated with that specific response. Correspondingly in the analysis phase, we want to compare both the response and the related state in the state machine. Given the overall complexity of using replay testing with stateful systems, we have employed other techniques in such scenarios. We will look at one of them in the follow-up blog post in this series.

Summary

We have adopted replay traffic testing at Netflix for numerous migration projects. A recent example involved leveraging replay testing to validate an extensive re-architecture of the edge APIs that drive the playback component of our product. Another instance included migrating a mid-tier service from REST to gRPC. In both cases, replay testing facilitated comprehensive functional testing, load testing, and system tuning at scale using real production traffic. This approach enabled us to identify elusive issues and rapidly build confidence in these substantial redesigns.

Upon concluding replay testing, we are ready to start introducing these changes in production. In an upcoming blog post, we will look at some of the techniques we use to roll out significant changes to the system to production in a gradual risk-controlled way while building confidence via metrics at different levels.

Migrating Critical Traffic At Scale with No Downtime — Part 1 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

2023-05-05 Maneesh Sharma

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/single-sign-on-with-amazon-redshift-serverless-with-okta-using-amazon-redshift-query-editor-v2-and-third-party-sql-clients/

Amazon Redshift Serverless makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. With Redshift Serverless, users such as data analysts, developers, business professionals, and data scientists can get insights from data by simply loading and querying data in the data warehouse.

Customers use their preferred SQL clients to analyze their data in Redshift Serverless. They want to use an identity provider (IdP) or single sign-on (SSO) credentials to connect to Redshift Serverless to reuse existing using credentials and avoid additional user setup and configuration. When you use AWS Identity and Access Management (IAM) or IdP-based credentials to connect to a serverless data warehouse, Amazon Redshift automatically creates a database user for the end-user. You can simplify managing user privileges by using role-based access control. Admins can use a database-role mapping for SSO with the IAM roles that users are assigned to get their database privileges automatically. With this integration, organizations can simplify user management because they no longer need to create users and map them to database roles manually. You can define the mapped database roles as a principal tag for the IdP groups or IAM role, so Amazon Redshift database roles and users who are members of those IdP groups are granted to the database roles automatically.

In this post, we focus on Okta as the IdP and provide step-by-step guidance to integrate Redshift Serverless with Okta using the Amazon Redshift Query Editor V2 and with SQL clients like SQL Workbench/J. You can use this mechanism with other IdP providers such as Azure Active Directory or Ping with any applications or tools using Amazon’s JDBC/ODBC/Python driver.

Solution overview

The following diagram illustrates the authentication flow of Okta with Redshift Serverless using federated IAM roles and automatic database-role mapping.

The workflow contains the following steps:

Either the user chooses an IdP app in their browser, or the SQL client initiates a user authentication request to the IdP (Okta).
Upon a successful authentication, Okta submits a request to the AWS federation endpoint with a SAML assertion containing the PrincipalTags.
The AWS federation endpoint validates the SAML assertion and invokes the AWS Security Token Service (AWS STS) API AssumeRoleWithSAML. The SAML assertion contains the IdP user and group information that is stored in the RedshiftDbUser and RedshiftDbRoles principal tags, respectively. Temporary IAM credentials are returned to the SQL client or, if using the Query Editor v2, the user’s browser is redirected to the Query Editor v2 console using the temporary IAM credentials.
The temporary IAM credentials are used by the SQL client or Query Editor v2 to call the Redshift Serverless GetCredentials API. The API uses the principal tags to determine the user and database roles that the user belongs to. An associated database user is created if the user is signing in for the first time and is granted the matching database roles automatically. A temporary password is returned to the SQL client.
Using the database user and temporary password, the SQL client or Query Editor v2 connects to Redshift Serverless. Upon login, the user is authorized based on the Amazon Redshift database roles that were assigned in Step 4.

To set up the solution, we complete the following steps:

Set up your Okta application:
- Create Okta users.
- Create groups and assign groups to users.
- Create the Okta SAML application.
- Collect Okta information.
Set up AWS configuration:
- Create the IAM IdP.
- Create the IAM role and policy.
Configure Redshift Serverless role-based access.
Federate to Redshift Serverless using the Query Editor V2.
Configure the SQL client (for this post, we use SQL Workbench/J).
Optionally, implement MFA with SQL Client and Query Editor V2.

Prerequisites

You need the following prerequisites to set up this solution:

An AWS account. If you don’t have one, you can sign up for one.
An Redshift Serverless data warehouse. For setup instructions, see Getting started with Amazon Redshift Serverless.
The latest Redshift Serverless JDBC SDK driver-dependent libraries (download the libraries and unzip the JDBC JAR zipped folder).
An Okta account that has an active subscription. You need an admin role to set up the application on Okta. If you’re new to Okta, you can sign up for a free trial or sign up for a developer account.
A SQL client such as SQL Workbench/J.

Set up Okta application

In this section, we provide the steps to configure your Okta application.

Create Okta users

To create your Okta users, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Directory in the navigation pane, choose People.
Choose Add person.
For First Name, enter the user’s first name.
For Last Name, enter the user’s last name.
For Username, enter the user’s user name in email format.
Select I will set password and enter a password.
Optionally, deselect User must change password on first login if you don’t want the user to change their password when they first sign in. Choose Save.

Create groups and assign groups to users

To create your groups and assign them to users, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Directory in the navigation pane, choose Groups.
Choose Add group.
Enter a group name and choose Save.
Choose the recently created group and then choose Assign people.
Choose the plus sign and then choose Done.
Repeat Steps 1–6 to add more groups.

In this post, we create two groups: sales and finance.

Create an Okta SAML application

To create your Okta SAML application, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Applications in the navigation pane, choose Applications.
Choose Create App Integration.
Select SAML 2.0 as the sign-in method and choose Next.
Enter a name for your app integration (for example, redshift_app) and choose Next.
Enter following values in the app and leave the rest as is:
- For Single Sign On URL, enter https://signin.aws.amazon.com/saml.
- For Audience URI (SP Entity ID), enter urn:amazon:webservices.
- For Name ID format, enter EmailAddress.
Choose Next.
Choose I’m an Okta customer adding an internal app followed by This is an internal app that we have created.
Choose Finish.
Choose Assignments and then choose Assign.
Choose Assign to groups and then select Assign next to the groups that you want to add.
Choose Done.

Set up Okta advanced configuration

After you create the custom SAML app, complete the following steps:

On the admin console, navigate to General and choose Edit under SAML settings.
Choose Next.
Set Default Relay State to the Query Editor V2 URL, using the format https://<region>.console.aws.amazon.com/sqlworkbench/home. For this post, we use https://us-west-2.console.aws.amazon.com/sqlworkbench/home.
Under Attribute Statements (optional), add the following properties:
- Provide the IAM role and IdP in comma-separated format using the Role attribute. You’ll create this same IAM role and IdP in a later step when setting up AWS configuration.
- Set user.login for RoleSessionName. This is used as an identifier for the temporary credentials that are issued when the role is assumed.
- Set the DB roles using PrincipalTag:RedshiftDbRoles. This uses the Okta groups to fill the principal tags and map them automatically with the Amazon Redshift database roles. Its value must be a colon-separated list in the format role1:role2.
- Set user.login for PrincipalTag:RedshiftDbUser. This uses the user name in the directory. This is a required tag and defines the database user that is used by Query Editor V2.
- Set the transitive keys using TransitiveTagKeys. This prevents users from changing the session tags in case of role chaining.

The preceding tags are forwarded to the GetCredentials API to get temporary credentials for your Redshift Serverless instance and map automatically with Amazon Redshift database roles. The following table summarizes their attribute statements configuration.

Name	Name Format	Format	Example
https://aws.amazon.com/SAML/Attributes/Role	Unspecified	`arn:aws:iam::<yourAWSAccountID>:role/role-name,arn:aws:iam:: <yourAWSAccountID>:saml-provider/provider-name`	`arn:aws:iam::112034567890:role/oktarole,arn:aws:iam::112034567890:saml-provider/oktaidp`
https://aws.amazon.com/SAML/Attributes/RoleSessionName	Unspecified	`user.login`	`user.login`
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbRoles	Unspecified	`String.join(":", isMemberOfGroupName("group1") ? 'group1' : '', isMemberOfGroupName("group2") ? 'group2' : '')`	`String.join(":", isMemberOfGroupName("sales") ? 'sales' : '', isMemberOfGroupName("finance") ? 'finance' : '')`
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser	Unspecified	`user.login`	`user.login`
https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys	Unspecified	`Arrays.flatten("RedshiftDbUser", "RedshiftDbRoles")`	`Arrays.flatten("RedshiftDbUser", "RedshiftDbRoles")`

After you add the attribute claims, choose Next followed by Finish.

Your attributes should be in similar format as shown in the following screenshot.

Collect Okta information

To gather your Okta information, complete the following steps:

On the Sign On tab, choose View SAML setup instructions.
For Identity Provider Single Sign-on URL, Use this URL when connecting with any third-party SQL client such as SQL Workbench/J.
Use the IdP metadata in block 4 and save the metadata file in .xml format (for example, metadata.xml).

Set up AWS configuration

In this section, we provide the steps to configure your IAM resources.

Create the IAM IdP

To create your IAM IdP, complete the following steps:

On the IAM console, under Access management in the navigation pane, choose Identity providers.
Choose Add provider.
For Provider type¸ select SAML.
For Provider name¸ enter a name.
Choose Choose file and upload the metadata file (.xml) you downloaded earlier.
Choose Add provider.

Create the IAM Amazon Redshift access policy

To create your IAM policy, complete the following steps:

On the IAM console, choose Policies.
Choose Create policy.
On the Create policy page, choose the JSON tab.

For the policy, enter the JSON in following format:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift-serverless:GetCredentials",
            "Resource": "<Workgroup ARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift-serverless:ListWorkgroups",
            "Resource": "*"
        }
    ]
}

The workgroup ARN is available on the Redshift Serverless workgroup configuration page.

The following example policy includes only a single Redshift Serverless workgroup; you can modify the policy to include multiple workgroups in the Resource section:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift-serverless:GetCredentials",
            "Resource": "arn:aws:redshift-serverless:us-west-2:123456789012:workgroup/4a4f12vc-123b-2d99-fd34-a12345a1e87f"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift-serverless:ListWorkgroups",
            "Resource": "*"
        }
    ]
}

Choose Next: Tags.
Choose Next: Review.
In the Review policy section, for Name, enter the name of your policy; for example, OktaRedshiftPolicy.
For Description, you can optionally enter a brief description of what the policy does.
Choose Create policy.

Create the IAM role

To create your IAM role, complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type, select SAML 2.0 federation.
For SAML 2.0-based provider, choose the IdP you created earlier.
Select Allow programmatic and AWS Management Console access.
Choose Next.
Choose the policy you created earlier.
Also, add the policy AmazonRedshiftQueryEditorV2ReadSharing.
Choose Next.
In the Review section, for Role Name, enter the name of your role; for example, oktarole.
For Description, you can optionally enter a brief description of what the role does.
Choose Create role.
Navigate to the role that you just created and choose Trust Relationships.
Choose Edit trust policy and choose TagSession under Add actions for STS.

When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.

Choose Update policy.

The following screenshot shows the role permissions.

The following screenshot shows the trust relationships.

Update the advanced Okta Role Attribute

Complete the following steps:

Switch back to Okta.com.
Navigate to the application which you created earlier.
Navigate to General and click Edit under SAML settings.
Under Attribute Statements (optional), update the value for the attribute – https://aws.amazon.com/SAML/Attributes/Role, using the actual role and identity provider arn values from the above step. For example, arn:aws:iam::123456789012:role/oktarole,arn:aws:iam::123456789012:saml-provider/oktaidp.

Configure Redshift Serverless role-based access

In this step, we create database roles in Amazon Redshift based on the groups that you created in Okta. Make sure the role name matches with the Okta Group name.

Amazon Redshift roles simplify managing privileges required for your end-users. In this post, we create two database roles, sales and finance, and grant them access to query tables with sales and finance data, respectively. You can download this sample SQL Notebook and import into Redshift Query Editor v2 to run all cells in the notebook used in this example. Alternatively, you can copy and enter the SQL into your SQL client.

The following is the syntax to create a role in Redshift Serverless:

create role <IdP groupname>;

For example:

create role sales;
create role finance;

Create the sales and finance database schema:

create schema sales_schema;
create schema finance_schema;

Create the tables:

CREATE TABLE IF NOT EXISTS finance_schema.revenue
(
account INTEGER   ENCODE az64
,customer VARCHAR(20)   ENCODE lzo
,salesamt NUMERIC(18,0)   ENCODE az64
)
DISTSTYLE AUTO
;

insert into finance_schema.revenue values (10001, 'ABC Company', 12000);
insert into finance_schema.revenue values (10002, 'Tech Logistics', 175400);
insert into finance_schema.revenue values (10003, 'XYZ Industry', 24355);
insert into finance_schema.revenue values (10004, 'The tax experts', 186577);

CREATE TABLE IF NOT EXISTS sales_schema.store_sales
(
ID INTEGER   ENCODE az64,
Product varchar(20),
Sales_Amount INTEGER   ENCODE az64
)
DISTSTYLE AUTO
;

Insert into sales_schema.store_sales values (1,'product1',1000);
Insert into sales_schema.store_sales values (2,'product2',2000);
Insert into sales_schema.store_sales values (3,'product3',3000);
Insert into sales_schema.store_sales values (4,'product4',4000);

The following is the syntax to grant permission to the Redshift Serverless role:

GRANT { { SELECT | INSERT | UPDATE | DELETE | DROP | REFERENCES } [,...]| ALL [ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO role <IdP groupname>;

Grant relevant permission to the role as per your requirements. In the following example, we grant full permission to the role sales on sales_schema and only select permission on finance_schema to the role finance:

grant usage on schema sales_schema to role sales;
grant select on all tables in schema sales_schema to role sales;

grant usage on schema finance_schema to role finance;
grant select on all tables in schema finance_schema to role finance;

Federate to Redshift Serverless using Query Editor V2

The RedshiftDbRoles principal tag and DBGroups are both mechanisms that can be used to integrate with an IdP. However, federating with the RedshiftDbRoles principal has some clear advantages when it comes to connecting with an IdP because it provides automatic mapping between IdP groups and Amazon Redshift database roles. Overall, RedshiftDbRoles is more flexible, easier to manage, and more secure, making it the better option for integrating Amazon Redshift with your IdP.

Now you’re ready to connect to Redshift Serverless using the Query Editor V2 and federated login:

Use the SSO URL you collected earlier and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
In the Query Editor v2, choose your Redshift Serverless instance (right-click) and choose Create connection.
For Authentication, select Federated user.
For Database, enter the database name you want to connect to.
Choose Create Connection.

User Ethan will be able to access sales_schema tables. If Ethan tries to access the tables in finance_schema, he will get a permission denied error.

Configure the SQL client (SQL Workbench/J)

To set up SQL Workbench/J, complete the following steps:

Create a new connection in SQL Workbench/J and choose Redshift Serverless as the driver.
Choose Manage drivers and add all the files from the downloaded AWS JDBC driver pack .zip file (remember to unzip the .zip file).
For Username and Password, enter the values that you set in Okta.
Capture the values for app_id, app_name, and idp_host from the Okta app embed link, which can be found on the General tab of your application.
Set the following extended properties:
- For app_id, enter the value from app embed link (for example, 0oa8p1o1RptSabT9abd0/avc8k7abc32lL4izh3b8).
- For app_name, enter the value from app embed link (for example, dev-123456_redshift_app_2).
- For idp_host, enter the value from app embed link (for example, dev-123456.okta.com).
- For plugin_name, enter com.amazon.redshift.plugin.OktaCredentialsProvider. The following screenshot shows the SQL Workbench/J extended properties.
  1. Choose OK.
  2. Choose Test from SQL Workbench/J to test the connection.
  3. When the connection is successful, choose OK.
  4. Choose OK to sign in with the users created.

User Ethan will be able to access the sales_schema tables. If Ethan tries to access the tables in the finance_schema, he will get a permission denied error.

Congratulations! You have federated with Redshift Serverless and Okta with SQL Workbench/J using RedshiftDbRoles.

[Optional] Implement MFA with SQL Client and Query Editor V2

Implementing MFA poses an additional challenge because the nature of multi-factor authentication is an asynchronous process between initiating the login (the first factor) and completing the login (the second factor). The SAML response will be returned to the appropriate listener in each scenario; the SQL Client or the AWS console in the case of QEV2. Depending on which login options you will be giving your users, you may need an additional Okta application. See below for the different scenarios:

If you are ONLY using QEV2 and not using any other SQL client, then you can use MFA with Query Editor V2 with the above application. There are no changes required in the custom SAML application which we have created above.
If you are NOT using QEV2 and only using third party SQL client (SQL Workbench/J etc), then you need to modify the above custom SAML app as mentioned below.
If you want to use QEV2 and third-party SQL Client with MFA, then you need create an additional custom SAML app as mentioned below.

Prerequisites for MFA

Each identity provider (IdP) has step for enabling and managing MFA for your users. In the case of Okta, see the following guides on how to enable MFA using the Okta Verify application and by defining an authentication policy.

Steps to create/update SAML application which supports MFA for a SQL Client

If creating a second app, follow all the steps which are described under section 1 (Create Okta SAML application).
Open the custom SAML app and select General.
Select Edit under SAML settings
Click Next in General Settings
Under General, update the Single sign-on URL to http://localhost:7890/redshift/
Select Next followed by Finish.

Below is the screenshot from the MFA App after making above changes:

Configure SQL Client for MFA

To set up SQL Workbench/J, complete the following steps:

Follow all the steps which are described under (Configure the SQL client (SQL Workbench/J))
Modify your connection updating the extended properties:
- login_url – Get the Single Sign-on URL as shown in section -Collect Okta information. (For example, https://dev-123456.okta.com/app/dev-123456_redshiftapp_2/abc8p6o5psS6xUhBJ517/sso/saml)
- plugin_name – com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
Choose OK
Choose OK from SQL Workbench/J. You’re redirected to the browser to sign in with your Okta credentials.
After that, you will get prompt for MFA. Choose either Enter a code or Get a push notification.
Once authentication is successful, log in to be redirected to a page showing the connection as successful.
With this connection profile, run the following query to return federated user name.

Troubleshooting

If your connection didn’t work, consider the following:

Enable logging in the driver. For instructions, see Configure logging.
Make sure to use the latest Amazon Redshift JDBC driver version.
If you’re getting errors while setting up the application on Okta, make sure you have admin access.
If you can authenticate via the SQL client but get a permission issue or can’t see objects, grant the relevant permission to the role, as detailed earlier in this post.

Clean up

When you’re done testing the solution, clean up the resources to avoid incurring future charges:

Delete the Redshift Serverless instance by deleting both the workgroup and the namespace.
Delete the IAM roles, IAM IdPs, and IAM policies.

Conclusion

In this post, we provided step-by-step instructions to integrate Redshift Serverless with Okta using the Amazon Redshift Query Editor V2 and SQL Workbench/J with the help of federated IAM roles and automatic database-role mapping. You can use a similar setup with any other SQL client (such as DBeaver or DataGrip) or business intelligence tool (such as Tableau Desktop). We also showed how Okta group membership is mapped automatically with Redshift Serverless roles to use role-based authentication seamlessly.

For more information about Redshift Serverless single sign-on using database roles, see Defining database roles to grant to federated users in Amazon Redshift Serverless.

About the Authors

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Mohamed Shaaban is a Senior Software Engineer in Amazon Redshift and is based in Berlin, Germany. He has over 12 years of experience in the software engineering. He is passionate about cloud services and building solutions that delight customers. Outside of work, he is an amateur photographer who loves to explore and capture unique moments.

Rajiv Gupta is Sr. Manager of Analytics Specialist Solutions Architects based out of Irvine, CA. He has 20+ years of experience building and managing teams who build data warehouse and business intelligence solutions.

Amol Mhatre is a Database Engineer in Amazon Redshift and works on Customer & Partner engagements. Prior to Amazon, he has worked on multiple projects involving Database & ERP implementations.

Ning Di is a Software Development Engineer at Amazon Redshift, driven by a genuine passion for exploring all aspects of technology.

Harsha Kesapragada is a Software Development Engineer for Amazon Redshift with a passion to build scalable and secure systems. In the past few years, he has been working on Redshift Datasharing, Security and Redshift Serverless.

How Encored Technologies built serverless event-driven data pipelines with AWS

2023-05-04 Younggu Yun

Post Syndicated from Younggu Yun original https://aws.amazon.com/blogs/big-data/how-encored-technologies-built-serverless-event-driven-data-pipelines-with-aws/

This post is a guest post co-written with SeonJeong Lee, JaeRyun Yim, and HyeonSeok Yang from Encored Technologies.

Encored Technologies (Encored) is an energy IT company in Korea that helps their customers generate higher revenue and reduce operational costs in renewable energy industries by providing various AI-based solutions. Encored develops machine learning (ML) applications predicting and optimizing various energy-related processes, and their key initiative is to predict the amount of power generated at renewable energy power plants.

In this post, we share how Encored runs data engineering pipelines for containerized ML applications on AWS and how they use AWS Lambda to achieve performance improvement, cost reduction, and operational efficiency. We also demonstrate how to use AWS services to ingest and process GRIB (GRIdded Binary) format data, which is a file format commonly used in meteorology to store and exchange weather and climate data in a compressed binary form. It allows for efficient data storage and transmission, as well as easy manipulation of the data using specialized software.

Business and technical challenge

Encored is expanding their business into multiple countries to provide power trading services for end customers. The amount of data and the number of power plants they need to collect data are rapidly increasing over time. For example, the volume of data required for training one of the ML models is more than 200 TB. To meet the growing requirements of the business, the data science and platform team needed to speed up the process of delivering model outputs. As a solution, Encored aimed to migrate existing data and run ML applications in the AWS Cloud environment to efficiently process a scalable and robust end-to-end data and ML pipeline.

Solution overview

The primary objective of the solution is to develop an optimized data ingestion pipeline that addresses the scaling challenges related to data ingestion. During its previous deployment in an on-premises environment, the time taken to process data from ingestion to preparing the training dataset exceeded the required service level agreement (SLA). One of the input datasets required for ML models is weather data supplied by the Korea Meteorological Administration (KMA). In order to use the GRIB datasets for the ML models, Encored needed to prepare the raw data to make it suitable for building and training ML models. The first step was to convert GRIB to the Parquet file format.

Encored used Lambda to run an existing data ingestion pipeline built in a Linux-based container image. Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, and logging. AWS Lambda is triggered to ingest and process GRIB data files when they are uploaded to Amazon Simple Storage Service (Amazon S3). Once the files are processed, they are stored in Parquet format in the other S3 bucket. Encored receives GRIB files throughout the day, and whenever new files arrive, an AWS Lambda function runs a container image registered in Amazon Elastic Container Registry (ECR). This event-based pipeline triggers a customized data pipeline that is packaged in a container-based solution. Leveraging Amazon AWS Lambda, this solution is cost-effective, scalable, and high-performing.Encored uses Python as their preferred language.

The following diagram illustrates the solution architecture.

For data-intensive tasks such as extract, transform, and load (ETL) jobs and ML inference, Lambda is an ideal solution because it offers several key benefits, including rapid scaling to meet demand, automatic scaling to zero when not in use, and S3 event triggers that can initiate actions in response to object-created events. All this contributes to building a scalable and cost-effective data event-driven pipeline. In addition to these benefits, Lambda allows you to configure ephemeral storage (/tmp) between 512–10,240 MB. Encored used this storage for their data application when reading or writing data, enabling them to optimize performance and cost-effectiveness. Furthermore, Lambda’s pay-per-use pricing model means that users only pay for the compute time in use, making it a cost-effective solution for a wide range of use cases.

Prerequisites

For this walkthrough, you should have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The Docker CLI
Your function codes

Build your application required for your Docker image

The first step is to develop an application that can ingest and process files. This application reads the bucket name and object key passed from a trigger added to Lambda function. The processing logic involves three parts: downloading the file from Amazon S3 into ephemeral storage (/tmp), parsing the GRIB formatted data, and saving the parsed data to Parquet format.

The customer has a Python script (for example, app.py) that performs these tasks as follows:

import os
import tempfile
import boto3
import numpy as np
import pandas as pd
import pygrib

s3_client = boto3.client('s3')
def handler(event, context):
    try:
        # Get trigger file name
        bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
        s3_file_name = event["Records"][0]["s3"]["object"]["key"]

        # Handle temp files: all temp objects are deleted when the with-clause is closed
        with tempfile.NamedTemporaryFile(delete=True) as tmp_file:
            # Step1> Download file from s3 into temp area
            s3_file_basename = os.path.basename(s3_file_name)
            s3_file_dirname = os.path.dirname(s3_file_name)
            local_filename = tmp_file.name
            s3_client.download_file(
                Bucket=bucket_name,
                Key=f"{s3_file_dirname}/{s3_file_basename}",
                Filename=local_filename
            )

            # Step2> Parse – GRIB2 
            grbs = pygrib.open(local_filename)
            list_of_name = []
            list_of_values = []
            for grb in grbs:
                list_of_name.append(grb.name)
                list_of_values.append(grb.values)
            _, lat, lon = grb.data()
            list_of_name += ["lat", "lon"]
            list_of_values += [lat, lon]
            grbs.close()

            dat = pd.DataFrame(
                np.transpose(np.stack(list_of_values).reshape(len(list_of_values), -1)),
                columns=list_of_name,
            )

        # Step3> To Parquet
        s3_dest_uri = S3path
        dat.to_parquet(s3_dest_uri, compression="snappy")

    except Exception as err:
        print(err)

Prepare a Docker file

The second step is to create a Docker image using an AWS base image. To achieve this, you can create a new Dockerfile using a text editor on your local machine. This Dockerfile should contain two environment variables:

LAMBDA_TASK_ROOT=/var/task
LAMBDA_RUNTIME_DIR=/var/runtime

It’s important to install any dependencies under the ${LAMBDA_TASK_ROOT} directory alongside the function handler to ensure that the Lambda runtime can locate them when the function is invoked. Refer to the available Lambda base images for custom runtime for more information.

FROM public.ecr.aws/lambda/python:3.8

# Install the function's dependencies using file requirements.txt
# from your project folder.

COPY requirements.txt  .
RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.handler" ]

Build a Docker image

The third step is to build your Docker image using the docker build command. When running this command, make sure to enter a name for the image. For example:

docker build -t process-grib .

In this example, the name of the image is process-grib. You can choose any name you like for your Docker image.

Upload the image to the Amazon ECR repository

Your container image needs to reside in an Amazon Elastic Container Registry (Amazon ECR) repository. Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. For instructions on creating an ECR repository, refer to Creating a private repository.

The first step is to authenticate the Docker CLI to your ECR registry as follows:

aws ecr get-login-password --region ap-northeast-2 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.ap-northeast-2.amazonaws.com

The second step is to tag your image to match your repository name, and deploy the image to Amazon ECR using the docker push command:

docker tag  hello-world:latest 123456789012.dkr.ecr. ap-northeast-2.amazonaws.com/hello-world:latest
docker push 123456789012.dkr.ecr. ap-northeast-2.amazonaws.com/hello-world:latest

Deploy Lambda functions as container images

To create your Lambda function, complete the following steps:

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Choose the Container image option.
For Function name, enter a name.
For Container image URI, provide a container image. You can enter the ECR image URI or browse for the ECR image.
Under Container image overrides, you can override configuration settings such as the entry point or working directory that are included in the Dockerfile.
Under Permissions, expand Change default execution role.
Choose to create a new role or use an existing role.
Choose Create function.

Key considerations

To handle a large amount of data concurrently and quickly, Encored needed to store GRIB formatted files in the ephemeral storage (/tmp) that comes with Lambda. To achieve this requirement, Encored used tempfile.NamedTemporaryFile, which allows users to create temporary files easily that are deleted when no longer needed. With Lambda, you can configure ephemeral storage between 512 MB–10,240 MB for reading or writing data, allowing you to run ETL jobs, ML inference, or other data-intensive workloads.

Business outcome

Hyoseop Lee (CTO at Encored Technologies) said, “Encored has experienced positive outcomes since migrating to AWS Cloud. Initially, there was a perception that running workloads on AWS would be more expensive than using an on-premises environment. However, we discovered that this was not the case once we started running our applications on AWS. One of the most fascinating aspects of AWS services is the flexible architecture options it provides for processing, storing, and accessing large volumes of data that are only required infrequently.”

Conclusion

In this post, we covered how Encored built serverless data pipelines with Lambda and Amazon ECR to achieve performance improvement, cost reduction, and operational efficiency.

Encored successfully built an architecture that will support their global expansion and enhance technical capabilities through AWS services and the AWS Data Lab program. Based on the architecture and various internal datasets Encored has consolidated and curated, Encored plans to provide renewable energy forecasting and energy trading services.

Thanks for reading this post and hopefully you found it useful. To accelerate your digital transformation with ML, AWS is available to support you by providing prescriptive architectural guidance on a particular use case, sharing best practices, and removing technical roadblocks. You’ll leave the engagement with an architecture or working prototype that is custom fit to your needs, a path to production, and deeper knowledge of AWS services. Please contact your AWS Account Manager or Solutions Architect to get started. If you don’t have an AWS Account Manager, please contact Sales.

To learn more about ML inference use cases with Lambda, check out the following blog posts:

These resources will provide you with valuable insights and practical examples of how to use Lambda for ML inference.

About the Authors

SeonJeong Lee is the Head of Algorithms at Encored. She is a data practitioner who finds peace of mind from beautiful codes and formulas.

JaeRyun Yim is a Senior Data Scientist at Encored. He is striving to improve both work and life by focusing on simplicity and essence in my work.

HyeonSeok Yang is the platform team lead at Encored. He always strives to work with passion and spirit to keep challenging like a junior developer, and become a role model for others.

Younggu Yun works at AWS Data Lab in Korea. His role involves helping customers across the APAC region meet their business objectives and overcome technical challenges by providing prescriptive architectural guidance, sharing best practices, and building innovative solutions together.

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

2023-05-04 Patrick O'Connor

Post Syndicated from Patrick O'Connor original https://aws.amazon.com/blogs/big-data/build-efficient-cross-regional-i-o-intensive-workloads-with-dask-on-aws/

Welcome to the era of data. The sheer volume of data captured daily continues to grow, calling for platforms and solutions to evolve. Services such as Amazon Simple Storage Service (Amazon S3) offer a scalable solution that adapts yet remains cost-effective for growing datasets. The Amazon Sustainability Data Initiative (ASDI) uses the capabilities of Amazon S3 to provide a no-cost solution for you to store and share climate science workloads across the globe. Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS.

Over the last decade, we’ve seen a surge in data science frameworks coming to fruition, along with mass adoption by the data science community. One such framework is Dask, which is powerful for its ability to provision an orchestration of worker compute nodes, thereby accelerating complex analysis on large datasets.

In this post, we show you how to deploy a custom AWS Cloud Development Kit (AWS CDK) solution that extends Dask’s functionality to work inter-Regionally across Amazon’s global network. The AWS CDK solution deploys a network of Dask workers across two AWS Regions, connecting into a client Region. For more information, refer to Guidance for Distributed Computing with Cross Regional Dask on AWS and the GitHub repo for open-source code.

After deployment, the user will have access to a Jupyter notebook, where they can interact with two datasets from ASDI on AWS: Coupled Model Intercomparison Project 6 (CMIP6) and ECMWF ERA5 Reanalysis. CMIP6 focuses on the sixth phase of global coupled ocean-atmosphere general circulation model ensemble; ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service.

This solution was inspired by work with a key AWS customer, the UK Met Office. The Met Office was founded in 1854 and is the national meteorological service for the UK. They provide weather and climate predictions to help you make better decisions to stay safe and thrive. A collaboration between the Met Office and EUMETSAT, detailed in Data Proximate Computation on a Dask Cluster Distributed Between Data Centres, highlights the growing need to develop a sustainable, efficient, and scalable data science solution. This solution achieves this by bringing compute closer to the data, rather than forcing the data to come closer to compute resources, which adds cost, latency, and energy.

Solution overview

Each day, the UK Met Office produces up to 300 TB of weather and climate data, a portion of which is published to ASDI. These datasets are distributed across the world and hosted for public use. The Met Office would like to enable consumers to make the more of their data to help inform critical decisions on addressing issues such as better preparation for climate change-induced wildfires and floods, and reducing food insecurity through better crop yield analysis.

Traditional solutions in use today, particularly with climate data, are time consuming and unsustainable, replicating datasets cross Regions. Unnecessary data transfer on the petabyte scale is costly, slow, and consumes energy.

We estimated that if this practice were adopted by the Met Office users, the equivalent of 40 homes’ daily power consumption could be saved every day, and they could also reduce the transfer of data between regions.

The following diagram illustrates the solution architecture.

The solution can be broken into three major segments: client, workers, and network. Let’s dive into each and see how they come together.

Client

The client represents the source Region where data scientists connect. This Region (Region A in the diagram) contains an Amazon SageMaker notebook, an Amazon OpenSearch Service domain, and a Dask scheduler as key components. System administrators have access to the built-in Dask dashboard exposed via an Elastic Load Balancer.

Data scientists have access to the Jupyter notebook hosted on SageMaker. The notebook is able to connect and run workloads on the Dask scheduler. The OpenSearch Service domain stores metadata on the datasets connected at the Regions. Notebook users can query this service to retrieve details such as the correct Region of Dask workers without needing to know the data’s Regional location beforehand.

Worker

Each of the worker Regions (Regions B and C in the diagram) is comprised of an Amazon Elastic Container Service (Amazon ECS) cluster of Dask workers, an Amazon FSx for Lustre file system, and a standalone Amazon Elastic Compute Cloud (Amazon EC2) instance. FSx for Lustre allows Dask workers to access and process Amazon S3 data from a high-performance file system by linking your file systems to S3 buckets. It provides sub-millisecond latencies, up to hundreds of GBs/s of throughput, and millions of IOPS. A key feature of Lustre is that only the file system’s metadata is synced. Lustre manages the balance of files to be loaded in and kept warm, based on demand.

Worker clusters scale based on CPU usage, provision additional workers in extended periods of demand, and scale down as resources become idle.

Each night at 0:00 UTC, a data sync job prompts the Lustre file system to resync with the attached S3 bucket, and pulls an up-to-date metadata catalog of the bucket. Subsequently, the standalone EC2 instance pushes these updates into OpenSearch Service respective to that Region’s index. OpenSearch Service provides the necessary information to the client as to which pool of workers should be called upon for a particular dataset.

Network

Networking forms the crux of this solution, utilizing Amazon’s internal backbone network. By using AWS Transit Gateway, we’re able to connect each of the Regions to each other without needing to traverse the public internet. Each of the workers are able to connect dynamically into the Dask scheduler, allowing data scientists to run inter-regional queries through Dask.

Prerequisites

The AWS CDK package uses the TypeScript programming language. Follow the steps in Getting Started for AWS CDK to set up your local environment and bootstrap your development account (you’ll need to bootstrap all Regions specified in the GitHub repo).

For a successful deployment, you’ll need Docker installed and running on your local machine.

Deploy the AWS CDK package

Deploying an AWS CDK package is straightforward. After you install the prerequisites and bootstrap your account, you can proceed with downloading the code base.

Download the GitHub repository:

# Command to clone the repository
git clone https://github.com/aws-solutions-library-samples/distributed-compute-on-aws-with-cross-regional-dask.git
cd distributed-compute-on-aws-with-cross-regional-dask

Install node modules:
```
npm install
```
Deploy the AWS CDK:
```
npx cdk deploy --all
```

The stack can take over an hour and a half to deploy.

Code walkthrough

In this section, we inspect some of the key features of the code base. If you’d like to inspect the full code base, refer to the GitHub repository.

Configure and customize your stack

In the file bin/variables.ts, you’ll find two variable declarations: one for the client and one for workers. The client declaration is a dictionary with a reference to a Region and CIDR range. Customizing these variables will change both the Region and CIDR range of where client resources will deploy.

The worker variable copies this same functionality; however, it’s a list of dictionaries to accommodate adding or subtracting datasets the user wishes to include. Additionally, each dictionary contains the added fields of dataset and lustreFileSystemPath. Dataset is used to specify the connecting S3 URI for Lustre to connect to. The lustreFileSystemPath variable is used as a mapping for how the user wants that dataset to map locally on the worker file system. See the following code:

export const client: IClient = { region: "eu-west-2", cidr: "10.0.0.0/16" };

export const workers: IWorker[] = [
  {
    region: "us-east-1",
    cidr: "10.1.0.0/16",
    // The public s3 dataset on https://registry.opendata.aws/ you wish to connect to
    dataset: "s3://era5-pds",
    lustreFileSystemPath: "era5-pds",
  },
...]

Dynamically publish the scheduler IP

A challenge inherent to the cross-Regional nature of this project was maintaining a dynamic connection between the Dask workers and the scheduler. How could we publish an IP address, which is capable of changing, across AWS Regions? We were able to accomplish this through the use of AWS Cloud Map and associate-vpc-with-hosted-zone. The service abstracts allowing AWS to manage this DNS namespace privately. See the following code:

    /**
     * Below we initialise a private namespace which will keep track of the changing schedulers IP
     * The workers will need this IP to connect to, so instead of tracking it statically, they can
     * Simply reference the DNS which will resolve to the IP every time
     */
    const PrivateNP = new PrivateDnsNamespace(this, "local-dask", {
      name: "local-dask",
      vpc: this.vpc,
    });
    // Other regions will have to associate-vpc-with-hosted-zone to access this namespace
    new StringParameter(this, "PrivateNP Param", {
      parameterName: `privatenp-hostedid-param-${this.region}`,
      stringValue: PrivateNP.namespaceHostedZoneId,
    });
    this.schedulerDisovery = new Service(this, "Scheduler Discovery", {
      name: "Dask-Scheduler",
      namespace: PrivateNP,
    });

Jupyter notebook UI

The Jupyter notebook hosted on SageMaker provides scientists with a ready-made environment for deployment to easily connect and experiment on the loaded datasets. We used a lifecycle configuration script to provision the notebook with a preconfigured developer environment and example code base. See the following code:

  // The Sagemaker Notebook
  new CfnNotebookInstance(this, "Dask Notebook", {
    notebookInstanceName: "Dask-Notebook",
    rootAccess: "Disabled",
    directInternetAccess: "Disabled",
    defaultCodeRepository: repo.repositoryCloneUrlHttp,
    instanceType: "ml.t3.2xlarge",
    roleArn: role.roleArn,
    subnetId: this.vpc.privateSubnets[0].subnetId,
    securityGroupIds: [SagemakerSec.securityGroupId],
    lifecycleConfigName: lifecycle.notebookInstanceLifecycleConfigName,
    kmsKeyId: nbKey.keyId,
    platformIdentifier: "notebook-al2-v1",
    volumeSizeInGb: 50,
  });

Dask worker nodes

When it comes to the Dask workers, greater customizability is provided, more specifically on instance type, threads per container, and scaling alarms. By default, the workers provision on instance type m5d.4xlarge, mount to the Lustre file system on launch, and subdivide its workers and threads dynamically to ports. All this is optionally customizable. See the following code:

capacity: {
  instanceType: new InstanceType("m5d.4xlarge"),
  minCapacity: 0,
  maxCapacity: 12,
  vpcSubnets: {
    subnetType: SubnetType.PRIVATE_WITH_EGRESS,
  },
},

command: [
  "bin/sh",
  "-c",
  `pip3 install --upgrade xarray[complete] intake_esm s3fs eccodes git+https://github.com/gjoseph92/dask-worker-pools.git@main && dask worker Dask-Scheduler.local-dask:8786 --worker-port 9000:${
    9000 + NWORKERS - 1
  } --nanny-port ${9000 + NWORKERS}:${
    9000 + NWORKERS * 2 - 1
  } --resources pool-${
    this.region
  }=1 --nworkers ${NWORKERS} --nthreads ${THREADS} --no-dashboard`,
],

Performance

To assess performance, we use a sample computation and plotting of air temperature at 2 meters based on the difference between CMIP6 prediction for a month and ERA5 mean air temperature for 10 years. We set a benchmark of two workers in each Region and assess the difference in time reduction as additional workers were added. In theory, as the solution scales, there should be a productive material difference in reducing overall time.

The following table summarizes our dataset details.

Dataset Variables Disk Size Xarray Dataset Size Region

ERA5 2011–2020 (120 netcdf files) 53.5GB 364.1 GB us-east-1

CMIP6

variable_ids = ['tas'] # tas is air temperature at 2m above surface
table_id = 'Amon' # Monthly data from Atmosphere 
grid = 'gn' 
experiment_id = 'ssp245' 
activity_ids = ['ScenarioMIP', 'CMIP'] 
institution_id = 'MOHC'

1.13GB

0.11 GB

us-west-2

The following table shows the results collected, showcasing the time (in seconds) for each computation and prediction in three stages in computing CMIP6 prediction, ERA5, and difference.

.	.	Number of Workers
Compute	Region	2(CMIP) + 2(ERA)	2(CMIP) + 4(ERA)	2(CMIP) + 8(ERA)	2(CMIP) + 12(ERA)
CMIP6 (`predicted_tas_regridded`)	us-west-2	11.8	11.5	11.2	11.6
ERA5 (`historic_temp_regridded`)	us-east-1	1512	711	427	202
Difference (`propogated pool`)	us-west-2 and us-east-1	1527	906	469	251

The following graph visualizes the performance and scale.

From our experiment, we observed a linear improvement on computation for the ERA5 dataset as the number of workers increased. As the numbers of workers increased, computation times were at times halved.

Jupyter notebook

As part of the solution launch, we deploy a preconfigured Jupyter notebook to help test the cross-Regional Dask solution. The notebook demonstrates the removed worry of needing to know the Regional location of datasets, instead querying a catalog through a series of Jupyter notebooks running in the background.

To get started, follow the instructions in this section.

The code for the notebooks can be found in lib/SagemakerCode with the primary notebook being ux_notebook.ipynb. This notebook calls upon other notebooks, triggering helper scripts. ux_notebook is designed to be the entry point for scientists, without the need for going elsewhere.

To get started, open this notebook in SageMaker after you have deployed the AWS CDK. The AWS CDK creates a notebook instance with all of the files in the repository loaded and backed up to an AWS CodeCommit repository.

To run the application, open and run the first cell of ux_notebook. This cell runs the get_variables notebook in the background, which prompts you for an input for the data you would like to select. We include an example; however, note that questions will only appear after the previous option has been selected. This is intentional in limiting the drop-down choices and is optionally configurable by editing the get_variables notebook.

The preceding code stores variables globally so that other notebooks can retrieve and load your selection of choices. For demonstration, the next cell should output the save variables from before.

Next, a prompt for further data specifications appears. This cell refines the data you’re after by presenting the IDs of tables in human-readable format. Users select as if it were a form, but the titles map to tables in the background that help the system retrieve the appropriate datasets.

After you have stored all your choices and selection cells, load the data into the Regions by running the cell in the Getting the data set section. The %%capture command will suppress unnecessary outputs from the get_data notebook. Note you may remove this to inspect outputs from the other notebooks. Data is then retrieved in the backend.

While other notebooks are being run in the background, the only touchpoint for the user is the ux_notebook. This is to abstract the tedious process of importing data into a format any user is able to follow with ease.

With the data now loaded, we can start interacting with it. The following cells are examples of calculations you may run on weather data. Using xarrays, we import, calculate, and then plot those datasets.

Our sample illustrates a plot of predictive data retrieving data, running the computation, and plotting the results in under 7.5 seconds—orders of magnitude faster than a typical approach.

Under the hood

The notebooks get_catalog_input and get_variables use the library ipywidgets to display widgets such as drop-downs and multi-box selections. These options are saved globally using the %%store command so that they can be accessed from the ux_notebook. One of the options prompts you on whether you want historical data, predictive data, or both. This variable is passed to the get_data notebook to determine which subsequent notebooks to run.

The get_data notebook first retrieves the shared OpenSearch Service domain saved to AWS Systems Manager Parameter Store. This domain allows our notebook to run a query on collecting information that will indicate where the selected datasets are stored Regionally. With those datasets located Regionally, the notebook will make a connection attempt to the Dask scheduler, passing the information collected from OpenSearch Service. The Dask scheduler in turn will be able to call on workers in the correct Regions.

How to customize and continue development

These notebooks are meant to be an example of how you can create a way for users to interface and interact with the data. The notebook in this post serves as an illustration for what’s possible, and we invite you to continue building upon the solution to further improve user engagement. The core part of this solution is the backend technology, but without some mechanism to interact with that backend, users won’t realize the full potential of the solution.

Clean up

To avoid incurring future charges, delete the resources. Let’s destroy our deployed solution with the following command:

npx cdk destroy –all

Conclusion

This post showcases the extension of Dask inter-Regionally on AWS, and a possible integration with public datasets on AWS. The solution was built as a generic pattern, and further datasets can be loaded in to accelerate high I/O analyses on complex data.

Data is transforming every field and every business. However, with data growing faster than most companies can keep track of, collecting data and getting value out of that data is challenging. A modern data strategy can help you create better business outcomes with data. AWS provides the most complete set of services for the end-to-end data journey to help you unlock value from your data and turn it into insight.

To learn more about the various ways to use your data on the cloud, visit the AWS Big Data Blog. We further invite you to comment with your thoughts on this post, and whether this is a solution you plan on trying out.

About the Authors

Patrick O’Connor is a WWSO Prototyping Engineer based in London. He is a creative problem-solver, adaptable across a wide range of technologies, such as IoT, serverless tech, 3D spatial tech, and ML/AI, along with a relentless curiosity on how technology can continue to evolve everyday approaches.

Chakra Nagarajan is a Principal Machine Learning Prototyping SA with 21 years of experience in machine learning, big data, and high-performance computing. In his current role, he helps customers solve real-world complex business problems by building prototypes with end-to-end AI/ML solutions in cloud and edge devices. His ML specialization includes computer vision, natural language processing, time series forecasting, and personalization.

Val Cohen is a senior WWSO Prototyping Engineer based in London. A problem solver by nature, Val enjoys writing code to automate processes, build customer obsessed tools, and create infrastructure for various applications for her global customer base. Val has experience across a wide variety of technologies, such as front-end web development, backend work, and AI/ML.

Niall Robinson is Head of product futures at the UK Met Office. He and his team explore new ways the Met Office can provide value through product innovation and strategic partnerships. He’s had a varied career, leading a multidisciplinary informatics R&D team, academic research in data science, and field scientist along with climate modeler expertise.

Improve reliability and reduce costs of your Apache Spark workloads with vertical autoscaling on Amazon EMR on EKS

2023-05-04 Rajkishan Gunasekaran

Post Syndicated from Rajkishan Gunasekaran original https://aws.amazon.com/blogs/big-data/improve-reliability-and-reduce-costs-of-your-apache-spark-workloads-with-vertical-autoscaling-on-amazon-emr-on-eks/

Amazon EMR on Amazon EKS is a deployment option offered by Amazon EMR that enables you to run Apache Spark applications on Amazon Elastic Kubernetes Service (Amazon EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less.

Apache Spark allows you to configure the amount of Memory and vCPU cores that a job will utilize. However, tuning these values is a manual process that can be complex and ripe with pitfalls. For example, allocating too little memory can result in out-of-memory exceptions and poor job reliability. On the other hand, too much can result in over-spending on idle resources, poor cluster utilization and high costs. Moreover, it’s hard to right-size these settings for some use cases such as interactive analytics due to lack of visibility into future requirements. In the case of recurring jobs, keeping these settings up to date taking into account changing load patterns (due to external seasonal factors for instance) remains a challenge.

To address this, Amazon EMR on EKS has recently announced support for vertical autoscaling, a feature that uses the Kubernetes Vertical Pod Autoscaler (VPA) to automatically tune the memory and CPU resources of EMR Spark applications to adapt to the needs of the provided workload, simplifying the process of tuning resources and optimizing costs for these applications. You can use vertical autoscaling’s ability to tune resources based on historic data to keep memory and CPU settings up to date even when the profile of the workload varies over time. Additionally, you can use its ability to react to real-time signals to enable applications recover from out-of-memory (OOM) exceptions, helping improve job reliability.

Vertical autoscaling vs existing autoscaling solutions

Vertical autoscaling complements existing Spark autoscaling solutions such as Dynamic Resource Allocation (DRA) and Kubernetes autoscaling solutions such as Karpenter.

Features such as DRA typically work on the horizontal axis, where an increase in load results in an increase in the number of Kubernetes pods that will process the load. In the case of Spark, this results in data being processed across additional executors. When DRA is enabled, Spark starts with an initial number of executors and scales this up if it observes that there are tasks sitting and waiting for executors to run on. DRA works at the pod level and would need an underlying cluster-level auto-scaler such as Karpenter to bring in additional nodes or scale down unused nodes in response to these pods getting created and deleted.

However, for a given data profile and a query plan, sometimes the parallelism and the number of executors can’t be easily changed. As an example, if you’re attempting to join two tables that store data already sorted and bucketed by the join keys, Spark can efficiently join the data by using a fixed number of executors that equals the number of buckets in the source data. Since the number of executors cannot be changed, vertical autoscaling can help here by offering additional resources or scaling down unused resources at the executor level. This has a few advantages:

If a pod is optimally sized, the Kubernetes scheduler can efficiently pack in more pods in a single node, leading to better utilization of the underlying cluster.
The Amazon EMR on EKS uplift is charged based on the vCPU and memory resources consumed by a Kubernetes pod.This means an optimally sized pod is cheaper.

How vertical autoscaling works

Vertical autoscaling is a feature that you can opt into at the time of submitting an EMR on EKS job. When enabled, it uses VPA to track the resource utilization of your EMR Spark jobs and derive recommendations for resource assignments for Spark executor pods based on this data. The data, fetched from the Kubernetes Metric Server, feeds into statistical models that VPA constructs in order to build recommendations. When new executor pods spin up belonging to a job that has vertical autoscaling enabled, they’re autoscaled based on this recommendation, ignoring the usual sizing done via Spark’s executor memory configuration (controlled by the spark.executor.memory Spark setting).

Vertical autoscaling does not impact pods that are running, since in-place resizing of pods remains unsupported as of Kubernetes version 1.26, the latest supported version of Kubernetes on Amazon EKS as of this writing. However, it’s useful in the case of a recurring job where we can perform autoscaling based on historic data as well as scenarios when some pods go out-of-memory and get re-started by Spark, where vertical autoscaling can be used to selectively scale up the re-started pods and facilitate automatic recovery.

Data tracking and recommendations

To recap, vertical autoscaling uses VPA to track resource utilization for EMR jobs. For a deep-dive into the functionality, refer to the VPA Github repo. In short, vertical autoscaling sets up VPA to track the container_memory_working_set_bytes metric for the Spark executor pods that have vertical autoscaling enabled.

Real-time metric data is fetched from the Kubernetes Metric Server. By default, vertical autoscaling tracks the peak memory working set size for each pod and makes recommendations based on the p90 of the peak with a 40% safety margin added in. It also listens to pod events such as OOM events and reacts to these events. In the case of OOM events, VPA automatically bumps up the recommended resource assignment by 20%.

The statistical models, which also represent historic resource utilization data are stored as custom resource objects on your EKS cluster. This means that deleting these objects also purges old recommendations.

Customized recommendations through job signature

One of the major use-cases of vertical autoscaling is to aggregate usage data across different runs of EMR Spark jobs to derive resource recommendations. To do so, you need to provide a job signature. This can be a unique name or identifier that you configure at the time of submitting your job. If your job recurs at a fixed schedule (such as daily or weekly), it’s important that your job signature doesn’t change for each new instance of the job in order for VPA to aggregate and compute recommendations across different runs of the job.

A job signature can be the same even across different jobs if you believe they’ll have similar resource profiles. You can therefore use the signature to combine tracking and resource modeling across different jobs that you expect to behave similarly. Conversely, if a job’s behavior is changing at some point in time, such as due to a change in the upstream data or the query pattern, you can easily purge the old recommendations by either changing your signature or deleting the VPA custom resource for this signature (as explained later in this post).

Monitoring mode

You can use vertical autoscaling in a monitoring mode where no autoscaling is actually performed. Recommendations are reported to Prometheus if you have that setup on your cluster and you can monitor the recommendations through Grafana dashboards and use that to debug and make manual changes to the resource assignments. Monitoring mode is the default but you can override and use one of the supported autoscaling modes as well at the time of submitting a job. Refer to documentation for usage and a walkthrough on how to get started.

Monitoring vertical autoscaling through kubectl

You can use the Kubernetes command-line tool kubectl to list active recommendations on your cluster, view all the job signatures that are being tracked as well as purge resources associated with signatures that aren’t relevant anymore. In this section, we provide some example code to demonstrate listing, querying, and deleting recommendations.

List all vertical autoscaling recommendations on a cluster

You can use kubectl to get the verticalpodautoscaler resource in order to view the current status and recommendations. The following sample query lists all resources currently active on your EKS cluster:

kubectl get verticalpodautoscalers \
-o custom-columns='NAME:.metadata.name,'\
'SIGNATURE:.metadata.labels.emr-containers\.amazonaws\.com/dynamic\.sizing\.signature,'\
'MODE:.spec.updatePolicy.updateMode,'\
'MEM:.status.recommendation.containerRecommendations[0].target.memory' \
--all-namespaces

This produces output similar to the following

NAME               SIGNATURE           MODE      MEM
ds-<some-id>-vpa   <some-signature>    Off       930143865
ds-<some-id>-vpa   <some-signature>    Initial   14291063673

Query and delete a recommendation

You can also use kubectl to purge recommendation for a job based on the signature. Alternately, you can use the --all flag and skip specifying the signature to purge all the resources on your cluster. Note that in this case you’ll actually be deleting the EMR vertical autoscaling job-run resource. This is a custom resource managed by EMR, deleting it automatically deletes the associated VPA object that tracks and stores recommendations. See the following code:

kubectl delete jobrun -n emr \
-l='emr-containers.amazonaws.com/dynamic.sizing.signature=<some-signature>'
jobrun.dynamicsizing.emr.services.k8s.aws "ds-<some-id>" deleted

You can use the --all and --all-namespaces to delete all vertical autoscaling related resources

kubectl delete jobruns --all --all-namespaces
jobrun.dynamicsizing.emr.services.k8s.aws "ds-<some-id>" deleted

Monitor vertical autoscaling through Prometheus and Grafana

You can use Prometheus and Grafana to monitor the vertical autoscaling functionality on your EKS cluster. This includes viewing recommendations that evolve over time for different job signatures, monitoring the autoscaling functionality etc. For this setup, we assume Prometheus and Grafana are already installed on your EKS cluster using the official Helm charts. If not, refer to the Setting up Prometheus and Grafana for monitoring the cluster section of the Running batch workloads on Amazon EKS workshop to get them up and running on your cluster.

Modify Prometheus to collect vertical autoscaling metrics

Prometheus doesn’t track vertical autoscaling metrics by default. To enable this, you’ll need to start gathering metrics from the VPA custom resource objects on your cluster. This can be easily done by patching your Helm chart with the following configuration:

helm upgrade -f prometheus-helm-values.yaml prometheus prometheus-community/prometheus -n prometheus

Here, prometheus-helm-values.yaml is the vertical autoscaling specific customization that tells Prometheus to gather vertical autoscaling related recommendations from the VPA resource objects, along with the minimal required metadata such as the job’s signature.

You can verify if this setup is working by running the following Prometheus queries for the newly created custom metrics:

kube_customresource_vpa_spark_rec_memory_target
kube_customresource_vpa_spark_rec_memory_lower
kube_customresource_vpa_spark_rec_memory_upper

These represent the lower bound, upper bound and target memory for EMR Spark jobs that have vertical autoscaling enabled. The query can be grouped or filtered using the signature label similar to the following Prometheus query:

kube_customresource_vpa_spark_rec_memory_target{signature=”<some-signature>”}

Use Grafana to visualize recommendations and autoscaling functionality

You can use our sample Grafana dashboard by importing the EMR vertical autoscaling JSON model into your Grafana deployment. The dashboard visualizes vertical autoscaling recommendations alongside the memory provisioned and actually utilized by EMR Spark applications as shown in the following screenshot.

Results are presented categorized by your Kubernetes namespace and job signature. When you choose a certain namespace and signature combination, you’re presented with a pane. The pane represents a comparison of the vertical autoscaling recommendations for jobs belonging to the chosen signature, compared to the actual resource utilization of that job and the amount of Spark executor memory provisioned to the job. If autoscaling is enabled, the expectation is that the Spark executor memory would track the recommendation. If you’re in monitoring mode however, the two won’t match but you can still view the recommendations from this dashboard or use them to better understand the actual utilization and resource profile of your job.

Illustration of provisioned memory, utilization and recommendations

To better illustrate vertical autoscaling behavior and usage for different workloads, we executed query 2 of the TPC-DS benchmark for 5 iterations — The first two iterations in monitoring mode and the last 3 in autoscaling mode and visualized the results in the Grafana dashboard shared in the previous section.

Monitoring mode

This particular job was provisioned to run with 32GB of executor memory (the blue line in the image) but the actual utilization hovered at around the 10 GB mark (amber line). Vertical autoscaling computes a recommendation of approximately 14 GB based on this run (the green line). This recommendation is based on the actual utilization with a safety margin added in.

The second iteration of the job was also run in the monitoring mode and the utilization and the recommendations remained unchanged.

Autoscaling mode

Iterations 3 through 5 were run in autoscaling mode. In this case, the provisioned memory drops from 32GB to match the recommended value of 14GB (the blue line).

The utilization and recommendations remained unchanged for subsequent iterations in the case of this example. Furthermore, we observed that all the iterations of the job completed in around 5 minutes, both with and without autoscaling. This example illustrates the successful scaling down of the job’s executor memory allocation by about 56% (a drop from 32GB to approximately 14 GB) which also translates to an equivalent reduction in the EMR memory uplift costs of the job, with no impact to the job’s performance.

Automatic OOM recovery

In the earlier example, we didn’t observe any OOM events as a result of autoscaling. In the rare occasion where autoscaling results in OOM events, jobs should usually be scaled back up automatically. On the other hand, if a job that has autoscaling enabled is under-provisioned and as a result experiences OOM events, vertical autoscaling can scale up resources to facilitate automatic recovery.

In the following example, a job was provisioned with 2.5 GB of executor memory and experienced OOM exceptions during its execution. Vertical autoscaling responded to the OOM events by automatically scaling up failed executors when they were re-started. As seen in the following image, when the amber line representing memory utilization started approaching the blue line representing the provisioned memory, vertical autoscaling kicked in to increase the amount of provisioned memory for the re-started executors, allowing the automatic recovery and successful completion of the job without any intervention. The recommended memory converged to approximately 5 GB before the job completed.

All subsequent runs of jobs with the same signature will now start-up with the recommended settings computed earlier, preventing OOM events right from the start.

Cleanup

Refer to documentation for information on cleaning up vertical autoscaling related resources from your cluster. To cleanup your EMR on EKS cluster after trying out the vertical autoscaling feature, refer to the clean-up section of the EMR on EKS workshop.

Conclusion

You can use vertical autoscaling to easily monitor resource utilization for one or more EMR on EKS jobs without any impact to your production workloads. You can use standard Kubernetes tooling including Prometheus, Grafana and kubectl to interact with and monitor vertical autoscaling on your cluster. You can also autoscale your EMR Spark jobs using recommendations that are derived based on the needs of your job, allowing you to realize cost savings and optimize cluster utilization as well as build resiliency to out-of-memory errors. Additionally, you can use it in conjunction with existing autoscaling mechanisms such as Dynamic Resource Allocation and Karpenter to effortlessly achieve optimal vertical resource assignment. Looking ahead, when Kubernetes fully supports in-place resizing of pods, vertical autoscaling will be able to take advantage of it to seamlessly scale your EMR jobs up or down, further facilitating optimal costs and cluster utilization.

To learn more about EMR on EKS vertical autoscaling and getting started with it, refer to documentation. You can also use the EMR on EKS Workshop to try out the EMR on EKS deployment option for Amazon EMR.

About the author

Rajkishan Gunasekaran is a Principal Engineer for Amazon EMR on EKS at Amazon Web Services.

Process price transparency data using AWS Glue

2023-05-04 Hari Thatavarthy

Post Syndicated from Hari Thatavarthy original https://aws.amazon.com/blogs/big-data/process-price-transparency-data-using-aws-glue/

The Transparency in Coverage rule is a federal regulation in the United States that was finalized by the Center for Medicare and Medicaid Services (CMS) in October 2020. The rule requires health insurers to provide clear and concise information to consumers about their health plan benefits, including costs and coverage details. Under the rule, health insurers must make available to their members a list of negotiated rates for in-network providers, as well as an estimate of the member’s out-of-pocket costs for specific health care services. This information must be made available to members through an online tool that is accessible and easy to use. The Transparency in Coverage rule also requires insurers to make available data files that contain detailed information on the prices they negotiate with health care providers. This information can be used by employers, researchers, and others to compare prices across different insurers and health care providers. Phase 1 implementation of this regulation, which went into effect on July 1, 2022, requires that payors publish machine-readable files publicly for each plan that they offer. CMS (Center for Medicare and Medicaid Services) has published a technical implementation guide with file formats, file structure, and standards on producing these machine-readable files.

This post walks you through the preprocessing and processing steps required to prepare data published by health insurers in light of this federal regulation using AWS Glue. We also show how to query and derive insights using Amazon Athena.

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data.

Challenges processing these machine-readable files

The machine-readable files published by these payors vary in size. A single file can range from a few megabytes to hundreds of gigabytes. These files contain large JSON objects that are deeply nested. Unlike NDJSON and JSONL formats, where each line in the file is a JSON object, these files contain a single large JSON object that can span across multiple lines. The following figure represents the schema of an in_network rate file published by a major health insurer on their website for public access. This file, when uncompressed, is about 20 GB in size, contains a single JSON object, and is deeply nested. The following figure represents the schema of this JSON object when printed using the Spark printSchema() function. Each highlighted box in red is a nested array structure.

JSON Schema

Loading a 20 GB deeply nested JSON object requires a machine with a large memory footprint. Data when loaded into memory is 4–10 times its size on disk. A 20 GB JSON object may need a machine with up to 200 GB memory. To process workloads larger than 20 GB, these machines need to be scaled vertically, thereby significantly increasing hardware costs. Vertical scaling has its limits, and it’s not possible to scale beyond a certain point. Analyzing this data requires unnesting and flattening of deeply nested array structures. These transformations explode the data at an exponential rate, thereby adding to the need for more memory and disk space.

You can use an in-memory distributed processing framework such as Apache Spark to process and analyze such large volumes of data. However, to load this single large JSON object as a Spark DataFrame and perform an action on it, a worker node needs enough memory to load this object in full. When a worker node tries to load this large deeply nested JSON object and there isn’t enough memory to load it in full, the processing job will fail with out-of-memory issues. This calls for splitting the large JSON object into smaller chunks using some form of preprocessing logic. Once preprocessed, these smaller files can then be further processed in parallel by worker nodes without running into out-of-memory issues.

Solution overview

The solution involves a two-step approach. The first is a preprocessing step, which takes the large JSON object as input and splits it to multiple manageable chunks. This is required to address the challenges we mentioned earlier. The second is a processing step, which prepares and publishes data for analysis.

The preprocessing step uses an AWS Glue Python shell job to split the large JSON object into smaller JSON files. The processing step unnests and flattens the array items from these smaller JSON files in parallel. It then partitions and writes the output as Parquet on Amazon Simple Storage Service (Amazon S3). The partitioned data is cataloged and analyzed using Athena. The following diagram illustrates this workflow.

Solution Overview

Prerequisites

To implement the solution in your own AWS account, you need to create or configure the following AWS resources in advance:

An S3 bucket to persist the source and processed data. Download the input file and upload it to the path s3://yourbucket/ptd/2023-03-01_United-HealthCare-Services—Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz.
An AWS Identity and Access Management (IAM) role for your AWS Glue extract, transform, and load (ETL) job. For instructions, refer to Setting up IAM permissions for AWS Glue. Adjust the permissions to ensure AWS Glue has read/write access to Amazon S3 locations.
An IAM role for Athena with AWS Glue Data Catalog permissions to create and query tables.

Create an AWS Glue preprocessing job

The preprocessing step uses ijson, an open-source iterative JSON parser to extract items in the outermost array of top-level attributes. By streaming and iteratively parsing the large JSON file, the preprocessing step loads only a portion of the file into memory, thereby avoiding out-of-memory issues. It also uses s3pathlib, an open-source Python interface to Amazon S3. This makes it easy to work with S3 file systems.

To create and run the AWS Glue job for preprocessing, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Python shell script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import ijson
import json
import decimal
from s3pathlib import S3Path
from s3pathlib import context
import boto3
from io import StringIO

class JSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, decimal.Decimal):
            return float(obj)
        return json.JSONEncoder.default(self, obj)
        
def upload_to_s3(data, upload_path):
    data = bytes(StringIO(json.dumps(data,cls=JSONEncoder)).getvalue(),encoding='utf-8')
    s3_client.put_object(Body=data, Bucket=bucket, Key=upload_path)
        
s3_client = boto3.client('s3')

#Replace with your bucket and path to JSON object on your bucket
bucket = 'yourbucket'
largefile_key = 'ptd/2023-03-01_United-HealthCare-Services--Inc-_Third-Party-Administrator_PS1-50_C2_in-network-rates.json.gz'
p = S3Path(bucket, largefile_key)


#Replace the paths to suit your needs
upload_path_base = 'ptd/preprocessed/base/base.json'
upload_path_in_network = 'ptd/preprocessed/in_network/'
upload_path_provider_references = 'ptd/preprocessed/provider_references/'

#Extract top the values of the following top level attributes and persist them on your S3 bucket
# -- reporting_entity_name
# -- reporting_entity_type
# -- last_updated_on
# -- version

base ={
    'reporting_entity_name' : '',
    'reporting_entity_type' : '',
    'last_updated_on' :'',
    'version' : ''
}

with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_name')
    for evt in obj:
        base['reporting_entity_name'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'reporting_entity_type')
    for evt in obj:
        base['reporting_entity_type'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f, 'last_updated_on')
    for evt in obj:
        base['last_updated_on'] = evt
        break
    
with p.open("r") as f:
    obj = ijson.items(f,'version')
    for evt in obj:
        base['version'] = evt
        break
        
upload_to_s3(base,upload_path_base)

#Seek the position of JSON key provider_references 
#Iterate through items in provider_references array, and for every 1000 items create a JSON file on S3 bucket
with p.open("r") as f:
    provider_references = ijson.items(f, 'provider_references.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(provider_references):
        if rowcnt % 1000 == 0:
            if fk > 0:
                dest = upload_path_provider_references + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'provider_references_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)

    path = 'provider_references_{0}.json'.format(fk)
    dest = upload_path_provider_references + path
    upload_to_s3(lst,dest)
    
#Seek the position of JSON key in_network
#Iterate through items in in_network array, and for every 25 items create a JSON file on S3 bucket
with p.open("r") as f:
    in_network = ijson.items(f, 'in_network.item')
    fk = 0
    lst = []
    for rowcnt,row in enumerate(in_network):
        if rowcnt % 25 == 0:
            if fk > 0:
                dest = upload_path_in_network + path
                upload_to_s3(lst,dest)
            lst = []
            path = 'in_network_{0}.json'.format(fk)
            fk = fk + 1

        lst.append(row)


    path = 'in_network_{0}.json'.format(fk)
    dest = upload_path_in_network + path
    upload_to_s3(lst,dest)

Update the properties of your job on the Job details tab:
1. For Type, choose Python Shell.
2. For Python version, choose Python 3.9.
3. For Data processing units, choose 1 DPU.

For Python shell jobs, you can allocate either 0.0625 or 1 DPU. The default is 0.0625 DPU. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

python shell job config

The Python libraries ijson and s3pathlib are available in pip and can be installed using the AWS Glue job parameter --additional-python-modules. You can also choose to package these libraries, upload them to Amazon S3, and refer to them from your AWS Glue job. For instructions on packaging your library, refer to Providing your own Python library.

To install the Python libraries, set the following job parameters:
- Key – --additional-python-modules
- Value – ijson,s3pathlib
Run the job.

The preprocessing step creates three folders in the S3 bucket: base, in_network and provider_references.

s3_folder_1

Files in in_network and provider_references folders contains array of JSON objects. Each of these JSON objects represents an element in the outermost array of the original large JSON object.

s3_folder_2

Create an AWS Glue processing job

The processing job uses the output of the preprocessing step to create a denormalized view of data by extracting and flattening elements and attributes from nested arrays. The extent of unnesting depends on the attributes we need for analysis. For example, attributes such as negotiated_rate, npi, and billing_code are essential for analysis and extracting values associated with these attributes requires multiple levels of unnesting. The denormalized data is then partitioned by the billing_code column, persisted as Parquet on Amazon S3, and registered as a table on the AWS Glue Data Catalog for querying.

The following code sample guides you through the implementation using PySpark. The columns used to partition the data depends on query patterns used to analyze the data. Arriving at a partitioning strategy that is in line with the query patterns will improve overall query performance during analysis. This post assumes that the queries used for analyzing data will always use the column billing_code to filter and fetch data of interest. Data in each partition is bucketed by npi to improve query performance.

To create your AWS Glue job, complete the following steps:

On the AWS Glue console, choose Jobs under Glue Studio in the navigation pane.
Create a new job.
Select Spark script editor.
Select Create a new script with boilerplate code.
Enter the following code into the editor (adjust the S3 bucket names and paths to point to the input and output locations in Amazon S3):

import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql.functions import explode

#create a dataframe of base objects - reporting_entity_name, reporting_entity_type, version, last_updated_on
#using the output of preprocessing step

base_df = spark.read.json('s3://yourbucket/ptd/preprocessed/base/')

#create a dataframe over provider_references objects using the output of preprocessing step
prvd_df = spark.read.json('s3://yourbucket/ptd/preprocessed/provider_references/')

#cross join dataframe of base objects with dataframe of provider_references 
prvd_df = prvd_df.crossJoin(base_df)

#create a dataframe over in_network objects using the output of preprocessing step
in_ntwrk_df = spark.read.json('s3://yourbucket/ptd/preprocessed/in_network/')

#unnest and flatten negotiated_rates and provider_references from in_network objects
in_ntwrk_df2 = in_ntwrk_df.select(
 in_ntwrk_df.billing_code, in_ntwrk_df.billing_code_type, in_ntwrk_df.billing_code_type_version,
 in_ntwrk_df.covered_services, in_ntwrk_df.description, in_ntwrk_df.name,
 explode(in_ntwrk_df.negotiated_rates).alias('exploded_negotiated_rates'),
 in_ntwrk_df.negotiation_arrangement)


in_ntwrk_df3 = in_ntwrk_df2.select(
 in_ntwrk_df2.billing_code, in_ntwrk_df2.billing_code_type, in_ntwrk_df2.billing_code_type_version,
 in_ntwrk_df2.covered_services, in_ntwrk_df2.description, in_ntwrk_df2.name,
 in_ntwrk_df2.exploded_negotiated_rates.negotiated_prices.alias(
 'exploded_negotiated_rates_negotiated_prices'),
 explode(in_ntwrk_df2.exploded_negotiated_rates.provider_references).alias(
 'exploded_negotiated_rates_provider_references'),
 in_ntwrk_df2.negotiation_arrangement)

#join the exploded in_network dataframe with provider_references dataframe
jdf = prvd_df.join(
 in_ntwrk_df3,
 prvd_df.provider_group_id == in_ntwrk_df3.exploded_negotiated_rates_provider_references,"fullouter")

#un-nest and flatten attributes from rest of the nested arrays.
jdf2 = jdf.select(
 jdf.reporting_entity_name,jdf.reporting_entity_type,jdf.last_updated_on,jdf.version,
 jdf.provider_group_id, jdf.provider_groups, jdf.billing_code,
 jdf.billing_code_type, jdf.billing_code_type_version, jdf.covered_services,
 jdf.description, jdf.name,
 explode(jdf.exploded_negotiated_rates_negotiated_prices).alias(
 'exploded_negotiated_rates_negotiated_prices'),
 jdf.exploded_negotiated_rates_provider_references,
 jdf.negotiation_arrangement)

jdf3 = jdf2.select(
 jdf2.reporting_entity_name,jdf2.reporting_entity_type,jdf2.last_updated_on,jdf2.version,
 jdf2.provider_group_id,
 explode(jdf2.provider_groups).alias('exploded_provider_groups'),
 jdf2.billing_code, jdf2.billing_code_type, jdf2.billing_code_type_version,
 jdf2.covered_services, jdf2.description, jdf2.name,
 jdf2.exploded_negotiated_rates_negotiated_prices.additional_information.
 alias('additional_information'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_class.alias(
 'billing_class'),
 jdf2.exploded_negotiated_rates_negotiated_prices.billing_code_modifier.
 alias('billing_code_modifier'),
 jdf2.exploded_negotiated_rates_negotiated_prices.expiration_date.alias(
 'expiration_date'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_rate.alias(
 'negotiated_rate'),
 jdf2.exploded_negotiated_rates_negotiated_prices.negotiated_type.alias(
 'negotiated_type'),
 jdf2.exploded_negotiated_rates_negotiated_prices.service_code.alias(
 'service_code'), jdf2.exploded_negotiated_rates_provider_references,
 jdf2.negotiation_arrangement)

jdf4 = jdf3.select(jdf3.reporting_entity_name,jdf3.reporting_entity_type,jdf3.last_updated_on,jdf3.version,
 jdf3.provider_group_id,
 explode(jdf3.exploded_provider_groups.npi).alias('npi'),
 jdf3.exploded_provider_groups.tin.type.alias('tin_type'),
 jdf3.exploded_provider_groups.tin.value.alias('tin'),
 jdf3.billing_code, jdf3.billing_code_type,
 jdf3.billing_code_type_version, jdf3.covered_services,
 jdf3.description, jdf3.name, jdf3.additional_information,
 jdf3.billing_class, jdf3.billing_code_modifier,
 jdf3.expiration_date, jdf3.negotiated_rate,
 jdf3.negotiated_type, jdf3.service_code,
 jdf3.negotiation_arrangement)

#repartition by billing_code. 
#Repartition changes the distribution of data on spark cluster. 
#By repartition data we will avoid writing too many small files.
jdf5=jdf4.repartition("billing_code")
 
datasink_path = "s3://yourbucket/ptd/processed/billing_code_npi/parquet/"

#persist dataframe as parquet on S3 and catalog it
#Partition the data by billing_code. This enables analytical queries to skip data and improve performance of queries
#Data is also bucketed and sorted npi to improve query performance during analysis

jdf5.write.format('parquet').mode("overwrite").partitionBy('billing_code').bucketBy(2, 'npi').sortBy('npi').saveAsTable('ptdtable', path = datasink_path)

Update the properties of your job on the Job details tab:
1. For Type, choose Spark.
2. For Glue version, choose Glue 4.0.
3. For Language, choose Python 3.
4. For Worker type, choose G 2X.
5. For Requested number of workers, enter 20.

Arriving at the number of workers and worker type to use for your processing job depends on factors such as the amount of data being processed, the speed at which it needs to be processed, and the partitioning strategy used. Repartitioning of data can result in out-of-memory issues, especially when data is heavily skewed on the column used to repartition. It’s possible to reach Amazon S3 service limits if too many workers are assigned to the job. This is because tasks running on these worker nodes may try to read/write from the same S3 prefix, causing Amazon S3 to throttle the incoming requests. For more details, refer to Best practices design patterns: optimizing Amazon S3 performance.

processing job config

Exploding array elements creates new rows and columns, thereby exponentially increasing the amount of data that needs to be processed. Apache Spark splits this data into multiple Spark partitions on different worker nodes so that it can process large amounts of data in parallel. In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. Shuffle operations are commonly triggered by wide transformations such as join, reduceByKey, groupByKey, and repartition. In case of exceptions due to local storage limitations, it helps to supplement or replace local disk storage capacity with Amazon S3 for large shuffle operations. This is possible with the AWS Glue Spark shuffle plugin with Amazon S3. With the cloud shuffle storage plugin for Apache Spark, you can avoid disk space-related failures.

To use the Spark shuffle plugin, set the following job parameters:
- Key – --write-shuffle-files-to-s3
- Value – true

Query the data

You can query the cataloged data using Athena. For instructions on setting up Athena, refer to Setting up.

On the Athena console, choose Query editor in the navigation pane to run your query, and specify your data source and database.

sql query

To find the minimum, maximum, and average negotiated rates for procedure codes, run the following query:

SELECT
billing_code,
round(min(negotiated_rate),2) as min_price,
round(avg(negotiated_rate),2) as avg_price,
round(max(negotiated_rate),2) as max_price,
description
FROM "default"."ptdtable"
group by billing_code, description
limit 10;

The following screenshot shows the query results.

sql query results

Clean up

To avoid incurring future charges, delete the AWS resources you created:

Delete the S3 objects and bucket.
Delete the IAM policies and roles.
Delete the AWS Glue jobs for preprocessing and processing.

Conclusion

This post guided you through the necessary preprocessing and processing steps to query and analyze price transparency-related machine-readable files. Although it’s possible to use other AWS services to process such data, this post focused on preparing and publishing data using AWS Glue.

To learn more about the Transparency in Coverage rule, refer to Transparency in Coverage. For best practices for scaling Apache Spark jobs and partitioning data with AWS Glue, refer to Best practices to scale Apache Spark jobs and partition data with AWS Glue. To learn how to monitor AWS Glue jobs, refer to Monitoring AWS Glue Spark jobs.

We look forward to hearing any feedback or questions.

About the Authors

Hari Thatavarthy is a Senior Solutions Architect on the AWS Data Lab team. He helps customers design and build solutions in the data and analytics space. He believes in data democratization and loves to solve complex data processing-related problems. In his spare time, he loves to play table tennis.

Krishna Maddileti is a Senior Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey and helps them with data engineering, data lakes, and analytics. In his spare time, he enjoys spending time with his family and playing video games with his 7-year-old.

Yadukishore Tatavarthi is a Senior Partner Solutions Architect at AWS. He works closely with global system integrator partners to enable and support customers moving their workloads to AWS.

Manish Kola is a Solutions Architect on the AWS Data Lab team. He partners with customers on their AWS journey.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Google “We Have No Moat, And Neither Does OpenAI” (SemiAnalysis)

2023-05-04

Post Syndicated from original https://lwn.net/Articles/930939/

The SemiAnalysis site has what is said to be a
leaked Google document on the state of open-source AI development.
Open source, it concludes, is winning.

At the beginning of March the open source community got their hands
on their first really capable foundation model, as Meta’s LLaMA was
leaked to the public. It had no instruction or conversation tuning,
and no RLHF. Nonetheless, the community immediately understood the
significance of what they had been given.

A tremendous outpouring of innovation followed, with just days
between major developments (see The Timeline for the full
breakdown). Here we are, barely a month later, and there are
variants with instruction tuning, quantization, quality
improvements, human evals, multimodality, RLHF, etc. etc. many of
which build on each other.

(Thanks to Dave Täht).