Триизмерна карта на бъдещото строителство в София

Post Syndicated from Боян Юруков original https://yurukov.net/blog/2024/sgradite-sofia/

През декември направих пилотен проект, в който отбелязах на триизмерна карта всички сгради в район Изгрев, София, които са предвидени да се построят. Целта ми беше да разширя картата с документите за строителството, която макар и полезна, не дава добра представа за измеренията на промените.

От тогава се случиха няколко неща. Първо се свързах с новата администрация на Столична община с цел съвместна работа. Първи резултат от това, че направих за тях вариация на картата ми с документите, но показваща само планираните промени на ПУП. Линк към картата ще намерите на страницата с регистъра на съответните документи. Това е важно, тъй като тези решения са първата индикация, че някъде нещо ще се строи и съответно е най-ранния момент, в който хората загрижени за градската среда биха могли да повлияят. Ключова роля тук играят районните кметове и архитекти, които публикуват документите.

Картата на общината с отбелязани места, където има промени по ПУП

Картата се развива постоянно и ще пиша скоро подробно за нея. Като част от нея добавих експерименталния триизмерен слой специално за очакваното височинно застрояване на парцелите засегнати от ПУП-овете. След това реших да копирам същата функционалност и на основната ми карта.

Данните обаче бяха проблем – не може да се изведат автоматично и се наложи да ги въвеждам на ръка, както правих през декември. Сега обаче си направих инструмент, с който това става много по-бързо и точно. Включително мога да го правя на телефона между другото. За няколко дни събирах по няколко стотин сгради на ден из целия град. До тук бих казал, че съм покрил всичко в районите Лозенец, Изгрев, Сердика и оборище. Въвел съм и сгради, които са мащабни или са ми направили впечатление из града. Ще продължа с Овча купел и Люлин преди да продължа към другите.

Условностите на данните

Преди да ви покажа самата карта, трябва да минем през задължителното за мен описание какво не е наред в картата ми и данните.

Първо, рисувам основите на сградите на ръка върху комбинираните слоеве на застрояване от iSofMap на ГИС София, след което въвеждам височината според посочените метри или етажи. Тези слоеве от своя страна се въвеждат от оператори във фирмата на база документи и скици. Не всички документи имат скици, не винаги това, което е въведено е актуално и слоевете се променят постоянно. В този смисъл ще намерите сгради, които са одобрени или дори построени вече в друга форма или височина. В първа фаза ще въвеждам това, което има на слоя, а нататък ще разширя инструмента да зарежда последните скици и да ги позиционира правилно на слоя, за да въвеждам правилната форма на сградите.

Тъй като чертая на ръка, не винаги границите на сградата са точни, а по-ниски съпътстващи постройки или снишения на сградата липсват. Тъй като използвам фотореалистични 3D карти на Google, вижда се лесно, че изображенията и сградите там са от преди година. Тоест някои от вече построените сгради, които ще видите на улицата, не са на тази карта. В такива случаи ги отбелязвам, за да са видими. Когато Google обновят снимките, ще ги махна.

Това, че на някое място е отбелязана сграда с дадена височина или площ, не означава, че непременно ще бъде застроено по този начин. Тези скици посочват какво е предвидено/заявено или разрешено на това място. Параметрите може да се променят при визите, при обратна връзка от обществото и районните кметове или при разрешението за строеж. Може да се намали като площ, но да се вдигнат етажите. Това, че нещо е предвидено обаче не значи, че ще се построи някога. Това може да е заради проблеми със собствеността, съдебни дела или просто защото собственикът иска да живее в хубава двуетажна къща и няма намерение да вдига 5 етажен блок. Не на последно място – посочените измерения са по-скоро външните очертания. В зависимост от кинт и други параметри е възможно части от сградата да са по-малки, други да са покриви и паркинги и прочие. Някои от отбелязаните сгради може да са и индустриални.

Картата, която ще споделя, не е обвързана с общината или НАГ и не претендира за точност. Целта ѝ е да даде представа къде какво ще се строи и колко високо, тъй като дори с първата ми карта, това е изключително трудно. Ще добавя всички сгради в града преди да започна с конкретните документи. Новите сгради се появяват веднага на картата, така че очаквайте нещо ново всеки път като я отворите. Ако видите нещо, което не е така, оставете коментар.

Тези данни биха могли да се извадят автоматично и точно тези разговори водим с общината от началото на годината. Има огромно желание и подкрепа за такива визуализации и прозрачност, което е видно от картата на ПУП-овете. Както много други неща там обаче, борбата е със самата вътрешна система. Качеството на данните не е сила на администрацията като цяло. Когато това стане все пак обаче, всичко, което виждате тук ще тече автоматично. Просто не можах да дочакам и малко напук го направих сам.

Изглед към горната част на бул. Черни връх

Бъдещото строителство в София

В процесът на направа на картата споделих снимки в социалките и и писах, че ни е бедно въображението колко много е предвидено да се строи. В последните месеци стана дума, че на година се започват сгради с по милион кв. м. полезна площ. През документите открих няколко, които самостоятелно са по 100 хиляди. Из града виждам много повече.

От години се ровя в разрешителни, ПУП-ове и визи, но нямах идея за някои от сградите. Например не знаех за 200 метровите кули срещу мол Парадайз и нагоре по бул. Цар Борис III от Пирогов. Не знаех за 130 метровата кула по Тодор Александров и въобще какви стени от сгради се готвят от двете страни на този булевард. Бях виждал снимки на плановете около Централна гара, но не осъзнавах мащабите или че ще има още една 100 метрова кула зад нея. В централната чат има десетки къщи, някои отбелязани като паметници на културата, които са предвидени да се бутат и застрояват с четири, пет и дори шест етажни сгради. В долната част на Лозенец положението е по-добре, но в горната и въобще около южната дъга е вече презастроено, а се предвиждат двойно повече сгради. Аналогично по бул. България.

Когато съм пускал подобни карти и съм коментирал върху тях, често тъпя критика, че искам да спирам частния интерес. Собствеността на земята е само един аспект от този проблем и често само по себе си е доста труден за изясняване, напоен с корупция и натрапчиво отсъствие на прокуратура и съд. Ако даден парцел е частен, собственикът има право да строи там в рамките на закона. Проблемът идва там, когато регулаторните органи умишлено спят и не забелязват критична инфраструктура като дерета, тръби и прочие. Не се взима под внимание градска среда, височина, училища и друга инфраструктура, а когато това се прави, административните съдилища с готовност зачеркват съдебна практика и дори закони, за да помогнат на това или онова свързано лице. Сведенията за собствеността на роднини да съдии из въпросните комплекси са изобилни.

Изглед към бул. Тодор Александров.

В същото време основният проблем тук е в местната и централната власт. Започваме с решенията на СОС за безразборна продажба на общинска собственост – данните за която най-яростно пазената тайна в Столична община. Следва и отлагането на придобиване на ключови имоти за градини, училища и паркове. Решения на СОС могат да изменят регулация и планове, но това се блокира умишлено от години. Това важи както за интензивността, така и за фасадите, стратегия, достъпност и визия града. Главният архитект има ключова роля в този процес, но при липса на такава визия до сега тази роля е била обвързвана само със скандали.

Илюстрация за сенчестите сделки в СОС, които все още виждаме е, че доста от парцелите предвидени за строителство в слоя на НАГ, са общинска собственост. Такова е например пространството между Японския хотел и небезизвестната сграда Златен век. Там на четири парцела общинска частна собственост вече е отбелязана 22 етажна сграда – на метри от прозорците на Златен век. Никой до сега не е успял да ми отговори как става това, а още от години питам за няколко такива случая в Дианабад. Предполага се, че са обещани на някого от години и се чака изгодна политическа ситуация, което беше спънато отчасти с последните избори.

От народното събрание пък зависи как релевантните нормативни актове ще оформят граници на тези стратегии и категоризация. Единственото съществено, което знаем, че се е случило, е промени позволяващи кули където и да е и носещи името на небезизвестен строителен инвеститор и корупционен скандал с политици осигурили тези промени. Впрочем, именно те и практиките им са причината въобще да започна да се вглеждам в темата, така че може да се радват, че са допринесли неволно към този и другите ми проекти, както и всичко останало по темата, което още не съм изкарал.

Триизмерната карта със сградите ще намерите като функция на картата с документите от НАГ натискайки бутона с кубче. Може да я достъпите и директно на http://govalert.eu/cityplan/buildings. В началото ще се отвори указание как се контролира. Най-общо казано използвайте два пръста на екран с докосване или мишка + Ctrl на компютър. Предупреждавам, че използва доста трафик и процесор – може да имате проблеми на по-стари телефони. Ще се радвам на обратна връзка.

Ще се радвам на обратна връзка. Освен добавяне на сградите в останалата част от града, искам да добавя функция за директно споделяне на линк към мястото, което сте се спрели в момента. Така ще може да покажете на други какво точно гледате и конкретна сграда. Също искам като се кликне на сградата да може да отидете на същото място в картата с документите. Има нужда да подобря и стабилността на кода. Надявам се до тогава да има и решение с автоматизацията, за да се подобри качеството на данните.

The post Триизмерна карта на бъдещото строителство в София first appeared on Блогът на Юруков.

[$] Divvi Up: privacy-respecting telemetry aggregation

Post Syndicated from daroc original https://lwn.net/Articles/983843/

There is ongoing discussion about the ethics and effectiveness of
telemetry following some recent LWN articles that touched on

Thunderbird’s use of opt-out
telemetry
and planned metrics in Fedora. The

Internet Security Research Group
(ISRG), the nonprofit behind

Let’s Encrypt
, has a potential solution to the problem of how to collect and
aggregate telemetry without violating users’ privacy. The scheme is based on a
draft
protocol
being standardized with the Internet Engineering Task Force (IETF),
and has an
open-source implementation
available.

OSPAR 2024 report now available with 163 services in scope

Post Syndicated from Joseph Goh original https://aws.amazon.com/blogs/security/ospar-2024-report-available-with-163-services-in-scope/

Amazon Web Services (AWS) is pleased to announce the completion of our annual Outsourced Service Provider’s Audit Report (OSPAR) audit cycle on July 1, 2024. The 2024 OSPAR certification cycle includes the addition of 10 new services in scope, bringing the total number of services in scope to 163 in the AWS Asia Pacific (Singapore) Region.

Newly added services in scope include the following:

The Association of Banks in Singapore (ABS) has established the Guidelines on Control Objectives and Procedures for Outsourced Service Providers to provide baseline controls criteria that Outsourced Service Providers (“OSPs”) operating in Singapore should have in place. Successfully completing the OSPAR assessment demonstrates that AWS has implemented a robust system of controls that adhere to these guidelines. This underscores our commitment to fulfill the security expectations for cloud service providers set by the financial services industry in Singapore.

Customers can use OSPAR to streamline their due diligence processes, thereby reducing the effort and costs associated with compliance. OSPAR remains a core assurance program for our financial services customers, as it is closely aligned with local regulatory requirements from the Monetary Authority of Singapore (MAS).

You can download the latest OSPAR report from AWS Artifact, a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact. The list of services in scope for OSPAR is available in the report, and is also available on the AWS Services in Scope by Compliance Program webpage.

As always, we’re committed to bringing new services into the scope of our OSPAR program based on your architectural and regulatory needs. If you have questions about the OSPAR report, contact your AWS account team.

If you have feedback about this post, submit comments in the Comments section below.

Joseph Goh

Joseph Goh
Joseph is the APJ ASEAN Lead at AWS, based in Singapore. He leads security audits, certifications, and compliance programs across the Asia Pacific region. Joseph is passionate about delivering programs that build trust with customers and providing them assurance on cloud security.

„Аз, Делян Пеевски, ще дам истината…“

Post Syndicated from Емилия Милчева original https://www.toest.bg/az-delyan-peevski-shte-dam-istinata/

Има папки за всички. За ВСИЧКИ.

„Аз, Делян Пеевски, ще дам истината…“

Обществото е чувало тези фрази много пъти. Разбира се, „папки“ е обозначение за компромати. След като санкционираният за корупция от САЩ и Великобритания съпредседател на ДПС Делян Пеевски публично обеща да демаскира политическия елит в предстоящата предизборна кампания, мнозина основателно се страхуват. Но дали ще се случи?

Едно обещавам: аз, Делян Пеевски, ще дам истината на хората за всички. Моята кампания ще бъде развенчаване на целия политически модел. Хората трябва да видят истината и да получат ново начало.

Използването на компромати и негативна кампания за влияние върху изборите не е от вчера. Въпросът е не колко далече може да стигне една черна кампания, доминирана от олигарх и политик с влияние в съдебната система и службите, а възможен ли е (жълто-кафяв) катарзис и ново начало, което да го последва.

В общество като българското, в което никой не вярва в справедливост, нито в политици, а в конспирации и кукловоди, мръсната вода трудно ще изтече – и ефектите от това ще са съмнителни. Но със сигурност ще се засили поляризацията в обществото, ще се подкопае допълнително доверието в политическите процеси и ще се снижи още гражданската ангажираност.

През последните години на повърхността излязоха не просто скандали, а разследвания, чиито разкриващи факти за покупко-продажбата на правосъдие в България са омерзителни – „Осемте джуджета“, Пепи Еврото и съпругата му Любена, заснемала оргии с магистрати и посредничила за уреждане на сделките за беззаконията, аферата „Нотариуса“ и разкритията на съдия Владислава Цариградска и др. Преди и по време на тези разобличения на бял свят бяха пуснати записи от политически сбирки на високо ниво, демаскирали някои възвишени партийни послания. 

А още преди тях се чуха пълните с вулгарности записи от телефонни разговори с гласа на Бойко Борисов и разни снимки с негово участие, но не те сложиха края на декадата на Борисов. Макар че компроматът с пачките евро и кюлчетата ще се помни дълго, лидерът на ГЕРБ и партията му продължават да печелят първото място на почти всички избори. Българските граждани са видели предостатъчно от арсенала на жълто-кафявите войни и нямат илюзии за морала на избраниците си. Но държавното обвинение така и не намери убедителни данни за престъпления.

Нечувствителност/безчувственост 

Постоянното излагане на компромати и негативна информация води до десенсибилизация на обществото. С други думи, хората започват да свикват с компроматите и те престават да предизвикват високи нива на шок или възмущение. В психологията това явление е добре проучено и е част от по-широкия феномен на „емоционално изчерпване“ или „емоционална адаптация“.

Жълтите сайтове са пълни с компромати за български политици и бизнесмени, а откакто е започнала войната в ДПС, близките до Пеевски канали са наводнени с очернящи публикации за депутати и активисти от лагера в Движението, останал верен на почетния председател Ахмед Доган. 

А санкционираният по „Магнитски“, който плющи като развято срещу корупцията знаме, вече пусна първия списък с „всички касиери в енергетиката на Мистър Кеш“ (така нарича президента Румен Радев). Тези хора са назначени в големите държавни енергийни дружества по време на служебните правителства на президента и на редовните и са останали на позициите си. Че държавната енергетика е парцелирана на зони за влияние – това не е тайна. А дали, като сочи едни назначения, Пеевски ще продължи с втора тема – за корупцията в прокуратурата и сред висшите магистрати, където ще трябва да назове „своите“?

Колеги, списъкът на всички касиери в енергетиката на Мистър Кеш – Румен Радев, и на Николай Копринков – касиера от село Труд, кръчмаря, в момента и днес ограбват енергетиката. Това е първата тема, надявам се да се занимаете с нея.

„Аз, Делян Пеевски, ще дам истината…“
Снимка: Ивайло Мирчев / Facebook

Какво би изненадало българите? 

Политиците са корумпирани? То се знае, защо иначе са в политиката! 

Изневеряват? Е, всички го правят. 

Крият сексуалността си? Амиии… тука е така.

Цинизъм и недоверие

Когато обществото непрекъснато е бомбардирано с компромати, това повишава цинизма и недоверието както към политическите фигури, така и към медиите. Хората започват да вярват, че всички политици са корумпирани, че медиите са манипулирани, което подкопава демократичните процеси и институции​. 

От това чудесно се възползват партии като „Величие“, чийто т.нар. идеолог и създател на атракциона „Исторически парк“ Ивелин Михайлов се обяви за жертва и дори монетизира виктимизацията си. В безпрецедентна акция той моли за безвъзмездни дарения, за да събере 5 млн. лв., необходими, за да продължи да съществува атракционът в село Неофит Рилски, община Ветрино. Също така ще участва и в следващите парламентарни избори наесен, защото държавата загива, политиците са мафия, а информацията от журналисти и влогъри, че „Исторически парк“ е пирамида, както и проверките за пране на пари и неплатени данъци целят да го неутрализират, защото е срещу тази порочна система.

Поляризация

Постоянното използване на компромати увеличава поляризацията в обществото. Хората, които имат силни политически убеждения, стават още по-убедени в своята правота и по-агресивни в защитата на позициите си. Те се радикализират и отхвърлят всяка информация, която противоречи на техните възгледи като част от „мръсната игра“ на противника​​.

Силното разделение по линията либерали – консерватори е все по-яростно в социалните мрежи и политическата нестабилност го засилва. Но има и други фактори, които също допринасят – войната в Украйна и предстоящите президентски избори в САЩ, както и масираната хибридна кампания на Русия.

Апатия и оттегляне

Възможно е някои хора да реагират на постоянния негативен фон с апатия и оттегляне от политическия живот. Те се чувстват безпомощни и разочаровани, тъй като са все по-убедени, че техният глас няма значение или че политическата система е твърде корумпирана, за да бъде променена​ („всички са маскари“, „изборите не променят нищо“ и др.).

В България оттеглянето на гражданите от терена на политиката започна с последните няколко вота и постепенния спад на избирателната активност, регистрирала дъното от 34,41%. Ако политиците не успеят да мобилизират хората с нови лица, нови политики и честност за политическите си съюзи, разпадът ще засили идеите за промяна на системата на държавно управление. 

Алтернативни източници на информация 

Продължителното „обработване“ с компромати може да промени начина, по който хората възприемат медиите. Те започват да търсят алтернативни източници на информация, които възприемат като по-надеждни, или пък изобщо се отказват да следят новини. Често тези нови източници се оказват съмнителни сайтове и инфлуенсъри, изпълняващи различни поръчки. 

Постепенно дигиталните платформи стават все по-предпочитани пред традиционните медии. В своя доклад за доверието към медиите (2014–2019) Европейският съвет за радио и телевизия установява „безпрецедентно високия нетен индекс на доверие (НИД) към социалните мрежи в България (+20 спрямо общия НИД за ЕС от -45) – абсолютно индивидуална тенденция за страната, която не се наблюдава на други места нито на регионално, нито на европейско ниво“. Това усложнява работата на политиците, които трябва да бъдат все по-изобретателни, за да „достъпват“ избирателите си, също и на медиите – за да популяризират съдържанието си. 

Пеевски не е първият, обещаващ скандални разкрития, които да раздрусат политическата система и гнилите ѝ ябълки да изпадат. Един бивш главен прокурор, който не успя да се справи с делото „КТБ“ като редови, но пък се издигна до Главен обвинител, гръмко обясняваше как не го е страх, как ще направи някакви страховити разкрития (след като го свалиха, не и преди това).

Но Иван Гешев не отиде по далеч от полутвърдението, че пари на Борисов са изнасяни в чували от правителствения „Авиоотряд 28“. По време на 17-годишната си кариера като прокурор, от която последните три години и половина като главен, той не събра смелост и за такова „въпросче“. Вече прекрати и изявите си по медиите след унизителния резултат от 0,14%, или 3003 гласа, за Граждански блок, с който се яви на изборите.

Преди време в едно интервю мажоритарният собственик на КТБ Цветан Василев го определи като „човек, ориентиран към властта и много алчен, но и много слаб човек. Предполагам, че е държан със страхотни компромати от Пеевски и Борисов“.

Ако Пеевски изпада в политическа изолация, има шанс обществото да научи някои истини за разпределителите на ресурси и законност в държавата – той познава лицата зад маските. Но ако новото начало в България започва с истината на Пеевски, значи ново начало няма.

Leaked GitHub Python Token

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/08/leaked-github-python-token.html

Here’s a disaster that didn’t happen:

Cybersecurity researchers from JFrog recently discovered a GitHub Personal Access Token in a public Docker container hosted on Docker Hub, which granted elevated access to the GitHub repositories of the Python language, Python Package Index (PyPI), and the Python Software Foundation (PSF).

JFrog discussed what could have happened:

The implications of someone finding this leaked token could be extremely severe. The holder of such a token would have had administrator access to all of Python’s, PyPI’s and Python Software Foundation’s repositories, supposedly making it possible to carry out an extremely large scale supply chain attack.

Various forms of supply chain attacks were possible in this scenario. One such possible attack would be hiding malicious code in CPython, which is a repository of some of the basic libraries which stand at the core of the Python programming language and are compiled from C code. Due to the popularity of Python, inserting malicious code that would eventually end up in Python’s distributables could mean spreading your backdoor to tens of millions of machines worldwide!

Amazon Q Developer just reached a $260 million dollar milestone

Post Syndicated from Aytul Arisoy Cholkar original https://aws.amazon.com/blogs/devops/amazon-q-developer-just-reached-a-260-million-dollar-milestone/

To help them be more productive, developers all over the world are turning to generative AI-powered assistants like Amazon Q Developer, the most capable assistant for accelerating software development. While Amazon Q Developer is great at providing code suggestions, writing new code is one of many things developers have to do on a day-to-day basis. Amazon Q goes well beyond writing code, helping developers with tasks like testing, debugging, understanding existing code, finding security vulnerabilities, implementing entire new features, and more. One of the most time consuming and frustrating tasks is upgrading applications to the latest version. Developers and IT teams need to modernize their existing applications to take advantage of the latest technologies that help them innovate faster and improve performance – but upgrade campaigns are costly, often taking months or years to complete.

Amazon Q Developer helps alleviate toil for a range of software development tasks using agents that are like giving developers a team to help them complete tasks. Agents can reason and plan with minimal human intervention, and are capable of performing complex, multi-step tasks.

On August 1st, Andy Jassy shared an exciting finding regarding the real and quantifiable impact that the Amazon Q Developer agent for code transformation offers IT and developer teams of any size. Amazon has migrated tens of thousands of production applications from Java 8 or 11 to Java 17 with assistance from Amazon Q Developer. This represents a savings of over 4,000 years of development work for over a thousand developers (when compared to manual upgrades) and performance improvements worth $260 million dollars in annual cost savings.

To determine the true business impact of Q Developer-assisted app upgrades, we estimated the time saved by looking at the number of Java dependencies we migrated. Typically, it can take a day or more of a developer’s time to migrate just one dependency, and many applications have dozens of dependencies that need migrating. With the agent for code transformation, many of these dependencies can be migrated in minutes, resulting in a significant time savings. To estimate cost savings, we looked at the number of hosts we were able to remove from the applications due to the performance improvements achieved by upgrading to Java 17. Both of these estimates are conservative and our actual cost and time saved is likely much greater.

What is the Amazon Q Developer agent for code transformation?

The Amazon Q Developer agent for code transformation automates the complete end-to-end process of upgrading and modernizing applications, significantly reducing the time and costs associated with transformation projects, while enhancing application security and performance. Developers who want to learn about how to get started with the agent can head over to community.aws for tutorials or check out this demo:

Accelerate complex, multi-step tasks to save hours of work every day

Amazon Q Developer has an agent for software development that can autonomously perform a range of tasks–everything from implementing features, to documenting and refactoring code. Developers can simply ask Q to implement an application feature (such as asking it to create an “add to favorites” feature in a social sharing app), and the agent will analyze their existing application code and generate a step-by-step implementation plan. Developers can collaborate with the agent to review and iterate on the plan before the agent implements it, connecting multiple steps together and applying updates across source files, code blocks, and test suites. Customers have reported efficiency improvements of 25% faster initial development and up to a 40% increase in developer productivity.

How can you get started with Amazon Q Developer?

Individual users can get started with Q Developer in the AWS Console, CLI, or in their IDE on the perpetual Free Tier. Try the Pro Tier subscription if you need to manage a team of users and policies via enterprise access controls, to customize the code suggestions Amazon Q Developer makes to include your internal code base, or to add higher limits on advanced features.

Introducing the Māori Data Lens for the Well-Architected Framework

Post Syndicated from Craig Hind original https://aws.amazon.com/blogs/architecture/introducing-the-maori-data-lens-for-the-well-architected-framework/

In Aotearoa New Zealand, we have been listening and learning to better understand Māori aspirations when using cloud technology. We have been learning from Māori customers, partners, and advisors who have helped us on this journey. A common theme was how to safeguard Māori data in a digital world. Together with a group of Māori advisers, we are excited to introduce the first iteration of a Māori Data Lens for the AWS Well-Architected Framework. This lens is the first of its kind for AWS globally that focuses on indigenous data, specifically Māori data considerations.

An AWS Well-Architected Framework lens is designed to provide a technology, industry, or domain specific perspective aligned with the AWS Well-Architected Framework. The Māori Data Lens allows customers to apply important Māori data considerations when designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. This lens is designed to be a living resource that can grow and adapt alongside the evolving questions and considerations Māori have about how to secure and protect their data as a taonga (treasure). We hope this lens will be valuable in empowering individuals and organisations to design, build, and operate applications and workloads in the AWS Cloud in ways that can align with Māori values and expectations across the six pillars of the Well-Architected Framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. These are not a rigid set of rules, but instead a set of guiding principles.

This lens is designed to complement invaluable te ao Māori knowledge and expertise. It’s important to consult, build trust, and reflect Māori voices in digital and technology choices. AWS customers can consult and partner with their Māori customers to build systems in a way that responsibly interact with their Māori data. This lens is a framework of practical questions and considerations. When combined with Māori knowledge and expertise, AWS customers can begin to use cloud technology in a way that empowers adherence to important cultural and ethical dimensions for safeguarding Māori data. As a starting point, we have sought to align the insights shared with us to our AWS best practices for architecting secure, reliable, and cost-effective applications in the AWS Cloud. Together, these guidelines support the durability and protection of Māori data.

At AWS, we have always believed it is essential that our customers have control over their data. We strive to give customers choice in how to secure and manage their data in the cloud in accordance with their needs. This has been true from the very beginning, when we were the only major cloud provider to allow our customers to control the geographic location of their data, never moving customer data without explicit instruction from the customer.

The launch of an AWS Local Zone in Auckland in 2023 and the coming AWS Region in Aotearoa gives all New Zealanders the choice to store their data onshore in Aotearoa New Zealand in the AWS Cloud without compromising on performance, innovation, scale, or security. We also know that some customers have needs that go beyond where their data is stored. We’re committed to expanding our understanding and our capability to help all customers meet their particular needs and best serve their own customers, to protect their data, and to meet legal and regulatory requirements.

Advice and feedback to date has been instrumental in helping to shape this resource, and we are deeply grateful for the ongoing partnership and insights. We recognise there are different perspectives, and that tikanga (protocol/practice) and experience among Māori on this topic continues to evolve. We welcome feedback on enhancing this resource to better serve the needs of our Māori customers and partners. To provide feedback, reach out to us using the feedback feature on the lens document or your local AWS account team.

We’d like to thank AWS partner HTK Group, as well as Māori technology and data experts who advised us on this work including Renata Hakiwai, Lee Timutimu, Nikora Ngaropo, Ngapera Riley, Wade Reweti, Atawhai Tibble and Eli Pohio. In their words:

“In this rapidly evolving digital landscape, the importance of understanding, organising, and harnessing data cannot be overstated. For our Māori communities, this holds even greater significance, as data can be a taonga or a treasure that represents the collective wisdom and knowledge passed down through generations.

Nā tō rourou, nā taku rourou, ka ora ai te iwi. With your food basket and my food basket, the people will thrive.

Mauri ora!”

Read the Māori Data Lens, or contact your AWS account team for more information.

Federated access to Amazon Athena using AWS IAM Identity Center

Post Syndicated from Ajay Rawat original https://aws.amazon.com/blogs/security/federated-access-to-amazon-athena-using-aws-iam-identity-center/

Managing Amazon Athena through identity federation allows you to manage authentication and authorization procedures centrally. Athena is a serverless, interactive analytics service that provides a simplified and flexible way to analyze petabytes of data.

In this blog post, we show you how you can use the Athena JDBC driver (which includes a browser Security Assertion Markup Language (SAML) plugin) to connect to Athena from third-party SQL client tools, which helps you quickly implement identity federation capabilities and multi-factor authentication (MFA). This enables automation and enforcement of data access policies across your organization.

You can use AWS IAM Identity Center to federate access to users to AWS accounts. IAM Identity Center integrates with AWS Organizations to manage access to the AWS accounts under your organization. In this post, you will learn how to configure the Athena driver to use the AWS configuration profile credentials. This will allow you to resolve credentials from IAM Identity Center and use the MFA capability of your federation identity provider (IdP).In this post, you will learn how you can integrate the Athena browser-based SAML plugin to add single sign-on (SSO) and MFA capability with your federation identity provider (IdP).

Prerequisites

To implement this solution, you must have the follow prerequisites:

Note: Lake Formation only supports a single role in the SAML assertion. Multiple roles cannot be used.

Solution overview

Figure 1: Solution architecture

Figure 1: Solution architecture

To implement the solution, complete the steps below as shown in Figure 1:

  1. An IAM Identity Center delegated administrator creates two custom permission sets within Identity Center.
  2. An IAM Identity Center delegated administrator assign permission sets to AWS accounts and users and groups. The user has permissions to single sign-on roles that are provisioned in the data lake account. The role created by Identity Center has a name that begins with AWSReservedSSO.
  3. A Lake Formation administrator grants single sign-on roles permissions to the corresponding database and tables.

The solution workflow consists of the following high-level steps as shown in Figure 1:

  1. The user configures IAM Identity Center authentication using the AWS CLI.
  2. The AWS CLI redirects the user to the AWS access portal URL. The user enters workforce identity credentials (username and password). Then chooses Sign in.
  3. The AWS access portal verifies the user’s identity. IAM Identity Center redirects the request to the Identity Center authentication service to validate the user’s credentials.
  4. If MFA is enabled for the user, then they are prompted to authenticate their MFA device.
  5. The user enters or approves the MFA details. The user’s MFA is successfully completed.
  6. The user selects the AWS account to use from the displayed list. Then select the IAM single sign-on role to use from the displayed list.
  7. The user tests the SQL client connection and then uses the client to run a SQL query.
  8. The client makes a call to Athena to retrieve the table and associated metadata from the Data Catalog.
  9. Athena requests access to the data from Lake Formation. Lake Formation invokes the AWS Security Token Service (AWS STS).
  10. Lake Formation invokes AWS STS.
    1. Lake Formation obtains temporary AWS credentials with the permissions of the defined IAM role (sensitive or non-sensitive) associated with the data lake location.
    2. Lake Formation returns temporary credentials to Athena.
  11. Athena uses the temporary credentials to retrieve data objects from Amazon S3.
  12. The Athena engine successfully runs the query and returns the results to the client.

Solution walkthrough

The walkthrough includes five sections that will guide you through the process of creating permission sets, assigning permission sets to AWS Accounts, managing permission sets access using Lake Formation, and setting up third-party SQL clients such as SQL Workbench to connect to your data store and query your data through Athena.

Step 1: Federate onboarding

Federating onboarding is done within the IAM Identity Center account. As part of federated onboarding, you need to create IAM Identity Center users and groups. Groups are a collection of people who have the same security rights and permissions. You can create groups and add users to the groups. Create one IAM Identity Center group for sensitive data and another for non-sensitive data to provide distinct access to different classes of data sets. You can assign access to IAM Identity Center permission sets to a user or group.

To federate onboarding:

  1. Open the AWS Management Console using the IAM Identity Center account and go to IAM Identity Center.
  2. Choose Groups.
  3. Choose Create group.
  4. Enter a Group name and Description .
  5. Choose Create group.

To add a user as a member of a group:

  1. Open the IAM Identity Center console.
  2. Choose Groups.
  3. Select the group name that you want to update.
  4. On the group details page, under Users in this group, choose Add users to group.
  5. On the Add users to group page, under Other users, locate the users you want to add as members and select the check box next to each of them.
  6. Choose Add users to group.

Figure 2: Assigning users to a group

Figure 2: Assigning users to a group

Step 2: Create permission sets

For this step, create two permission sets (sensitive-iam-role and non-sensitive-iam-role). These permission sets can be assigned to users or groups in IAM Identity Center, granting them specific access to AWS account resources.

To create custom permission sets:

  1. In the IAM Identity Center administrator account, under Multi-Account permissions, choose Permission sets.
  2. Choose Create permission set.
  3. On the Select permission set type page, under Permission set type, choose Custom permission set.

    Figure 3: Selecting a permission set

    Figure 3: Selecting a permission set

  4. Choose Next.
  5. On the Specify policies and permission boundary page, expand Inline policy to add custom JSON-formatted policy text.
  6. Insert the following policy and update the S3 bucket name (<s3-bucket-name>), AWS Region (<region>) account ID (<account-id>), CloudWatch alarm name (<AlarmName>), Athena workgroup name (sensitive or non-sensitive) (<WorkGroupName>), KMS key alias name (<KMS-key-alias-name>), and organization ID (<aws-PrincipalOrgID>).
    {
      "Statement": [
        {
          "Action": [
            "lakeformation:SearchTablesByLFTags",
            "lakeformation:SearchDatabasesByLFTags",
            "lakeformation:ListLFTags",
            "lakeformation:GetResourceLFTags",
            "lakeformation:GetLFTag",
            "lakeformation:GetDataAccess",
            "glue:SearchTables",
            "glue:GetTables",
            "glue:GetTable",
            "glue:GetPartitions",
            "glue:GetDatabases",
            "glue:GetDatabase"
          ],
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "LakeformationAccess"
        },
        {
          "Action": [
            "s3:PutObject",
            "s3:ListMultipartUploadParts",
            "s3:ListBucketMultipartUploads",
            "s3:ListBucket",
            "s3:GetObject",
            "s3:GetBucketLocation",
            "s3:CreateBucket",
            "s3:AbortMultipartUpload"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::<s3-bucket-name>/*",
            "arn:aws:s3:::<s3-bucket-name>"
          ],
          "Sid": "S3Access"
        },
        {
          "Action": "s3:ListAllMyBuckets",
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "AthenaS3ListAllBucket"
        },
        {
          "Action": [
            "cloudwatch:PutMetricAlarm",
            "cloudwatch:DescribeAlarms"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:cloudwatch:<region>:<account-id>:alarm:<AlarmName>"
          ],
          "Sid": "CloudWatchLogs"
        },
        {
          "Action": [
            "athena:UpdatePreparedStatement",
            "athena:StopQueryExecution",
            "athena:StartQueryExecution",
            "athena:ListWorkGroups",
            "athena:ListTableMetadata",
            "athena:ListQueryExecutions",
            "athena:ListPreparedStatements",
            "athena:ListNamedQueries",
            "athena:ListEngineVersions",
            "athena:ListDatabases",
            "athena:ListDataCatalogs",
            "athena:GetWorkGroup",
            "athena:GetTableMetadata",
            "athena:GetQueryResultsStream",
            "athena:GetQueryResults",
            "athena:GetQueryExecution",
            "athena:GetPreparedStatement",
            "athena:GetNamedQuery",
            "athena:GetDatabase",
            "athena:GetDataCatalog",
            "athena:DeletePreparedStatement",
            "athena:DeleteNamedQuery",
            "athena:CreatePreparedStatement",
            "athena:CreateNamedQuery",
            "athena:BatchGetQueryExecution",
            "athena:BatchGetNamedQuery"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:athena:<region>:<account-id>:workgroup/<WorkGroupName>",
            "arn:aws:athena:{Region}:{Account}:datacatalog/{DataCatalogName}"
          ],
          "Sid": "AthenaAllow"
        },
        {
          "Action": [
            "kms:GenerateDataKey",
            "kms:DescribeKey",
            "kms:Decrypt"
          ],
          "Condition": {
            "ForAnyValue:StringLike": {
              "kms:ResourceAliases": "<KMS-key-alias-name>"
            }
          },
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "kms"
        },
        {
          "Action": "*",
          "Condition": {
            "StringNotEquals": {
              "aws:PrincipalOrgID": "<aws-PrincipalOrgID>"
            }
          },
          "Effect": "Deny",
          "Resource": "*",
          "Sid": "denyRule"
        }
      ],
      "Version": "2012-10-17"
    }

  7. Update the custom policy to add the corresponding Athena workgroup ARN for the sensitive and non-sensitive IAM roles.

    Note: See the documentation for information about AWS global condition context keys.

  8. Choose Next.
  9. On the Specify permission set details page, enter a name to identify this permission set in IAM Identity Center. The name that you specify for this permission set appears in the AWS access portal as an available role. Users sign in to the AWS access portal, choose an AWS account, and then choose the role.
  10. Choose Next.
  11. On the Review and create page, review the selections that you made, and then choose Create.

Step 3: Assign permission sets to AWS accounts

You can add and remove permissions sets for an IAM user or group by attaching and detaching permission sets. Permission sets define what actions an identity can perform on which AWS resources.

To assign permission sets to AWS accounts:

  1. In the IAM Identity Center administrator account, under Multi-account permissions, choose AWS accounts.
  2. On the AWS accounts page, select one or more AWS accounts that you want to assign single sign-on access to.
  3. Choose Assign users or groups.

    Figure 4: Selecting users and groups

    Figure 4: Selecting users and groups

  4. On the Assign users and groups to “<AWS account name>”, for Selected users and groups, choose the users that you want to create the permission set for. Choose Next.
  5. Select permission sets: On the Assign permission sets to “AWS-account-name” page, select one or more permission sets.
  6. On the Review and submit assignments to AWS-account-name page, for Review and submit, choose Submit.

Step 4. Grant permissions to IAM (single sign-on) roles

A data lake administrator has the broad ability to grant a principal (including themselves) permissions on Data Catalog resources. This includes the ability to manage access controls and permissions for the data lake. When you grant Lake Formation permissions on a specific Data Catalog table, you can also include data filtering specifications. This allows you to further restrict access to certain data within the table, limiting what users can see in their query results based on those filtering rules.

To grant permissions to IAM roles:

In the Lake Formation console, under Permissions in the navigation pane, select Data Lake permissions, and then choose Grant.

To grant Database permissions to IAM roles:

  1. Under Principals, select the IAM role name (for example, Sensitive-IAM-Role).
  2. Under Named Data Catalog resources, go to Databases and select a database (for example, demo).

    Figure 5: Select an IAM role and database

    Figure 5: Select an IAM role and database

  3. Under Database permissions, select Describe and then choose Grant.

    Figure 6: Grant database permissions to an IAM role

    Figure 6: Grant database permissions to an IAM role

To grant tables permissions to IAM roles:

  1. Repeat steps 1 and 2 of the preceding procedure.
  2. Under Tables – optional, select a table name (for example, demo2).

    Figure 7: Select tables within a database to grant access

    Figure 7: Select tables within a database to grant access

  3. Select the desired Table Permissions (for example, select and describe), and then choose Grant.

    Figure 8: Grant access to tables within the database

    Figure 8: Grant access to tables within the database

  4. Repeat steps 1 through 4 to grant access for the respective database and tables for the non-sensitive IAM role.

Step 5: Client-side setup using JDBC

You can use a JDBC connection to connect Athena and SQL client applications (for example, PyCharm or SQL Workbench) to enable analytics and reporting on the data that Athena returns from Amazon S3 databases. To use the Athena JDBC driver, you must specify the driver class from the JAR file. Additionally, you must pass in some parameters to change the authentication mechanism so the athena-sts-auth libraries are used:

  • S3 output location – Where in S3 the Athena service can write its output. For example, s3://path/to/query/bucket/.
  • The IAM Identity Center administrator can configure the session duration for the AWS access portal. The session duration can be set from a minimum of 15 minutes to a maximum of 90 days.

To set up PyCharm

  1. Install Athena JDBC 3.x driver from Athena JDBC 3.x driver.
    1. In the left navigation pane, select JDBC 3.x and then Getting started. Select Uber jar to download a .jar file, which contains the driver and its dependencies.

      Figure 9: Download Athena JDBC jar

      Figure 9: Download Athena JDBC jar

  2. Open PyCharm and create a new project.
    1. Enter a Name for your project
    2. Select the desired project Location
    3. Choose Create

    Figure 10: Create a new project in PyCharm

    Figure 10: Create a new project in PyCharm

  3. Configure Data Source and drivers. Select Data Source, and then choose the plus sign or New to configure new data sources and drivers.

    Figure 11: Add database source properties

    Figure 11: Add database source properties

  4. Configure the Athena driver by selecting the Drivers tab, and then choose the plus sign to add a new driver.

    Figure 12: Add database drivers

    Figure 12: Add database drivers

  5. Under Driver Files, upload the custom JAR file that you downloaded in the Step 1. Select the Athena class dropdown. Enter the driver’s name (for example Athena JDBC Driver). Then choose Apply.

    Figure 13: Add database driver files

    Figure 13: Add database driver files

  6. Configure a new data source. Choose the plus sign and select your driver’s name from the driver dropdown.
  7. Enter the data source name (for example, Athena Demo). For the authentication method, select User & Password. Then choose Apply.

    Figure 14: Create a project data source profile

    Figure 14: Create a project data source profile

  8. Select the SSH/SSL tab and select Use SSL. Verify that the Use truststore options for IDE, JAVA, and system are all selected. Then choose Apply.

    Figure 15: Enable data source profile SSL

    Figure 15: Enable data source profile SSL

  9. Select the Options tab and then select Single Session Mode. Then choose Apply.

    Figure 16: Configure single session mode in PyCharm

    Figure 16: Configure single session mode in PyCharm

  10. Select the General tab and enter the JDBC and single sign-on URL. The following is a sample JDBC URL based on the SAML application:
    jdbc:athena://;CredentialsProvider= ProfileCredentials; ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>; 

    1. Choose Apply.
    2. Choose Test Connection. If the profile has expired, refresh the single sign-on session by running aws sso login --profile <profile-name> with the corresponding profile.

    Figure 17: Test the data source connection

    Figure 17: Test the data source connection

  11. After the connection is successful, select the Schemas tab and select All databases and All schemas.

    Figure 18: Select data source databases and schemas

    Figure 18: Select data source databases and schemas

  12. Run a sample test query: SELECT <table-names> FROM <database-name> limit 10;
  13. Verify that the credentials and permissions are working as expected.

To set up SQL Workbench

  1. Open SQL Workbench.
  2. Configure an Athena driver by selecting File and then Manage Drivers.
  3. Enter the Athena JDBC Driver as the name and set the library to browse the path for the location where you downloaded the driver. Enter amazonaws.athena.jdbc.AthenaDriver as the Classname.
  4. Enter the following URL, replacing <name-of-the-WorkGroup> with your workgroup name.
    jdbc:athena://;CredentialsProvider=ProfileCredentials;ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>;

  5. Choose OK.
  6. Run a test query, replacing <table-names> and <database-name> with your table and database names:
    SELECT <table-names> FROM <database-name> limit 10;

  7. Verify that the credentials and permissions are working as expected.

Conclusion

In this post, we covered how to use JDBC drivers to connect to Athena from third-party SQL client tools. You were able to set this up without creating IAM users or any type of long-lived credentials that would need to be stored on your developers’ workstations. You learned how to configure IAM Identity Center users and groups, create permission sets, and assign permission sets to AWS Accounts. You also learned how to grant permissions to single sign-on roles using Lake Formation to create distinct access to different classes of data sets and connect to Athena through an SQL client tool (such as PyCharm). This setup can also work with other supported identity sources such as IAM Identity Centerself-managed or on-premises Active Directory, or an external IdP.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Ajay Rawat
Ajay Rawat

Ajay is a Senior Security Consultant, focusing on AWS Identity and Access Management (IAM), data protection, incident response, and operationalizing AWS security services to increase security effectiveness and reduce risk. Ajay is a technology enthusiast and enjoys working with customers to solve their technical challenges and to improve their security posture in the cloud.
Mihir Borkar
Mihir Borkar

Mihir is an AWS Data Architect who excels at simplifying customer challenges with innovative cloud data solutions. Specializing in AWS Lake Formation and AWS Glue, he designs scalable data lakes and analytics platforms, demonstrating expertise in crafting efficient solutions within the AWS Cloud.

Create a customizable cross-company log lake for compliance, Part I: Business Background

Post Syndicated from Colin Carson original https://aws.amazon.com/blogs/big-data/create-a-customizable-cross-company-log-lake-for-compliance-part-i-business-background/

As described in a previous postAWS Session Manager, a capability of AWS Systems Manager, can be used to manage access to Amazon Elastic Compute Cloud (Amazon EC2) instances by administrators who need elevated permissions for setup, troubleshooting, or emergency changes. While working for a large global organization with thousands of accounts, we were asked to answer a specific business question: “What did employees with privileged access do in Session Manager?”

This question had an initial answer: use logging and auditing capabilities of Session Manager and integration with other AWS services, including recording connections (StartSession API calls) with AWS CloudTrail, and recording commands (keystrokes) by streaming session data to Amazon CloudWatch Logs.

This was helpful, but only the beginning. We had more requirements and questions:

  • After session activity is logged to CloudWatch Logs, then what?
  • How can we provide useful data structures that minimize work to read out, delivering faster performance, using more data, with more convenience?
  • How do we support a variety of usage patterns, such as ongoing system-to-system bulk transfer, or an ad-hoc query by a human for a single session?
  • How should we share and implement governance?
  • Thinking bigger, what about the same question for a different service or across more than one use case? How do we add what other API activity happened before or after a connection—in other words, context?

We needed more comprehensive functionality, more customization, and more control than a single service or feature could offer. Our journey began where previous customer stories about using Session Manager for privileged access (similar to our situation), least privilege, and guardrails ended. We had to create something new that combined existing approaches and ideas:

  • Low-level primitives such as Amazon Simple Storage Service (Amazon S3).
  • Latest features and approaches of AWS, such as vertical and horizontal scaling in AWS Glue.
  • Our experience working with legal, audit, and compliance in large enterprise environments.
  • Customer feedback.

In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail. We share our story in three parts:

  • Part 1: Business background – We share why we created Log Lake and AWS alternatives that might be faster or easier for you.
  • Part 2: Build – We describe the architecture and how to set it up using AWS CloudFormation templates.
  • Part 3: Add – We show you how to add invocation logs, model input, and model output from Amazon Bedrock to Log Lake.

Do you really want to do it yourself?

Before you build your own log lake, consider the latest, highest-level options already available in AWS–they can save you a lot of work. Whenever possible, choose AWS services and approaches that abstract away undifferentiated heavy lifting to AWS so you can spend time on adding new business value instead of managing overhead. Know the use cases services were designed for, so you have a sense of what they already can do today and where they’re going tomorrow.

If that doesn’t work, and you don’t see an option that delivers the customer experience you want, then you can mix and match primitives in AWS for more flexibility and freedom, as we did for Log Lake.

Session Manager activity logging

As we mentioned in our introduction, you can save logging data to AmazonS3add a table on top, and query that table using Amazon Athena—this is what we recommend you consider first because it’s straightforward.

This would result in files with the sessionid in the name. If you want, you can process these files into a calendarday, sessionid, sessiondata format using an S3 event notification that invokes a function (and make sure to save it to a different bucket, in a different table, to avoid causing recursive loops). The function could derive the calendarday and sessionid from the S3 key metadata, and sessiondata would be the entire file contents.

Alternatively, you can sign to one log group in CloudWatch logs, have an Amazon Data Firehose subscription filter move that to S3 (this file would have additional metadata in the JSON content and more customization potential from filters). This was used in our situation, but it wasn’t enough by itself.

AWS CloudTrail Lake

CloudTrail Lake is for running queries on events over years of history and with near real-time latency and offers a deeper and more customizable view of events than CloudTrail Event history. CloudTrail Lake enables you to federate an event data store, which lets you view the metadata in the AWS Glue catalog and run Athena queries. For needs involving one organization and ongoing ingesting from a trail (or point-in-time import from Amazon S3, or both), you can consider CloudTrail Lake.

We considered CloudTrail Lake, as either a managed lake option or source for CloudTrail only, but ended up creating our own AWS Glue job instead. This was because of a combination of reasons, including full control over schema and jobs, ability to ingest data from an S3 bucket of our choosing as an ongoing source, fine-grained filtering on account, AWS Region, and eventName (eventName filtering wasn’t supported for management events ), and cost.

The cost of CloudTrail lake based on uncompressed data ingested (data size can be 10 times larger than in Amazon S3) was a factor for our use case. In one test, we found CloudTrail Lake to be 38 times faster to process the same workload as Log Lake, but Log Lake was 10–100 times less costly depending on filters, timing, and account activity. Our test workload was 15.9 GB file size in S3, 199 million events, and 400 thousand files, spread across over 150 accounts and 3 Regions. Filters Log Lake applied were eventname='StartSession', 'AssumeRole', 'AssumeRoleWithSAML', and five arbitrary allow listed accounts. These tests might be different from your use case, so you should do your own testing, gather your own data, and decide for yourself.

Other services

The products mentioned previously are the most relevant to the outcomes we were trying to accomplish, but you should consider security, identity, and compliance products on AWS, too. These products and features can be used either as an alternative to Log Lake or to add functionality.

As an example, Amazon Bedrock can add functionality in three ways:

  • To skip the search and query Log Lake for you
  • To summarize across logs
  • As a source for logs (similar to Session Manager as a source for CloudWatch logs)

Querying means you can have an AI agent query your AWS Glue catalog (such as the Log Lake catalog) for data-based results. Summarizing means you can use generative artificial intelligence (AI) to summarize your text logs from a knowledge base as part of retrieval augmented generation (RAG), to ask questions like “How many log files are exactly the same? Who changed IAM roles last night?” Considerations and limitations apply.

Adding Amazon Bedrock as a source means using invocation logging to collect requests and responses.

Because we wanted to store very large amounts of data frugally (compressed and columnar format, not text) and produce non-generative (data-based) results that can be used for legal compliance and security, we didn’t use Amazon Bedrock in Log Lake—but we will revisit this topic in Part 3 when we detail how to use the approach we used for Session Manager for Amazon Bedrock.

Business background

When we began talking with our business partners, sponsors, and other stakeholders, important questions, problems, opportunities, and requirements emerged.

Why we needed to do this

Legal, security, identity, and compliance authorities of the large enterprise we were working for had created a customer-specific control. To comply with the control objective, use of elevated privileges required a manager to manually review all available data (including any session manager activity) to confirm or deny if use of elevated privileges was justified. This was a compliance use case that, when solved, could be applied to more use cases such as auditing and reporting.

Note on terms:

  • Here, the customer in customer-specific control means a control that is solely the responsibility of a customer, not AWS, as described in the AWS Shared Responsibility Model.
  • In this article, we define auditing broadly as testing information technology (IT) controls to mitigate risk, by anyone, at any cadence (ongoing as part of day-to-day operations, or one time only). We don’t refer to auditing that is financial, only conducted by an independent third-party, or only at certain times. We use self-review and auditing interchangeably.
  • We also define reporting broadly as presenting data for a specific purpose in a specific format to evaluate business performance and facilitate data-driven decisions—such as answering “how many employees had sessions last week?”

The use case

Our first and most important use case was a manager who needed to review activity, such as from an after-hours on-call page the previous night. If the manager needed to have additional discussions with their employee or needed additional time to consider activity, they had up to a week (7 calendar days) before they needed to confirm or deny elevated privileges were needed, based on their team’s procedures. A manager needed to review an entire set of events that all share the same session, regardless of known keywords or specific strings, as part of all available data in AWS. This was the workflow:

  1. Employee uses homegrown application and standardized workflow to access Amazon EC2 with elevated privileges using Session Manager.
  2. API activity in CloudTrail and continuous logging to CloudWatch logs.
  3. The problem space – Data somehow gets procured, processed, and provided (this would become Log Lake later).
  4. Another homegrown system (different from step 1) presents session activity to managers and applies access controls (a manager should only review activity for their own employees, and not be able to peruse data outside their team). This data might be only one StartSession API call and no session details, or might be thousands of lines from cat file
  5. The manager reviews all available activity, makes an informed decision, and confirms or denies if use was justified.

This was an ongoing day-to-day operation, with a narrow scope. First, this meant only data available in AWS; if something couldn’t be captured by AWS, it was out of scope. If something was possible, it should be made available. Second, this meant only certain workflows; using Session Manager with elevated privileges for a specific, documented standard operating procedure.

Avoiding review

The simplest solution would be to block sessions on Amazon EC2 with elevated privileges, and fully automate build and deployment. This was possible for some but not all workloads, because some workloads required initial setup, troubleshooting, or emergency changes of Marketplace AMIs.

Is accurate logging and auditing possible?

We won’t extensively detail ways to bypass controls here, but there are important limitations and considerations we had to consider, and we recommend you do too.

First, logging isn’t available for sessionType Port, which includes SSH. This could be mitigated by ensuring employees can only use a custom application layer to start sessions without SSH. Blocking direct SSH access to EC2 instances using security group policies is another option.

Second, there are many ways to intentionally or accidentally hide or obfuscate activity in a session, making review of a specific command difficult or impossible. This was acceptable for our use case for multiple reasons:

  • A manager would always know if a session started and needed review from CloudTrail (our source signal). We joined to CloudWatch to meet our all available data requirement.
  • Continuous streaming to CloudWatch logs would log activity as it happened. Additionally, streaming to CloudWatch Logs supported interactive shell access, and our use case only used interactive shell access (sessionType Standard_Stream). Streaming isn’t supported for sessionType, InteractiveCommands, or NonInteractiveCommands.
  • The most important workflow to review involved an engineered application with one standard operating procedure (less variety than all the ways Session Manager could be used).
  • Most importantly, the manager was responsible for reviewing the reports and expected to apply their own judgement and interpret what happened. For example, a manager review could result in a follow up conversation with the employee that could improve business processes. A manager might ask their employee, “Can you help me understand why you ran this command? Do we need to update our runbook or automate something in deployment?”

To protect data against tampering, changes, or deletion, AWS provides tools and features such as AWS Identity and Access Management (IAM) policies and permissions and Amazon S3 Object Lock.

Security and compliance are a shared responsibility between AWS and the customer, and customers need to decide what AWS services and features to use for their use case. We recommend customers consider a comprehensive approach that considers overall system design and includes multiple layers of security controls (defense in depth). For more information, see the Security pillar of the AWS Well-Architected Framework.

Avoiding automation

Manual review can be a painful process, but we couldn’t automate review for two reasons: Legal requirements and to add friction to the feedback loop felt by a manager whenever an employee used elevated privileges, to discourage using elevated privileges.

Works with existing

We had to work with existing architecture, spanning thousands of accounts and multiple AWS Organizations. This meant sourcing data from buckets as an edge and point of ingress. Specifically, CloudTrail data was managed and consolidated outside of CloudTrail, across organizations and trails, into S3 buckets. CloudWatch data was also consolidated to S3 buckets, from Session Manager to CloudWatch Logs, with Amazon Data Firehose subscription filters on CloudWatch Logs pointing to S3. To avoid negative side effects on existing business processes, our business partners didn’t want to change settings in CloudTrail, CloudWatch, and Firehose. This meant Log Lake needed features and flexibility that enabled changes without impacting other workstreams using the same sources.

Event filtering is not a data lake

Before we were asked to help, there were attempts to do event filtering. One attempt tried to monitor session activity using Amazon EventBridge. This was limited to AWS API operations recorded by CloudTrail such as StartSession and didn’t include the information from inside the session, which was in CloudWatch Logs. Another attempt tried event filtering CloudWatch in the form of a subscription filter. Also, an attempt was made using EventBridge Event Bus with EventBridge rules, and storage in Amazon DynamoDB. These attempts didn’t deliver the expected results because of a combination of factors:

Size

Couldn’t accept large session log payloads because of the EventBridge PutEvents limit of 256 KB entry size. Saving large entries to Amazon S3 and using the object URL in the PutEvents entry would avoid this limitation in EventBridge, but wouldn’t pass the most important information the manager needed to review (the event’s sessionData element). This meant managing files and physical dependencies, and losing the metastore benefit of working with data as logical sets and objects.

Storage

Event filtering was a way to process data, not storage or a source of truth. We asked, how do we restore data lost in flight or destroyed after landing? If components are deleted or undergoing maintenance, can we still procure, process, and provide data—at all three layers independently? Without storage, no.

Data quality

No source of truth meant data quality checks weren’t possible.  We couldn’t answer questions like: “Did the last job process more than 90 percent of events from CloudTrail in DynamoDB?” or“What percentage are we missing from source to target?”

Anti-patterns

DynamoDB as long-term storage wasn’t the most appropriate data store for large analytical workloads, low I/O, and highly complex many-to-many joins.

Reading out

Deliveries were fast, but work (and time and cost) was needed after delivery. In other words, queries had to do extra work to transform raw data into the needed format at time of read, which had a significant, cumulative effect on performance and cost. Imagine users running a select * from table without any filters on years of data and paying for storage and compute of those queries.

Cost of ownership

Filtering by event contents (sessionData from CloudWatch) required knowledge of session behavior, which was business logic. This meant changes to business logic required changes to event filtering. Imagine being asked to change CloudWatch filters or EventBridge rules based on a business process change, and trying to remember where to make the change, or troubleshoot why expected events weren’t being passed. This meant a higher cost of ownership and slower cycle times at best, and inability to meet SLA and scale at worst.

Accidental coupling

Creates accidental coupling between downstream consumers and low-level events. Consumers who directly integrate against events might get different schemas at different times for the same events, or events they don’t need. There’s no way to manage data at a higher level than event, at the level of sets (like all events for one sessionid), or at the object level (a table designed for dependencies). In other words, there was no metastore layer that separated the schema from the files, like in a data lake.

More sources (data to load in)

There were other, less important use cases that we wanted to expand to later: inventory management and security.

For inventory management, such as identifying EC2 instances running a Systems Manager agent that’s missing a patch, finding IAM users with inline policies, or finding Redshift clusters with nodes that aren’t RA3. This data would come from AWS Config unless it isn’t a supported resource type. We cut inventory management from scope because AWS Config data could be added to an AWS Glue catalog later, and queried from Athena using an approach like the one described in How to query your AWS resource configuration states using AWS Config and Amazon Athena.

For security, Splunk and OpenSearch were already in use for serviceability and operational analysis, sourcing files from Amazon S3. Log Lake is a complementary approach sourcing from the same data, which adds metadata and simplified data structures at the cost of latency. For more information about having different tools analyze the same data, see Solving big data problems on AWS.

More use cases (reasons to read out)

We knew from the first meeting that this was a bigger opportunity than just building a dataset for sessions from Systems Manager for manual manager review. Once we had procured logs from CloudTrail and CloudWatch, set up Glue jobs to process logs into convenient tables, and were able to join across these tables, we could change filters and configuration settings to answer questions about additional services and use cases, too. Similar to how we process data for Session Manager, we could expand the filters on Log Lake’s Glue jobs, and add data for Amazon Bedrock model invocation logging. For other use cases, we could use Log Lake as a source for automation (rules-based or ML), deep forensic investigations, or string-match searches (such as IP addresses or user names).

Additional technical considerations

*How did we define session? We would always know if a session started from StartSession event in CloudTrail API activity. Regarding when a session ended, we did not use TerminateSession because this was not always present and we considered this domain-specific logic. Log Lake enabled downstream customers to decide how to interpret the data. For example, our most important workflow had a Systems Manager timeout of 15 minutes, and our SLA was 90 minutes. This meant managers knew a session with a start time more than 2 hours prior to the current time was already ended.

*CloudWatch data required additional processing compared to CloudTrail, because CloudWatch logs from Firehose were saved in gzip format without gz suffix and had multiple JSON documents in the same line that needed to be processed to be on separate lines. Firehose can transform and convert records, such as invoking a Lambda function to transform, convert JSON to ORC, and decompress data, but our business partners didn’t want to change existing settings.

How to get the data (a deep dive)

To support the dataset needed for a manager to review, we needed to identify API-specific metadata (time, event source, and event name), and then join it to session data. CloudTrail was necessary because it was the most authoritative source for AWS API activity, specifically StartSession and AssumeRole and AssumeRoleWithSAML events, and contained context that didn’t exist in CloudWatch Logs (such as the error code AccessDenied) which could be useful for compliance and investigation. CloudWatch was necessary because it contained the keystrokes in a session, in the CloudWatch log’s sessionData element. We needed to obtain the AWS source of record from CloudTrail, but we recommend you check with your authorities to confirm you really need to join to CloudTrail. We mention this in case you hear this question “why not derive some sort of earliest eventTime from CloudWatch logs, and skip joining to CloudTrail entirely? That would cut size and complexity by half.”

To join CloudTrail (eventTime, eventname, errorCode, errorMessage, and so on) with CloudWatch (sessionData), we had to do the following:

  1. Get the higher level API data from CloudTrail (time, event source, and event name), as the authoritative source for auditing Session Manager. To get this, we needed to look inside all CloudTrail logs and get only the rows with eventname=‘StartSession’ and eventsource=‘ssm.amazonaws.com’ (events from Systems Manager)—our business partners described this as looking for a needle in a haystack, because this could be only one session event across millions or billions of files. After we obtained this metadata, we needed to extract the sessionid to know what session to join it to, and we chose to extract sessionid from responseelements. Alternatively, we could use useridentity.sessioncontext.sourceidentity if a principal provided it while assuming a role (requires sts:SetSourceIdentity in the role trust policy).

Sample of a single record’s responseelements.sessionid value: "sessionid":"theuser-thefederation-0b7c1cc185ccf51a9"

The actual sessionid was the final element of the logstream: 0b7c1cc185ccf51a9.

  1. Next we needed to get all logs for a single session from CloudWatch. Similarly to CloudTrail, we needed to look inside all CloudWatch logs landing in Amazon S3 from Firehose to identify only the needles that contained "logGroup":"/aws/ssm/sessionlogs". Then, we could get sessionid from logstream or sessionId, and get session activity from the message.sessionData.

Sample of a single record’s logStream element: "sessionId": "theuser-thefederation-0b7c1cc185ccf51a9"

Note: Looking inside the log isn’t always necessary. We did it because we had to work with existing logs Firehose put to Amazon S3, which didn’t have the logstream (and sessionid) in the file name. For example, a file from Firehose might have a name like

cloudwatch-logs-otherlogs-3-2024-03-03-22-22-55-55239a3d-622e-40c0-9615-ad4f5d4381fa

If we were able to use the ability of Session Manager to send to S3 directly, the file name in S3 is the loggroup (theuser-thefederation-0b7c1cc185ccf51a9.dms)and could be used to derive sessionid without looking inside the file.

  1. Downstream of Log Lake, consumers could join on sessionid which was derived in the previous step.

What’s different about Log Lake

If you remember one thing about Log Lake, remember this: Log Lake is a data lake for compliance-related use cases, uses CloudTrail and CloudWatch as data sources, has separate tables for writing (original raw) and reading (read-optimized or readready), and gives you control over all components so you can customize it for yourself.

Here are some of the signature qualities of Log Lake:

Legal, identity, or compliance use cases

This includes deep dive forensic investigation, meaning use cases that are large volume, historical, and analytical. Because Log Lake uses Amazon S3, it can meet regulatory requirements that require write-once-read-many (WORM) storage.

AWS Well-Architected Framework

Log Lake applies real-world, time-tested design principles from the AWS Well-Architected Framework. This includes, but is not limited to:

Operational Excellence also meant knowing service quotas, performing workload testing, and defining and documenting runbook processes. If we hadn’t tried to break something to see where the limit is, then we considered it untested and inappropriate for production use. To test, we would determine the highest single day volume we’d seen in the past year, and then run that same volume in an hour to see if (and how) it would break.

High-Performance, Portable Partition Adding (AddAPart)

Log Lake adds partitions to tables using Lambda functions with SQS, a pattern we call AddAPart. This uses Amazon Simple Query Service (SQS) to decouple triggers (files landing in Amazon S3) from actions (associating that file with metastore partition). Think of this as having four F’s:

This means no AWS Glue crawlers, no alter table or msck repair table to add partitions in Athena, and can be reused across sources and buckets. The management of partitions in Log Lake makes using partition-related features available in AWS Glue, including AWS Glue partition indexes and workload partitioning and bounded execution.

File name filtering uses the same central controls for lower cost of ownership, faster changes, troubleshooting from one location, and emergency levers—this means that if you want to avoid log recursion happening from a specific account, or want to exclude a Region because of regulatory compliance, you can do it in one place, managed by your change control process, before you pay for processing in downstream jobs.

If you want to tell a team, “onboard your data source to our log lake, here are the steps you can use to self-serve,” you can use AddAPart to do that. We describe this in Part 2.

Readready Tables

In Log Lake, data structures offer differentiated value to users, and original raw data isn’t directly exposed to downstream users by default. For each source, Log Lake has a corresponding read-optimized readready table.

Instead of this:

from_cloudtrail_raw

from_cloudwatch_raw

Log Lake exposes only these to users:

from_cloudtrail_readready

from_cloudwatch_readready

In Part 2, we describe these tables in detail. Here are our answers to frequently asked questions about readready tables:

Q: Doesn’t this have an up-front cost to process raw into readready? Why not pass the work (and cost) to downstream users?

A: Yes, and for us the cost of processing partitions of raw into readready happened once and was fixed, and was offset by the variable costs of querying, which was from many company-wide callers (systemic and human), with high frequency, and large volume.

Q: How much better are readready tables in terms of performance, cost, and convenience? How do you achieve these gains? How do you measure “convenience”?

A: In most tests, readready tables are 5–10 times faster to query and more than 2 times smaller in Amazon S3. Log Lake applies more than one technique: omitting columns, partition design, AWS Glue partition indexes, data types (readready tables don’t allow any nested complex data types within a column, such as struct<struct>), columnar storage (ORC), and compression (ZLIB). We measure convenience as the amount of operations required to join on a sessionid; using Log Lake’s readready tables this is 0 (zero).

Q: Do raw and readready use the same files or buckets?

A: No, files and buckets are not shared. This decouples writes from reads, improves both write and read performance, and adds resiliency.

This question is important when designing for large sizes and scaling, because a single job or downstream read alone can span millions of files in Amazon S3. S3 scaling doesn’t happen immediately, so queries against raw or original data involving many tiny JSON files can cause S3 503 errors when it exceeds 5,500 GET/HEAD per second. More than one bucket helps avoid resource saturation. There is another option that we didn’t have when we created Log Lake: S3 Express One Zone. For reliability, we still recommend not putting all your files in one bucket. Also, don’t forget to filter your data.

Customization and control

You can customize and control all components (columns or schema, data types, compression, job logic, job schedule, and so on) because Log Lake is built using AWS primitives—such as Amazon SQS and Amazon S3—for the most comprehensive combination of features with the most freedom to customize. If you want to change something, you can.

From mono to many

Rather than one large, monolithic lake that is tightly coupled to other systems, Log Lake is just one node in a larger network of distributed data products across different data domains—this concept is data mesh. Just like the AWS APIs it is built on, Log Lake abstracts away heavy lifting and enables users to move faster, more efficiently, and not wait for centralized teams to make changes. Log Lake does not try to cover all use cases—instead, Log Lake’s data can be accessed and consumed by domain-specific teams, empowering business experts to self-serve.

When you need more flexibility and freedom

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom.

We built Log Lake for our long-term needs, but it would have been easier in the short-term to save Session Manager logs to Amazon S3 and query them with Athena. If you have considered what already exists in AWS, and you’re sure you need more comprehensive abilities or customization, read on to Part 2: Build, which explains Log Lake’s architecture and how you can set it up.

If you have feedback and questions, let us know in the comments section.

References


About the authors

Colin Carson is a Data Engineer at AWS ProServe. He has designed and built data infrastructure for multiple teams at Amazon, including Internal Audit, Risk & Compliance, HR Hiring Science, and Security.

Sean O’Sullivan is a Cloud Infrastructure Architect at AWS ProServe. He has over 8 years industry experience working with customers to drive digital transformation projects, helping architect, automate, and engineer solutions in AWS.

Тайланд под кожата (втора част)

Post Syndicated from Емине Садкъ original https://www.toest.bg/tayland-pod-kozhata-vtora-chast/

<< Към първа част

Тайланд под кожата (втора част)

Онова, което остави Тайланд у мен тази година, е усещането за непоносима жега. Беше толкова топло, че дори комарите не се появиха. За влажната джунгла, в която се намира тази страна, липсата на комари означава: страшно топло. Извън културните различия, в които можеш да се чувстваш чужденец в тази част на Югоизточна Азия, подобна жега може да те хване неподготвен, докато разбереш със сетивата си истинския смисъл на думата „другоземец“. Културата все можеш да си я обясниш някак, да я наблюдаваш, да я имитираш. Не всичко, разбира се.

Тайският език и 36-те му тоналности са нещо свръхестествено за гласовия и слуховия апарат на повечето от нас.

Посегналите към джобните тайски речници, които практикуват фрази и думи в самолета, често се озовават в неудобна ситуация. Най-тривиалната е да си поръчат шейк от пенис (khuai) вместо от банан (kluai). От един приятел го знам. Климатът обаче е друга работа, тялото преминава през сложни процеси на адаптация, които не може да се имитират. На мен ми отнема по няколко седмици, докато се приспособя към по-екстремни условия от онези, с които съм свикнала. Тази година за почти два месеца в Тайланд намерих златната среда на моето функциониране едва в края на престоя си. Събуждах се много рано сутринта или излизах много късно вечерта. А през останалото време лежах на сянка в хамак, с насочен към лицето ми вентилатор. Оттам наблюдавах местните, които се придвижваха с фините си пластични тела под жежкото слънце. Правеха го със завидна лекота.

За същите два месеца си дадох сметка за еволюционните процеси и как природата извайва телата ни, за да оцеляваме в различните ѝ условия. Колко голямо е разнообразието на Homo sapiens. Колко сурова и агресивна е природата на повечето места по света. Колко жилавост се изисква от някои хора да оцеляват. И най-вече

колко лесно е да се говори от гледна точка на човек, който никога не е страдал от малария, денга, японски енцефалит, лептоспироза, бяс и регулярно хранително натравяне.

За всичко това бях чела в книги, гледала в документални филми, но единствено в безпомощността, в която бях изпаднала, го разбрах напълно. 

Разбрах също, че е лесно да мислиш за природата като за нещо красиво и омагьосващо сетивата, когато тя не ти е враг. Лесно е, когато най-отровната змия в страната ти се намира по върхове и камънаци, до които (почти) никой така или иначе вече не припарва. Искам да кажа, че на някои места в Тайланд кралската кобра може да се свие до леглото ти или направо да те налази, докато спиш.

Разбира се, туристическите дестинации стават все по-безопасни, но това е страшно условно. Маймуни има навсякъде. И са хищни. От маймунски набези страдат ресторанти, заведения, хотели и частни домове. От лодката, с която обикаляхме Андаманско море, ни бяха отмъкнати: слънчеви очила, диктофон и няколко сладкиша с манго. Всичко се разви така бързо и с такива ужасяващи, съскащи, заплашителни звуци, издавани от тартора на маймунската банда, че никой от нас не възропта. Почувствахме се обаче глупаво, че сме се връзвали на приказки и анимационни филми със сладки маймунки.

Тайланд под кожата (втора част)

Тайландците използват същата тази наивност да развеждат туристите в мангови гори с традиционните тайландски лодки, наричани ruea hang yao (на англ. long tail boats). Лодките спират до онази част от манговата гора, където имат свои маймуни, разказа ми Чад (за запознанството ни ще стане дума след малко), за които се знае, че не биха нападнали никого или отмъкнали нещо от туристите. Чад също ми разказа за слоновете. На няколко пъти стъпкали цялото му село. Слоновете можели да минат ей така, през всички къщи, по няколко пъти в годината. Затова в Тайланд се строи по такъв начин, че да не ти е жал, ако природата се надигне – с покриви от сплетени палмови листа и дебели бамбукови колони, които повдигат бамбуковата постройка с бамбукови стени.

Повечето къщи в Тайланд са повдигнати, а английският термин за подобен вид строителство е stlit house – къща на кокили. Повдигат се, за да може водата от проливни дъждове или цунами да не достига до основите на къщите, а хората да бъдат предпазени от змии и други вредители. Повечето постройки в Тайланд, както и всичко в тях, е направено от възобновяеми природни материали. Това се дължи на друга част от особеността на флората им: при постоянната влага и слънце всичко расте по-бързо, а строителните материали са достъпни за всички, тоест всеки с мачете може да си нареже бамбук, да си събере палмови листа и да си сглоби къща на кокили.

Тайланд под кожата (втора част)

Един от малкото градове, в които видях тухли, метал, стъкла и бетон, излят така, както ние тук си знаем, беше в Сурат Тани, където пристигнахме сутринта, след шестнайсет часа път със среднощния влак от Банкок. Пътувахме в трета класа. Места за първа и втора нямаше, всичко беше разпродадено две седмици преди това. Билетът за трета класа струваше по-малко от десет лева.

Пътувахме с моряци, войници, селяни, възрастни жени и влюбени млади хора, които лежаха по пода прегърнати, а амбулантните търговци, продаващи варени яйца, пържени скариди, фъстъци, бананови и манго десерти, прясно нарязани плодове, вода, безалкохолни напитки и дженга джус (jungle juice – опиат с цвят на лимонада, приготвян от листата на кратом в комбинация с DEET или вид сироп за кашлица), прескачаха влюбените, без да им обръщат внимание.

Но да, във влака има всичко. На всяка гара се качват по няколко души или подават стока през прозорците, продават по малко и преди да потегли влакът, скачат в движение или тичат, а ръцете им са протегнати вътре в купето. Най-впечатляващи бяха двойка мъж и жена, организирани толкова добре, че за няколко минути престой продадоха петнайсет литра супа, съхранявана във висока хладилна чанта, която жената влачеше по пода с помощта на дълга хавлиена кърпа, завързана в единия край за дръжката на чантата, а в другия – около кръста ѝ. В двата предни джоба на престилката на жената стояха черпак и пластмасови купички. Пред нея вървеше мъжът ѝ, който зарибяваше хората, после посочваше желаещите супа, вземаше им парите и хвърчеше напред като стрела, а тя със същото темпо сипваше и раздаваше.

Сурат Тани се оказа не толкова интересна дестинация, почти грозновата с панелните си постройки. Ако трябва да бъдем откровени, въпреки че в Сурат Тани няма плаж и подобен релеф, по нещо напомня на известния остров Пукет, който в последните години, най-вече след началото на войната в Украйна, е изкупен от богати руснаци, избягали от репресиите, и съответно се е превърнал в нещо като Слънчев бряг или Монтана – тонове бетон на пясъка и панелни гета.

Чад сподели с мен, че онова, което се случвало в Пукет, било като наказване на природата, а тя щяла да си го върне, просто все още търпяла, но нещата излизали извън контрол. Природата няма да си отмъсти, а ще си изчисти грешките, казвам на Чад, а той ми отговаря, че щом тайландското правителство не прави нищо, става въпрос за много пари. Във всеки случай, няма толкова пари, които да купят законите на Тайланд. Страната е защитила територията си и е подсигурила населението си, като е ограничила правото на собственост на чужденци върху земя или недвижим имот до 49%. Това означава, че

ако желаеш да си купиш имот в Тайланд, е нужно да си намериш тайландски съсобственик или фирма, които винаги ще притежават 51% от имота, тоест почти винаги имотът остава в тайландски ръце.

Друго, което Тайланд успява да запази (освен Пукет), е плажовете си. Ако си вземете SIM карта на местен оператор, всеки път, когато стъпите на който и да е плаж, ще получите автоматично предупреждение (SMS), че пушенето е строго забранено и че има охрана, която може да ви глоби, ако запалите цигара. Освен това след определена натовареност плажовете в Тайланд затварят за туристи, за да се възстанови екосистемата им. Включително и популярният плаж Мая, станал известен с филма „Плажът“.

След Сурат Тани пристигнахме на остров Ко Ланта, на който летуват много скандинавци. Преди те са били основните гости на острова. Старите ресторанти на Ко Ланта могат да се разпознаят по менюто, включващо „шведска закуска“ – кафе и цигари. Поръчах си веднъж и наистина ми донесоха кафе и цигара.

Тайланд под кожата (втора част)

Ко Ланта е мюсюлмански остров със стара китайска махала, а в края на острова живеят морски цигани. В началото на 2023 г. Ко Ланта се превърна в мой дом. Затова и се върнахме през 2024-та. Именно там се запознах и с Чад – собственик на Joker Bar, тайландски рокер, който слуша реге.

Половината от бара на Чад е заета с тенис маса. Така се сприятелихме с него – играехме всяка вечер тенис на маса. Побеждавах го, той се ядосваше, но винаги ме черпеше бира. На следващата вечер ме очакваше, подал дребното си тяло на пътя, оглеждайки се за мотора ми. И всеки път се самоубеждаваше, че ще ме победи: Тудей Чад уин! Тудей Чад уин!

В последните вечери от престоя ми Чад винаги побеждава. Надделява с 2–3 точки. Разкъсва ризата си и започва да крещи от радост, а после ентусиазирано кани на канадска борба всякакви яки скандинавски мъже, които го побеждават, но Чад не забелязва това, защото победата на тенис на маса е по-голяма от всяка следваща загуба.

Понякога съм сигурна, че Чад ще разказва на внуците си, че е победил световната шампионка по тенис на маса от България. И това ме кара да се гордея и с двама ни.

(Следва продължение.)

Всички снимки в статията са собственост на авторката.

На второ четене: „Изтокът“

Post Syndicated from Стефан Иванов original https://www.toest.bg/na-vtoro-chetene-iztokut/

„Изтокът“ от Анджей Сташук

На второ четене: „Изтокът“

превод от полски Милена Милева, София: изд. „Парадокс“, 2022

Чух за Сташук покрай излизането на „По пътя за Бабадаг“ преди 14 години от Силвия Чолева, която възторжено ми го препоръчваше. И тази негова книга, както и настоящата си заслужават. Той отново и отново изненадва, като говори за уж добре познати неща. „Изтокът“ е пътешествие към сърцето на една метафора – за Изтока. От дете, още преди да чуя Go West, песента на „Пет Шоп Бойс“, слушам за Изтока и за Запада. Сташук също е слушал.

От много години той се е оттеглил, подобно на Борис Христов, и не живее във Варшава, а в карпатско село. В разговор в София през 2013 г. казва:

Видях този пейзаж, планинските силуети и си казах: ето това е, което ми трябва. Свободата, дивото около мен. Такъв характер съм, аз съм близко до природата. Не се чувствам добре в града, той не е моята среда. Обичам да спя в спален чувал на верандата, обичам да не се къпя няколко дена, най-вече обичам тишината. Не мога да живея без тишината. Не работя в къщата, имам една малка барака, където пиша. Трябва ми пълна тишина. Оттам се виждат Карпатите, великолепен пейзаж. И е тихо, чувам как птичка прехвръква пред прозореца. Градът не може да ми даде това.

Важен е контекстът на написването на тази и останалите му книги. Сташук не обикаля писателски резиденции, той постоянно е в такава. И в писането му се усеща този простор. Даже и когато пътеписът му за изворите на Изтока през Русия, Китай и Монголия го среща с травми от миналото и настоящето,

писането му е освобождаващо, в него няма перверзна и популистка носталгия, суетна тъга или снизходителност.

Налице са единствено желание да се преброди географията с широко отворени очи и с пуснати шлюзове, за да може паметта да потече и да напои градината или гората, каквато тази книга всъщност е. Гора на паметта, в която източните страни са влюбени в националните си катастрофи, а не в свободата или справедливостта, и градина на днешната несигурност, в която границите са паднали, но цъфти буренът на национализма.

Книга за уроците, научени от Платонов, чиято повест „Изкопът“ е настолното четиво на Сташук, заедно с писаното от Бруно Шулц, докато се придвижва в себе си и в света, чийто дом разширява. Европа му е тясна, но не и раната, която падането на комунизма е затворила и отворила едновременно. Докато го чета, пред очите ми изникват части от книги на Капка Касабова – за граници, грешки и съвест.

Сташук смесва спомените за детството си и с тогавашната липса на разговор за идеологията, защото е куха и никой не вярва в нея, и същият „никой“, всъщност цели народи се опитват да се адаптират по човешки или нечовешки начин към нея. Той преминава през „небитието на тази странна империя, която покорявала пустошта, за да остави след себе си небитие“, за да си представи какво е било за родителите му и тяхното поколение да напуснат селото, да отидат в града, да се настанят в опразнени домове, чиито собственици са умрели в концентрационните лагери. Какво е било за дядо му като селски кмет да бъде и таен свещеник.

В прозата му, лишена от ювелирна прецизност, но наситена с непосредственост и спонтанност, човек може да разбере повече за себе си и за страната си, за България (без даже и дума да е казана директно за нея), отколкото от няколко специализирани академични изследвания. Сташук пътува в дълбочина, а не с ефектни фрази и сентенции.

В основата си това е политическо писане и на хоризонта му е не абстрактна идея, а съвсем конкретно желание животът на хората да не бъде превърнат в еднообразен и скучен затвор.

Живот, лишен както от ужаса на ГУЛАГ, така и от ужаса на моловете. Живот, който да е прост и естествен, колкото и наивно да звучи това:

… винаги съм бил за народа, макар да знаех, че народът винаги пада жертва. Така че поне в мислите си му позволявах да победи. За да може да живее на остров или кораб и силните на този свят да нямат достъп до него. За да може да живее така, както иска. За да не го принуждава никой нито на покорство, нито на свобода. Такава беше моята утопия.

Сташук не е кабинетен човек, не е университетски преподавател; в това, което пише, има сериозност и тежест, защото то наистина има нужда да бъде написано. Необходимост – етическа и естетическа. Той седи зад думите си. Пацифист е и лежи година и половина в затвора за дезертьорство в края на 80-те. Първата му книга е за неговия престой там. 

На второ четене: „Изтокът“

Преди повече от две години, когато започна войната на Русия с Украйна, Сташук казва:

Украйна започва оттук. Той седи в своята затоплена кухня в село в Югоизточна Полша и поглежда през прозореца. Отвъд снега и гората, на два часа и половина път с кола, е полско-украинската граница.

И Сташук отново пътува. Този път до границата. Носи лекарства и храна, връща се с хора, с жени и деца.

Най-трудно е да се гледа как се сбогуват. Как мъжете поемат назад в пустотата, по осеяния с дупки асфалт.

За него още преди десет години е било ясно, че ще се стигне до тази катастрофа. И заедно с десетки писатели и интелектуалци е подписал писмо предупреждение за бъдещето. За да не гледа Западът за пореден път безучастно на поредното зверство в Изтока, както това се е случвало и преди.

Безучастието или предразсъдъците не са привилегия само на Запада. Андрей Сташук казва пред „Шпигел“ преди осем години:

Мнозина поляци възприемат украинците само като работници и чистачки… Преди украинците бяха определяни негативно, с изрази като „селяни“ и „примитивни“, сега тези асоциации изчезват. На украинците се гледа като на братя, смели бойци, хора с достойнство. 

Голямото и забележително постижение на Сташук, поне за мен, е в синтеза между рискован и смел литературен размах и мащаб и човешка и писателска честност. В отказа от лесна и лека сантименталност, която едновременно да загатне за размерите на трагедията и последствията ѝ, но да се притеснява да погледне изключително сложната реалност в очите.

За Сташук е важно лицето на света, който той създава с думите си, да не е покрито с грим, да не е подложено на различни процедури. В това лице ги има страхът и смелостта, достойнството и низостта, радостта и скръбта, смъртта и детството, забравата и светкавицата на внезапното припомняне в детайли. Това не е майсторско или зубърско писане, не е образцово или съвършено, но е живо и безсрамно гребе от живота с пълни шепи. Преди Нобеловата награда на Токарчук, Сташук беше титулуван като най-популярния и ценен писател на Полша. Според мен това е напълно уместно даже и след награждаването. Мечтая си примерът и на двамата автори да бъде наистина заразителен за българската литература.


Активните дарители на „Тоест“ получават постоянна отстъпка в размер нa 20% от коричната цена на всички заглавия от каталога на издателство „Парадокс“, както и на няколко други български издателства в рамките на партньорската програма Читателски клуб „Тоест“. За повече информация прочетете на toest.bg/club.

Никой от нас не чете единствено най-новите книги. Тогава защо само за тях се пише? „На второ четене“ е рубрика, в която отваряме списъците с книги, публикувани преди поне година, четем ги и препоръчваме любимите си от тях. Рубриката е част от партньорската програма Читателски клуб „Тоест“. Изборът на заглавия обаче е единствено на авторите – Стефан Иванов и Антония Апостолова, които биха ви препоръчали тези книги и ако имаше как веднъж на две седмици да се разходите с тях в книжарницата.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Post Syndicated from Chanpreet Singh original https://aws.amazon.com/blogs/big-data/unlock-scalability-cost-efficiency-and-faster-insights-with-large-scale-data-migration-to-amazon-redshift/

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and advanced analytics.

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). However, you might face significant challenges when planning for a large-scale data warehouse migration. These challenges can range from ensuring data quality and integrity during the migration process to addressing technical complexities related to data transformation, schema mapping, performance, and compatibility issues between the source and target data warehouses. Additionally, organizations must carefully consider factors such as cost implications, security and compliance requirements, change management processes, and the potential disruption to existing business operations during the migration. Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift.

In this post, we discuss best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

Success criteria for large-scale migration

The following diagram illustrates a scalable migration pattern for an extract, load, and transform (ELT) scenario using Amazon Redshift data sharing patterns.

The following diagram illustrates a scalable migration pattern for extract, transform, and load (ETL) scenario.

Migration pattern extract, transform, and load (ETL) scenarios

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

KPIs make sure you can track and audit optimal implementation, achieve consumer satisfaction and trust, and minimize disruptions during the final transition. They measure workload trends, cost usage, data flow throughput, consumer data rendering, and real-life performance. This makes sure the new data platform can meet current and future business goals.

Migration from a large-scale mission-critical monolithic legacy data warehouse (such as Oracle, Netezza, Teradata, or Greenplum) is typically planned and implemented over 6–16 months, depending on the complexity of the existing implementation. The monolithic data warehouse environments that have been built over the last 30 years contain proprietary business logic and multiple data design patterns, including an operation data store, star or Snowflake schema, dimension and facts, data warehouses and data marts, online transaction processing (OLTP) real-time dashboards, and online analytic processing (OLAP) cubes with multi-dimensional analytics. The data warehouse is highly business critical with minimal allowable downtime. If your data warehouse platform has gone through multiple enhancements over the years, your operational service levels documentation may not be current with the latest operational metrics and desired SLAs for each tenant (such as business unit, data domain, or organization group).

As part of the success criteria for operational service levels, you need to document the expected service levels for the new Amazon Redshift data warehouse environment. This includes the expected response time limits for dashboard queries or analytical queries, elapsed runtime for daily ETL jobs, desired elapsed time for data sharing with consumers, total number of tenants with concurrency of loads and reports, and mission-critical reports for executives or factory operations.

As part of your modern data architecture transition strategy, the migration goal of a new Amazon Redshift based platform is to use the scalability, performance, cost-optimization, and additional lake house capabilities of Amazon Redshift, resulting in improving the existing data consumption experience. Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies:

A majority of organizations opt for the organic strategy (lift and shift) when migrating their large data platforms to Amazon Redshift. This approach uses AWS migration tools such as the AWS Schema Conversion Tool (AWS SCT) or the managed service version DMS Schema Conversion to rapidly meet goals around data center exit, cloud adoption, reducing legacy licensing costs, and replacing legacy platforms.

By establishing clear success criteria and monitoring KPIs, you can implement a smooth migration to Amazon Redshift that meets performance and operational goals. Thoughtful planning and optimization are crucial, including optimizing your Amazon Redshift configuration and workload management, addressing concurrency needs, implementing scalability, tuning performance for large result sets, minimizing schema locking, and optimizing join strategies. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction. A successful migration can be accomplished through proactive planning, continuous monitoring, and performance fine-tuning, thereby aligning with and delivering on business objectives.

Migration involves the following phases, which we delve into in the subsequent sections:

  • Assessment
    • Discovery of workload and integrations
    • Dependency analysis
    • Effort estimation
    • Team sizing
    • Strategic wave planning
  • Functional and performance
    • Code conversion
    • Data validation
  • Measure and benchmark KPIs
    • Platform-level KPIs
    • Tenant-level KPIs
    • Consumer-level KPIs
    • Sample SQL
  • Monitoring Amazon Redshift performance and continual optimization
    • Identify top offending queries
    • Optimization strategies

To achieve a successful Amazon Redshift migration, it’s important to address these infrastructure, security, and deployment considerations simultaneously, thereby implementing a smooth and secure transition.

Assessment

In this section, we discuss the steps you can take in the assessment phase.

Discovery of workload and integrations

Conducting discovery and assessment for migrating a large on-premises data warehouse to Amazon Redshift is a critical step in the migration process. This phase helps identify potential challenges, assess the complexity of the migration, and gather the necessary information to plan and implement the migration effectively. You can use the following steps:

  • Data profiling and assessment – This involves analyzing the schema, data types, table sizes, and dependencies. Special attention should be given to complex data types such as arrays, JSON, or custom data types and custom user-defined functions (UDFs), because they may require specific handling during the migration process. Additionally, it’s essential to assess the volume of data and daily incremental data to be migrated, and estimate the required storage capacity in Amazon Redshift. Furthermore, analyzing the existing workload patterns, queries, and performance characteristics provides valuable insights into the resource requirements needed to optimize the performance of the migrated data warehouse in Amazon Redshift.
  • Code and query assessment – It’s crucial to assess the compatibility of existing SQL code, including queries, stored procedures, and functions. The AWS SCT can help identify any unsupported features, syntax, or functions that need to be rewritten or replaced to achieve a seamless integration with Amazon Redshift. Additionally, it’s essential to evaluate the complexity of the existing processes and determine if they require redesigning or optimization to align with Amazon Redshift best practices.
  • Performance and scalability assessment – This includes identifying performance bottlenecks, concurrency issues, or resource constraints that may be hindering optimal performance. This analysis helps determine the need for performance tuning or workload management techniques that may be required to achieve optimal performance and scalability in the Amazon Redshift environment.
  • Application integrations and mapping – Embarking on a data warehouse migration to a new platform necessitates a comprehensive understanding of the existing technology stack and business processes intertwined with the legacy data warehouse. Consider the following:
    • Meticulously document all ETL processes, BI tools, and scheduling mechanisms employed in conjunction with the current data warehouse. This includes commercial tools, custom scripts, and any APIs or connectors interfacing with source systems.
    • Take note of any custom code, frameworks, or mechanisms utilized in the legacy data warehouse for tasks such as managing slowly changing dimensions (SCDs), generating surrogate keys, implementing business logic, and other specialized functionalities. These components may require redevelopment or adaptation to operate seamlessly on the new platform.
    • Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Map out their specific dependencies on database objects, tables, views, and other components. Trace the flow of data from its origins in the source systems, through the data warehouse, and ultimately to its consumption by reporting, analytics, and other downstream processes.
  • Security and access control assessment – This includes reviewing the existing security model, including user roles, permissions, access controls, data retention policies, and any compliance requirements and industry regulations that need to be adhered to.

Dependency analysis

Understanding dependencies between objects is crucial for a successful migration. You can use system catalog views and custom queries on your on-premises data warehouses to create a comprehensive object dependency report. This report shows how tables, views, and stored procedures rely on each other. This also involves analyzing indirect dependencies (for example, a view built on top of another view, which in turn uses a set of tables), and having a complete understanding of data usage patterns.

Effort estimation

The discovery phase serves as your compass for estimating the migration effort. You can translate those insights into a clear roadmap as follows:

  • Object classification and complexity assessment – Based on the discovery findings, categorize objects (tables, views, stored procedures, and so on) based on their complexity. Simple tables with minimal dependencies will require less effort to migrate than intricate views or stored procedures with complex logic.
  • Migration tools – Use the AWS SCT to estimate the base migration effort per object type. The AWS SCT can automate schema conversion, data type mapping, and function conversion, reducing manual effort.
  • Additional considerations – Factor in additional tasks beyond schema conversion. This may include data cleansing, schema optimization for Amazon Redshift performance, unit testing of migrated objects, and migration script development for complex procedures. The discovery phase sheds light on potential schema complexities, allowing you to accurately estimate the effort required for these tasks.

Team sizing

With a clear picture of the effort estimate, you can now size the team for the migration.

Person-months calculation

Divide the total estimated effort by the desired project duration to determine the total person-months required. This provides a high-level understanding of the team size needed.

For example, for a ELT migration project from an on-premises data warehouse to Amazon Redshift to be completed within 6 months, we estimate the team requirements based on the number of schemas or tenants (for example, 30), number of database tables (for example, 5,000), average migration estimate for a schema (for example, 4 weeks based on complexity of stored procedures, tables and views, platform-specific routines, and materialized views), and number of business functions (for example, 2,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • Migration time period (65% migration/35% for validation & transition) = 0.8* 6 months = 5 months or 22 weeks
  • Dedicated teams = Number of tenants / (migration time period) / (average migration period for a tenant) = 30/5/1 = 6 teams
  • Migration team structure:
    • One to three data developers with stored procedure conversion expertise per team, performing over 25 conversions per week
    • One data validation engineer per team, testing over 50 objects per week
    • One to two data visualization experts per team, confirming consumer downstream applications are accurate and performant
  • A common shared DBA team with performance tuning expertise responding to standardization and challenges
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions

Team composition expertise

Based on the skillsets required for various migration tasks, we assemble a team with the right expertise. Platform architects define a well-architected platform. Data engineers are crucial for schema conversion and data transformation, and DBAs can handle cluster configuration and workload monitoring. An engagement or project management team makes sure the project runs smoothly, on time, and within budget.

For example, for an ETL migration project from Informatica/Greenplum to a target Redshift lakehouse with an Amazon Simple Storage Service (Amazon S3) data lake to be completed within 12 months, we estimate the team requirements based on the number of schemas and tenants (for example, 50 schemas), number of database tables (for example, 10,000), average migration estimate for a schema (6 weeks based on complexity of database objects), and number of business functions (for example, 5,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Ingestion migration implementation is segmented by tenants and type of ingestion patterns, such as internal database change data capture (CDC); data streaming, clickstream, and Internet of Things (IoT); public dataset capture; partner data transfer; and file ingestion patterns.
  • The migration team composition is tailored to the needs of a project wave. Depending on each migration wave and what is being done in the wave (development, testing, or performance tuning), the right people will be engaged. When the wave is complete, the people from that wave will move to another wave.
  • A loading team builds a producer-consumer architecture in Amazon Redshift to process concurrent near real-time publishing of data. This requires a dedicated team of 3–7 members building and publishing refined datasets in Amazon Redshift.
  • A shared DBA group of 3–5 individuals helping with schema standardization, migration challenges, and performance optimization outside the automated conversion.
  • Data transformation experts to convert database stored functions in the producer or consumer.
  • A migration sprint plan for 10 months with 2 sprint weeks with multiple waves to release tenants to the new architecture.
  • A validation team to confirm a reliable and complete migration.
  • One to two data visualization experts per team, confirming that consumer downstream applications are accurate and performant.
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions.

Strategic wave planning

Migration waves can be determined as follows:

  • Dependency-based wave delineation – Objects can be grouped into migration waves based on their dependency relationships. Objects with no or minimal dependencies will be prioritized for earlier waves, whereas those with complex dependencies will be migrated in subsequent waves. This provides a smooth and sequential migration process.
  • Logical schema and business area alignment – You can further revise migration waves by considering logical schema and business areas. This allows you to migrate related data objects together, minimizing disruption to specific business functions.

Functional and performance

In this section, we discuss the steps for refactoring the legacy SQL codebase to leverage Redshift SQL best practices, build validation routines to ensure accuracy and completeness during the transition to Redshift, capturing KPIs to ensure similar or better service levels for consumption tools/downstream applications, and incorporating performance hooks and procedures for scalable and performant Redshift Platform.

Code conversion

We recommend using the AWS SCT as the first step in the code conversion journey. The AWS SCT is a powerful tool that can streamline the database schema and code migrations to Amazon Redshift. With its intuitive interface and automated conversion capabilities, the AWS SCT can significantly reduce the manual effort required during the migration process. Refer to Converting data warehouse schemas to Amazon Redshift using AWS SCT for instructions to convert your database schema, including tables, views, functions, and stored procedures, to Amazon Redshift format. For an Oracle source, you can also use the managed service version DMS Schema Conversion.

When the conversion is complete, the AWS SCT generates a detailed conversion report. This report highlights any potential issues, incompatibilities, or areas requiring manual intervention. Although the AWS SCT automates a significant portion of the conversion process, manual review and modifications are often necessary to address various complexities and optimizations.

Some common cases where manual review and modifications are typically required include:

  • Incompatible data types – The AWS SCT may not always handle custom or non-standard data types, requiring manual intervention to map them to compatible Amazon Redshift data types.
  • Database-specific SQL extensions or proprietary functions – If the source database uses SQL extensions or proprietary functions specific to the database vendor (for example, STRING_AGG() or ARRAY_UPPER functions, or custom UDFs for PostgreSQL), these may need to be manually rewritten or replaced with equivalent Amazon Redshift functions or UDFs. The AWS SCT extension pack is an add-on module that emulates functions present in a source database that are required when converting objects to the target database.
  • Performance optimization – Although the AWS SCT can convert the schema and code, manual optimization is often necessary to take advantage of the features and capabilities of Amazon Redshift. This may include adjusting distribution and sort keys, converting row-by-row operations to set-based operations, optimizing query plans, and other performance tuning techniques specific to Amazon Redshift.
  • Stored procedures and code conversion – The AWS SCT offers comprehensive capabilities to seamlessly migrate stored procedures and other code objects across platforms. Although its automated conversion process efficiently handles the majority of cases, certain intricate scenarios may necessitate manual intervention due to the complexity of the code and utilization of database-specific features or extensions. To achieve optimal compatibility and accuracy, it’s advisable to undertake testing and validation procedures during the migration process.

After you address the issues identified during the manual review process, it’s crucial to thoroughly test the converted stored procedures, as well as other database objects and code, such as views, functions, and SQL extensions, in a non-production Redshift cluster before deploying them in the production environment. This exercise is mostly undertaken by QA teams. This phase also involves conducting holistic performance testing (individual queries, batch loads, consumption reports and dashboards in BI tools, data mining applications, ML algorithms, and other relevant use cases) in addition to functional testing to make sure the converted code meets the required performance expectations. The performance tests should simulate production-like workloads and data volumes to validate the performance under realistic conditions.

Data validation

When migrating data from an on-premises data warehouse to a Redshift cluster on AWS, data validation is a crucial step to confirm the integrity and accuracy of the migrated data. There are several approaches you can consider:

  • Custom scripts – Use scripting languages like Python, SQL, or Bash to develop custom data validation scripts tailored to your specific data validation requirements. These scripts can connect to both the source and target databases, extract data, perform comparisons, and generate reports.
  • Open source tools – Use open source data validation tools like Amazon Deequ or Great Expectations. These tools provide frameworks and utilities for defining data quality rules, validating data, and generating reports.
  • AWS native or commercial tools – Use AWS native tools such as AWS Glue Data Quality or commercial data validation tools like Collibra Data Quality. These tools often provide comprehensive features, user-friendly interfaces, and dedicated support.

The following are different types of validation checks to consider:

  • Structural comparisons – Compare the list of columns and data types of columns between the source and target (Amazon Redshift). Any mismatches should be flagged.
  • Row count validation – Compare the row counts of each core table in the source data warehouse with the corresponding table in the target Redshift cluster. This is the most basic validation step to make sure no data has been lost or duplicated during the migration process.
  • Column-level validation – Validate individual columns by comparing column-level statistics (min, max, count, sum, average) for each column between the source and target databases. This can help identify any discrepancies in data values or data types.

You can also consider the following validation strategies:

  • Data profiling – Perform data profiling on the source and target databases to understand the data characteristics, identify outliers, and detect potential data quality issues. For example, you can use the data profiling capabilities of AWS Glue Data Quality or the Amazon Deequ
  • Reconciliation reports – Produce detailed validation reports that highlight errors, mismatches, and data quality issues. Consider generating reports in various formats (CSV, JSON, HTML) for straightforward consumption and integration with monitoring tools.
  • Automate the validation process – Integrate the validation logic into your data migration or ETL pipelines using scheduling tools or workflow orchestrators like Apache Airflow or AWS Step Functions.

Lastly, keep in mind the following considerations for collaboration and communication:

  • Stakeholder involvement – Involve relevant stakeholders, such as business analysts, data owners, and subject matter experts, throughout the validation process to make sure business requirements and data quality expectations are met.
  • Reporting and sign-off – Establish a clear reporting and sign-off process for the validation results, involving all relevant stakeholders and decision-makers.

Measure and benchmark KPIs

For multi-tenant Amazon Redshift implementation, KPIs are segmented at the platform level, tenant level, and consumption tools level. KPIs evaluate the operational metrics, cost metrics, and end-user response time metrics. In this section, we discuss the KPIs needed for achieving a successful transition.

Platform-level KPIs

As new tenants are gradually migrated to the platform, it’s imperative to monitor the current state of Amazon Redshift platform-level KPIs. The current KPI’s state will help the platform team make the necessary scalability modifications (add nodes, add consumer clusters, add producer clusters, or increase concurrency scaling clusters). Amazon Redshift query monitoring rules (QMR) also help govern the overall state of data platform, providing optimal performance for all tenants by managing outlier workloads.

The following table summarizes the relevant platform-level KPIs.

Component KPI Service Level and Success Criteria
ETL Ingestion data volume Daily or hourly peak volume in GBps, number of objects, number of threads.
Ingestion threads Peak hourly ingestion threads (COPY or INSERT), number of dependencies, KPI segmented by tenants and domains.
Stored procedure volume Peak hourly stored procedure invocations segmented by tenants and domains.
Concurrent load Peak concurrent load supported by the producer cluster; distribution of ingestion pattern across multiple producer clusters using data sharing.
Data sharing dependency Data sharing between producer clusters (objects refreshed, locks per hour, waits per hour).
Workload Number of queries Peak hour query volume supported by cluster segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, domain, or sub-domain.
Number of queries per queue Peak hour query volume supported by priority automatic WLM queue segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, business group, domain, or sub-domain.
Runtime pattern Total runtime per hour; max, median, and average run pattern; segmented by service class across clusters.
Wait time patterns Total wait time per hour; max, median, and average wait pattern for queries waiting.
Performance Leader node usage Service level for leader node (recommended less than 80%).
Compute node CPU usage Service level for compute node (recommended less than 90%).
Disk I/O usage per node Service level for disk I/O per node.
QMR rules Number of outlier queries stopped by QMR (large scan, large spilling disk, large runtime); logging thresholds for potential large queries running more than 5 minutes.
History of WLM queries Historical trend of queries stored in historical archive table for all instances of queries in STL_WLM_QUERY; trend analysis over 30 days, 60 days, and 90 days to fine-tune the workload across clusters.
Cost Total cost per month of Amazon Redshift platform Service level for mix of instances (reserved, on-demand, serverless), cost of Concurrency Scaling, cost of Amazon Redshift Spectrum usage. Use AWS tools like AWS Cost Explorer or daily cost usage report to capture monthly costs for each component.
Daily Concurrency Scaling usage Service limits to monitor cost for concurrency scaling; invoke for outlier activity on spikes.
Daily Amazon Redshift Spectrum usage Service limits to monitor cost for using Amazon Redshift Spectrum; invoke for outlier activity.
Redshift Managed Storage usage cost Track usage of Redshift Managed Storage, monitoring wastage on temporary, archival, and old data assets.
Localization Remote or on-premises tools Service level for rendering large datasets to remote destinations.
Data transfer to remote tools Data transfer to BI tools or workstations outside the Redshift cluster VPC; separation of datasets to Amazon S3 using the unload feature, avoiding bottlenecks at leader node.

Tenant-level KPIs

Tenant-level KPIs help capture current performance levels from the legacy system and document expected service levels for the data flow from the source capture to end-user consumption. The captured legacy KPIs assist in providing the best target modern Amazon Redshift platform (a single Redshift data warehouse, a lake house with Amazon Redshift Spectrum, and data sharing with the producer and consumer clusters). Cost usage tracking at the tenant level helps you spread the cost of a shared platform across tenants.

The following table summarizes the relevant tenant-level KPIs.

Component KPI Service Level and Success Criteria
Cost Compute usage by tenant Track usage by tenant, business group, or domain; capture query volume by business unit associating Redshift user identity to internal business unit; data observability by consumer usage for data products helping with cost attribution.
ETL Orchestration SLA Service level for daily data availability.
Runtime Service level for data loading and transformation.
Data ingestion volume Peak expected volume for service level guarantee.
Query consumption Response time Response time SLA for query patterns (dashboards, SQL analytics, ML analytics, BI tool caching).
Concurrency Peak query consumers for tenant.
Query volume Peak hourly volume service levels and daily query volumes.
Individual query response for critical data consumption Service level and success criteria for critical workloads.

Consumer-level KPIs

A multi-tenant modern data platform can set service levels for a variety of consumer tools. The service levels provide guidance to end-users of the capability of the new deployment.

The following table summarizes the relevant consumer-level KPIs.

Consumer KPI Service Level and Success Criteria
BI tools Large data extraction Service level for unloading data for caching or query rendering a large result dataset.
Dashboards Response time Service level for data refresh.
SQL query tools Response time Service level for response time by query type.
Concurrency Service level for concurrent query access by all consumers.
One-time analytics Response time Service level for large data unloads or aggregation.
ML analytics Response time Service level for large data unloads or aggregation.

Sample SQL

The post includes sample SQL to capture daily KPI metrics. The following example KPI dashboard trends assist in capturing historic workload patterns, identifying deviations in workload, and providing guidance on the platform workload capacity to meet the current workload and anticipated growth patterns.

The following figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds).

Figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds)

The following figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns).

Figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns)

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters).

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters)

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time).

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time)

Monitoring Redshift performance and continual optimization

Amazon Redshift uses automatic table optimization (ATO) to choose the right distribution style, sort keys, and encoding when you create a table with AUTO options. Therefore, it’s a good practice to take advantage of the AUTO feature and create tables with DISTSTYLE AUTO, SORTKEY AUTO, and ENCODING AUTO. When tables are created with AUTO options, Amazon Redshift initially creates tables with optimal keys for the best first-time query performance possible using information such as the primary key and data types. In addition, Amazon Redshift analyzes the data volume and query usage patterns to evolve the distribution strategy and sort keys to optimize performance over time. Finally, Amazon Redshift performs table maintenance activities on your tables that reduce fragmentation and make sure statistics are up to date.

During a large, phased migration, it’s important to monitor and measure Amazon Redshift performance against target KPIs at each phase and implement continual optimization. As new workloads are onboarded at each phase of the migration, it’s recommended to perform regular Redshift cluster reviews and analyze query pattern and performance. Cluster reviews can be done by engaging the Amazon Redshift specialist team through AWS Enterprise support or your AWS account team. The goal of a cluster review includes the following:

  • Use cases – Review the application use cases and determine if the design is suitable to solve for those use cases.
  • End-to-end architecture – Assess the current data pipeline architecture (ingestion, transformation, and consumption). For example, determine if too many small inserts are occurring and review their ETL pipeline. Determine if integration with other AWS services can be useful, such as AWS Lake Formation, Amazon Athena, Redshift Spectrum, or Amazon Redshift federation with PostgreSQL and MySQL.
  • Data model design – Review the data model and table design and provide recommendations for sort and distribution keys, keeping in mind best practices.
  • Performance – Review cluster performance metrics. Identify bottlenecks or irregularities and suggest recommendations. Dive deep into specific long-running queries to identify solutions specific to the customer’s workload.
  • Cost optimization – Provide recommendations to reduce costs where possible.
  • New features – Stay up to date with the new features in Amazon Redshift and identify where they can be used to meet these goals.

New workloads can introduce query patterns that could impact performance and miss target SLAs. A number of factors can affect query performance. In the following sections, we discuss aspects impacting query speed and optimizations for improving Redshift cluster performance.

Identify top offending queries

A compute node is partitioned into slices. More nodes means more processors and more slices, which enables you to redistribute the data as needed across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system. For more information on Redshift cluster architecture, see Data warehouse system architecture. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information on node types, see Amazon Redshift pricing.

Redshift Test Drive is an open source tool that lets you evaluate which different data warehouse configuration options are best suited for your workload. We created Redshift Test Drive from Simple Replay and Amazon Redshift Node Configuration Comparison (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with. With Amazon Redshift Serverless, you can start with a base Redshift Processing Unit (RPU), and Redshift Serverless automatically scales based on your workload needs.

Optimization strategies

If you choose to fine-tune manually, the following are key concepts and considerations:

  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.
  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.
  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate (a condition in the WHERE clause) to restrict the query dataset.
  • Concurrent operations – Amazon Redshift offers a powerful feature called automatic workload management (WLM) with query priorities, which enhances query throughput and overall system performance. By intelligently managing multiple concurrent operations and allocating resources dynamically, automatic WLM makes sure high-priority queries receive the necessary resources promptly, while lower-priority queries are processed efficiently without compromising system stability. This advanced queuing mechanism allows Amazon Redshift to optimize resource utilization, minimizing potential bottlenecks and maximizing query throughput, ultimately delivering a seamless and responsive experience for users running multiple operations simultaneously.
  • Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs. For more information, see Amazon Redshift best practices for designing queries.
  • Queries with a long return time – Queries with a long return time can impact the processing of other queries and overall performance of the cluster. It’s critical to identify and optimize them. You can optimize these queries by either moving clients to the same network or using the UNLOAD feature of Amazon Redshift, and then configure the client to read the output from Amazon S3. To identify percentile and top running queries, you can download the sample SQL notebook system queries. You can import this in Query Editor V2.0.

Conclusion

In this post, we discussed best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

The assessment phase of a data migration project is critical for implementing a successful migration. It involves a comprehensive analysis of the existing workload, integrations, and dependencies to accurately estimate the effort required and determine the appropriate team size. Strategic wave planning is crucial for prioritizing and scheduling the migration tasks effectively. Establishing KPIs and benchmarking them helps measure progress and identify areas for improvement. Code conversion and data validation processes validate the integrity of the migrated data and applications. Monitoring Amazon Redshift performance, identifying and optimizing top offending queries, and conducting regular cluster reviews are essential for maintaining optimal performance and addressing any potential issues promptly.

By addressing these key aspects, organizations can seamlessly migrate their data workloads to Amazon Redshift while minimizing disruptions and maximizing the benefits of Amazon Redshift.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the authors

Chanpreet Singh is a Senior Lead Consultant at AWS, specializing in Data Analytics and AI/ML. He has over 17 years of industry experience and is passionate about helping customers build scalable data warehouses and big data solutions. In his spare time, Chanpreet loves to explore nature, read, and enjoy with his family.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Ram Bhandarkar is a Principal Data Architect at AWS based out of Northern Virginia. He helps customers with planning future Enterprise Data Strategy and assists them with transition to Modern Data Architecture platform on AWS. He has worked with building and migrating databases, data warehouses and data lake solutions for over 25 years.

Vijay Bagur is a Sr. Technical Account Manager. He works with enterprise customers to modernize and cost optimize workloads, improve security posture, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, biking and traveling.

Proper Address: IPv4 vs. IPv6

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/proper-address-ipv4-vs-ipv6/

A decorative image showing a cloud over performance graphs and charts.

Ah, the 1980s. It brought us such classics as Ghostbusters, The Princess Bride, Tina Turner’s triumphant comeback, Pac-Man, and the original Apple Macintosh. Also, it gave us the birth of the internet, in which we figured out how to make all our computers one giant, powerful network held together initially by internet protocols (IPs) and, eventually, by a mutual love of cat videos

Now, each of our devices that connect to the internet require a way to find and send information back and forth, which means they need an IP address. Most folks don’t type IP addresses into their search bar though—we use domain names (for example, www.backblaze.com). Which IP addresses correspond to which domain names is stored in a hierarchical and distributed database system known as the domain name system (DNS), which is also an internet protocol. 

Today, let’s talk about IP addresses: What are IPv4 and IPv6, why is IPv6 necessary, and what impact will it have on networking?

Let’s set the scene

Any time you’re sending and receiving data, be it a letter in the mail, dialing a phone number, or loading a website, you’ve got to have an identifiable address reach the proper person and/or device. What all of these types of addresses have in common is that as our population has exploded, we’ve had to re-work how addresses work in order to include more possible data locations. U.S. zip codes were established in 1963. Area codes were established in 1947, and a great expansion was necessary only three(ish) decades later, and that plan was implemented starting in the late 1980s and ending in the mid ’90s.

IP addresses, meanwhile, have been operating on the first and only protocol we introduced in the 1980s, called IPv4. Not only has the world population almost doubled since then, but there has also been a nonlinear explosion in internet-connected devices per person. When IP addresses were first invented, it was unfathomable that most folks would be walking around with a computer in their pocket, remotely checking who’s ringing their doorbells while adjusting their thermostat in anticipation of returning home. All of those internet-connected devices use an IP address, in one way or another. 

So, it’s no surprise that we’re now seeing an adoption of a new IP address standard. In keeping with tradition, the versions aren’t sequential: Right now we’re jumping from IPv4 to IPv6. (What happened to IPv5? It was skipped, sort of.)

What is IPv4?

IPv4 is an internet protocol that assigns addresses to devices. It uses a 32-bit address, represented by four numbers (octets), each between 0 and 255, separated by dots (e.g., 192.168.1.100), and uses decimal notation. 

Remember that each bit represents one of two possible values, a 0 or a 1. So, for a 32-bit value, there are 2^32 possible addresses, or 4,294,967,296 IP addresses total. Several IPv4 address blocks were also reserved for private networks and multicast addresses, about 286 million total. Between the two reserved blocks of addresses, that’s about 7% of the total addresses in existence.

What is IPv6?

IPv6 uses a 128-bit address, represented by a longer string of numbers and letters (e.g., 2001:0db8:85a3:0000:0000:8a2e:0370:7334) in hexadecimal code, aka hex code. If you’ve ever designed a MySpace page (hi, Tom!) or a webpage, you’re likely familiar with the hex codes used to identify precise colors.

Doing the math as we did above, there are 2^128 possible IPv6 addresses, which is 340 undecillion. (That’s the 11th order of magnitude if you’re going, million, billion, trillion, and so on.) And, just like IPv4, there are some reserved addresses, but they represent such a comparatively smaller number of total available addresses that it’s not even worth calculating a percentage. 

Woah, how have we been surviving in the meantime?

We mentioned above that we’ve known we’re running out of IP addresses for a while. But, important detail: There was evidence of the problem as early as 1981, and mitigation efforts were enacted by 1992. Before we get into what mitigation strategies have been used over the years, a bit of a refinement of the above information—IP addresses consist of two main parts, one that identifies the network (or, sometimes, the subnet) and the host, or the destination on that network. (That’s true of both IPv4 and IPv6.)

Classful networking

In the original iteration of IPv4, the bits that identified the subnet were fixed, and that meant a lot of wasted space. In 1981, we implemented classful networking. Instead of keeping a fixed number of bits to identify a network, the three most significant bits identified the size of the network prefix, and that sent you to different classes. That meant that existing addresses didn’t have to change. Here’s a handy table:

Class Most significant bits Network prefix size (bits) Host identifier size (bits) Address range Maximum number of networks Maximum number of hosts per network
A 0 8 24 0.0.0.0–127.255.255.255 128 networks 16,777,216 hosts per network
B 10 16 16 128.0.0.0–191.255.255.255 16,384 networks 65,386 hosts per network
C 110 24 8 192.0.0.0–223.255.255.255 2,097,152 networks 256 hosts per network
D (multicast)
E (reserved)
1110
1111
224.0.0.0–255.255.255.255

All that sounds a bit like gobbley-gook. An analogy: You live in a city that wants to improve mail delivery, so it’s introduced the option to choose from a small, medium, or large mailbox. The sizes are actually pretty disproportionate—the small is about the size of a toaster, whereas the medium is the size of a kitchen trash can. (And large is the size of your car. Who gets that much mail?) No matter which size mailbox you (or your neighbor) chooses, your physical address didn’t change when this system was implemented. You usually get more mail than the toaster would accommodate, but never even come close to filling your trash can-sized mailbox. So, that extra space just sits empty and unused, never fulfilling its mail volume potential.  

Note that classful networking is now largely defunct, replaced by…  

Classless inter-domain routing (CIDR)

The biggest issue of the above system was its inflexibility. Adding classes gave us more flexibility than the original design, but you were still restricted to 8, 16, or 24 bits to identify the network. That means you can end up with a lot of unused IP addresses, as indicated by our above analogy. Here’s the math behind why: 

The number of addresses available on a network is the inverse of how many bits you use to define it. So, in a 32-bit address, if you use 16 bits to define the network, you have 8 bits leftover to define the host. That’s our Class C network, which contained 2^8 (256) IP addresses—not enough for most use cases. And, the next smallest subset, Class B, represented 2^16 IP addresses (65,536 total), which most organizations could not use efficiently. After DNS became the norm, it became clear that classful networking wasn’t scalable, and thus CIDR rose to prominence.  

CIDR is based on variable-length subnet masking (VLSM), which lets each network be divided into subnetworks of various power-of-two sizes. This method optimizes the allocation of IPv4 addresses by allowing for more flexible address blocks. 

Using our analogy, instead of assigning mailbox size based on household size, you might just have a system in which folks walk up to the post office and find their name on a list associated with a mailbox. If someone has more or less mail that month, then they can be assigned the properly sized mailbox. 

Network address translation (NAT)

NAT allows multiple devices to share a single public IPv4 address by modifying the IP header when it’s in transit. This is super useful when you’re talking about private networks—you can assign a single IP address to multiple devices. For example, if you have several internet of thing (IoT) devices in your home, they can all appear to the public network as one IP address, and your local network can figure out what traffic goes where. It also makes it so that if a network moves, the host doesn’t necessarily have to be assigned a new IP address, such as if an internet provider like Cox decides to stop doing business in your region, and Spectrum takes over their IP address allocation—though likely they’d just change your public IP address in that specific scenario.

In our mail analogy, NAT is like those group mailboxes you see in rural areas, apartment buildings, or in neighborhoods. Everyone in the same location gets their mail delivered to the same physical address, and your box number is used to further identify your house within the group mailbox. 

The secondary market of IP addresses

If we can learn anything from the above workarounds, flexibility and possibility is key. So, it’s unsurprising to know that a secondary market has cropped up, introducing things like address recycling, address trading, and address leasing. IPv6 will solve the scarcity issue—but what else can it do?

What are the benefits of IPv6?

So far we’ve talked about the primary benefit of IPv6—more IP addresses that we clearly need. But, there are other benefits as well. Here’s a summary: 

Improved Efficiency

  • Simpler header: The IPv6 header is simpler than IPv4’s, leading to faster packet processing and reduced overhead.
  • Efficient routing: IPv6’s design allows for more efficient routing, potentially reducing latency and improving network performance. Arguably, most folks won’t see a huge performance improvement unless they reconfigure their own network architecture, but the possibility is there. 
  • Autoconfiguration: IPv6 supports automatic configuration of network interfaces, simplifying setup and reducing administrative overhead.

Enhanced Security

  • Built-in security features: IPv6 offers built-in security mechanisms like IPsec, potentially providing better protection against attacks. In practice, it’s not typically implemented as most encryption is typically handled at the transport layer security (TLS) IP layer. 

Quality of Service (QoS)

  • Improved QoS: IPv6 provides better support for QoS, allowing for prioritization of different types of traffic, ensuring a better user experience for applications like video conferencing and online gaming.

Other Benefits

  • Reduced reliance on NAT: IPv6 reduces the need for NAT, simplifying network configurations and improving end-to-end connectivity.
  • Support for new services: IPv6 is better suited for emerging technologies and applications that require a large number of addresses and advanced features.

What’s next? Will we run out again?

Given the amount of addresses for IPv4 vs. IPv6 (4.2 billion vs. 340 undecillion, respectively), you can understand how we might have needed to shore up our IPv4 addresses. Honestly, if you assume one device per person, we already outnumber IPv4 addresses—in fact, we outnumbered IP addresses in the 1970s, before IPv4 was even invented! You shouldn’t assume one device per person, by the way. While many countries with widespread broadband access have several devices per person—in the U.S., Consumer Affairs was reporting 21 per U.S. household in 2023, and the average U.S. household for that same year was 2.51 people. Globally, that same source reports 3.6 internet-connected devices per person.   

Changes like this can certainly be disruptive, but the good news on that front is that most devices will be dual-stacked for quite a while. That means that you’ll have both versions of an IP address, and this change can roll out organically (so to speak). In the end, we’ll have a better-performing internet, ready to grow with us for the foreseeable future.

The post Proper Address: IPv4 vs. IPv6 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup