2021. Годината започва с поне три нови издания (и два нови превода) на български език на „1984“ и „Фермата“ на Оруел. Засега. Знам, че се подготвят още две издания на „1984“ до края на годината. Странно, нали? Когато споделих с приятел, че това изглежда абсурдно на фона на смешно малкия книжен пазар у нас, той ми припомни, че миналата година се навършиха 70 години от смъртта на автора и правата върху заглавията вече са свободни. „Некрасиво осребяване“ беше коментарът му.

От друга страна 2021 може и да бъде годината, в която ще надвием над безконтролното разпространение на вируса причинител на COVID-19, но със сигурност трябва да маркира и някаква по-осезаема съпротива срещу големите технологични корпорации. В този ред на мисли, четенето на антиутопии помага.

Интернет се централизира и деформира прекомерно. Съдържанието, генерирано от хората (шибаните корпорации ни наричат потребители) е засмукано от няколко т.нар. платформи. Но те го събират не за да е по-лесно намираемо или полезно за останалите, а единствено заради техните си егоистични и алчни цели. От своя страна почти всички платформи, услуги и инструменти са подслонени в обятията на облаците на Amazon, Microsoft или Google. Много вероятно е дори да сте наели виртуален сървър или нещо друго от по-малък доставчик, но той пак да разчита на инфраструктурата на трите големи, като само я преопакова и евентуално добавя някакви свои услуги или поддръжка.

Така уж децентрализираният Интернет е в ръцете на неколцина алчни копелдаци, които е крайно време да започнат да ядат шамари. Заради тях се пропукват фундаментални опори на обществата и демокрациите ни. Платформите отказват да носят отговорност за съдържанието, което разпространяват и усилват, но в същия момент задушават и поставят в зависимост медиите, които все още се опитват да изпълняват обществения си дълг. Новите бизнес-модели до един облагодетелстват платформите, без никакъв ангажимент да компенсират създаващите съдържание. Опитите на социалните мрежи да обозначават дезинформацията, политическите манипулации и фалшивите новини са едва отскоро, но реализацията на това е нелепа и много личи, че е направена под натиск и без нужното осъзнаване на проблема и отговорността.

Искрено се надявам, че скоро ще видим решителни и категорични мерки спрямо техноолигополите, включително разделяне на гигантите на по-малки компании. Но не съм наивен. Не е много вероятно, нито ще е лесно. Политиците в последно време са маркетингов продукт без идеологически фундамент. Нагласят се спрямо посоката на попътния вятър и следват популистките вълни на масата, вместо мотивирано да водят хората в една или друга посока.

Затова е важно простосмъртните да се съпротивляваме всячески.

Да избягваме платформите и масовите доставчици – с особено внимание към тези в юрисдикциите на Петте очи. Да ползваме услугите на по-малки и по възможност местни или европейски хостинг-компании и микро-облаци.
Да пускаме и поддържаме алтернативни решения за съхраняване на файлове и данни.
Да криптираме комуникацията и трафика си (и да внимаваме бюрократите да не направят това незаконно).
Да не трупаме съдържание по разни чужди платформи, а да имаме собствени малки сайтове, блогове и сървъри.
Да ползваме инструменти, които са извън контрола на алгоритмите за куриране на информация – например RSS или Mastodon.
Да не ползваме технологии, които обсебват трафик или се превръщат в гравитационен център – като AMP.
Да децентрализираме всичко, каквото можем в Интернет.
Да бойкотираме копелдаците.

Обмислям да подхвана серия от публикации, които да подсказват добри (според мен) идеи и решения за опазване на личната неприкосновеност в Интернет, за по-добра сигурност на устройствата и компютрите ни, за подсигуряване на фирмените и личните ни данни чрез инструменти и услуги, които са алтернатива на общоизвестните и ще се старая доколкото е възможно да пиша разбираемо и общодостъпно, макар и за някои по-технически теми това да е доста трудно.

От тази година ще започна да предлагам и внедряване на някои от решенията, за които пиша. Като услуга. Предвид ограниченото ми лично време и физически капацитет, поне на първо време услугите ми ще са предназначени основно за малки и средни бизнеси или организации, с които сме си взаимно симпатични. Така или иначе нямам никакъв мерак да помагам на големите 🙂

Заглавна снимка: @ev

Време е да притопляме шамарите

2021-01-05 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/vreme-za-shamari/

Затова е важно простосмъртните да се съпротивляваме всячески.

Да избягваме платформите и масовите доставчици – с особено внимание към тези в юрисдикциите на Петте очи. Да ползваме услугите на по-малки и по възможност местни или европейски хостинг-компании и микро-облаци.
Да пускаме и поддържаме алтернативни решения за съхраняване на файлове и данни.
Да криптираме комуникацията и трафика си (и да внимаваме бюрократите да не направят това незаконно).
Да не трупаме съдържание по разни чужди платформи, а да имаме собствени малки сайтове, блогове и сървъри.
Да ползваме инструменти, които са извън контрола на алгоритмите за куриране на информация – например RSS или Mastodon.
Да не ползваме технологии, които обсебват трафик или се превръщат в гравитационен център – като AMP.
Да децентрализираме всичко, каквото можем в Интернет.
Да бойкотираме копелдаците.

Заглавна снимка: @ev

Време е да притопляме шамарите

2021-01-05 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/vreme-za-shamari/

Затова е важно простосмъртните да се съпротивляваме всячески.

Да избягваме платформите и масовите доставчици – с особено внимание към тези в юрисдикциите на Петте очи. Да ползваме услугите на по-малки и по възможност местни или европейски хостинг-компании и микро-облаци.
Да пускаме и поддържаме алтернативни решения за съхраняване на файлове и данни.
Да криптираме комуникацията и трафика си (и да внимаваме бюрократите да не направят това незаконно).
Да не трупаме съдържание по разни чужди платформи, а да имаме собствени малки сайтове, блогове и сървъри.
Да ползваме инструменти, които са извън контрола на алгоритмите за куриране на информация – например RSS или Mastodon.
Да не ползваме технологии, които обсебват трафик или се превръщат в гравитационен център – като AMP.
Да децентрализираме всичко, каквото можем в Интернет.
Да бойкотираме копелдаците.

Заглавна снимка: @ev

Време е да притопляме шамарите

2021-01-05 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/vreme-za-shamari/

Затова е важно простосмъртните да се съпротивляваме всячески.

Да избягваме платформите и масовите доставчици – с особено внимание към тези в юрисдикциите на Петте очи. Да ползваме услугите на по-малки и по възможност местни или европейски хостинг-компании и микро-облаци.
Да пускаме и поддържаме алтернативни решения за съхраняване на файлове и данни.
Да криптираме комуникацията и трафика си (и да внимаваме бюрократите да не направят това незаконно).
Да не трупаме съдържание по разни чужди платформи, а да имаме собствени малки сайтове, блогове и сървъри.
Да ползваме инструменти, които са извън контрола на алгоритмите за куриране на информация – например RSS или Mastodon.
Да не ползваме технологии, които обсебват трафик или се превръщат в гравитационен център – като AMP.
Да децентрализираме всичко, каквото можем в Интернет.
Да бойкотираме копелдаците.

Заглавна снимка: @ev

Време е да притопляме шамарите

2021-01-05 Yovko Lambrev

Post Syndicated from Yovko Lambrev original https://yovko.net/vreme-za-shamari/

Интернет се централизира и деформира прекомерно. Съдържанието, генерирано от хората (шибаните корпорации ни наричат потребители) е засмукано от няколко т.нар. платформи. И те го събират не за да е по-лесно намираемо или полезно за останалите, а единствено за техните си алчни цели. От своя страна почти всички платформи, услуги и инструменти са подслонени в обятията на облаците на Amazon, Microsoft и Google. Много вероятно е дори да сте наели виртуален сървър от по-малък доставчик, той пак да разчита на инфраструктурата на трите големи, като само я преопакова и добавя някакви свои услуги.

Затова е важно простосмъртните да се съпротивляваме всячески.

Да избягваме платформите и масовите доставчици – с особено внимание към тези в юрисдикциите на Петте очи. Да ползваме услугите на по-малки и по възможност местни или европейски хостинг-компании и микро-облаци.
Да пускаме и поддържаме алтернативни решения за съхраняване на файлове и данни.
Да криптираме комуникацията и трафика си (и да внимаваме бюрократите да не направят това незаконно).
Да не трупаме съдържание по разни чужди платформи, а да имаме собствени малки сайтове, блогове и сървъри.
Да ползваме инструменти, които са извън контрола на алгоритмите за куриране на информация – например RSS или Mastodon.
Да не ползваме технологии, които обсебват трафик или се превръщат в гравитационен център – като AMP.
Да децентрализираме всичко, каквото можем в Интернет.
Да бойкотираме копелдаците.

Заглавна снимка: @ev

Latest on the SVR’s SolarWinds Hack

2021-01-05 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/01/latest-on-the-svrs-solarwinds-hack.html

The New York Times has an in-depth article on the latest information about the SolarWinds hack (not a great name, since it’s much more far-reaching than that).

Interviews with key players investigating what intelligence agencies believe to be an operation by Russia’s S.V.R. intelligence service revealed these points:

The breach is far broader than first believed. Initial estimates were that Russia sent its probes only into a few dozen of the 18,000 government and private networks they gained access to when they inserted code into network management software made by a Texas company named SolarWinds. But as businesses like Amazon and Microsoft that provide cloud services dig deeper for evidence, it now appears Russia exploited multiple layers of the supply chain to gain access to as many as 250 networks.
The hackers managed their intrusion from servers inside the United States, exploiting legal prohibitions on the National Security Agency from engaging in domestic surveillance and eluding cyberdefenses deployed by the Department of Homeland Security.
“Early warning” sensors placed by Cyber Command and the National Security Agency deep inside foreign networks to detect brewing attacks clearly failed. There is also no indication yet that any human intelligence alerted the United States to the hacking.
The government’s emphasis on election defense, while critical in 2020, may have diverted resources and attention from long-brewing problems like protecting the “supply chain” of software. In the private sector, too, companies that were focused on election security, like FireEye and Microsoft, are now revealing that they were breached as part of the larger supply chain attack.
SolarWinds, the company that the hackers used as a conduit for their attacks, had a history of lackluster security for its products, making it an easy target, according to current and former employees and government investigators. Its chief executive, Kevin B. Thompson, who is leaving his job after 11 years, has sidestepped the question of whether his company should have detected the intrusion.
Some of the compromised SolarWinds software was engineered in Eastern Europe, and American investigators are now examining whether the incursion originated there, where Russian intelligence operatives are deeply rooted.

Separately, it seems that the SVR conducted a dry run of the attack five months before the actual attack:

The hackers distributed malicious files from the SolarWinds network in October 2019, five months before previously reported files were sent to victims through the company’s software update servers. The October files, distributed to customers on Oct. 10, did not have a backdoor embedded in them, however, in the way that subsequent malicious files that victims downloaded in the spring of 2020 did, and these files went undetected until this month.

[…]

“This tells us the actor had access to SolarWinds’ environment much earlier than this year. We know at minimum they had access Oct. 10, 2019. But they would certainly have had to have access longer than that,” says the source. “So that intrusion [into SolarWinds] has to originate probably at least a couple of months before that - probably at least mid-2019 [if not earlier].”

The files distributed to victims in October 2019 were signed with a legitimate SolarWinds certificate to make them appear to be authentic code for the company’s Orion Platform software, a tool used by system administrators to monitor and configure servers and other computer hardware on their network.

Meet team behind the mini Raspberry Pi–powered ISS

2021-01-05 Ashley Whittaker

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/meet-team-behind-the-mini-raspberry-pi-powered-iss/

Quite possibly the coolest thing we saw Raspberry Pi powering this year was ISS Mimic, a mini version of the International Space Station (ISS). We wanted to learn more about the brains that dreamt up ISS Mimic, which uses data from the ISS to mirror exactly what the real thing is doing in orbit.

The ISS Mimic team’s a diverse, fun-looking bunch of people and they all made their way to NASA via different paths. Maybe you could see yourself there in the future too?

Dallas Kidd

Dallas in a green t shirt stood next to Estefannie in a black t shirt on a blue background. Estefannie is wearing safety googles — Dallas (in the green t shirt) having a lark with teammate Estefannie. Safety first!

Dallas Kidd currently works at the startup Skylark Wireless, helping to advance the technology to provide affordable high speed internet to rural areas.

Previously, she worked on traffic controllers and sensors, in finance on a live trading platform, on RAID controllers for enterprise storage, and at a startup tackling the problem of alarm fatigue in hospitals.

Before getting her Master’s in computer science with a thesis on automatically classifying stars, she taught English as a second language, Algebra I, geometry, special education, reading, and more.

Her hobbies are scuba diving, learning about astronomy, creative writing, art, and gaming.

Tristan Moody

Tristan Moody currently works as a spacecraft survivability engineer at Boeing, helping to keep the ISS and other satellites safe from the threat posed by meteoroids and orbital debris.

He has a PhD in mechanical engineering and currently spends much of his free time as playground equipment for his two young kids.

Estefannie

Estefannie is a software engineer, designer, punk rocker and likes to overly engineer things and document her findings on her YouTube and Instagram channels as Estefannie Explains It All.

Estefannie spends her time inventing things before thinking, soldering for fun, writing, filming and producing content for her YouTube channel, and public speaking at universities, conferences, and hackathons.

She lives in Houston, Texas and likes tacos.

Douglas Kimble

A member of team ISS Mimic giving a thumbs up while working on the ISS Mimic — Where are the dogs, Douglas?!

Douglas Kimble currently works as an electrical/mechanical design engineer at Boeing. He has designed countless wire harness and installation drawings for the ISS.

He assumes the mentor role and interacts well with diverse personalities. He is also the world’s biggest Lakers fan living in Texas.

His favorite pastimes includes hanging out with his two dogs, Boomer and Teddy.

Craig Stanton

A member of team ISS Mimic raising an eyebrow while working on the ISS Mimic hardware — Craig’s knows what’s up. Or knows a secret. We can’t tell. Maybe both?

Craig’s father worked for the Space Shuttle program, designing the ascent flight trajectories profiles for the early missions. He remembers being on site at Johnson Space Center one evening, in a freezing cold computer terminal room, punching cards for a program his dad wrote in the early 1980s.

Craig grew up with LEGO and majored in Architecture and Space Design at the University of Houston’s Sasakawa International Center for Space Architecture (SICSA).

His day job involves measuring ISS major assemblies on the ground to ensure they’ll fit together on-orbit. Traveling to many countries to measure hardware that will never see each other until on-orbit is the really coolest part of the job.

Sam Treagold

A member of team ISS Mimc sitting at a laptop — Sam: not to be trusted with hardware you don’t want shot in the desert

Sam Treadgold is an aerospace engineer who also works on the Meteoroid and Orbital Debris team, helping to protect the ISS and Space Launch System from hypervelocity impacts. Occasionally they take spaceflight hardware out to the desert and shoot it with a giant gun to see what happens.

In a non-pandemic world he enjoys rock climbing, music festivals, and making sound-reactive LED sunglasses.

Chen Deng

A member of team ISS Mimic showing off a solar panel — Chen showing off the very shiniest part of the ISS Mimic (solar panels)

Chen Deng is a Systems Engineer working at Boeing with the International Space Station (ISS) program. Her job is to ensure readiness of Payloads, or science experiments, to launch in various spacecraft and operations to conduct research aboard the ISS.

The ISS provides a very unique science laboratory environment, something we can’t get much of on earth: microgravity! The term microgravity means a state of little or very weak gravity. The virtual absence of gravity allows scientists to conduct experiments that are impossible to perform on earth, where gravity affects everything that we do.

In her free time, Chen enjoys hiking, board games, and creative projects alike.

Bryan Murphy

Bryan Murphy is a dynamics and motion control engineer at Boeing, where he gets to create digital physics models of robotic space mechanisms to predict their performance.

His favorite projects include the ISS treadmill vibration isolation system and the shiny new docking system. He grew up on a small farm where his hands-on time with mechanical devices fueled his interest in engineering.

When not at work, he loves to brainstorm and create with his artist/engineer wife and their nerdy kids, or go on long family roadtrips—- especially to hike and kayak or eat ice cream. He’s also vice president of a local makerspace, where he leads STEM outreach and includes excess LEDs in all his builds.

Susan

A member of team ISS Mimic — Here’s Susan rocking some of those LED glasses and getting a good grip on ISS Mimic

Susan is a mechanical engineer and a 30+-year veteran of manned spaceflight operations. She has worked the Space Shuttle Program for Payloads (middeck experiments and payloads deployed with the shuttle arm) starting with STS-30 and was on the team that deployed the Hubble Space Telescope.

She then transitioned into life sciences experiments, which led to the NASA Mir Program where she was on continuous rotation for three years to Russian Mission Control, supporting the NASA astronaut and science experiments onboard the space station as a predecessor to the ISS.

She currently works on the ISS Program (for over 20 years now), where she used to write procedures for on-orbit assembly of the Space Xtation and now writes installation procedures for on-orbit modifications like the docking adapter. She is also an artist and makes crosses out of found objects, and even used to play professional women’s football.

Keep in touch

Team ISS posing in NASA t shirts in front of the ISS mimic

You can keep up with Team ISS Mimic on Facebook, Instagram, and Twitter. For more info or to join the team, check out their GitHub page and Discord.

Kids, run your code on the ISS!

Did you know that there are Raspberry Pi computers aboard the real ISS that young people can run their own Python programs on? How cool is that?!

Find out how to participate at astro-pi.org.

The post Meet team behind the mini Raspberry Pi–powered ISS appeared first on Raspberry Pi.

Бацилите като серийни убийци

2021-01-05

Post Syndicated from original https://yurukov.net/blog/2021/seriini-ubiici/

Вирусолозите: Открихме серийния убиец. Искаме да предпазим обществото.

Атенуирани ваксини: Ето, показваме ви серийния убиец, но е онемощял и вързан. Знаем, че тук има поне минимална сигурност, така че вероятността да избяга е нищожна, но дори тогава не би стигнал далеч.

Инактивирани ваксини: Ето, показваме ви отрязаната му глава, за да се научите как да го разпознавате. Добре е заключена, видимо не може да избяга, но някои все пак може да се стреснат от гледката. Молекулни ваксини: Няма да ви показваме нищо, но ще пуснем сред вас запознати с него, които да ви предупреждават. Най-вероятно ще се спогодите с тях.

мРНК ваксини: Няма да ви показваме нищо от убиеца. Ще ви дадем обаче подробен чертеж на шперца, с който влиза с взлом при жертвите. Така ще може да си го направите сами и да го разучите. Ако видите някого с такъв шперц, знаете какво да правите. Рисковете включват порязване на хартията на чертежа.

Антиваксъри: О, не. Не ни трябват такива неща. Вярваме, че всеки има нужда да се сблъска с подобни серийни убийци. Е, от кога е имало такива и сме си добре, а и е важно за развитието на детето да се сблъска с убийци и да се научи по естествен начин как да се защитава от тях. Да не говорим, че повечето деца не са от типа на жертвите, които тоя предпочита, така че за какво е врявата? Още повече, че мислим, че са силно преувеличени твърденията колко е бил жесток – просто си е играел с тях. Всичко това е голяма конспирация – едни само искат да правят пари, за да разпечатват, показват и да плашат децата. Представяте ли си как ще се отрази на детското съзнание грозната гледка на снимка на един престъпник? Далеч по-добре е да го срещне на живо!

The post Бацилите като серийни убийци first appeared on Блогът на Юруков.

Comic for 2021.01.05

2021-01-05 Explosm.net

Post Syndicated from Explosm.net original http://explosm.net/comics/5760/

New Cyanide and Happiness Comic

How FactSet automates thousands of AWS accounts at scale

2021-01-05 Amit Borulkar

Post Syndicated from Amit Borulkar original https://aws.amazon.com/blogs/devops/factset-automation-at-scale/

This post is by FactSet’s Cloud Infrastructure team, Gaurav Jain, Nathan Goodman, Geoff Wang, Daniel Cordes, Sunu Joseph, and AWS Solution Architects Amit Borulkar and Tarik Makota. In their own words, “FactSet creates flexible, open data and software solutions for tens of thousands of investment professionals around the world, which provides instant access to financial data and analytics that investors use to make crucial decisions. At FactSet, we are always working to improve the value that our products provide.”

At FactSet, our operational goal to use the AWS Cloud is to have high developer velocity alongside enterprise governance. Assigning AWS accounts per project enables the agility and isolation boundary needed by each of the project teams to innovate faster. As existing workloads are migrated and new workloads are developed in the cloud, we realized that we were operating close to thousands of AWS accounts. To have a consistent and repeatable experience for diverse project teams, we automated the AWS account creation process, various service control policies (SCP) and AWS Identity and Access Management (IAM) policies and roles associated with the accounts, and enforced policies for ongoing configuration across the accounts. This post covers our automation workflows to enable governance for thousands of AWS accounts.

AWS account creation workflow

To empower our project teams to operate in the AWS Cloud in an agile manner, we developed a platform that enables AWS account creation with the default configuration customized to meet FactSet’s governance policies. These AWS accounts are provisioned with defaults such as a virtual private cloud (VPC), subnets, routing tables, IAM roles, SCP policies, add-ons for monitoring and load-balancing, and FactSet-specific governance. Developers and project team members can request a micro account for their product via this platform’s website, or do so programmatically using an API or wrap-around custom Terraform modules. The following screenshot shows a portion of the web interface that allows developers to request an AWS account.

FactSet service catalog
Continue reading How FactSet automates thousands of AWS accounts at scale →

The best new features for data analysts in Amazon Redshift in 2020

2021-01-05 Helen Anderson

Post Syndicated from Helen Anderson original https://aws.amazon.com/blogs/big-data/the-best-new-features-for-data-analysts-in-amazon-redshift-in-2020/

This is a guest post by Helen Anderson, data analyst and AWS Data Hero

Every year, the Amazon Redshift team launches new and exciting features, and 2020 was no exception. New features to improve the data warehouse service and add interoperability with other AWS services were rolling out all year.

I am part of a team that for the past 3 years has used Amazon Redshift to store source tables from systems around the organization and usage data from our software as a service (SaaS) product. Amazon Redshift is our one source of truth. We use it to prepare operational reports that support the business and for ad hoc queries when numbers are needed quickly.

When AWS re:Invent comes around, I look forward to the new features, enhancements, and functionality that make things easier for analysts. If you haven’t tried Amazon Redshift in a while, or even if you’re a longtime user, these new capabilities are designed with analysts in mind to make it easier to analyze data at scale.

Amazon Redshift ML

The newly launched preview of Amazon Redshift ML lets data analysts use Amazon SageMaker over datasets in Amazon Redshift to solve business problems without the need for a data scientist to create custom models.

As a data analyst myself, this is one of the most interesting announcements to come out in re:Invent 2020. Analysts generally use SQL to query data and present insights, but they don’t often do data science too. Now there is no need to wait for a data scientist or learn a new language to create predictive models.

For information about what you need to get started with Amazon Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

For information about what you need to get started with Amazon Redshift ML, see the Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML blog post.

Federated queries

As analysts, we often have to join datasets that aren’t in the same format and sometimes aren’t ready for use in the same place. By using federated queries to access data in other databases or Amazon Simple Storage Service (Amazon S3), you don’t need to wait for a data engineer or ETL process to move data around.

re:Invent 2019 featured some interesting talks from Amazon Redshift customers who were tackling this problem. Now that federated queries over operational databases like Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL are generally available and querying Amazon RDS for MySQL and Amazon Aurora MySQL is in preview, I’m excited to hear more.

For a step-by-step example to help you get started, see Build a Simplified ETL and Live Data Query Solution Using Redshift Federated Query.

SUPER data type

Another problem we face as analysts is that the data we need isn’t always in rows and columns. The new SUPER data type makes JSON data easy to use natively in Amazon Redshift with PartiQL.

PartiQL is an extension that helps analysts get up and running quickly with structured and semistructured data so you can unnest and query using JOINs and aggregates. This is really exciting for those who deal with data coming from applications that store data in JSON or unstructured formats.

For use cases and a quickstart, see Ingesting and querying semistructured data in Amazon Redshift (preview).

Partner console integration

The preview of the native console integration with partners announced at AWS re:Invent 2020 will also make data analysis quicker and easier. Although analysts might not be doing the ETL work themselves, this new release makes it easier to move data from platforms like Salesforce, Google Analytics, and Facebook Ads into Amazon Redshift.

Matillion, Sisense, Segment, Etleap, and Fivetran are launch partners, with other partners coming soon. If you’re an Amazon Redshift partner and would like to integrate into the console, contact [email protected].

RA3 nodes with managed storage

Previously, when you added Amazon Redshift nodes to a cluster, both storage and compute were scaled up. This all changed with the 2019 announcement of RA3 nodes, which upgrade storage and compute independently.

In 2020, the Amazon Redshift team introduced RA3.xlplus nodes, which offer even more compute sizing options to address a broader set of workload requirements.

AQUA for Amazon Redshift

As analysts, we want our queries to run quickly so we can spend more time empowering the users of our insights and less time watching data slowly return. AQUA, the Advanced Query Accelerator for Amazon Redshift tackles this problem at an infrastructure level by bringing the stored data closer to the compute power

This hardware-accelerated cache enables Amazon Redshift to run up to 10 times faster as it scales out and processes data in parallel across many nodes. Each node accelerates compression, encryption, and data processing tasks like scans, aggregates, and filtering. Analysts should still try their best to write efficient code, but the power of AQUA will speed up the return of results considerably.

AQUA is available on Amazon Redshift RA3 instances at no additional cost. To get started with AQUA, sign up for the preview.

The following diagram shows Amazon Redshift architecture with an AQUA layer.

AQUA is available on Amazon Redshift RA3 instances at no additional cost.

Figure 1: Amazon Redshift architecture with AQUA layer

Automated performance tuning

For analysts who haven’t used sort and distribution keys, the learning curve can be steep. A table created with the wrong keys can mean results take much longer to return.

Automatic table optimization tackles this problem by using machine learning to select the best keys and tune the physical design of tables. Letting Amazon Redshift determine how to improve cluster performance reduces manual effort.

Summary

These are just some of the Amazon Redshift announcements made in 2020 to help analysts get query results faster. Some of these features help you get access to the data you need, whether it’s in Amazon Redshift or somewhere else. Others are under-the-hood enhancements that make things run smoothly with less manual effort.

For more information about these announcements and a complete list of new features, see What’s New in Amazon Redshift.

About the Author

Helen Anderson is a Data Analyst based in Wellington, New Zealand. She is well known in the data community for writing beginner-friendly blog posts, teaching, and mentoring those who are new to tech. As a woman in tech and a career switcher, Helen is particularly interested in inspiring those who are underrepresented in the industry.

Building a real-time notification system with Amazon Kinesis Data Streams for Amazon DynamoDB and Amazon Kinesis Data Analytics for Apache Flink

2021-01-05 Saurabh Shrivastava

Post Syndicated from Saurabh Shrivastava original https://aws.amazon.com/blogs/big-data/building-a-real-time-notification-system-with-amazon-kinesis-data-streams-for-amazon-dynamodb-and-amazon-kinesis-data-analytics-for-apache-flink/

Amazon DynamoDB helps you capture high-velocity data such as clickstream data to form customized user profiles and Internet of Things (IoT) data so that you can develop insights on sensor activity across various industries, including smart spaces, connected factories, smart packing, fitness monitoring, and more. It’s important to store these data points in a centralized data lake in real time, where they can be transformed, analyzed, and combined with diverse organizational datasets to derive meaningful insights and make predictions.

A popular use case in the wind energy sector is to protect wind turbines from wind speed. As per National Wind Watch, every wind turbine has a range of wind speeds, typically 30–55 mph, in which it produces maximum capacity. When wind speed is greater than 70 mph, it’s important to start shutdown to protect the turbine from a high wind storm. Customers often store high-velocity IoT data in DynamoDB and use Amazon Kinesis streaming to extract data and store it in a centralized data lake built on Amazon Simple Storage Service (Amazon S3). To facilitate this ingestion pipeline, you can deploy AWS Lambda functions or write custom code to build a bridge between DynamoDB Streams and Kinesis streaming.

Amazon Kinesis Data Streams for DynamoDB help you to publish item-level changes in any DynamoDB table to a Kinesis data stream of your choice. Additionally, you can take advantage of this feature for use cases that require longer data retention on the stream and fan out to multiple concurrent stream readers. You also can integrate with Amazon Kinesis Data Analytics or Amazon Kinesis Data Firehose to publish data to downstream destinations such as Amazon Elasticsearch Service, Amazon Redshift, or Amazon S3.

In this post, you use Kinesis Data Analytics for Apache Flink (Data Analytics for Flink) and Amazon Simple Notification Service (Amazon SNS) to send a real-time notification when wind speed is greater than 60 mph so that the operator can take action to protect the turbine. You use Kinesis Data Streams for DynamoDB and take advantage of managed streaming delivery of DynamoDB data to other AWS services without having to use Lambda or write and maintain complex code. To process DynamoDB events from Kinesis, you have multiple options: Amazon Kinesis Client Library (KCL) applications, Lambda, and Data Analytics for Flink. In this post, we showcase Data Analytics for Flink, but this is just one of many available options.

Architecture

The following architecture diagram illustrates the wind turbine protection system.

In this architecture, high-velocity wind speed data comes from the wind turbine and is stored in DynamoDB. To send an instant notification, you need to query the data in real time and send a notification when the wind speed is greater than the established maximum. To achieve this goal, you enable Kinesis Data Streams for DynamoDB, and then use Data Analytics for Flink to query real-time data in a 60-second tumbling window. This aggregated data is stored in another data stream, which triggers an email notification via Amazon SNS using Lambda when the wind speed is greater than 60 mph. You will build this entire data pipeline in a serverless manner.

Deploying the wind turbine data simulator

To replicate a real-life scenario, you need a wind turbine data simulator. We use Amazon Amplify in this post to deploy a user-friendly web application that can generate the required data and store it in DynamoDB. You must have a GitHub account which will help to fork the Amplify app code and deploy it in your AWS account automatically.

Complete the following steps to deploy the data simulator web application:

Choose the following AWS Amplify link to launch the wind turbine data simulator web app.

Choose Connect to GitHub and provide credentials, if required.

Choose Connect to GitHub and provide credentials, if required.

In the Deploy App section, under Select service role, choose Create new role.
Follow the instructions to create the role amplifyconsole-backend-role.
When the role is created, choose it from the drop-down menu.
Choose Save and deploy.

Choose Save and deploy.

On the next page, the dynamodb-streaming app is ready to deploy.

Choose Continue.

On the next page, the dynamodb-streaming app is ready to deploy.

On the next page, you can see the app build and deployment progress, which might take as many as 10 minutes to complete.

When the process is complete, choose the URL on the left to access the data generator user interface (UI).
Make sure to save this URL because you will use it in later steps.

Make sure to save this URL because you will use it in later steps.

You also get an email during the build process related to your SSH key. This email indicates that the build process created an SSH key on your behalf to connect to the Amplify application with GitHub.

On the sign-in page, choose Create account.

On the sign-in page, choose Create account.

Provide a user name, password, and valid email to which the app can send you a one-time passcode to access the UI.
After you sign in, choose Generate data to generate wind speed data.
Choose the Refresh icon to show the data in the graph.

You can generate a variety of data by changing the range of minimum and maximum speeds and the number of values.

To see the data in DynamoDB, choose the DynamoDB icon, note the table name that starts with windspeed-, and navigate to the table in the DynamoDB console.

To see the data in DynamoDB, choose the DynamoDB icon, note the table name that starts with windspeed.

Now that the wind speed data simulator is ready, let’s deploy the rest of the data pipeline.

Deploying the automated data pipeline by using AWS CloudFormation

You use AWS CloudFormation templates to create all the necessary resources for the data pipeline. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. You can view the template and code in the GitHub repository.

Choose Launch with CloudFormation Console:
Choose the US West (Oregon) Region (us-west-2).
For pEmail, enter a valid email to which the analytics pipeline can send notifications.
Choose Next.

For pEmail, enter a valid email to which the analytics pipeline can send notifications.

Acknowledge that the template may create AWS Identity and Access Management (IAM) resources.
Choose Create stack.

This CloudFormation template creates the following resources in your AWS account:

An IAM role to provide a trust relationship between Kinesis and DynamoDB to replicate data from DynamoDB to the data stream
Two data streams:
- An input stream to replicate data from DynamoDB
- An output stream to store aggregated data from the Data Analytics for Flink app
A Lambda function
An SNS topic to send an email notifications about high wind speeds

When the stack is ready, on the Outputs tab, note the values of both data streams.

When the stack is ready, on the Outputs tab, note the values of both data streams.

Check your email and confirm your subscription to receive notifications. Make sure to check your junk folder if you don’t see the email in your inbox.

Check your email and confirm your subscription to receive notifications.

Now you can use Kinesis Data Streams for DynamoDB, which allows you to have your data in both DynamoDB and Kinesis without having to use Lambda or write custom code.

Enabling Kinesis streaming for DynamoDB

AWS recently launched Kinesis Data Streams for DynamoDB so that you can send data from DynamoDB to Kinesis Data. You can use the AWS Command Line Interface (AWS CLI) or the AWS Management Console to enable this feature.

To enable this feature from the console, complete the following steps:

In the DynamoDB console, choose the table that you created earlier (it begins with the prefix windspeed-).
On the Overview tab, choose Manage streaming to Kinesis.

On the Overview tab, choose Manage streaming to Kinesis.

Choose your input stream.

Choose your input stream.

Choose Enable.

Choose Enable.

Choose Close.

Choose Close.

Make sure that Stream enabled is set to Yes.

Make sure that Stream enabled is set to Yes.

Building the Data Analytics for Flink app for real-time data queries

As part of the CloudFormation stack, the new Data Analytics for Flink application is deployed in the configured AWS Region. When the stack is up and running, you should be able to see the new Data Analytics for Flink application in the configured Region. Choose Run to start the app.

Choose Run to start the app.

When your app is running, you should see the following application graph.

Review the Properties section of the app, which shows you the input and output streams that the app is using.

Review the Properties section of the app, which shows you the input and output streams that the app is using.

Let’s learn important code snippets of the Flink Java application in next section, which explain how the Flink application reads data from a data stream, aggregates the data, and outputs it to another data stream.

Diving Deep into Flink Java application code:

In the following code, createSourceFromStaticConfig provides all the wind turbine speed readings from the input stream in string format, which we pass to the WindTurbineInputMap map function. This function parses the string into the Tuple3 data type (exp Tuple3<>(turbineID, speed, 1)). All Tuple3 messages are grouped by turbineID to further apply a one-minute tumbling window. The AverageReducer reduce function provides two things: the sum of all the speeds for the specific turbineId in the one-minute window, and a count of the messages for the specific turbineId in the one-minute window. The AverageMap map function takes the output of the AverageReducer reduce function and transforms it into Tuple2 (exp Tuple2<>(turbineId, averageSpeed)). Then all turbineIds are filtered with an average speed greater than 60 and map them to a JSON-formatted message, which we send to the output stream by using the createSinkFromStaticConfig sink function.

final StreamExecutionEnvironment env =
   StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> input = createSourceFromStaticConfig(env);

input.map(new WindTurbineInputMap())
   .filter(v -> v.f2 > 0)
   .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
   .reduce(new AverageReducer())
   .map(new AverageMap())
   .filter(v -> v.f1 > 60)
   .map(v -> "{ \"turbineID\": \"" + v.f0 + "\", \"avgSpeed\": "+ v.f1 +" }")
   .addSink(createSinkFromStaticConfig());

env.execute("Wind Turbine Data Aggregator");

The following code demonstrates how the createSourceFromStaticConfig and createSinkFromStaticConfig functions read the input and output stream names from the properties of the Data Analytics for Flink application and establish the source and sink of the streams.

private static DataStream<String> createSourceFromStaticConfig(
   StreamExecutionEnvironment env) throws IOException {
   Map<String, Properties> applicationProperties = KinesisAnalyticsRuntime.getApplicationProperties();
   Properties inputProperties = new Properties();
   inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, (String) applicationProperties.get("WindTurbineEnvironment").get("region"));
   inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "TRIM_HORIZON");

   return env.addSource(new FlinkKinesisConsumer<>((String) applicationProperties.get("WindTurbineEnvironment").get("inputStreamName"),
      new SimpleStringSchema(), inputProperties));
}

private static FlinkKinesisProducer<String> createSinkFromStaticConfig() throws IOException {
   Map<String, Properties> applicationProperties = KinesisAnalyticsRuntime.getApplicationProperties();
   Properties outputProperties = new Properties();
   outputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, (String) applicationProperties.get("WindTurbineEnvironment").get("region"));

   FlinkKinesisProducer<String> sink = new FlinkKinesisProducer<>(new
      SimpleStringSchema(), outputProperties);
   sink.setDefaultStream((String) applicationProperties.get("WindTurbineEnvironment").get("outputStreamName"));
   sink.setDefaultPartition("0");
   return sink;
}

In the following code, the WindTurbineInputMap map function parses Tuple3 out of the string message. Additionally, the AverageMap map and AverageReducer reduce functions process messages to accumulate and transform data.

public static class WindTurbineInputMap implements MapFunction<String, Tuple3<String, Integer, Integer>> {
   @Override
   public Tuple3<String, Integer, Integer> map(String value) throws Exception {
      String eventName = JsonPath.read(value, "$.eventName");
      if(eventName.equals("REMOVE")) {
         return new Tuple3<>("", 0, 0);
      }
      String turbineID = JsonPath.read(value, "$.dynamodb.NewImage.deviceID.S");
      Integer speed = Integer.parseInt(JsonPath.read(value, "$.dynamodb.NewImage.value.N"));
      return new Tuple3<>(turbineID, speed, 1);
   }
}

public static class AverageMap implements MapFunction<Tuple3<String, Integer, Integer>, Tuple2<String, Integer>> {
   @Override
   public Tuple2<String, Integer> map(Tuple3<String, Integer, Integer> value) throws Exception {
      return new Tuple2<>(value.f0, (value.f1 / value.f2));
   }
}

public static class AverageReducer implements ReduceFunction<Tuple3<String, Integer, Integer>> {
   @Override
   public Tuple3<String, Integer, Integer> reduce(Tuple3<String, Integer, Integer> value1, Tuple3<String, Integer, Integer> value2) {
      return new Tuple3<>(value1.f0, value1.f1 + value2.f1, value1.f2 + 1);
   }
}

Receiving email notifications of high wind speed

The following screenshot shows an example of the notification email you will receive about high wind speeds.

To test the feature, in this section you generate high wind speed data from the simulator, which is stored in DynamoDB, and get an email notification when the average wind speed is greater than 60 mph for a one-minute period. You’ll observe wind data flowing through the data stream and Data Analytics for Flink.

To test this feature:

Generate wind speed data in the simulator and confirm that it’s stored in DynamoDB.
In the Kinesis Data Streams console, choose the input data stream, kds-ddb-blog-InputKinesisStream.
On the Monitoring tab of the stream, you can observe the Get records – sum (Count) metrics, which show multiple records captured by the data stream automatically.
In the Kinesis Data Analytics console, choose the Data Analytics for Flink application, kds-ddb-blog-windTurbineAggregator.
On the Monitoring tab, you can see the Last Checkpoint metrics, which show multiple records captured by the Data Analytics for Flink app automatically.
In the Kinesis Data Streams console, choose the output stream, kds-ddb-blog-OutputKinesisStream.
On the Monitoring tab, you can see the Get records – sum (Count) metrics, which show multiple records output by the app.
Finally, check your email for a notification.

If you don’t see a notification, change the data simulator value range between a minimum of 50 mph and maximum of 90 mph and wait a few minutes.

Conclusion

As you have learned in this post, you can build an end-to-end serverless analytics pipeline to get real-time insights from DynamoDB by using Kinesis Data Streams—all without writing any complex code. This allows your team to focus on solving business problems by getting useful insights immediately. IoT and application development have a variety of use cases for moving data quickly through an analytics pipeline, and you can make this happen by enabling Kinesis Data Streams for DynamoDB.

If this blog post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available in the GitHub repository for you to use and extend. Contributions are always welcome!

About the Authors

Saurabh Shrivastava is a solutions architect leader and analytics/machine learning specialist working with global systems integrators. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture in hybrid and AWS environments. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

Sameer Goel is a solutions architect in Seattle who drives customers’ success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a Master’s degree with a Data Science concentration from NEU Boston. He enjoys building and experimenting with creative projects and applications.

Pratik Patel is a senior technical account manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions by using best practices, and proactively helps keep customers’ AWS environments operationally healthy.

[$] LibreSSL languishes on Linux

2021-01-04

Post Syndicated from original https://lwn.net/Articles/841664/rss

The LibreSSL project has been
developing a fork of the OpenSSL
package since 2014; it is supported as part of OpenBSD. Adoption of
LibreSSL on the Linux side has been slow from the start, though, and it
would appear that the situation is about to get worse. LibreSSL is
starting to look like an idea whose time may never come in the Linux world.

Military Cryptanalytics, Part III

2021-01-04 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/01/military-cryptanalytics-part-iii.html

The NSA has just declassified and released a redacted version of Military Cryptanalytics, Part III, by Lambros D. Callimahos, October 1977.

Parts I and II, by Lambros D. Callimahos and William F. Friedman, were released decades ago — I believe repeatedly, in increasingly unredacted form — and published by the late Wayne Griswold Barker’s Agean Park Press. I own them in hardcover.

Like Parts I and II, Part III is primarily concerned with pre-computer ciphers. At this point, the document only has historical interest. If there is any lesson for today, it’s that modern cryptanalysis is possible primarily because people make mistakes

The monograph took a while to become public. The cover page says that the initial FOIA request was made in July 2012: eight and a half years ago.

And there’s more books to come. Page 1 starts off:

This text constitutes the third of six basic texts on the science of cryptanalytics. The first two texts together have covered most of the necessary fundamentals of cryptanalytics; this and the remaining three texts will be devoted to more specialized and more advanced aspects of the science.

Presumably, volumes IV, V, and VI are still hidden inside the classified libraries of the NSA.

And from page ii:

Chapters IV-XI are revisions of seven of my monographs in the NSA Technical Literature Series, viz: Monograph No. 19, “The Cryptanalysis of Ciphertext and Plaintext Autokey Systems”; Monograph No. 20, “The Analysis of Systems Employing Long or Continuous Keys”; Monograph No. 21, “The Analysis of Cylindrical Cipher Devices and Strip Cipher Systems”; Monograph No. 22, “The Analysis of Systems Employing Geared Disk Cryptomechanisms”; Monograph No.23, “Fundamentals of Key Analysis”; Monograph No. 15, “An Introduction to Teleprinter Key Analysis”; and Monograph No. 18, “Ars Conjectandi: The Fundamentals of Cryptodiagnosis.”

This points to a whole series of still-classified monographs whose titles we do not even know.

EDITED TO ADD: I have been informed by a reliable source that Parts 4 through 6 were never completed. There may be fragments and notes, but no finished works.

Accessing and visualizing data from multiple data sources with Amazon Athena and Amazon QuickSight

2021-01-04 Saurabh Bhutyani

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-data-from-multiple-data-sources-with-amazon-athena-and-amazon-quicksight/

Amazon Athena now supports federated query, a feature that allows you to query data in sources other than Amazon Simple Storage Service (Amazon S3). You can use federated queries in Athena to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources. Athena queries including federated queries can be run from the Athena console, a JDBC or ODBC connection, the Athena API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.

The goal for this post is to discuss how we can use different connectors to run federated queries with complex joins across different data sources with Athena and visualize the data with Amazon QuickSight.

Athena Federated Query

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that translates between your target data source and Athena. You can think of a connector as an extension of the Athena query engine. Prebuilt Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), Amazon Elasticsearch Service (Amazon ES), Amazon ElastiCache for Redis, and JDBC-compliant relational data sources such as MySQL, PostgreSQL, and Amazon Redshift under the Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. After you deploy data source connectors, the connector is associated with a catalog name that you can specify in your SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query.

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Because connectors run in Lambda, you can use them to access data from any data source in the cloud or on premises that is accessible from Lambda.

Prerequisites

Before creating your development environment, you must have the following prerequisites:

An AWS account that provides access to AWS services.
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI).
- The IAM user also needs permissions to create an IAM role and policies, and create stacks in AWS CloudFormation.
An environment set up. For this post, we use the CloudFormation template from the post Extracting and joining data from multiple data sources with Athena Federated Query.
A QuickSight subscription. If you don’t have a QuickSight subscription configured, see Signing Up for an Amazon QuickSight Subscription. You can access the QuickSight free trial as part of the AWS free tier option.

Configuring your data source connectors

After you deploy your CloudFormation stack, follow the instructions in the post Extracting and joining data from multiple data sources with Athena Federated Query to configure various Athena data source connectors for HBase on Amazon EMR, DynamoDB, ElastiCache for Redis, and Amazon Aurora MySQL.

You can run Athena federated queries in the AmazonAthenaPreviewFunctionality workgroup created as part of the CloudFormation stack or you could run them in the primary workgroup or other workgroups as long as you’re running with Athena engine version 2. As of this writing, Athena Federated Query is generally available in the Asia Pacific (Mumbai), Asia Pacific (Tokyo), Europe (Ireland), US East (N. Virginia), US East (Ohio), US West (N. California), and US West (Oregon) Regions. If you’re running in other Regions, use the AmazonAthenaPreviewFunctionality workgroup.

For information about changing your workgroup to Athena engine version 2, see Changing Athena Engine Versions.

Configuring QuickSight

The next step is to configure QuickSight to use these connectors to query data and visualize with QuickSight.

On the AWS Management Console, navigate to QuickSight.
If you’re not signed up for QuickSight, you’re prompted with the option to sign up. Follow the steps to sign up to use QuickSight.
After you log in to QuickSight, choose Manage QuickSight under your account.

After you log in to QuickSight, choose Manage QuickSight under your account.

In the navigation pane, choose Security & permissions.
Under QuickSight access to AWS services, choose Add or remove.

Under QuickSight access to AWS services, choose Add or remove.

A page appears for enabling QuickSight access to AWS services.

Choose Athena.

Choose Athena.

In the pop-up window, choose Next.

In the pop-up window, choose Next.

On the S3 tab, select the necessary S3 buckets. For this post, I select the athena-federation-workshop-<account_id> bucket and another one that stores my Athena query results.
For each bucket, also select Write permission for Athena Workgroup.

For each bucket, also select Write permission for Athena Workgroup.

On the Lambda tab, select the Lambda functions corresponding to the Athena federated connectors that Athena federated queries use. If you followed the post Extracting and joining data from multiple data sources with Athena Federated Query when configuring your Athena federated connectors, you can select dynamo, hbase, mysql, and redis.

For information about registering a data source in Athena, see the appendix in this post.

Choose Finish.

Choose Finish.

Choose Update.
On the QuickSight console, choose New analysis.
Choose New dataset.
For Datasets, choose Athena.
For Data source name, enter Athena-federation.
For Athena workgroup, choose primary.
Choose Create data source.

As stated earlier, you can use the AmazonAthenaPreviewFunctionality workgroup or another workgroup as long as you’re running Athena engine version 2 in a supported Region.

You can use the AmazonAthenaPreviewFunctionality workgroup or another workgroup as long as you’re running Athena engine version 2 in a supported Region.

For Catalog, choose the catalog that you created for your Athena federated connector.

For information about creating and registering a data source in Athena, see the appendix in this post.

For this post, I choose the dynamo catalog, which does a federation to the Athena DynamoDB connector.

For this post, I choose the dynamo catalog, which does a federation to the Athena DynamoDB connector.

I can now see the database and tables listed in QuickSight.

Choose Edit/Preview data to see the data.
Choose Save & Visualize to start using this data for creating visualizations in QuickSight.

22. Choose Save & Visualize to start using this data for creating visualizations in QuickSight.

To do a join with another Athena data source, choose Add data and select the catalog and table.
Choose the join link between the two datasets and choose the appropriate join configuration.
Choose Apply.

Choose Apply

You should be able to see the joined data.

Running a query in QuickSight

Now we use the custom SQL option in QuickSight to run a complex query with multiple Athena federated data sources.

On the QuickSight console, choose New analysis.
Choose New dataset.
For Datasets, choose Athena.
For Data source name, enter Athena-federation.
For the workgroup, choose primary.
Choose Create data source.
Choose Use custom SQL.
Enter the query for ProfitBySupplierNation.
Choose Edit/Preview data.

Choose Edit/Preview data.

Under Query mode, you have the option to view your query in either SPICE or direct query. SPICE is the QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. Using SPICE can save time and money because your analytical queries process faster, you don’t need to wait for a direct query to process, and you can reuse data stored in SPICE multiple times without incurring additional costs. You also can refresh data in SPICE on a recurring basis as needed or on demand. For more information about refresh options, see Refreshing Data.

With direct query, QuickSight doesn’t use SPICE data and sends the query every time to Athena to get the data.

Select SPICE.
Choose Apply.
Choose Save & visualize.

Choose Save & visualize.

On the Visualize page, under Fields list, choose nation and sum_profit.

QuickSight automatically chooses the best visualization type based on the selected fields. You can change the visual type based on your requirement. The following screenshot shows a pie chart for Sum_profit grouped by Nation.

The following screenshot shows a pie chart for Sum_profit grouped by Nation.

You can add more datasets using Athena federated queries and create dashboards. The following screenshot is an example of a visual analysis over various datasets that were added as part of this post.

The following screenshot is an example of a visual analysis over various datasets that were added as part of this post.

When your analysis is ready, you can choose Share to create a dashboard and share it within your organization.

Summary

QuickSight is a powerful visualization tool, and with Athena federated queries, you can run analysis and build dashboards on various data sources like DynamoDB, HBase on Amazon EMR, and many more. You can also easily join relational, non-relational, and custom object stores in Athena queries and use them with QuickSight to create visualizations and dashboards.

For more information about Athena Federated Query, see Using Amazon Athena Federated Query and Query any data source with Amazon Athena’s new federated query.

Appendix

To register a data source in Athena, complete the following steps:

On the Athena console, choose Data sources.

On the Athena console, choose Data sources.

Choose Connect data source.

Choose Connect data source.

Select Query a data source.
For Choose a data source, select a data source (for this post, I select Redis).
Choose Next.

Choose Next.

For Lambda function, choose your function.

For this post, I use the redis Lambda function, which I configured as part of configuring the Athena federated connector in the post Extracting and joining data from multiple data sources with Athena Federated Query.

For Catalog name, enter a name (for example, redis).

The catalog name you specify here is the one that is displayed in QuickSight when selecting Lambda functions for access.

Choose Connect.

Choose Connect.

When the data source is registered, it’s available in the Data source drop-down list on the Athena console.

When the data source is registered, it’s available in the Data source drop-down list on the Athena console.

About the Author

Saurabh Bhutyani is a Senior Big Data Specialist Solutions Architect at Amazon Web Services. He is an early adopter of open-source big data technologies. At AWS, he works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. In his free time, he likes to watch movies and spend time with his family.

Multi-tenant processing pipelines with AWS DMS, AWS Step Functions, and Apache Hudi on Amazon EMR

2021-01-04 Francisco Oliveira

Post Syndicated from Francisco Oliveira original https://aws.amazon.com/blogs/big-data/multi-tenant-processing-pipelines-with-aws-dms-aws-step-functions-and-apache-hudi-on-amazon-emr/

Large enterprises often provide software offerings to multiple customers by providing each customer a dedicated and isolated environment (a software offering composed of multiple single-tenant environments). Because the data is in various independent systems, large enterprises are looking for ways to simplify data processing pipelines. To address this, you can create data lakes to bring your data to a single place.

Typically, a replication tool such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3). When the data is in Amazon S3, you process it based on your requirements. A typical requirement is to sync the data in Amazon S3 with the updates on the source systems. Although it’s easy to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s tough to apply this change data capture (CDC) process on your data lakes. Apache Hudi is a good way to solve this problem. You can use Hudi on Amazon EMR to create Hudi tables (for more information, see Hudi in the Amazon EMR Release Guide).

This post introduces a pipeline that loads data and its ongoing changes (change data capture) from multiple single-tenant tables from different databases to a single multi-tenant table in an Amazon S3-backed data lake, simplifying data processing activities by creating multi-tenant datasets.

Architecture overview

At a high level, this architecture consolidates multiple single-tenant environments into a single multi-tenant dataset so data processing pipelines can be centralized. For example, suppose that your software offering has two tenants, each with their dedicated and isolated environment, and you want to maintain a single multi-tenant table that includes data of both tenants. Moreover, you want any ongoing replication (CDC) in the sources for tenant 1 and tenant 2 to be synchronized (compacted or reconciled) when an insert, delete, or update occurs in the source systems of the respective tenant.

In the past, to support record-level updates or inserts (called upserts) and deletes on an Amazon S3-backed data lake, you relied on either having an Amazon Redshift cluster or an Apache Spark job that reconciled the update, deletes, and inserts with existing historical data.

The architecture for our solution uses Hudi to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. For more information, see Apache Hudi on Amazon EMR.

Moreover, the architecture for our solution uses the following AWS services:

AWS DMS – AWS DMS is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. For more information, see What is AWS Database Migration Service?
AWS Step Functions – AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. For more information, see What Is AWS Step Functions?
Amazon EMR – Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. For more information, see Overview of Amazon EMR Architecture and Overview of Amazon EMR.
Amazon S3 – Data is stored in Amazon S3, an object storage service with scalable performance, ease-of-use features, and native encryption and access control capabilities. For more details on Amazon S3, see Amazon S3 as the Data Lake Storage Platform.

Architecture deep dive

The following diagram illustrates our architecture.

This architecture relies on AWS Database Migration Service (AWS DMS) to transfer data from specific tables into an Amazon S3 location organized by tenant-id.

Although AWS DMS performs the migration and the ongoing replication—also known as change data capture (CDC)—it applies a data transformation that adds a custom column named tenant-id and populates it with the tenant-id value defined in the AWS DMS migration task configuration. The AWS DMS data transformations allow you to modify a schema, table, or column or, in this case, add a column with the tenant-id so data transferred to Amazon S3 is grouped by tenant-id.

AWS DMS is also configured to add an additional column with timestamp information. For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by AWS DMS. For ongoing replication, each row of the timestamp column contains the timestamp for the commit of that row in the source database.

We use an AWS Step Functions workflow to move the files AWS DMS wrote to Amazon S3 into an Amazon S3 location that is organized by table name and holds all the tenant’s data. Files in this location all have the new column tenant-id, and the respective tenant-id value is configured in the AWS DMS task configuration.

Next, the Hudi DeltaStreamer utility runs on Amazon EMR to process the multi-tenant source data and create or update the Hudi dataset on Amazon S3.

You can pass to the Hudi DeltaStreamer utility a field in the data that has each record’s timestamp. The Hudi DeltaStreamer utility uses this to ensure records are processed in the proper chronological order. You can also provide the Hudi DeltaStreamer utility one or more SQL transforms, which the utility applies in a sequence as records are read and before the datasets are persisted on Amazon S3 as an Hudi Parquet dataset. We highlight the SQL transform later in this post.

Depending on your downstream consumption patterns, you might require a partitioned dataset. We discuss the process to choose a partition within Hudi DeltaStreamer later in this post.

For this post, we use the Hudi DeltaStreamer utility instead of the Hudi DataSource due to its operational simplicity. However, you can also use Hudi DataSource with this pattern.

When to use and not use this architecture

This architecture is ideal for the workloads that are processed in batches and can tolerate the latency associated with the time required to capture the changes in the sources, write those changes into objects in Amazon S3, and run the Step Functions workflow that aggregates the objects per tenant and creates the multi-tenant Hudi dataset.

This architecture uses and applies to Hudi COPY_ON_WRITE tables. This architecture is not recommended for latency-sensitive applications and does not support MERGE_ON_READ tables. For more information, see the section Analyzing the properties provided to the command to run the Hudi DeltaStreamer utility.

This architecture pattern is also recommended for workloads that have update rates to the same record (identified by a primary key) that are separated by, at most, microseconds or microsecond precision. The reason behind this is that Hudi uses a field, usually a timestamp, to break ties between records with the same key. If the field used to break ties can capture each individual update, data stored in the Hudi dataset on Amazon S3 is exactly the same as in the source system.

The precision of timestamp fields in your table depends on the database engine at the source. We strongly recommended that as you evaluate adopting this pattern, you also evaluate the support that AWS DMS provides to your current source engines, and understand the rate of updates in your source and the respective timestamp precision requirements that the source needs to support or currently supports. For example: AWS DMS writes any timestamp column values that are written to Amazon S3 as part of an ongoing replication with second precision if the data source is MySQL, and with microsecond precision if the data source is PostgreSQL. See the section Timestamp precision considerations for additional details.

This architecture assumes that all tables in every source database have the same schema and that any changes to a table’s schema is performed to each data source at the same time. Moreover, this architecture pattern assumes that any schema changes are backward compatible—you only append new fields and don’t delete any existing fields. Hudi supports schema evolutions that are backward compatible.

If you’re expecting constant schema changes to the sources, it might be beneficial to consider performing full snapshots instead of ingesting and processing the CDC. If performing full snapshots isn’t practical and you are expecting constant schema changes that are not compatible with Hudi’s schema evolution support, you can use the Hudi DataWriter API with your Spark jobs and address schema changes within the code by casting and adding columns as required to keep backward compatibility.

See the Schema evolution section for more details on the process for schema evolution with AWS DMS and Hudi.

Although it’s out of scope to evaluate the consumption tools available downstream to the Hudi dataset, you can consume Hudi datasets stored on Amazon S3 from Apache Hive, Spark, and Presto on Amazon EMR. Moreover, you can consume Hudi datasets stored on Amazon S3 from Amazon Redshift Spectrum and Amazon Athena.

Solution overview

This solution uses an AWS CloudFormation template to create the necessary resources.

You trigger the Step Functions workflow via the AWS Management Console. The workflow uses AWS Lambda for processing tasks that are part of the logic. Moreover, the workflow submits jobs to an EMR cluster configured with Hudi.

To perform the database migration, the CloudFormation template deploys one AWS DMS replication instance and configures two AWS DMS replications tasks, one per tenant. The AWS DMS replication tasks connect to the source data stores, read the source data, apply any transformations, and load the data into the target data store.

You access an Amazon SageMaker notebook to generate changes (updates) to the sources. Moreover, you connect into the Amazon EMR master node via AWS Systems Manager Session Manager to run Hive or Spark queries in the Hudi dataset backed by Amazon S3. Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys.

The following diagram illustrates the solution architecture.

The orchestration in this demo code currently supports processing at most 25 sources (tables within a database or distributed across multiple databases) per run and is not preventing concurrent runs of the same tenant-id, database-name, or table-name triplet by keeping track of the tenant-id, database-name, or table-name triplet being processed or already processed. Preventing concurrent runs avoids duplication of work. Moreover, the orchestration in this demo code doesn’t prevent the Hudi DeltaStreamer job to run with the output of both an AWS DMS full load task and an AWS DMS CDC load task. For production environments, we recommend that you keep track of the existing tenant_id in the multi-tenant Hudi dataset. This way, if an existing AWS DMS replication task is mistakenly restarted to perform a full load instead of continuing the ongoing replication, your solution can adequately prevent any downstream impact to the datasets. Moreover, we recommend that you keep track of the schema changes in the source and guarantee that the Hudi DeltaStreamer utility only processes files with the same schema.

For details on considerations related to the Step Functions workflows, see Best Practices for Step Functions. For more information about considerations when running AWS DMS at scale, see Best practices for AWS Database Migration Service. Finally, for details on how to tune Hudi, see Performance and Tuning Guide.

Next, we walk you through several key areas of the solution.

Prerequisites

Before getting started, you must create a S3 bucket, unzip and upload the blog artifacts to the S3 bucket and store the database passwords in AWS Systems Manager Parameter Store.

Creating and storing admin passwords in AWS Systems Manager Parameter Store

This solution uses AWS Systems Manager Parameter Store to store the passwords used in the configuration scripts. With Parameter Store, you can create secure string parameters, which are parameters that have a plaintext parameter name and an encrypted parameter value. Parameter Store uses AWS Key Management Service (AWS KMS) to encrypt and decrypt the parameter values of secure string parameters. With Parameter Store, you improve your security posture by separating your data from your code and by controlling and auditing access at granular levels. There is no charge from Parameter Store to create a secure string parameter, but charges for using AWS KMS do apply. For information, see AWS Key Management Service pricing.

Before deploying the CloudFormation templates, run the following AWS Command Line Interface (AWS CLI) commands. These commands create Parameter Store parameters to store the passwords for the RDS master user for each tenant.

aws ssm put-parameter --name "/HudiStack/RDS/Tenant1Settings" --type SecureString --value "ch4ng1ng-s3cr3t" --region Your-AWS-Region

aws ssm put-parameter --name "/HudiStack/RDS/Tenant2Settings" --type SecureString --value "ch4ng1ng-s3cr3t" --region Your-AWS-Region

AWS DMS isn’t integrated with Parameter Store, so you still need to set the same password as in the CloudFormation template parameter DatabasePassword (see the following section).

Creating an S3 bucket for the solution and uploading the solution artifacts to Amazon S3

This solution uses Amazon S3 to store all artifacts used in the solution. Before deploying the CloudFormation templates, create an Amazon S3 bucket and download the artifacts required by the solution.

Unzip the artifacts and upload all folders and files in the .zip file to the S3 bucket you just created.

The following screenshot uses the root location hudistackbucket.

Keep a record of the Amazon S3 root path because you add it as a parameter to the CloudFormation template later.

Creating the CloudFormation stack

To launch the entire solution, choose Launch Stack:

The template requires the following parameters. You can accept the default values for any parameters not in the table. For the full list of parameters, see the CloudFormation template.

S3HudiArtifacts – The bucket name that holds the solution artifacts (Lambda function Code, Amazon EMR step artifacts, Amazon SageMaker notebook, Hudi job configuration file template). You created this bucket in the previous step. For this post, we use hudistackbucket.
DatabasePassword – The database password. This value needs to be the same as the one configured via Parameter Store. The CloudFormation template uses this value to configure the AWS DMS endpoints.
emrLogUri – The Amazon S3 location to store Amazon EMR cluster logs. For example, s3://replace-with-your-bucket-name/emrlogs/.

Testing database connectivity

To test connectivity, complete the following steps:

On the Amazon SageMaker Console, choose Notebook instances.
Locate the notebook instance you created and choose Open Jupyter.
In the new window, choose Runmev5.ipynb.

This opens the notebook for this post. We use the notebook to generate changes and updates to the databases used in the post.

Run all cells of the notebook until the section Triggering the AWS DMS full load tasks for tenant 1 and tenant 2.

Analyzing the AWS DMS configuration

In this section, we examine the data transformation configuration and other AWS DMS configurations.

Data transformation configuration

To support the conversion from single-tenant to multi-tenant pipelines, the CloudFormation template applied a data transformation to the AWS DMS replication task. Specifically, the data transformation adds a new column named tenant_id to the Amazon S3 AWS DMS target. Adding the tenant_id column helps with downstream activities organize the datasets per tenant_id. For additional details on how to set up AWS DMS data transformation, see Transformation Rules and Actions. For reference, the following code is the data transformation we use for this post:

{
    "rules": [
        {
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "salesdb",
                "table-name": "sales_order_detail"
            },
            "rule-action": "include",
            "filters": []
        },
        {
            "rule-type": "transformation",
            "rule-id": "2",
            "rule-name": "2",
            "rule-action": "add-column",
            "rule-target": "column",
            "object-locator": {
                "schema-name": "salesdb",
                "table-name": "sales_order_detail"
            },
            "value": "tenant_id",
            "expression": "1502",
            "data-type": {
                "type": "string",
                "length": 50
            }
        }
    ]
}

Other AWS DMS configurations

When using Amazon S3 as a target, AWS DMS accepts several configuration settings that provide control on how the files are written to Amazon S3. Specifically, for this use case, AWS DMS uses Parquet as the value for the configuration property DataFormat. For additional details on the S3 settings used by AWS DMS, see S3Settings. For reference, we use the following code:

DataFormat=parquet;TimestampColumnName=timestamp;

Timestamp precision considerations

By default, AWS DMS writes timestamp columns in a Parquet format with a microsecond precision, should the source engine support that precision. If the rate of updates you’re expecting is high, it’s recommended that you use a source that has support for microsecond precision, such as PostgreSQL.

Moreover, if the rate of updates is high, you might want to use a data source with microsecond precision and the AR_H_TIMESTAMP internal header column, which captures the timestamp of when the changes were made instead of the timestamp indicating the time of the commit. See Replicating source table headers using expressions for more details, specifically the details on the AR_H_TIMESTAMP internal header column. When you set TimestampColumnName=timestamp as we mention earlier, the new timestamp column captures the time of the commit.

If you need to use the AR_H_TIMESTAMP internal header column with a data source that supports microsecond precision such as PostgreSQL, we recommend using the Hudi DataSource writer job instead of the Hudi DeltaStreamer utility. The reason for this is that although the AR_H_TIMESTAMP internal header column (in a source that supports microsecond precision) has microsecond precision, the actual value written by AWS DMS on Amazon S3 has a nanosecond format (microsecond precision with the nanosecond dimension set to 0). By using the Hudi DataSource writer job, you can convert the AR_H_TIMESTAMP internal header column to a timestamp datatype in Spark with microsecond precision and use that new value as the PARTITIONPATH_FIELD_OPT_KEY. See Datasource Writer for more details.

Triggering the AWS DMS full load tasks for Tenant 1 and Tenant 2

In this step, we run a full load of data from both databases to Amazon S3 using AWS DMS. To accomplish this, perform the following steps:

On the AWS DMS console, under Migration, choose Database migration tasks.
Select the replication task for Tenant 1 (dmsreplicationtasksourcetenant1-xxxxxxxxxxxxxxx).
From the Actions menu, choose Restart/Resume.
Repeat these steps for the Tenant 2 replication task.

You can monitor the progress of this task by choosing the task link.

Triggering the Step Functions workflow

Next, we start a Step Functions workflow that automates the end-to-end process of processing the files loaded by AWS DMS to Amazon S3 and creating a multi-tenant Amazon S3-backed table using Hudi.

To trigger the Step Functions workflow, perform the following steps:

On the Step Functions console, choose State machines.
Choose the MultiTenantProcessing workflow.
In the new window, choose Start execution.
Edit the following JSON code and replace the values as needed. You can find the emrClusterId on the Outputs tab of the Cloudformation template.

{
  "hudiConfig": {
    "emrClusterId": "[REPLACE]",
    "targetBasePath": "s3://hudiblog-[REPLACE-WITH-YOUR-ACCOUNT-ID]/transformed/multitenant/huditables/sales_order_detail_mt",
"targetTable": "sales_order_detail_mt",
"sourceOrderingField": "timestamp",
"blogArtifactBucket": "[REPLACE-WITH-BUCKETNAME-WITH-BLOG-ARTIFACTS]",
"configScriptPath": "s3://[REPLACE-WITH-BUCKETNAME-WITH-BLOG-ARTIFACTS]/emr/copy_apache_hudi_deltastreamer_command_properties.sh",
"propertiesFilename": "dfs-source.properties"
},
  "copyConfig":{
            "srcBucketName": "hudiblog-[REPLACE-WITH-YOUR-ACCOUNT-ID]",
            "srcPrefix": "raw/singletenant/",
            "destBucketName": "hudiblog-[REPLACE-WITH-YOUR-ACCOUNT-ID]",
            "destPrefix": "raw/multitenant/salesdb/sales_order_detail_mt/"},
      "sourceConfig":{
            "databaseName": "salesdb",
            "tableName": "sales_order_detail"},  
  "workflowConfig":{
            "ddbTableName": "WorkflowTimestampRegister",
            "ddbTimestampFieldName": "T"},
  "tenants":{
    "array": [
              {
            "tenantId": "1502"
        },           {
            "tenantId": "1501"
        }
    ]
  }
}

Submit the edited JSON as the input to the workflow.

If you scroll down, you should see an ExecutionSucceeded message in the last line of the event history (see the following screenshot).

On the Amazon S3 console, search for the bucket name used in this post (hudiblog-[account-id]) and then for the prefix raw/multitenant/salesdb/sales_order_detail_mt/.

You should see two files.

Navigate to the prefix transformed/multitenant/salesdb/sales_order_detail_mt/.

You should see the Hudi table Parquet format.

Analyzing the properties provided to the command to run the Hudi DeltaStreamer utility

If the MultitenantProcessing workflow was successful, the files that AWS DMS loaded into Amazon S3 are now available in a multi-tenant table on Amazon S3. This table is now ready to process changes to the databases for each tenant.

In this step, we go over the command the workflow triggers to create a table with Hudi.

The Step Functions workflow for this post runs all the steps except the tasks in the Amazon Sagemaker notebook that you trigger. The following section is just for your reference and discussion purposes.

On the Amazon EMR console, choose the cluster created by the CloudFormation and choose the Steps view of the cluster to obtain the full command used by the workflow.

The command has two sets of properties: one defined directly in the command, and another defined on a configuration file, dfs-source.properties, which is updated automatically by the Step Functions workflow.

The following are some of the properties defined directly in the command:

–table-type – The Hudi table type, for this use case, COPY_ON_WRITE. The reason COPY_ON_WRITE is preferred for this use case relates to the fact that the ingestion is done in batch mode, access to the changes in the data are not required in real time, and the downstream workloads are read-heavy. Moreover, with this storage type, you don’t need to handle compactions because updates create a new Parquet file with the impacted rows being updated. Given that the ingestion is done in batch mode, using the COPY_ON_WRITE table type efficiently keeps track of the latest record change, triggering a new write to update the Hudi dataset with the latest value of the record.
- This post requires that you use the COPY_ON_WRITE table type.
- For reference, if your requirement is to ingest write- or change-heavy workloads and make the changes available as fast as possible for downstream consumption, Hudi provides the MERGE_ON_READ table type. In this table type, data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files. For more details on the two table types provided by Hudi, see Understanding Dataset Storage Types: Copy on Write vs. Merge on Read and Considerations and Limitations for Using Hudi on Amazon EMR.
–source-ordering-field – The field in the source dataset that the utility uses to order the records. For this use case, we configured AWS DMS to add a timestamp column to the data. The utility uses that column to order the records and break ties between records with the same key. This field needs to exist in the data source and can’t be the result of a transformation.
–source-class – AWS DMS writes to the Amazon S3 destination in Apache Parquet. Use apache.hudi.utilities.sources.ParquetDFSSource as the value for this property.
–target-base-path – The destination base path that the utility writes.
–target-table – The table name of the table the utility writes to.
–transformer-class – This property indicates the transformer classes that the utility is applied to the input records. For this use case, we use the AWSDmsTransformer plus the SqlQueryBasedTransformer. The transformers are applied in the order they are identified in this property.
–payload-class – Set to org.apache.hudi.payload.AWSDmsAvroPayload.
–props – The path of the file with additional configurations. For example, file:///home/hadoop/dfs-source.properties.
- The file /home/hadoop/dfs-source.properties has additional configurations passed to Hudi DeltaStreamer. You can view that file by logging in to your Amazon EMR master node and running cat /home/hadoop/dfs-source.properties.

The following code is the configuration file. By setting the properties in the file, we configure Hudi to write the dataset in a partitioned way and applying a SQL transform before persisting the dataset into the Amazon S3 location.

===Start of Configuration File ===
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.datasource.write.partitionpath.field=tenant_id,year,month
hoodie.deltastreamer.transformer.sql=SELECT a.timestamp, a.line_id, a.line_number, a.order_id, a.product_id,  a.quantity, a.unit_price, a.discount, a.supply_cost, a.tax, string(year(to_date(a.order_date))) as year, string(month(to_date(a.order_date))) as month, a.Op, a.tenant_id FROM <SRC> a
hoodie.datasource.write.recordkey.field=tenant_id,line_id
hoodie.datasource.write.hive_style_partitioning=true
# DFS Source
hoodie.deltastreamer.source.dfs.root=s3://hudiblog-xxxxxxxxxxxxxx/raw/multitenant/salesdb/sales_order_detail_mt/xxxxxxxxxxxxxxxxxxxxx
===End of Configuration File ===

Some configurations in this file include the following:

hoodie.datasource.write.operation=upsert – This property defines if the write operation is an insert, upsert, or bulkinsert. If you have a large initial import, use bulkinsert to load new data into a table, and on the next loads use upsert or insert. The default value for this property is upsert. For this post, the default is accepted because the dataset is small. When you run the solution with larger datasets, you can perform the initial import with bulkinsert and then use upsert for the next loads. For more details on the three modes, see Write Operations.
hoodie.datasource.write.hive_style_partitioning=true – This property generates Hive style partitioning—partitions of the form partition_key=partition_values. See the property hoodie.datasource.write.partitionpath.field for more details.
hoodie.datasource.write.partitionpath.field=tenant_id,year,month – This property identifies the fields that Hudi uses to extract the partition fields.
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator – This property allows you to combine multiple fields and use the combination of fields as the record key and partition key.
hoodie.datasource.write.recordkey.field=tenant_id,line_id – This property indicates the fields in the dataset that Hudi uses to identify a record key. The source table has line_id as the primary key. Given that the Hudi dataset is multi-tenant, tenant_id is also part of the record key.
hoodie.deltastreamer.source.dfs.root=s3://hudiblog-your-account-id/raw/multitenant/salesdb/sales_order_detail_mt/xxxxxxxxx – This property indicates the Amazon S3 location with the source files that the Hudi DeltaStreamer utility consumes. This is the location that the MultiTenantProcessing state machine created and includes files from both tenants.
hoodie.deltastreamer.transformer.sql=SELECT a.timestamp, a.line_id, a.line_number, a.order_id, a.product_id, a.quantity, a.unit_price, a.discount, a.supply_cost, a.tax, year(string(a.order_date)) as year, month(string(a.order_date)) as month, a.Op, a.tenant_id FROM a – This property indicates that the Hudi DeltaStreamer applies a SQL transform before writing the records as a Hudi dataset in Amazon S3. For this use case, we create new fields from the RBDMS table’s field order_date.

A change to the schema of the source RDBMS tables requires the respective update to the SQL transformations. As mentioned before, this use case requires that schema changes to a source schema occur in every table for every tenant.

For additional details on Hudi, see Apply record level changes from relational databases to Amazon S3 data lake using Hudi on Amazon EMR and AWS DMS.

Although the Hudi multi-tenant table is partitioned, you should only have one job (Hudi DeltaStreamer utility or Spark data source) writing to the Hudi dataset. If you’re expecting specific tenants to produce more changes than others, you can consider prioritizing some tenants over others or use dedicated tables for the most active tenants to avoid any impact to tenants that produce a smaller amount of changes.

Schema evolution

Hudi supports schema evolutions that are backward compatible—you only append new fields and don’t delete any existing fields.

By default, Hudi handles schema changes of type by appending new fields to the end of the table.

New fields that you add have to either be nullable or have a default value. For example, as you add a new field to the source database, the records that generate a change have a value for that field in the Hudi dataset, but older records have a null value for that same field in the dataset.

If you require schema changes that are not supported by Hudi, you need to use either a SQL transform or the Hudi DataSource API to handle those changes before writing to the Hudi dataset. For example, if you need to delete a field from the source, you need to use a SQL transform before writing the Hudi dataset to ensure the deleted column is populated by a constant or dummy value, or use the Hudi DataSource API to do so.

Moreover, AWS DMS with Amazon S3 targets support only the following DDL commands: Truncate Table, Drop Table, and Create Table. See Limitations to using Amazon S3 as a target for more details.

This means that when you issue an Alter Table command, the AWS DMS replication tasks don’t capture those changes until you restart the task.

As you implement this architecture pattern, it’s recommended that you automate the schema evolution process and apply the schema changes in a maintenance window when the source isn’t serving requests and the AWS DMS replication CDC tasks aren’t running.

Simulating random updates to the source databases

In this step, you perform some random updates to the data. Navigate back to the Runmev5.ipynb Jupyter notebook and run the cells under the section Simulate random updates for tenant 1 and tenant 2.

Triggering the Step Functions workflow to process the ongoing replication

In this step, you rerun the MultitenantProcessing workflow to process the files generated during the ongoing replication.

On the Step Functions console, choose State machines.
Choose the MultiTenantProcessing workflow
In the new window, choose Start execution.
Use the same JSON as the JSON used when you first ran the workflow.
Submit the edited JSON as the input to the workflow.

Querying the Hudi multi-tenant table with Spark

To query your table with Spark, complete the following steps:

On the Session Manager console, select the instance HudiBlog Spark EMR Cluster.
Choose Start session.

Session Manager creates a secure session to the Amazon EMR master node.

Switch to the Hadoop user and then to the home directory of the Hadoop user.

You’re now ready to run the commands described in the next sections.

Run the following command in the command line:

spark-shell  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

When the spark-shell is available, enter the following code:

import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._

val tableName = "sales_order_detail_mt"

val basePath = "s3://hudiblog-[REPLACE-WITH-YOUR-ACCOUNT-ID]/transformed/multitenant/huditables/"+tableName+"/"

spark.read.format("org.apache.hudi").load(basePath + "/*/*/*/*").createOrReplaceTempView(tableName)

val sqlDF = spark.sql("SELECT * FROM "+tableName)
sqlDF.show(20,false)

Updating the source table schema

In this section, you change the schema of the source tables.

On the AWS DMS management console, stop the AWS DMS replication tasks for Tenant 1 and Tenant 2.
Make sure the Amazon SageMaker notebook isn’t writing to the database.
Navigate back to the Jupyter notebook and run the cells under the section Updating the source table schema.
On the AWS DMS console, resume (not restart) the AWS DMS replication tasks for both tenants.
Navigate back to the Jupyter notebook and run the cells under the section Simulate random updates for tenant 1 after the schema change.

Analyzing the changes to the configuration file for the Hudi DeltaStreamer utility

Because there was a change to the schema of the sources and a new field needs to be propagated to the Hudi dataset, the property hoodie.deltastreamer.transformer.sql is updated with the following value:

hoodie.deltastreamer.transformer.sql=SELECT a.timestamp, a.line_id, a.line_number, a.order_id, a.product_id,  a.quantity, a.unit_price, a.discount, a.supply_cost, a.tax, string(year(to_date(a.order_date))) as year, string(month(to_date(a.order_date))) as month, a.Op, a.tenant_id, a.critical FROM <SRC> a

The field a.critical is added to the end, after a.tenant_id.

Triggering the Step Functions workflow to process the ongoing replication after the schema change

In this step, you rerun the MultitenantProcessing workflow to process the files produced by AWS DMS during the ongoing replication.

On the Step Functions console, choose State machines.
Choose the MultiTenantProcessing
In the new window, choose Start execution.
Update the JSON document used when you first ran the workflow by replacing the field propertiesFilename with the following value:

"propertiesFilename": "dfs-source-new_schema.properties"

Submit the edited JSON as the input to the workflow.

Querying the Hudi multi-tenant table after the schema change

We can now query the multi-tenant table again.

Using Hive

If using Hive when the workflow completes, go back to the terminal window opened by Session Manager and run the following hive-sync command:

/usr/lib/hudi/bin/run_sync_tool.sh  --table sales_order_detail_mt  --partitioned-by tenant_id,year,month  --database default --base-path s3://hudiblog-[REPLACE-WITH-YOUR-ACCOUNT-ID]/transformed/multitenant/huditables/sales_order_detail_mt    --jdbc-url jdbc:hive2:\/\/localhost:10000 --user hive   --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor --pass passnotenforced

For this post, we run the hive-sync command manually. You can also add the flag --enable-hive-sync to the Hudi DeltaStreamer utility command showed in section Analyzing the properties provided to the command to run the Hudi DeltaStreamer utility.

After the hive-sync updates the table schema, the new column is visible from Hive. Start Hive in the command line and run the following query:

select `timestamp`, quantity, critical,  order_id, tenant_id from sales_order_detail_mt where order_id=4668 and tenant_id=1502;

Using Spark

If you just want to use Spark to read the Hudi datasets on Amazon S3, after the schema change, you can use the same code in Querying the Hudi multi-tenant table with Spark and add the following line to the code:

spark.conf.set("spark.sql.parquet.mergeSchema", "true")

Cleaning up

To avoid incurring future charges, stop the AWS DMS replication tasks, delete the contents in the S3 bucket for the solution, and delete the CloudFormation stack.

Conclusion

This post explained how to use AWS DMS, Step Functions, and Hudi on Amazon EMR to convert a single-tenant pipeline to a multi-tenant pipeline.

With this solution, your software offerings that use dedicate sources for each tenant can offload some of the common tasks across all the tenants to a pipeline backed by Hudi datasets stored on Amazon S3. Moreover, by using Hudi on Amazon EMR, you can easily apply inserts, updates, and deletes of the source databases to the datasets in Amazon S3. Moreover, you can easily support schema evolution that is backward compatible.

Follow these steps and deploy the CloudFormation templates in your account to further explore the solution. If you have any questions or feedback, please leave a comment.

The author would like to thank Radhika Ravirala and Neil Mukerje for the dataset and the functions on the notebook.

About the Author

Francisco Oliveira is a senior big data solutions architect with AWS. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.

Discovering sensitive data in AWS CodeCommit with AWS Lambda

2021-01-04 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/discovering-sensitive-data-in-aws-codecommit-with-aws-lambda-2/

This post is courtesy of Markus Ziller, Solutions Architect.

Today, git is a de facto standard for version control in modern software engineering. The workflows enabled by git’s branching capabilities are a major reason for this. However, with git’s distributed nature, it can be difficult to reliably remove changes that have been committed from all copies of the repository. This is problematic when secrets such as API keys have been accidentally committed into version control. The longer it takes to identify and remove secrets from git, the more likely that the secret has been checked out by another user.

This post shows a solution that automatically identifies credentials pushed to AWS CodeCommit in near-real-time. I also show three remediation measures that you can use to reduce the impact of secrets pushed into CodeCommit:

Notify users about the leaked credentials.
Lock the repository for non-admins.
Hard reset the CodeCommit repository to a healthy state.

I use the AWS Cloud Development Kit (CDK). This is an open source software development framework to model and provision cloud application resources. Using the CDK can reduce the complexity and amount of code needed to automate the deployment of resources.

Overview of solution

The services in this solution are AWS Lambda, AWS CodeCommit, Amazon EventBridge, and Amazon SNS. These services are part of the AWS serverless platform. They help reduce undifferentiated work around managing servers, infrastructure, and the parts of the application that add less value to your customers. With serverless, the solution scales automatically, has built-in high availability, and you only pay for the resources you use.

This diagram outlines the workflow implemented in this blog:

After a developer pushes changes to CodeCommit, it emits an event to an event bus.
A rule defined on the event bus routes this event to a Lambda function.
The Lambda function uses the AWS SDK for JavaScript to get the changes introduced by commits pushed to the repository.
It analyzes the changes for secrets. If secrets are found, it publishes another event to the event bus.
Rules associated with this event type then trigger invocations of three Lambda functions A, B, and C with information about the problematic changes.
Each of the Lambda functions runs a remediation measure:
- Function A sends out a notification to an SNS topic that informs users about the situation (A1).
- Function B locks the repository by setting a tag with the AWS SDK (B2). It sends out a notification about this action (B2).
- Function C runs git commands that remove the problematic commit from the CodeCommit repository (C2). It also sends out a notification (C1).

Walkthrough

The following walkthrough explains the required components, their interactions and how the provisioning can be automated via CDK.

For this walkthrough, you need:

An AWS account
Installed and authenticated AWS CLI (authenticate with an IAM user or an AWS STS security token)
Installed Node.js, TypeScript
Installed git
AWS CDK installed (npm install aws-cdk -g)

Checkout and deploy the sample stack:

After completing the prerequisites, clone the associated GitHub repository by running the following command in a local directory:
```
git clone [email protected]:aws-samples/discover-sensitive-data-in-aws-codecommit-with-aws-lambda.git
```
Open the repository in a local editor and review the contents of cdk/lib/resources.ts, src/handlers/commits.ts, and src/handlers/remediations.ts.
Follow the instructions in the README.md to deploy the stack.

The CDK will deploy resources for the following services in your account.

Using CodeCommit to manage your git repositories

The CDK creates a new empty repository called TestRepository and adds a tag RepoState with an initial value of ok. You later use this tag in the LockRepo remediation strategy to restrict access.

It also creates two IAM groups with one user in each. Members of the CodeCommitSuperUsers group are always able to access the repository, while members of the CodeCommitUsers group can only access the repository when the value of the tag RepoState is not locked.

I also import the CodeCommitSystemUser into the CDK. Since the user requires git credentials in a downloaded CSV file, it cannot be created by the CDK. Instead it must be created as described in the README file.

The following CDK code sets up all the described resources:

const TAG_NAME = "RepoState";

const superUsers = new Group(this, "CodeCommitSuperUsers", { groupName: "CodeCommitSuperUsers" });
superUsers.addUser(new User(this, "CodeCommitSuperUserA", {
    password: new Secret(this, "CodeCommitSuperUserPassword").secretValue,
    userName: "CodeCommitSuperUserA"
}));

const users = new Group(this, "CodeCommitUsers", { groupName: "CodeCommitUsers" });
users.addUser(new User(this, "User", {
    password: new Secret(this, "CodeCommitUserPassword").secretValue,
    userName: "CodeCommitUserA"
}));

const systemUser = User.fromUserName(this, "CodeCommitSystemUser", props.codeCommitSystemUserName);

const repo = new Repository(this, "Repository", {
    repositoryName: "TestRepository",
    description: "The repository to test this project out",
});
Tags.of(repo).add(TAG_NAME, "ok");

users.addToPolicy(new PolicyStatement({
    effect: Effect.ALLOW,
    actions: ["*"],
    resources: [repo.repositoryArn],
    conditions: {
        StringNotEquals: {
            [`aws:ResourceTag/${TAG_NAME}`]: "locked"
        }
    }
}));

superUsers.addToPolicy(new PolicyStatement({
    effect: Effect.ALLOW,
    actions: ["*"],
    resources: [repo.repositoryArn]
}));

Using EventBridge to pass events between components

I use EventBridge, a serverless event bus, to connect the Lambda functions together. Many AWS services like CodeCommit are natively integrated into EventBridge and publish events about changes in their environment.

repo.onCommit is a higher-level CDK construct. It creates the required resources to invoke a Lambda function for every commit to a given repository. The created events rule looks like this:

Note that this event rule only matches commit events in TestRepository. To send commits of all repositories in that account to the inspecting Lambda function, remove the resources filter in the event pattern.

CodeCommit Repository State Change is a default event that is published by CodeCommit if changes are made to a repository. In addition, I define CodeCommit Security Event, a custom event, which Lambda publishes to the same event bus if secrets are discovered in the inspected code.

The sample below shows how you can set up Lambda functions as targets for both type of events.

const DETAIL_TYPE = "CodeCommit Security Event";
const eventBus = new EventBus(this, "CodeCommitEventBus", {
    eventBusName: "CodeCommitSecurityEvents"
});

repo.onCommit("AnyCommitEvent", {
    ruleName: "CallLambdaOnAnyCodeCommitEvent",
    target: new targets.LambdaFunction(commitInspectLambda)
});


new Rule(this, "CodeCommitSecurityEvent", {
    eventBus,
    enabled: true,
    ruleName: "CodeCommitSecurityEventRule",
    eventPattern: {
        detailType: [DETAIL_TYPE]
    },
    targets: [
        new targets.LambdaFunction(lockRepositoryLambda),
        new targets.LambdaFunction(raiseAlertLambda),
        new targets.LambdaFunction(forcefulRevertLambda)
    ]
});

Using Lambda functions to run remediation measures

AWS Lambda functions allow you to run code in response to events. The example defines four Lambda functions.

By comparing the delta to its predecessor, the commitInspectLambda function analyzes if secrets are introduced by a commit. With the CDK, you can create a Lambda function with:

const myLambdaInCDK = new Function(this, "UniqueIdentifierRequiredByCDK", {
    runtime: Runtime.NODEJS_12_X,
    handler: "<handlerfile>.<function name>",
    code: Code.fromAsset(path.join(__dirname, "..", "..", "src", "handlers")),
    // See git repository for complete code
});

The code for this Lambda function uses the AWS SDK for JavaScript to fetch the details of the commit, the differences introduced, and the new content.

The code checks each modified file line by line with a regular expression that matches typical secret formats. In src/handlers/regex.json, I provide a few regular expressions that match common secrets. You can extend this with your own patterns.

If a secret is discovered, a CodeCommit Security Event is published to the event bus. EventBridge then invokes all Lambda functions that are registered as targets with this event. This demo triggers three remediation measures.

The raiseAlertLambda function uses the AWS SDK for JavaScript to send out a notification to all subscribers (that is, CodeCommit administrators) on an SNS topic. It takes no further action.

SNS.publish({
    TopicArn: <TOPIC_ARN>,
    Subject: `[ACTION REQUIRED] Secrets discovered in <repo>`
    Message: `<Your message>
}

The lockRepositoryLambda function uses the AWS SDK for JavaScript to change the RepoState tag from ok to locked. This restricts access to members of the CodeCommitSuperUsers IAM group.

CodeCommit.tagResource({
    resourceArn: event.detail.repositoryArn,
    tags: {
        RepoState: "locked"
    }
})

In addition, the Lambda function uses SNS to send out a notification. The forcefulRevertLambda function runs the following git commands:

git clone <repository>
git checkout <branch>
git reset –hard <previousCommitId>
git push origin <branch> --force

These commands reset the repository to the last accepted commit, by forcefully removing the respective commit from the git history of your CodeCommit repo. I advise you to handle this with care and only activate it on a real project if you fully understand the consequences of rewriting git history.

The Node.js v12 runtime for Lambda does not have a git runtime installed by default. You can add one by using the git-lambda2 Lambda layer. This allows you to run git commands from within the Lambda function.

Finally, this Lambda function also sends out a notification. The complete code is available in the GitHub repo.

Using SNS to notify users

To notify users about secrets discovered and actions taken, you create an SNS topic and subscribe to it via email.

const topic = new Topic(this, "CodeCommitSecurityEventNotification", {
    displayName: "CodeCommitSecurityEventNotification",
});

topic.addSubscription(new subs.EmailSubscription(/* your email address */));

Testing the solution

You can test the deployed solution by running these two sets of commands. First, add a file with no credentials:

echo "Clean file - no credentials here" > clean_file.txt
git add clean_file.txt
git commit clean_file.txt -m "Adds clean_file.txt"
git push

Then add a file containing credentials:

SECRET_LIKE_STRING=$(cat /dev/urandom | env LC_CTYPE=C tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
echo "secret=$SECRET_LIKE_STRING" > problematic_file.txt
git add problematic_file.txt
git commit problematic_file.txt -m "Adds secret-like string to problematic_file.txt"
git push

This first command creates, commits and pushes an unproblematic file clean_file.txt that will pass the checks of commitInspectLambda. The second command creates, commits, and pushes problematic_file.txt, which matches the regular expressions and triggers the remediation measures.

If you check your email, you soon receive notifications about actions taken by the Lambda functions.

Cleaning up

To avoid incurring charges, delete the resources by running cdk destroy and confirming the deletion.

Conclusion

This post demonstrates how you can implement a solution to discover secrets in commits to AWS CodeCommit repositories. It also defines different strategies to remediate this.

The CDK code to set up all components is minimal and can be extended for remediation measures. The template is portable between Regions and uses serverless technologies to minimize cost and complexity.

For more serverless learning resources, visit Serverless Land.

Security updates for Monday

2021-01-04

Post Syndicated from original https://lwn.net/Articles/841653/rss

Security updates have been issued by Debian (chromium, dovecot, flac, influxdb, libhibernate3-java, and p11-kit), Fedora (ceph and guacamole-server), Mageia (audacity, gdm, libxml2, rawtherapee, and vlc), openSUSE (jetty-minimal and privoxy), Red Hat (kernel and kernel-rt), SUSE (gimp), and Ubuntu (libproxy).

Kernel prepatch 5.11-rc2

2021-01-04

Post Syndicated from original https://lwn.net/Articles/841625/rss

The second 5.11 kernel prepatch is out for
testing. “People have (rightly) mostly been offline since, presumably
over-eating and doing all the other traditional holiday things. And
just generally not being hugely active. That very much shows in a tiny
rc2 release.”

Dallas Kidd

Tristan Moody

Estefannie

Douglas Kimble

Craig Stanton

Sam Treagold

Chen Deng

Bryan Murphy

Susan

Keep in touch

Kids, run your code on the ISS!

AWS account creation workflow

Amazon Redshift ML

Federated queries

SUPER data type

Partner console integration

RA3 nodes with managed storage

AQUA for Amazon Redshift

Automated performance tuning

Summary

About the Author

Architecture

Deploying the wind turbine data simulator

Deploying the automated data pipeline by using AWS CloudFormation

Enabling Kinesis streaming for DynamoDB

Building the Data Analytics for Flink app for real-time data queries

Receiving email notifications of high wind speed

Conclusion

About the Authors

Athena Federated Query

Prerequisites

Configuring your data source connectors

Configuring QuickSight

Running a query in QuickSight

Summary

Appendix

About the Author

Architecture overview

Architecture deep dive

When to use and not use this architecture

Solution overview

Prerequisites

Creating and storing admin passwords in AWS Systems Manager Parameter Store

Creating an S3 bucket for the solution and uploading the solution artifacts to Amazon S3

Creating the CloudFormation stack

Testing database connectivity

Analyzing the AWS DMS configuration

Data transformation configuration

Other AWS DMS configurations

Timestamp precision considerations

Triggering the AWS DMS full load tasks for Tenant 1 and Tenant 2

Triggering the Step Functions workflow

Analyzing the properties provided to the command to run the Hudi DeltaStreamer utility

Schema evolution

Simulating random updates to the source databases

Triggering the Step Functions workflow to process the ongoing replication

Querying the Hudi multi-tenant table with Spark

Updating the source table schema

Analyzing the changes to the configuration file for the Hudi DeltaStreamer utility

Triggering the Step Functions workflow to process the ongoing replication after the schema change

Querying the Hudi multi-tenant table after the schema change

Using Hive

Using Spark

Cleaning up

Conclusion

About the Author

Overview of solution

Walkthrough

Using CodeCommit to manage your git repositories

Using EventBridge to pass events between components

Using Lambda functions to run remediation measures

Using SNS to notify users

Testing the solution

Cleaning up

Conclusion

The collective thoughts of the interwebz