Post Syndicated from Home Assistant original https://www.youtube.com/watch?v=99lGuB4J-4o
Невидимата децентрализация в България
Post Syndicated from Александър Нуцов original https://www.toest.bg/nevidimata-detsentralizatsiya-v-bulgaria/

Кой движи света напред? Кой води войните? Къде е концентрирана властта? В индивида, държавата, наднационалните институции? В политическите науки и международните отношения това са философски въпроси, от чиито корени черпят своето начало различни научни течения и школи. Но едва ли има единствен, цялостен и правилен отговор на тях. Докато хората на власт изпълват управленските структури със своята идентичност, позиции и идеи, същите структури влияят на хората в тях и ги ограничават с установените си рамки, инструменти и норми на съществуване.
Безспорно обаче най-голям властови ресурс – административен, финансов и военен – е съсредоточен в националната държава. Държавата съществува със своите структури, модели на поведение и „инстинкти“. Подобно на човека, първичното стремление на всяка власт е самосъхранението и възпроизвеждането. Държавите и техните структури също като нас се адаптират към променящата се среда и трансформират елементи от себе си, за да оцелеят. Така историческите събития ги подтикват да делегират част от правомощията си нагоре, към наднационални институции като ООН и Европейския съюз, и надолу – към местната власт, гражданите и частния сектор.
Вълни на децентрализация
Процеси по децентрализация текат навсякъде по света още от времето на Студената война. Първоначално те се състоят преди всичко в предаване на правомощия и функции от централната към местната власт и различни локални структури с идеята да предоставят по-качествени и ефективни услуги на своите граждани. Тук можем да включим и случаите на федерализация и засилване на федералните власти в страни с такова управление.
С падането на желязната завеса – особено в страните от бившия Източен блок – децентрализацията поема към по-значително включване на частния сектор в публичните дела и икономиката заради масовата приватизация. Отварянето на пазарите и стимулирането на частната инициатива се осъществява в по-широкия контекст на демократизация на обществата.
Впоследствие „разпръскването“ на власт придобива все по-широк смисъл. Вече говорим за включване на все повече обществени кръгове във вземането на политически решения – граждани, неправителствени организации, бизнес. Тоест от тясното разбиране за делегиране на правомощия от централната към местната власт постепенно стигаме до по-широката тема за създаване на по-справедливо и ефективно държавно управление. Именно в този по-общ смисъл ще си поговорим за децентрализацията в България.
Предпоставките на средата
За разлика от много други страни по света, в България липсва държавна стратегия или целенасочена държавна политика за децентрализация на властта. У нас тя се развива по-скоро по естествен път, в съзвучие със сложните социално-икономически, политически и исторически обстоятелства на средата.
Парадоксално, политическата нестабилност през последните години, незначителната избирателна активност и ниската легитимност на властта от белег за немощ на българската демокрация могат да се превърнат в първопричината за нейното надграждане.
Фрагментираният парламент без единствена доминираща партия доведе до няколко ключови последици. Първата е липсата на единен център, който да създава и променя законодателството. Това, от една страна, създава поле за битка на повече и различни гледни точки. От друга обаче, означава невъзможност за водене на единна и стабилна политика, тъй като твърде многото интереси и компромиси водят до забавяне на законодателния процес или до осакатяване на реформите.
Вторият важен момент е, че немалка част от политическите фигури, като Бойко Борисов и Румен Радев, които дълго време концентрираха огромна власт в ръцете си, постепенно загубиха този свой прерогатив. И докато самият парламент се разпокъса отвътре, конфронтацията между парламента и Президентството доведе до конституционните промени, които орязват някои правомощия на президента.
При тази липса на единни самодостатъчни властови ядра законодателният процес и политиците се нуждаят от обединяващ фактор, който да надмогне кавгите и управленската разпокъсаност. И тук все по-съществена роля играят частният и гражданският сектор. Защото бизнесите и гражданските организации предлагат надпартийни решения отвъд институционалните борби, които нямат цвят, а решават проблеми на обществената среда.
На този фон администрацията остава изключително ретроградна, а политиците често нямат време, експертност и воля да реформират закостенялото законодателство. Тези дефицити все повече се запълват от активната роля на бизнес асоциациите, неправителствените организации и гражданите с експертни познания в различни области.
Освен това държавният апарат изостава драстично от технологичното развитие на глобално ниво, което забавя модернизацията на всички сектори от обществено значение – от образование и здравеопазване до електронно правителство и дигитализация на обществените услуги. Оттам идва и нарастващата зависимост на държавата от предприемачите, които създават, развиват и предоставят технологиите.
В условията на глобализация и ускорено технологично развитие частният и гражданският сектор придобиват все по-голяма тежест и лостове за влияние върху политиците.
Ефектът върху демокрацията
Нарастващата роля на бизнеса и гражданите трябва да се превърне в тенденция, така че те да участват все по-активно в политическите процеси и вземането на стратегически решения. Тук говорим за „разпръскване“ на власт от парламента и партиите, както и от правителството и министерствата надолу – към обществото.
По-силният глас на модерните бизнеси и гражданския сектор гарантира повече контрол върху дейността на политиците, повече и по-добри идеи за реформи от хора и организации с експертно знание и по-ефективни решения на конкретни проблеми на средата. А колкото по-силни и разпознаваеми сред обществото са добрите примери на хора и организации от неправителствения сектор, толкова по-голямо внимание получават те от медиите. И ако политиците откажат да се съобразяват с тях, именно медиите са тези, които могат да разобличат нередностите и да осъществяват натиск, което неизбежно води до загуба на доверие и електорат.
Къде бизнесът покрива дефицити?
По отношение на дигитализацията бизнесът предоставя технологии и ноу-хау в най-различни сектори, част от които вече споменахме. През миналата година например България най-накрая създаде своя национален електронен идентификатор благодарение на разработената от Evrotrust Technologies дигитална схема Evrotrust eID.
Най-големите пробойни на държавата бизнесът се опитва да попълни в образованието. За първи път 25 бизнес организации излязоха с обща позиция за нуждата от дълбока реформа в образованието. Все повече предприемачи споделят за развитието на собствени академии, стажове и семинари, с които да обучават собствени кадри поради липса на добра подготовка на кандидатите. Инвестициите в учебната инфраструктура, като например научни лаборатории, също говорят за големия принос на бизнеса. Социални предприемачи пък развиват собствени платформи за обучение и осигуряват финансов стимул за успелите ученици и студенти.
В здравеопазването имаме все повече дигитални платформи, които улесняват достъпа до специалисти в най-различни области, осигуряват ценна информация, дигитализират работата на институциите, помагат при подбора на персонал и т.н.
В земеделието български компании предоставят технологични решения за нуждите на земеделците. Например произвеждат високотехнологична роботика за най-различни дейности, като умно косене и плевене, разработват автоматизиране на капковото напояване, осъществяват контрол на климата в закрити и открити площи при отглеждането на култури, осигуряват софтуер и специалисти, които правят измервания и анализи и прилагат добри практики за устойчиво развитие на земеделието и оптимизиране на разходите.
Сигурността и отбраната също все по-силно зависят от модерните предприемачи, най-вече при надпреварата в иновациите. НАТО има собствен иновационен акселератор – DIANA, който предоставя ресурси и тестова база за иновативните стартъпи и ги свързва с учените и крайните потребители при създаването на „дълбоки технологии“ (deep tech) с т.нар. двойна употреба за страните от Съюза. Един от ключовите партньори на DIANA в България като тестов център е институтът „Големи данни в полза на интелигентно общество“ (GATE) към СУ „Св. Климент Охридски“, който осъществява и връзката между българските компании и натовската акселераторска програма.
Това са само няколко примера за нарастващата роля на бизнеса в традиционни за държавата сфери, което постепенно ще се превърне в тенденция в контекста на технологичното развитие.
Стабилна децентрализация в политическа нестабилност
На фона на политическата нестабилност и слабата държавност българските граждани и предприемачи ще придобиват все по-ключова позиция, която ще им позволи да държат политиците отговорни и да изсветляват управленските нередности и корупционни схеми.
Децентрализацията обаче не е лек сама по себе си. Тя трябва да улесни процеса на реформиране на законодателството и модернизиране на средата, като подпомогне комуникацията между политиците и експертите/практиците от частния и гражданския сектор, така че всички да участват дейно в политическите и публичните дела.
Необходимо е създаването на повече профилирани неправителствени и граждански организации, които обединяват широки обществени кръгове, застъпват се за каузи и черпят креативни идеи и решения от практиката. Те трябва да инициират и улесняват разговора между по-широк кръг лица от гражданския, частния и държавния сектор, за да се отговори на нуждите на обществото по най-смислен и ефективен начин. И най-важното – трябва да предоставят конкретни решения на конкретни проблеми, издигайки на пиедестал идеята всяка отделна сфера от живота ни да е по-добра за всички.
Децентрализацията в България ще продължи да се реализира по-скоро по естествен път, движен от нестабилната политическа обстановка и ускореното технологично развитие. Това е шанс за създаване на по-устойчива, макар и несъвършена демокрация, с повече съвестни управленци в по-благоприятна среда за правене на политика.
Shelly #CES2024 Interview – New Products
Post Syndicated from digiblurDIY original https://www.youtube.com/watch?v=aFs9x0rhO1Y
Uncovering 4 Must-see HACS Gems For 2024: Let’s Dive In!
Post Syndicated from BeardedTinker original https://www.youtube.com/watch?v=cPFcxPuzx-s
OpenSSH announces DSA-removal timeline
Post Syndicated from corbet original https://lwn.net/Articles/958048/
For those of you still using DSA keys with SSH: the project has announced
its plans to remove support for that algorithm around the beginning of
2025.
The only remaining use of DSA at this point should be deeply legacy
devices. As such, we no longer consider the costs of maintaining
DSA in OpenSSH to be justified. Moreover, we hope that OpenSSH’s
final removal of this insecure algorithm accelerates its
deprecation in other SSH implementations and allows maintainers of
cryptography libraries to remove it too.
[$] The kernel “closure” API
Post Syndicated from corbet original https://lwn.net/Articles/957187/
The data structure known as a “closure” first found its way into the
mainline kernel with the addition of bcache in the 3.10 development
cycle. With the advent of bcachefs in
6.7, though, it acquired a second user and was moved to the kernel’s
lib directory, making it available to other kernel users as well.
The documentation of closures in the source is better than that of many
things in the kernel, but there is still room for a gentler introduction.
4 Questions for CISOs to Reduce Threat Exposure Risk
Post Syndicated from Rapid7 original https://blog.rapid7.com/2024/01/11/4-questions-for-cisos-to-reduce-threat-exposure-risk/

In an ongoing effort to help security organizations gain greater visibility into threat exposure risk, we have determined four key questions every CISO should be considering based on our understanding of the recommendations of a new report from Gartner®. The report, 2024 Strategic Roadmap for Managing Threat Exposure, can help CISOs and other top executives steer away from risk by analyzing their attack surfaces for gaps.
Question #1: What Do You Already Know?
What are the business-driven events that have already been or are currently being scoped and planned for? In analyzing threat exposure for specific events along the course of the year, a security organization will have the power to better tailor their risk mitigation approaches.
“It’s crucial to scope risk in relation to threat exposure, as this is one of the key outputs that will benefit the wider business. To do so, senior leaders must understand the exposure facing the organization, in direct relation to the impact that an exploitation of said exposure would have. Together, with this information, executives can make informed decisions to either remediate, mitigate or accept the perceived risks. Without impact context, the exposures may be addressed in isolation, leading to uncoordinated fixes relegated to individual departments exacerbating the current problems associated with most vulnerability management programs.” says the Gartner report.
Post-risk scoping, it’s a good idea to then consider if there are any measures that can be taken to better protect certain business-driven events if they have been found to have a greater chance of threat-actor exploitability.
Question #2: How Visible Are Your Critical Systems?
It is also incredibly valuable to take inventory of the most critical and exposed systems in the network, along with each system’s level of visibility and its location. Having a thorough catalog of the points that are or could be the most vulnerable is a must. Just because an exploitable asset might not be considered a remediation priority, there is always the possibility it could be exploited down the line.
Within the context of the report, Gartner details a visibility framework that can aid with vulnerability prioritization:
“Coupled with accessibility is the visibility of the exploitable service, port, or asset. These technologies implement configuration to ensure that details of exploitable elements are not revealed to potential attackers, but not directly removing the possibility of their exploitation.”
Therefore, it becomes necessary to leverage technologies that can provide insights into the visibility of an asset so that – if there is currently a low likelihood of exploitability – remediation efforts can be focused elsewhere and efficiences can be gained within the security organization.
Question #3: Who “Owns” IT Systems?
Identifying who is responsible for the deployment and management of critical IT systems is key if the security organization is to get interdepartmental buy-in for an effective plan to manage threat exposure. Sometimes there isn’t just one person responsible for a certain aspect of network management, which is important to keep in mind as efforts to mitigate threat exposure are built out.
Security personnel, as with so many business operations in which they take part, also must keep in mind that there could be pushback or slow buy-in to a plan that is perceived to lack context. To this point, the research states:
“Without impact context, the exposures may be addressed in isolation, leading to uncoordinated fixes relegated to individual departments exacerbating the current problems associated with most vulnerability management programs.”
Question #4: Who is Responsible for Risk?
Potential friction could also lie in the effort to convince a system owner that there is real action required – and that it could upend that team’s workflow. Effective communication will be imperative here, as will the ability to provide the visibility needed to quickly convince stakeholders that action is, indeed, needed and worth the potential interruption. The report drives home the need for allying with those responsible for risk decisions:
“From the perspective of the organization’s business risk owner, it’s important to recognize that the security team’s role is to support risk management in such a way that the owner can make informed data-driven decisions.”
The CISO Says It All
It will ultimately be up to the CISO to manage and connect separate plans to both limit and eliminate threat exposure along attack surfaces. Through this effort, the CISO can demonstrate the benefits of implementing platforms to manage the growing risk of threat exposure. They’ll also be able to prove the worth of the security operations center (SOC) as both key partners in the effort to keep business secure.
We’re pleased to continually offer leading research to help you gain clarity into managing the risk of threat exposure. Read the Gartner report to better understand how a broad set of exposures can impact the workloads of a security organization – and how important it becomes to prioritize properly and communicate effectively.
Gartner, 2024 Strategic Roadmap for Managing Threat Exposure, Pete Shoard, 8 November 2023.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
Security updates for Thursday
Post Syndicated from jake original https://lwn.net/Articles/958029/
Security updates have been issued by Debian (chromium), Fedora (chromium, python-paramiko, tigervnc, and xorg-x11-server), Oracle (ipa, libxml2, python-urllib3, python3, and squid), Red Hat (.NET 6.0, .NET 7.0, .NET 8.0, container-tools:4.0, fence-agents, frr, gnutls, idm:DL1, ipa, kernel, kernel-rt, libarchive, libxml2, nss, openssl, pixman, python-urllib3, python3, tigervnc, tomcat, and virt:rhel and virt-devel:rhel modules), SUSE (gstreamer-plugins-bad), and Ubuntu (firefox, Go, linux-aws, linux-gcp-5.15, linux-intel-iotg-5.15, linux-iot, linux-oem-6.1, and twisted).
Zero-Day Exploitation of Ivanti Connect Secure and Policy Secure Gateways
Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2024/01/11/etr-zero-day-exploitation-of-ivanti-connect-secure-and-policy-secure-gateways/

On Wednesday, January 10, 2024, Ivanti disclosed two zero-day vulnerabilities affecting their Ivanti Connect Secure and Ivanti Policy Secure gateways. Security firm Volexity, who discovered the vulnerabilities, also published a blog with information on indicators of compromise and attacker behavior observed in the wild. In an attack Volexity investigated in December 2023, the two vulnerabilities were chained to gain initial access, deploy webshells, backdoor legitimate files, capture credentials and configuration data, and pivot further into the victim environment.
The two vulnerabilities in the advisory are:
- CVE-2023-46805, an authentication bypass vulnerability in the web component of Ivanti Connect Secure (9.x, 22.x) and Ivanti Policy Secure that allows a remote attacker to access restricted resources by bypassing control checks.
- CVE-2024-21887, a critical command injection vulnerability in web components of Ivanti Connect Secure (9.x, 22.x) and Ivanti Policy Secure that allows an authenticated administrator to send specially crafted requests and execute arbitrary commands on the appliance. This vulnerability can be exploited over the internet
Rapid7 urges customers who use Ivanti Connect Secure or Policy Secure in their environments to take immediate steps to apply the workaround and look for indicators of compromise. Volexity have released an extensive description of the attack and indicators of compromise — we strongly recommend reviewing their blog, which includes the information below:
“Volexity observed the attacker modifying legitimate ICS components and making changes to the system to evade the ICS Integrity Checker Tool. Notably, Volexity observed the attacker backdooring a legitimate CGI file (compcheck.cgi) on the ICS VPN appliance to allow command execution. Further, the attacker also modified a JavaScript file used by the Web SSL VPN component of the device in order to keylog and exfiltrate credentials for users logging into it. The information and credentials collected by the attacker allowed them to pivot to a handful of systems internally, and ultimately gain unfettered access to systems on the network.”
Ivanti Connect Secure, previously known as Pulse Connect Secure, is a security appliance that has been targeted in a range of threat campaigns in recent years. The U.S. Cybersecurity and Infrastructure Security Agency (CISA) also released a bulletin on January 10, 2024 urging Ivanti Connect Secure and Ivanti Policy Secure users to mitigate the two vulnerabilities immediately.
Counts of internet-exposed appliances vary widely depending on the query used. The following Shodan query identifies roughly 7K devices on the public internet, while looking for Ivanti’s welcome page alone more than doubles that number (but reduces accuracy): http.favicon.hash:-1439222863 html:"welcome.cgi?p=logo. Rapid7 Labs has observed scanning activity targeting our honeypots that emulate Ivanti Connect Secure appliances.
Mitigation guidance
All supported versions (9.x and 22.x) of Ivanti Connect Secure and Ivanti Policy Secure are vulnerable to CVE-2023-46805 and CVE-2024-21887. Ivanti’s advisory notes that a workaround is available for CVE-2023-46805 and CVE-2024-21887. Ivanti Connect Secure and Ivanti Policy Secure customers should apply the vendor-supplied workaround immediately and investigate their environments for signs of compromise. Ivanti advises customers using unsupported versions of the product to upgrade to a supported version before applying the workaround.
Ivanti has indicated that patches will be released in a staggered schedule between January 22 and February 19, 2024 — target patch timelines can be found here.
Per Ivanti’s advisory and KB article, “Ivanti Neurons for ZTA gateways cannot be exploited when in production. If a gateway for this solution is generated and left unconnected to a ZTA controller, then there is a risk of exploitation on the generated gateway. Ivanti Neurons for Secure Access is not vulnerable to these CVEs; however, the gateways being managed are independently vulnerable to these CVEs.”
Note: Volexity indicated that adversaries have been observed wiping logs and/or disabling logging on target devices. Administrators should ensure logging is enabled. Ivanti has a built-in integrity checker tool (ICT) that verifies the image on Ivanti Connect Secure and Ivanti Policy Secure appliances and looks for modified files. Ivanti is advising customers to use the external version of this tool to check the integrity of the ICS/IPS images, since Ivanti has seen adversaries “attempting to manipulate” the internal integrity checker tool.
Rapid7 customers
Our engineering team is investigating options for InsightVM and Nexpose coverage for these vulnerabilities. We will provide an update to this blog no later than 3 PM EST on Thursday, January 11, 2024.
Pharmacies Giving Patient Records to Police without Warrants
Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/pharmacies-giving-patient-records-to-police-without-warrants.html
Add pharmacies to the list of industries that are giving private data to the police without a warrant.
Teaching about AI explainability
Post Syndicated from Mac Bowley original https://www.raspberrypi.org/blog/teaching-ai-explainability/
In the rapidly evolving digital landscape, students are increasingly interacting with AI-powered applications when listening to music, writing assignments, and shopping online. As educators, it’s our responsibility to equip them with the skills to critically evaluate these technologies.

A key aspect of this is understanding ‘explainability’ in AI and machine learning (ML) systems. The explainability of a model is how easy it is to ‘explain’ how a particular output was generated. Imagine having a job application rejected by an AI model, or facial recognition technology failing to recognise you — you would want to know why.

Establishing standards for explainability is crucial. Otherwise we risk creating a world where decisions impacting our lives are made by opaque systems we don’t understand. Learning about explainability is key for students to develop digital literacy, enabling them to navigate the digital world with informed awareness and critical thinking.
Why AI explainability is important
AI models can have a significant impact on people’s lives in various ways. For instance, if a model determines a child’s exam results, parents and teachers would want to understand the reasoning behind it.

Artists might want to know if their creative works have been used to train a model and could be at risk of plagiarism. Likewise, coders will want to know if their code is being generated and used by others without their knowledge or consent. If you came across an AI-generated artwork that features a face resembling yours, it’s natural to want to understand how a photo of you was incorporated into the training data.
Explainability is about accountability, transparency, and fairness, which are vital lessons for children as they grow up in an increasingly digital world.
There will also be instances where a model seems to be working for some people but is inaccurate for a certain demographic of users. This happened with Twitter’s (now X’s) face detection model in photos; the model didn’t work as well for people with darker skin tones, who found that it could not detect their faces as effectively as their lighter-skinned friends and family. Explainability allows us not only to understand but also to challenge the outputs of a model if they are found to be unfair.
In essence, explainability is about accountability, transparency, and fairness, which are vital lessons for children as they grow up in an increasingly digital world.
Routes to AI explainability
Some models, like decision trees, regression curves, and clustering, have an in-built level of explainability. There is a visual way to represent these models, so we can pretty accurately follow the logic implemented by the model to arrive at a particular output.
By teaching students about AI explainability, we are not only educating them about the workings of these technologies, but also teaching them to expect transparency as they grow to be future consumers or even developers of AI technology.
A decision tree works like a flowchart, and you can follow the conditions used to arrive at a prediction. Regression curves can be shown on a graph to understand why a particular piece of data was treated the way it was, although this wouldn’t give us insight into exactly why the curve was placed at that point. Clustering is a way of collecting similar pieces of data together to create groups (or clusters) with which we can interrogate the model to determine which characteristics were used to create the groupings.

However, the more powerful the model, the less explainable it tends to be. Neural networks, for instance, are notoriously hard to understand — even for their developers. The networks used to generate images or text can contain millions of nodes spread across thousands of layers. Trying to work out what any individual node or layer is doing to the data is extremely difficult.

Regardless of the complexity, it is still vital that developers find a way of providing essential information to anyone looking to use their models in an application or to a consumer who might be negatively impacted by the use of their model.
Model cards for AI models
One suggested strategy to add transparency to these models is using model cards. When you buy an item of food in a supermarket, you can look at the packaging and find all sorts of nutritional information, such as the ingredients, macronutrients, allergens they may contain, and recommended serving sizes. This information is there to help inform consumers about the choices they are making.
Model cards attempt to do the same thing for ML models, providing essential information to developers and users of a model so they can make informed choices about whether or not they want to use it.

Model cards include details such as the developer of the model, the training data used, the accuracy across diverse groups of people, and any limitations the developers uncovered in testing.
Model cards should be accessible to as many people as possible.
A real-world example of a model card is Google’s Face Detection model card. This details the model’s purpose, architecture, performance across various demographics, and any known limitations of their model. This information helps developers who might want to use the model to assess whether it is fit for their purpose.
Transparency and accountability in AI
As the world settles into the new reality of having the amazing power of AI models at our disposal for almost any task, we must teach young people about the importance of transparency and responsibility.

As a society, we need to have hard discussions about where and when we are comfortable implementing models and the consequences they might have for different groups of people. By teaching students about explainability, we are not only educating them about the workings of these technologies, but also teaching them to expect transparency as they grow to be future consumers or even developers of AI technology.
Most importantly, model cards should be accessible to as many people as possible — taking this information and presenting it in a clear and understandable way. Model cards are a great way for you to show your students what information is important for people to know about an AI model and why they might want to know it. Model cards can help students understand the importance of transparency and accountability in AI.
This article also appears in issue 22 of Hello World, which is all about teaching and AI. Download your free PDF copy now.
If you’re an educator, you can use our free Experience AI Lessons to teach your learners the basics of how AI works, whatever your subject area.
The post Teaching about AI explainability appeared first on Raspberry Pi Foundation.
Изкуственият интелект на Аллах
Post Syndicated from Атанас Шиников original https://www.toest.bg/izkustvieniyat-intelekt-na-allah/

Шумът около новите технологии минавал през пет етапа, твърдят консултанти от ИТ сферата, където и аз заработвам насъщния. Описва ги популярната крива на очакванията, положени върху времевата ос. Първо, твърдят те, идвал пускът на иновацията: поява на нов продукт, медиен шум, всеки говори за него. Следвал пикът на свръхочакванията. Пропастта на разочарованията е третият етап, когато става ясно, че се сбъдва заглавието на Шекспировата пиеса „Много шум за нищо“. Сетне бавно покачващият се „склон на просвещението“ уталожвал развитието на продукта. Следват второ и трето поколение продукти, при което инвеститорите продължават да харчат само при условие че технологията се подобри. Най-накрая идвало платото на продуктивността, когато изобретението става част от масовата култура на крайния потребител и устойчиво доставя предсказуеми резултати.
Според последната прогноза развитието на изкуствения интелект (ИИ) в момента се позиционира някъде между пуска на технологията и пика на свръхочакванията.
Малко сухи цифри
Според проучване от 2023 г. 38% от инициативите, свързани с употребата на т.нар. генеративен изкуствен интелект, са насочени към подобряване на клиентското удовлетворение, тоест как да направим крайните ни потребители по-щастливи. 26% от тях се фокусират върху растежа на приходите, 17% – върху оптимизация на разходите. 7% от употребите са свързани с решения за непрекъснатост на дейностите – какво правим, за да не спира бизнесът ни при форсмажорни обстоятелства. И накрая, 12% от внедряването на решения с ИИ са в други области или не са приложими, доколкото запитването е отправено към инвеститор или към предлагащ решения за изкуствен интелект.
Само че мен не ме интересува това.
Интересува ме може ли ИИ да бъде мюсюлманският андроид, който сънува шериатски пасбища за електрически правоверни овце. Кой пази пазачите на вратата към изрядния помюсюлманскому начин на живот. Не просто дали мюсюлманите могат да намерят отговор на даден въпрос чрез интегриране на ИИ в търсачка, както направиха Microsoft с Bing. А би ли могъл изкуственият интелект да се превърне в легитимен източник на житейски напътствия за един последовател на исляма? Би ли могъл да влезе в почетната роба на един от авторитетите сред мюсюлманската общност? Възможно ли е някой ден притежателите на „религиозното знание“, които са „наследниците на пророците“, по известното предание от Пророка Мохамед, тези, които удържат целостта на общността от правоверни чак до настъпването на Съдния ден, да се окажат конструкти от нули и единици? Чука ли скриптът на киберходжата на вратата? Чакаме ли дигитален скрит имам като в една цифровизирана версия на шиизма?
Да, явно имам скрити помисли. Затова и започвам с прост въпрос към ChatGPT. Искам да чуя какво ще каже електронният всезнайко във връзка с мой казус в ролята ми на религиозен любопитко, който чете постоянно сборници с мюсюлмански въпроси и отговори.
Аз съм мюсюлманин, който работи с юдеи (йахуд). Позволено ли е според Свещения закон (шариа‘) да ги поздравявам сутрин?
С това нагазвам в многовековната традиция на мюсюлмански responsa – въпроси и отговори – между вярващите в Аллах и техните религиозни авторитети. Та хората, независимо от религията им, винаги са се сблъсквали с проблеми, за които няма пряко решение в свещените книги. В мюсюлманската традиция тези своеобразни диалози кристализират в жанра на т.нар. фетва (фатуа). Отговорите на авторитетни богослови се събират в сборници, четат се, разпространяват се и се използват като справочен материал. И тук терминът фетва не означава „смъртна присъда“, както понякога си мисли широката публика след скандала със „Сатанински строфи“ на Салман Рушди и смъртната присъда срещу него от аятолах Хомейни през 1989 г. или пък след нападението срещу писателя от 2022 г.
Тривиалната, общоприетата дефиниция на фетва е „необвързващо религиозно-законово мнение в отговор на даден въпрос“. Питащият не е непременно задължен да приложи решението, дадено му от признат от него авторитет. Но пък може да бъде мотивиран от силата на религиозния аргумент. И тези мнения-отговори, в зависимост от това колко е популярен отговарящият, може да имат широк обществен отзвук – както в случая със Салман Рушди или с фетвата, одобряваща обезглавяването на френския учител по история през 2020 г. Може да имат и политическо влияние – като историческите постановления на османския шейхюислям или пък мюфтията на ИДИЛ, заловен около Мосул през 2020 г.
Често целта на запитването е да положи отговора на въпроса някъде по оста на петте степени на позволеност по мюсюлманския свещен закон.
Според този закон нещата от тукашния свят могат да бъдат разпределени в пет категории, т.нар. „шариатски постановления“, „регулации“ или „степени на позволеност“ (ахкам). На първо място са религиозните задължения (дадена повеля следва да се спазва, ако се самоопределяш като мюсюлманин), после идват препоръчителните (добре е да се извърши, похвално е), неутралните, позволени неща (например да пия ли сок от ябълка или морков), порицаемите (няма да те вкара в ада, но не го прави, по-добре е за теб). А накрая идват откровено забранените, греховни дела (ще страдаш и в тукашния, и в отвъдния свят, ако го извършиш), за тях съществува и терминът харам. И мюсюлманите – и в миналото, и днес – питат за неща, които за съвременния човек може да са скандални. Или да стоят изцяло извън периметъра на модерното секуларно схващане за духовност и отношението между религиозно, обществено и политическо.
– Мюсюлманка съм. Може ли да си направя липосукция, ако съм дебела?
– Може ли да поставя украсена с розички снимка на братовчедка ми на бюрото?
– Приложенията за онлайн запознанства и срещи допустими ли са?
– Ядат и пият ли мъчениците, отишли в рая?
– Биткойн и другите дигитални валути позволени ли са според свещения закон на Аллах?
– Може ли да нарека дъщеря си Натали?
– Живея в България, където повечето хора са неверници, трябва ли да празнувам техните празници?
– Може ли да отворя хотел, където да използвам бизнеса си с мисионерска цел за разпространение на исляма?
– Позволява ли ми религията да си поставя имплант с хормони?
– Някой открадна орган на роднина по време на операция, какво е наказанието за това?
– Може ли да търгувам със свински продукти, при условие че не ги консумирам?
– Как стои въпросът с храни, в чийто състав има посочен минимален процент алкохол?
– Ако мъжът и жената са голи по време на секс, това прави ли брака невалиден?
– Какво да сторя в случаи на домашно насилие от страна на жена ми, която ме заплашва с нож при семеен скандал?
– Защо може да работим с жени колеги само ако преди това сме били кърмени от тях?
– Какви са религиозните основания за практикуване на футбола?
Примерите по-горе са реални. Въпросите са отправени от реално съществуващи хора към реално съществуващи религиозни авторитети, но посредством електронни инструменти. Тази популярна практика е довела в днешно време до изграждането на мащабни и добре финансирани бази данни от въпроси и отговори онлайн, поддържани от екипи около даден шейх или институция. Те са интегрална част от „киберислямските среди“ (Cyber-Islamic Environments, CIE), както ги нарича Гари Р. Бунт, професор от Уелс и мой добър познат, който от години разработва дигиталното присъствие на мюсюлманските общности по света.
Влиятелни сайтове като IslamQA.info на саудитския шейх Мухаммад Салих ал-Мунадджид или пък портала IslamWeb.net, финансиран от катарското Министерство на религиозните дела, съдържат стотици хиляди запитвания от мюсюлмани по света. Сходно е положението със сайта на починалия през 2022 г. Юсуф ал-Карадауи или друг свързан с него сайт – IslamOnline.net, пуснат през 1997 г. от обществото „Ал-Балаг“ в Катар по инициатива на тамошни ИТ специалисти. Дори и страницата на нашенското Главно мюфтийство съдържа скромна секция с няколко отговора на въпроси от мюсюлманската общност тук, както и онлайн формуляр.
И каналите за интеракция с крайния потребител варират. Някъде запитванията се отправят по телефон, другаде има контактен формуляр, понякога шейховете очакват запитвания по електронна поща или пък има разработени приложения за мобилни устройства. Но всички тези канали като цяло страдат от капацитет за обработване на входящите заявки вследствие на огромното търсене на услугата.
И да, на първо място мюсюлманите питат за общото положение в шариата относно изкуствения интелект
… защото проблематиката е сравнително нова. Тук отговорът е донейде предсказуем. Прибягването до технологично решение за целите на отговори на религиозни въпроси е разбираема. Дихотомията „религия или технология“, изразена в почти банализираната метафора на Даниъл Лернър за „Мека или механизация“ от 1958 г., вече се възприема като изкуствена. Та историята на религиозната общност познава и други моменти, при които използването на дадена технология първоначално е възприемано като съмнително, след което се превръща в интегрална част от живота на мюсюлманите. Например навлизането на печатната преса през XIX век, което лека-полека смъква преписвачите на Корана и друга религиозна литература от пиедестала им. Или радиопредаванията и аудиокасетите, впрегнати в услуга на шейховете. Или първите версии на уебстраници, после и навлизането на Web 2.0, което повишава скоростта на обмен на информация и интерактивността в социални мрежи, блогове, приложения, както и във вече споменатите портали за фетви.
Типичен отговор на това дали изкуственият интелект е религиозно забранен (харам), се дава на катарския сайт за фетви IslamWeb.net. Слава на Аллах, казва богословът, основният принцип за окачествяването на нещата от света е, че те са позволени, докато не се появи доказателство за тяхната възбраненост и греховност. Аргументът за това общо съждение, което прилича на юридическата презумпция за невиновност до доказване на противното, е зает от позоваване на богослова Ибн Таймия от XIV век. Същият Ибн Таймия е известен и като идеологически вдъхновител на радикални мюсюлмански течения, сред които ИДИЛ, „Боко Харам“ и „Ал-Кайда“.
Друг пък мюсюлманин има по-нюансирано питане:
Уча програмиране, покрай това учим за изкуствения интелект. Като цяло се говори, че той можел да имитира човешкия ум, това позволено ли е по шариата, или способността за разсъждение следва да се схваща единствено като вложена от Аллах в човешкия род?
Отговорът е, че проблемът е само терминологичен, защото изкуственият интелект представлява единствено средство; той не може да напусне предзададените му граници, както и категорично не притежава качествата на живи същества, камо ли душа, която Аллах дава на тварите от света. А до голяма степен дефиницията на разумни и живи същества идва от Корана, за което питащият е препратен към няколко други фетви. Оттук и – ако се удържат поставените от Аллах граници – няма проблем да се разработва тази технология. Тоест ако продължим тази нишка на разсъждение, няма проблем технологията да бъде впрегната от благочестиви мюсюлмани за общите цели на медицината, образованието, програмирането, информационните технологии и прочее, а всеки специфичен случай подлежи на отделно разглеждане.
На второ място обаче стои по-конкретният въпрос:
Ако ИИ е принципно позволен, възможно ли е той да се превърне в самостоятелен религиозен авторитет?
На пръв поглед такъв умствен експеримент не е невъзможен. Направо си е изкусителен. Мюсюлманското право традиционно се възприема като основано на четири принципа (усул). На първо място стои Писанието, тоест Корана. Следват преданията на Пророка – Сунната, съставена от отдавна дефинирани текстови корпуси. На трето място идва „аналогията“ (кийас) и накрая, като че ли най-размито откъм дефиниция, стои консенсусът (иджма‘) на религиозните учени. Тези четири извора на правото се съотнасят един спрямо друг по дефиниран от юриспруденцията йерархичен начин поне от времето на имам аш-Шафии (поч. 820 г.) и неговия „Трактат“ (Рисала). Оттук и задачата на всеки религиозен учен, който участва в издаването на фетва, е лесна. Той трябва да стъпи върху предзададеното „канонично“ съдържание на Корана и Сунната и посредством формалната логика и инструментите на аналогията и консенсуса да достигне до удържимо от религиозна гледна точка решение на дадено запитване.
Всеки от четирите източника на правото може да бъде обрисуван като база данни с ограничени на брой, ако и обемни, колекции от логически свързани набори от записи.
Като че ли най-ясна е ситуацията с Писанието. Коранът съдържа 114 глави (сури), 6236 „знамения“, тоест стихове, 77 797 думи и 330 709 букви. Сурите се делят на мекански и медински според мястото на тяхното „низпославане“ от Аллах към Пророка; съществува и хронологична подредба на отделните знамения.
Подобна е ситуацията и със Сунната, поне със съдържанието на признатия за достоверен корпус от суннитите, т.нар. Сихах. Най-популярните са двата сборника на Муслим и Ал-Бухари от IX век, а в тях преданията за казаното или направеното от Пророка са тематично подредени и номерирани. Всеки отделен разказ се състои от две части. Първата съдържа споменаване на веригата от съратници на Пророка, които са го предали, т.нар. „опора“ (иснад). След това идва реалното съдържание („корпус“, „основна част“, матн).
„Аналогията“ пък невинаги е всепризнат механизъм за извеждане на религиозно валидни съждения и тук зависи от способността за изграждане на автоматизация на алгоритмите за взаимовръзка на принципа „Ако А е забранено, то Б, доколкото е сходно на А, би следвало също да се забрани“. Консенсусът е още по-проблематичен от гледна точка на това кои точно правно-богословски мнения искаме да използваме. Както с повечето случаи на употреба на ИИ, тук особено интересно би било с какво го захранваме. От кой исторически период и географски ареали приоритизираме съдържание, отдаваме ли превес на някоя от четирите правно-богословски школи (мазхаби) повече от други.
Или пък, ако се върнем към Корана и твърдим, че традицията на коранични коментари (тафсир) ни дава ключ към историческото му разбиране, на кои точно ще наблегнем. Ще го захраним ли например с един от най-ранните – този на Мукатил ибн Сулайман от VIII век като доста близък до времето на самия Пророк? Само че тафсирът на Мукатил е твърде лапидарен и не ни осветлява за по-късното богатство на традицията на тълкуване на Писанието.
Хубаво. Да му подадем тогава класици, като Ат-Табари от началото на X век. Ат-Табари обаче пише един доста обширен коментар, в който наред с всичко включва и признат за „апокрифен“, съмнителен материал. Откъде мислите, че Салман Рушди е взел историята за това как Иблис, мюсюлманският сатана (Шайтан), вдъхновява Пророка? Точно оттам. Ами къде мислите, че четем историята за създаването на прасето от слона в Ноевия ковчег? Все при Ат-Табари.
Или пък да му подадем съвременни коментари като „В сенките на Корана“ на идеолога на „Мюсюлмански братя“ и баща на салафитския джихадизъм Сайид Кутб от средата на миналия век?
Дали пък в крайна сметка няма да свършим с електронна версия на Бен Ладен?
Но на общо основание е ясно, че ИИ може да заеме ролята на устройство за производство на легитимни религиозни силогизми на базата на необработени данни. Не подобие на делфийския оракул, панаирна машина за случайни късметчета или гадаене по китайската Идзин, а по-скоро универсален скрипт, отговарящ на дефиницията за генеративен изкуствен интелект, който, стъпвайки на базата на предзададени входни данни, да може да отговори на всяко възможно религиозно запитване чрез изграждане на ново съдържание с добавена стойност.
Опитите за представяне на религиозните извори на исляма като структурирани релационни бази данни не са отскоро. Само че идеята за тяхното креативно комбиниране тепърва прохожда. И все пак е отпреди шума около ChatGPT.
Очаквайте продължение следващата седмица.
В рубриката „Ориент кафе“ Атанас Шиников поднася любопитни теми, свързани не толкова с горещата политика, колкото с историята и културата на Близкия изток. А той, древен и днешен, е по-близко до нас и съвремието ни, отколкото си представяме.
Cheap NICGIGA 10Gbase-T Adapter Mini-Review Marvell AQC113C
Post Syndicated from Rohit Kumar original https://www.servethehome.com/cheap-nicgiga-10gbase-t-adapter-mini-review-marvell-aqc113c/
This might be the cheapest 10Gbase-T adapter out there. We found a NICGIGA 10GbE card based on a low-power Marvell AQC113C NIC
The post Cheap NICGIGA 10Gbase-T Adapter Mini-Review Marvell AQC113C appeared first on ServeTheHome.
[$] LWN.net Weekly Edition for January 11, 2024
Post Syndicated from corbet original https://lwn.net/Articles/956868/
The LWN.net Weekly Edition for January 11, 2024 is available.
[$] Notes on Emacs Org mode
Post Syndicated from jake original https://lwn.net/Articles/957316/
As part of my quest to master Emacs, which
is sort of a sub-quest on the way toward learning more about Lisp, I have
spent a fair amount of time discovering various corners of the Emacs
world. One of those is the famous “Org
mode” that is used for a wide variety of organizational tasks within
the editor—and not just Emacs, but for Vim and others too.
Org mode can be
used for to-do lists, notes with interconnections between them, literate
programming, web sites, and more. Now my quests are growing quests of
their own and digging into Org mode is one of those.
Rebuilding Netflix Video Processing Pipeline with Microservices
Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359
Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu
This is the first blog in a multi-part series on how Netflix rebuilt its video processing pipeline with microservices, so we can maintain our rapid pace of innovation and continuously improve the system for member streaming and studio operations. This introductory blog focuses on an overview of our journey. Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process.
The Netflix video processing pipeline went live with the launch of our streaming service in 2007. Since then, the video pipeline has undergone substantial improvements and broad expansions:
- Starting with Standard Dynamic Range (SDR) at Standard-Definitions, we expanded the encoding pipeline to 4K and High Dynamic Range (HDR) which enabled support for our premium offering.
- We moved from centralized linear encoding to distributed chunk-based encoding. This architecture shift greatly reduced the processing latency and increased system resiliency.
- Moving away from the use of dedicated instances that were constrained in quantity, we tapped into Netflix’s internal trough created due to autoscaling microservices, leading to significant improvements in computation elasticity as well as resource utilization efficiency.
- We rolled out encoding innovations such as per-title and per-shot optimizations, which provided significant quality-of-experience (QoE) improvement to Netflix members.
- By integrating with studio content systems, we enabled the pipeline to leverage rich metadata from the creative side and create more engaging member experiences like interactive storytelling.
- We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case.
Our experience of the last decade-and-a-half has reinforced our conviction that an efficient, flexible video processing pipeline that allows us to innovate and support our streaming service, as well as our studio partners, is critical to the continued success of Netflix. To that end, the Video and Image Encoding team in Encoding Technologies (ET) has spent the last few years rebuilding the video processing pipeline on our next-generation microservice-based computing platform Cosmos.
From Reloaded to Cosmos
Reloaded
Starting in 2014, we developed and operated the video processing pipeline on our third-generation platform Reloaded. Reloaded was well-architected, providing good stability, scalability, and a reasonable level of flexibility. It served as the foundation for numerous encoding innovations developed by our team.
When Reloaded was designed, we focused on a single use case: converting high-quality media files (also known as mezzanines) received from studios into compressed assets for Netflix streaming. Reloaded was created as a single monolithic system, where developers from various media teams in ET and our platform partner team Content Infrastructure and Solutions (CIS)¹ worked on the same codebase, building a single system that handled all media assets. Over the years, the system expanded to support various new use cases. This led to a significant increase in system complexity, and the limitations of Reloaded began to show:
- Coupled functionality: Reloaded was composed of a number of worker modules and an orchestration module. The setup of a new Reloaded module and its integration with the orchestration required a non-trivial amount of effort, which led to a bias towards augmentation rather than creation when developing new functionalities. For example, in Reloaded the video quality calculation was implemented inside the video encoder module. With this implementation, it was extremely difficult to recalculate video quality without re-encoding.
- Monolithic structure: Since Reloaded modules were often co-located in the same repository, it was easy to overlook code-isolation rules and there was quite a bit of unintended reuse of code across what should have been strong boundaries. Such reuse created tight coupling and reduced development velocity. The tight coupling among modules further forced us to deploy all modules together.
- Long release cycles: The joint deployment meant that there was increased fear of unintended production outages as debugging and rollback can be difficult for a deployment of this size. This drove the approach of the “release train”. Every two weeks, a “snapshot” of all modules was taken, and promoted to be a “release candidate”. This release candidate then went through exhaustive testing which attempted to cover as large a surface area as possible. This testing stage took about two weeks. Thus, depending on when the code change was merged, it could take anywhere between two and four weeks to reach production.
As time progressed and functionalities grew, the rate of new feature contributions in Reloaded dropped. Several promising ideas were abandoned owing to the outsized work needed to overcome architectural limitations. The platform that had once served us well was now becoming a drag on development.
Cosmos
As a response, in 2018 the CIS and ET teams started developing the next-generation platform, Cosmos. In addition to the scalability and the stability that the developers already enjoyed in Reloaded, Cosmos aimed to significantly increase system flexibility and feature development velocity. To achieve this, Cosmos was developed as a computing platform for workflow-driven, media-centric microservices.
The microservice architecture provides strong decoupling between services. Per-microservice workflow support eases the burden of implementing complex media workflow logic. Finally, relevant abstractions allow media algorithm developers to focus on the manipulation of video and audio signals rather than on infrastructural concerns. A comprehensive list of benefits offered by Cosmos can be found in the linked blog.
Building the Video Processing Pipeline in Cosmos
Service Boundaries
In the microservice architecture, a system is composed of a number of fine-grained services, with each service focusing on a single functionality. So the first (and arguably the most important) thing is to identify boundaries and define services.
In our pipeline, as media assets travel through creation to ingest to delivery, they go through a number of processing steps such as analyses and transformations. We analyzed these processing steps to identify “boundaries” and grouped them into different domains, which in turn became the building blocks of the microservices we engineered.
As an example, in Reloaded, the video encoding module bundles 5 steps:
1. divide the input video into small chunks
2. encode each chunk independently
3. calculate the quality score (VMAF) of each chunk
4. assemble all the encoded chunks into a single encoded video
5. aggregate quality scores from all chunks
From a system perspective, the assembled encoded video is of primary concern while the internal chunking and separate chunk encodings exist in order to fulfill certain latency and resiliency requirements. Further, as alluded to above, the video quality calculation provides a totally separate functionality as compared to the encoding service.
Thus, in Cosmos, we created two independent microservices: Video Encoding Service (VES) and Video Quality Service (VQS), each of which serves a clear, decoupled function. As implementation details, the chunked encoding and the assembling were abstracted away into the VES.
Video Services
The approach outlined above was applied to the rest of the video processing pipeline to identify functionalities and hence service boundaries, leading to the creation of the following video services².
- Video Inspection Service (VIS): This service takes a mezzanine as the input and performs various inspections. It extracts metadata from different layers of the mezzanine for downstream services. In addition, the inspection service flags issues if invalid or unexpected metadata is observed and provides actionable feedback to the upstream team.
- Complexity Analysis Service (CAS): The optimal encoding recipe is highly content-dependent. This service takes a mezzanine as the input and performs analysis to understand the content complexity. It calls Video Encoding Service for pre-encoding and Video Quality Service for quality evaluation. The results are saved to a database so they can be reused.
- Ladder Generation Service (LGS): This service creates an entire bitrate ladder for a given encoding family (H.264, AV1, etc.). It fetches the complexity data from CAS and runs the optimization algorithm to create encoding recipes. The CAS and LGS cover much of the innovations that we have previously presented in our tech blogs (per-title, mobile encodes, per-shot, optimized 4K encoding, etc.). By wrapping ladder generation into a separate microservice (LGS), we decouple the ladder optimization algorithms from the creation and management of complexity analysis data (which resides in CAS). We expect this to give us greater freedom for experimentation and a faster rate of innovation.
- Video Encoding Service (VES): This service takes a mezzanine and an encoding recipe and creates an encoded video. The recipe includes the desired encoding format and properties of the output, such as resolution, bitrate, etc. The service also provides options that allow fine-tuning latency, throughput, etc., depending on the use case.
- Video Validation Service (VVS): This service takes an encoded video and a list of expectations about the encode. These expectations include attributes specified in the encoding recipe as well as conformance requirements from the codec specification. VVS analyzes the encoded video and compares the results against the indicated expectations. Any discrepancy is flagged in the response to alert the caller.
- Video Quality Service (VQS): This service takes the mezzanine and the encoded video as input, and calculates the quality score (VMAF) of the encoded video.
Service Orchestration
Each video service provides a dedicated functionality and they work together to generate the needed video assets. Currently, the two main use cases of the Netflix video pipeline are producing assets for member streaming and for studio operations. For each use case, we created a dedicated workflow orchestrator so the service orchestration can be customized to best meet the corresponding business needs.
For the streaming use case, the generated videos are deployed to our content delivery network (CDN) for Netflix members to consume. These videos can easily be watched millions of times. The Streaming Workflow Orchestrator utilizes almost all video services to create streams for an impeccable member experience. It leverages VIS to detect and reject non-conformant or low-quality mezzanines, invokes LGS for encoding recipe optimization, encodes video using VES, and calls VQS for quality measurement where the quality data is further fed to Netflix’s data pipeline for analytics and monitoring purposes. In addition to video services, the Streaming Workflow Orchestrator uses audio and timed text services to generate audio and text assets, and packaging services to “containerize” assets for streaming.
For the studio use case, some example video assets are marketing clips and daily production editorial proxies. The requests from the studio side are generally latency-sensitive. For example, someone from the production team may be waiting for the video to review so they can decide the shooting plan for the next day. Because of this, the Studio Workflow Orchestrator optimizes for fast turnaround and focuses on core media processing services. At this time, the Studio Workflow Orchestrator calls VIS to extract metadata of the ingested assets and calls VES with predefined recipes. Compared to member streaming, studio operations have different and unique requirements for video processing. Therefore, the Studio Workflow Orchestrator is the exclusive user of some encoding features like forensic watermarking and timecode/text burn-in.

Where we are now
We have had the new video pipeline running alongside Reloaded in production for a few years now. During this time, we completed the migration of all necessary functionalities from Reloaded, began gradually shifting over traffic one use case at a time, and completed the switchover in September of 2023.
While it is still early days, we have already seen the benefits of the new platform, specifically the ease of feature delivery. Notably, Netflix launched the Advertising-supported plan in November 2022. Processing Ad creatives posed some new challenges: media formats of Ads are quite different from movie and TV mezzanines that the team was familiar with, and there was a new set of media processing requirements related to the business needs of Ads. With the modularity and developer productivity benefits of Cosmos, we were able to quickly iterate the pipeline to keep up with the changing requirements and support a successful product launch.
Summary
Rebuilding the video pipeline was a huge undertaking for the team. We are very proud of what we have achieved, and also eager to share our journey with the technical community. This blog has focused on providing an overview: a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses. In the next blog, we are going to delve into the details of the Video Encoding Service (VES), explaining step-by-step the service creation, and sharing lessons learned (we have A LOT!). We also plan to cover other video services in future tech blogs. Follow the Netflix Tech Blog to stay up to date.
Acknowledgments
A big shout out to the CIS team for their outstanding work in building the Cosmos platform and their receptiveness to feedback from service developers.
We want to express our appreciation to our users, the Streaming Encoding Pipeline team, and the Video Engineering team. Just like our feedback helps iron out the platform, the feedback from our users has been instrumental in building high-quality services.
We also want to thank Christos Bampis and Zhi Li for their significant contributions to video services, and our two former team members, Chao Chen and Megha Manohara for contributing to the early development of this project.
Footnotes
- Formerly known as Media Cloud Engineering/MCE team.
- The actual number of video services is more than listed here. Some of them are Netflix-specific and thus omitted from this blog.
Rebuilding Netflix Video Processing Pipeline with Microservices was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Do you have FOPO (fear of other people’s opinions)? Dr. Michael Gervais helps us overcome it.
Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=gymHbsssITo
Strengthen the DevOps pipeline and protect data with AWS Secrets Manager, AWS KMS, and AWS Certificate Manager
Post Syndicated from Magesh Dhanasekaran original https://aws.amazon.com/blogs/security/strengthen-the-devops-pipeline-and-protect-data-with-aws-secrets-manager-aws-kms-and-aws-certificate-manager/
In this blog post, we delve into using Amazon Web Services (AWS) data protection services such as Amazon Secrets Manager, AWS Key Management Service (AWS KMS), and AWS Certificate Manager (ACM) to help fortify both the security of the pipeline and security in the pipeline. We explore how these services contribute to the overall security of the DevOps pipeline infrastructure while enabling seamless integration of data protection measures. We also provide practical insights by demonstrating the implementation of these services within a DevOps pipeline for a three-tier WordPress web application deployed using Amazon Elastic Kubernetes Service (Amazon EKS).
DevOps pipelines involve the continuous integration, delivery, and deployment of cloud infrastructure and applications, which can store and process sensitive data. The increasing adoption of DevOps pipelines for cloud infrastructure and application deployments has made the protection of sensitive data a critical priority for organizations.
Some examples of the types of sensitive data that must be protected in DevOps pipelines are:
- Credentials: Usernames and passwords used to access cloud resources, databases, and applications.
- Configuration files: Files that contain settings and configuration data for applications, databases, and other systems.
- Certificates: TLS certificates used to encrypt communication between systems.
- Secrets: Any other sensitive data used to access or authenticate with cloud resources, such as private keys, security tokens, or passwords for third-party services.
Unintended access or data disclosure can have serious consequences such as loss of productivity, legal liabilities, financial losses, and reputational damage. It’s crucial to prioritize data protection to help mitigate these risks effectively.
The concept of security of the pipeline encompasses implementing security measures to protect the entire DevOps pipeline—the infrastructure, tools, and processes—from potential security issues. While the concept of security in the pipeline focuses on incorporating security practices and controls directly into the development and deployment processes within the pipeline.
By using Secrets Manager, AWS KMS, and ACM, you can strengthen the security of your DevOps pipelines, safeguard sensitive data, and facilitate secure and compliant application deployments. Our goal is to equip you with the knowledge and tools to establish a secure DevOps environment, providing the integrity of your pipeline infrastructure and protecting your organization’s sensitive data throughout the software delivery process.
Sample application architecture overview
WordPress was chosen as the use case for this DevOps pipeline implementation due to its popularity, open source nature, containerization support, and integration with AWS services. The sample architecture for the WordPress application in the AWS cloud uses the following services:
- Amazon Route 53: A DNS web service that routes traffic to the correct AWS resource.
- Amazon CloudFront: A global content delivery network (CDN) service that securely delivers data and videos to users with low latency and high transfer speeds.
- AWS WAF: A web application firewall that protects web applications from common web exploits.
- AWS Certificate Manager (ACM): A service that provides SSL/TLS certificates to enable secure connections.
- Application Load Balancer (ALB): Routes traffic to the appropriate container in Amazon EKS.
- Amazon Elastic Kubernetes Service (Amazon EKS): A scalable and highly available Kubernetes cluster to deploy containerized applications.
- Amazon Relational Database Service (Amazon RDS): A managed relational database service that provides scalable and secure databases for applications.
- AWS Key Management Service (AWS KMS): A key management service that allows you to create and manage the encryption keys used to protect your data at rest.
- AWS Secrets Manager: A service that provides the ability to rotate, manage, and retrieve database credentials.
- AWS CodePipeline: A fully managed continuous delivery service that helps to automate release pipelines for fast and reliable application and infrastructure updates.
- AWS CodeBuild: A fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages.
- AWS CodeCommit: A secure, highly scalable, fully managed source-control service that hosts private Git repositories.
Before we explore the specifics of the sample application architecture in Figure 1, it’s important to clarify a few aspects of the diagram. While it displays only a single Availability Zone (AZ), please note that the application and infrastructure can be developed to be highly available across multiple AZs to improve fault tolerance. This means that even if one AZ is unavailable, the application remains operational in other AZs, providing uninterrupted service to users.
Figure 1: Sample application architecture
The flow of the data protection services in the post and depicted in Figure 1 can be summarized as follows:
First, we discuss securing your pipeline. You can use Secrets Manager to securely store sensitive information such as Amazon RDS credentials. We show you how to retrieve these secrets from Secrets Manager in your DevOps pipeline to access the database. By using Secrets Manager, you can protect critical credentials and help prevent unauthorized access, strengthening the security of your pipeline.
Next, we cover data encryption. With AWS KMS, you can encrypt sensitive data at rest. We explain how to encrypt data stored in Amazon RDS using AWS KMS encryption, making sure that it remains secure and protected from unauthorized access. By implementing KMS encryption, you add an extra layer of protection to your data and bolster the overall security of your pipeline.
Lastly, we discuss securing connections (data in transit) in your WordPress application. ACM is used to manage SSL/TLS certificates. We show you how to provision and manage SSL/TLS certificates using ACM and configure your Amazon EKS cluster to use these certificates for secure communication between users and the WordPress application. By using ACM, you can establish secure communication channels, providing data privacy and enhancing the security of your pipeline.
Note: The code samples in this post are only to demonstrate the key concepts. The actual code can be found on GitHub.
Securing sensitive data with Secrets Manager
In this sample application architecture, Secrets Manager is used to store and manage sensitive data. The AWS CloudFormation template provided sets up an Amazon RDS for MySQL instance and securely sets the master user password by retrieving it from Secrets Manager using KMS encryption.
Here’s how Secrets Manager is implemented in this sample application architecture:
- Creating a Secrets Manager secret.
- Create a Secrets Manager secret that includes the Amazon RDS database credentials using CloudFormation.
- The secret is encrypted using an AWS KMS customer managed key.
- Sample code:
The ManageMasterUserPassword: true line in the CloudFormation template indicates that the stack will manage the master user password for the Amazon RDS instance. To securely retrieve the password for the master user, the CloudFormation template uses the MasterUserSecret parameter, which retrieves the password from Secrets Manager. The KmsKeyId: !Ref RDSMySqlSecretEncryption line specifies the KMS key ID that will be used to encrypt the secret in Secrets Manager.
By setting the MasterUserSecret parameter to retrieve the password from Secrets Manager, the CloudFormation stack can securely retrieve and set the master user password for the Amazon RDS MySQL instance without exposing it in plain text. Additionally, specifying the KMS key ID for encryption adds another layer of security to the secret stored in Secrets Manager.
- Retrieving secrets from Secrets Manager.
- The secrets store CSI driver is a Kubernetes-native driver that provides a common interface for Secrets Store integration with Amazon EKS. The secrets-store-csi-driver-provider-aws is a specific provider that provides integration with the Secrets Manager.
- To set up Amazon EKS, the first step is to create a SecretProviderClass, which specifies the secret ID of the Amazon RDS database. This SecretProviderClass is then used in the Kubernetes deployment object to deploy the WordPress application and dynamically retrieve the secrets from the secret manager during deployment. This process is entirely dynamic and verifies that no secrets are recorded anywhere. The SecretProviderClass is created on a specific app namespace, such as the wp namespace.
- Sample code:
When using Secrets manager, be aware of the following best practices for managing and securing Secrets Manager secrets:
- Use AWS Identity and Access Management (IAM) identity policies to define who can perform specific actions on Secrets Manager secrets, such as reading, writing, or deleting them.
- Secrets Manager resource policies can be used to manage access to secrets at a more granular level. This includes defining who has access to specific secrets based on attributes such as IP address, time of day, or authentication status.
- Encrypt the Secrets Manager secret using an AWS KMS key.
- Using CloudFormation templates to automate the creation and management of Secrets Manager secrets including rotation.
- Use AWS CloudTrail to monitor access and changes to Secrets Manager secrets.
- Use CloudFormation hooks to validate the Secrets Manager secret before and after deployment. If the secret fails validation, the deployment is rolled back.
Encrypting data with AWS KMS
Data encryption involves converting sensitive information into a coded form that can only be accessed with the appropriate decryption key. By implementing encryption measures throughout your pipeline, you make sure that even if unauthorized individuals gain access to the data, they won’t be able to understand its contents.
Here’s how data at rest encryption using AWS KMS is implemented in this sample application architecture:
- Amazon RDS secret encryption
- Encrypting secrets: An AWS KMS customer managed key is used to encrypt the secrets stored in Secrets Manager to ensure their confidentiality during the DevOps build process.
- Sample code:
- Amazon RDS data encryption
- Enable encryption for an Amazon RDS instance using CloudFormation. Specify the KMS key ARN in the CloudFormation stack and RDS will use the specified KMS key to encrypt data at rest.
- Sample code:
- Kubernetes Pods storage
- Use encrypted Amazon Elastic Block Store (Amazon EBS) volumes to store configuration data. Create a managed encrypted Amazon EBS volume using the following code snippet, and then deploy a Kubernetes pod with the persistent volume claim (PVC) mounted as a volume.
- Sample code:
- Amazon ECR
- To secure data at rest in Amazon Elastic Container Registry (Amazon ECR), enable encryption at rest for Amazon ECR repositories using the AWS Management Console or AWS Command Line Interface (AWS CLI). ECR uses AWS KMS to encrypt the data at rest.
- Create a KMS key for Amazon ECR and use that key to encrypt the data at rest.
- Automate the creation of encrypted ECR repositories and enable encryption at rest using a DevOps pipeline, use CodePipeline to automate the deployment of the CloudFormation stack.
- Define the creation of encrypted Amazon ECR repositories as part of the pipeline.
- Sample code:
AWS best practices for managing encryption keys in an AWS environment
To effectively manage encryption keys and verify the security of data at rest in an AWS environment, we recommend the following best practices:
- Use separate AWS KMS customer managed KMS keys for data classifications to provide better control and management of keys.
- Enforce separation of duties by assigning different roles and responsibilities for key management tasks, such as creating and rotating keys, setting key policies, or granting permissions. By segregating key management duties, you can reduce the risk of accidental or intentional key compromise and improve overall security.
- Use CloudTrail to monitor AWS KMS API activity and detect potential security incidents.
- Rotate KMS keys as required by your regulatory requirements.
- Use CloudFormation hooks to validate KMS key policies to verify that they align with organizational and regulatory requirements.
Following these best practices and implementing encryption at rest for different services such as Amazon RDS, Kubernetes Pods storage, and Amazon ECR, will help ensure that data is encrypted at rest.
Securing communication with ACM
Secure communication is a critical requirement for modern environments and implementing it in a DevOps pipeline is crucial for verifying that the infrastructure is secure, consistent, and repeatable across different environments. In this WordPress application running on Amazon EKS, ACM is used to secure communication end-to-end. Here’s how to achieve this:
- Provision TLS certificates with ACM using a DevOps pipeline
- To provision TLS certificates with ACM in a DevOps pipeline, automate the creation and deployment of TLS certificates using ACM. Use AWS CloudFormation templates to create the certificates and deploy them as part of infrastructure as code. This verifies that the certificates are created and deployed consistently and securely across multiple environments.
- Sample code:
- Provisioning of ALB and integration of TLS certificate using AWS ALB Ingress Controller for Kubernetes
- Use a DevOps pipeline to create and configure the TLS certificates and ALB. This verifies that the infrastructure is created consistently and securely across multiple environments.
- Sample code:
- CloudFront and ALB
- To secure communication between CloudFront and the ALB, verify that the traffic from the client to CloudFront and from CloudFront to the ALB is encrypted using the TLS certificate.
- Sample code:
- ALB to Kubernetes Pods
- To secure communication between the ALB and the Kubernetes Pods, use the Kubernetes ingress resource to terminate SSL/TLS connections at the ALB. The ALB sends the PROTO metadata http connection header to the WordPress web server. The web server checks the incoming traffic type (http or https) and enables the HTTPS connection only hereafter. This verifies that pod responses are sent back to ALB only over HTTPS.
- Additionally, using the X-Forwarded-Proto header can help pass the original protocol information and help avoid issues with the $_SERVER[‘HTTPS’] variable in WordPress.
- Sample code:
- Kubernetes Pods to Amazon RDS
- To secure communication between the Kubernetes Pods in Amazon EKS and the Amazon RDS database, use SSL/TLS encryption on the database connection.
- Configure an Amazon RDS MySQL instance with enhanced security settings to verify that only TLS-encrypted connections are allowed to the database. This is achieved by creating a DB parameter group with a parameter called require_secure_transport set to ‘1‘. The WordPress configuration file is also updated to enable SSL/TLS communication with the MySQL database. Then enable the TLS flag on the MySQL client and the Amazon RDS public certificate is passed to ensure that the connection is encrypted using the TLS_AES_256_GCM_SHA384 protocol. The sample code that follows focuses on enhancing the security of the RDS MySQL instance by enforcing encrypted connections and configuring WordPress to use SSL/TLS for communication with the database.
- Sample code:
In this architecture, AWS WAF is enabled at CloudFront to protect the WordPress application from common web exploits. AWS WAF for CloudFront is recommended and use AWS managed WAF rules to verify that web applications are protected from common and the latest threats.
Here are some AWS best practices for securing communication with ACM:
- Use SSL/TLS certificates: Encrypt data in transit between clients and servers. ACM makes it simple to create, manage, and deploy SSL/TLS certificates across your infrastructure.
- Use ACM-issued certificates: This verifies that your certificates are trusted by major browsers and that they are regularly renewed and replaced as needed.
- Implement certificate revocation: Implement certificate revocation for SSL/TLS certificates that have been compromised or are no longer in use.
- Implement strict transport security (HSTS): This helps protect against protocol downgrade attacks and verifies that SSL/TLS is used consistently across sessions.
- Configure proper cipher suites: Configure your SSL/TLS connections to use only the strongest and most secure cipher suites.
Monitoring and auditing with CloudTrail
In this section, we discuss the significance of monitoring and auditing actions in your AWS account using CloudTrail. CloudTrail is a logging and tracking service that records the API activity in your AWS account, which is crucial for troubleshooting, compliance, and security purposes. Enabling CloudTrail in your AWS account and securely storing the logs in a durable location such as Amazon Simple Storage Service (Amazon S3) with encryption is highly recommended to help prevent unauthorized access. Monitoring and analyzing CloudTrail logs in real-time using CloudWatch Logs can help you quickly detect and respond to security incidents.
In a DevOps pipeline, you can use infrastructure-as-code tools such as CloudFormation, CodePipeline, and CodeBuild to create and manage CloudTrail consistently across different environments. You can create a CloudFormation stack with the CloudTrail configuration and use CodePipeline and CodeBuild to build and deploy the stack to different environments. CloudFormation hooks can validate the CloudTrail configuration to verify it aligns with your security requirements and policies.
It’s worth noting that the aspects discussed in the preceding paragraph might not apply if you’re using AWS Organizations and the CloudTrail Organization Trail feature. When using those services, the management of CloudTrail configurations across multiple accounts and environments is streamlined. This centralized approach simplifies the process of enforcing security policies and standards uniformly throughout the organization.
By following these best practices, you can effectively audit actions in your AWS environment, troubleshoot issues, and detect and respond to security incidents proactively.
Complete code for sample architecture for deployment
The complete code repository for the sample WordPress application architecture demonstrates how to implement data protection in a DevOps pipeline using various AWS services. The repository includes both infrastructure code and application code that covers all aspects of the sample architecture and implementation steps.
The infrastructure code consists of a set of CloudFormation templates that define the resources required to deploy the WordPress application in an AWS environment. This includes the Amazon Virtual Private Cloud (Amazon VPC), subnets, security groups, Amazon EKS cluster, Amazon RDS instance, AWS KMS key, and Secrets Manager secret. It also defines the necessary security configurations such as encryption at rest for the RDS instance and encryption in transit for the EKS cluster.
The application code is a sample WordPress application that is containerized using Docker and deployed to the Amazon EKS cluster. It shows how to use the Application Load Balancer (ALB) to route traffic to the appropriate container in the EKS cluster, and how to use the Amazon RDS instance to store the application data. The code also demonstrates how to use AWS KMS to encrypt and decrypt data in the application, and how to use Secrets Manager to store and retrieve secrets. Additionally, the code showcases the use of ACM to provision SSL/TLS certificates for secure communication between the CloudFront and the ALB, thereby ensuring data in transit is encrypted, which is critical for data protection in a DevOps pipeline.
Conclusion
Strengthening the security and compliance of your application in the cloud environment requires automating data protection measures in your DevOps pipeline. This involves using AWS services such as Secrets Manager, AWS KMS, ACM, and AWS CloudFormation, along with following best practices.
By automating data protection mechanisms with AWS CloudFormation, you can efficiently create a secure pipeline that is reproducible, controlled, and audited. This helps maintain a consistent and reliable infrastructure.
Monitoring and auditing your DevOps pipeline with AWS CloudTrail is crucial for maintaining compliance and security. It allows you to track and analyze API activity, detect any potential security incidents, and respond promptly.
By implementing these best practices and using data protection mechanisms, you can establish a secure pipeline in the AWS cloud environment. This enhances the overall security and compliance of your application, providing a reliable and protected environment for your deployments.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Enable metric-based and scheduled scaling for Amazon Managed Service for Apache Flink
Post Syndicated from Francisco Morillo original https://aws.amazon.com/blogs/big-data/enable-metric-based-and-scheduled-scaling-for-amazon-managed-service-for-apache-flink/
Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to keep your applications running successfully in a production environment.
Amazon Managed Service for Apache Flink is a fully managed service that reduces the complexity of building and managing Apache Flink applications. Amazon Managed Service for Apache Flink manages the underlying Apache Flink components that provide durable application state, metrics, logs, and more.
In this post, we show a simplified way to automatically scale up and down the number of KPUs (Kinesis Processing Units; 1 KPU is 1 vCPU and 4 GB of memory) of your Apache Flink applications with Amazon Managed Service for Apache Flink. We show you how to scale by using metrics such as CPU, memory, backpressure, or any custom metric of your choice. Additionally, we show how to perform scheduled scaling, allowing you to adjust your application’s capacity at specific times, particularly when dealing with predictable workloads. We also share an AWS CloudFormation utility to help you implement auto scaling quickly with your Amazon Managed Service for Apache Flink applications.
Metric-based scaling
This section describes how to implement a scaling solution for Amazon Managed Service for Apache Flink based on Amazon CloudWatch metrics. Amazon Managed Service for Apache Flink comes with an auto scaling option out of the box that scales out when container CPU utilization is above 75% for 15 minutes. This works well for many use cases; however, for some applications, you may need to scale based on a different metric, or trigger the scaling action at a certain point in time or by a different factor. You can customize your scaling policies and save costs by right-sizing your Amazon Managed Apache Flink applications the deploying this solution.
To perform metric-based scaling, we use CloudWatch alarms, Amazon EventBridge, AWS Step Functions, and AWS Lambda. You can choose from metrics coming from the source such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), or metrics from the Amazon Managed Service for Apache Flink application. You can find these components in the CloudFormation template in the GitHub repo.
The following diagram shows how to scale an Amazon Managed Service for Apache Flink application in response to a CloudWatch alarm.

This solution uses the metric selected and creates two CloudWatch alarms that, depending on the threshold you use, trigger a rule in EventBridge to start running a Step Functions state machine. The following diagram illustrates the state machine workflow.
Note: Amazon Kinesis Data Analytics was renamed to Amazon Managed Service for Apache Flink August 2023
The Step Functions workflow consists of the following steps:
- The state machine describes the Amazon Managed Service for Apache Flink application, which will provide information related to the current number of KPUs in the application, as well if the application is being updated or is it running.
- The state machine invokes a Lambda function that, depending on which alarm was triggered, will scale the application up or down, following the parameters set in the CloudFormation template. When scaling the application, it will use the increase factor (either add/subtract or multiple/divide based on that factor) defined in the CloudFormation template. You can have different factors for scaling in or out. If you want to take a more cautious approach to scaling, you can use add/subtract and use an increase factor for scaling in/out of 1.
- If the application has reached the maximum or minimum number of KPUs set in the parameters of the CloudFormation template, the workflow stops. Keep in mind that Amazon Managed Service for Apache Flink applications have a default maximum of 64 KPUs (you can request to increase this limit). Do not specify a maximum value above 64 KPUs if you have not requested to increase the quota, because the scaling solution will get stuck by failing to update.
- If the workflow continues, because the allocated KPUs haven’t reached the maximum or minimum values, the workflow will wait for a period of time you specify, and then describe the application and see if it has finished updating.
- The workflow will continue to wait until the application has finished updating. When the application is updated, the workflow will wait for a period of time you specify in the CloudFormation template, to allow the metric to fall within the threshold and have the CloudWatch rule change from ALARM state to OK.
- If the metric is still in ALARM state, the workflow will start again and continue to scale the application either up or down. If the metric is in OK state, the workflow will stop.
For applications that read from a Kinesis Data Streams source, you can use the metric millisBehindLatest. If using a Kafka source, you can use records lag max for scaling events. These metrics capture how far behind your application is from the head of the stream. You can also use a custom metric that you have registered in your Apache Flink applications.
The sample CloudFormation template allows you to select one of the following metrics:
- Amazon Managed Service for Apache Flink application metrics – Requires an application name:
- ContainerCPUUtilization – Overall percentage of CPU utilization across task manager containers in the Flink application cluster.
- ContainerMemoryUtilization – Overall percentage of memory utilization across task manager containers in the Flink application cluster.
- BusyTimeMsPerSecond – Time in milliseconds the application is busy (neither idle nor back pressured) per second.
- BackPressuredTimeMsPerSecond – Time in milliseconds the application is back pressured per second.
- LastCheckpointDuration – Time in milliseconds it took to complete the last checkpoint.
- Kinesis Data Streams metrics – Requires the data stream name:
- MillisBehindLatest – The number of milliseconds the consumer is behind the head of the stream, indicating how far behind the current time the consumer is.
- IncomingRecords – The number of records successfully put to the Kinesis data stream over the specified time period. If no records are coming, this metric will be null and you won’t be able to scale down.
- Amazon MSK metrics – Requires the cluster name, topic name, and consumer group name):
- MaxOffsetLag – The maximum offset lag across all partitions in a topic.
- SumOffsetLag – The aggregated offset lag for all the partitions in a topic.
- EstimatedMaxTimeLag – The time estimate (in seconds) to drain MaxOffsetLag.
- Custom metrics – Metrics you can define as part of your Apache Flink applications. Most common metrics are counters (continuously increase) or gauges (can be updated with last value). For this solution, you need to add the kinesisAnalytics dimension to the metric group. You also need to provide the custom metric name as a parameter in the CloudFormation template. If you need to use more dimensions in your custom metric, you need to modify the CloudWatch alarm so it’s able to use your specific metric. For more information on custom metrics, see Using Custom Metrics with Amazon Managed Service for Apache Flink.
The CloudFormation template deploys the resources as well as the auto scaling code. You only need to specify the name of the Amazon Managed Service for Apache Flink application, the metric to which you want to scale your application in or out, and the thresholds for triggering an alarm. The solution by default will use the average aggregation for metrics and a period duration of 60 seconds for each data point. You can configure the evaluation periods and data points to alarm when defining the CloudFormation template.
Scheduled scaling
This section describes how to implement a scaling solution for Amazon Managed Service for Apache Flink based on a schedule. To perform scheduled scaling, we use EventBridge and Lambda, as illustrated in the following figure.

These components are available in the CloudFormation template in the GitHub repo.
The EventBridge scheduler is triggered based on the parameters set when deploying the CloudFormation template. You define the KPU of the applications when running at peak times, as well as the KPU for non-peak times. The application runs with those KPU parameters depending on the time of day.
As with the previous example for metric-based scaling, the CloudFormation template deploys the resources and scaling code required. You only need to specify the name of the Amazon Managed Service for Apache Flink application and the schedule for the scaler to modify the application to the set number of KPUs.
Considerations for scaling Flink applications using metric-based or scheduled scaling
Be aware of the following when considering these solutions:
- When scaling Amazon Managed Service for Apache Flink applications in or out, you can choose to either increase the overall application parallelism or modify the parallelism per KPU. The latter allows you to set the number of parallel tasks that can be scheduled per KPU. This sample only updates the overall parallelism, not the parallelism per KPU.
- If SnapshotsEnabled is set to true in ApplicationSnapshotConfiguration, Amazon Managed Service for Apache Flink will automatically pause the application, take a snapshot, and then restore the application with the updated configuration whenever it is updated or scaled. This process may result in downtime for the application, depending on the state size, but there will be no data loss. When using metric-based scaling, you have to choose a minimum and a maximum threshold of KPU the application can have. Depending on by how much you perform the scaling, if the new desired KPU is bigger or lower than your thresholds, the solution will update the KPUs to be equal to your thresholds.
- When using metric-based scaling, you also have to choose a cooling down period. This is the amount of time you want your application to wait after being updated, to see if the metric has gone from ALARM status to OK status. This value depends on how long are you willing to wait before another scaling event to occur.
- With the metric-based scaling solution, you are limited to choosing the metrics that are listed in the CloudFormation template. However, you can modify the alarms to use any available metric in CloudWatch.
- If your application is required to run without interruptions for periods of time, we recommend using scheduled scaling, to limit scaling to non-critical times.
Summary
In this post, we covered how you can enable custom scaling for Amazon Managed Service for Apache Flink applications using enhanced monitoring features from CloudWatch integrated with Step Functions and Lambda. We also showed how you can configure a schedule to scale an application using EventBridge. Both of these samples and many more can be found in the GitHub repo.
About the Authors
Deepthi Mohan is a Principal PMT on the Amazon Managed Service for Apache Flink team.
Francisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.
Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers
Post Syndicated from Anshu Agarwal original https://aws.amazon.com/blogs/big-data/achieve-high-availability-in-amazon-opensearch-multi-az-with-standby-enabled-domains-a-deep-dive-into-failovers/
Amazon OpenSearch Service recently introduced Multi-AZ with Standby, a deployment option designed to provide businesses with enhanced availability and consistent performance for critical workloads. With this feature, managed clusters can achieve 99.99% availability while remaining resilient to zonal infrastructure failures.
In this post, we explore how search and indexing works with Multi-AZ with Standby and delve into the underlying mechanisms that contribute to its reliability, simplicity, and fault tolerance.
Background
Multi-AZ with Standby deploys OpenSearch Service domain instances across three Availability Zones, with two zones designated as active and one as standby. This configuration ensures consistent performance, even in the event of zonal failures, by maintaining the same capacity across all zones. Importantly, this standby zone follows a statically stable design, eliminating the need for capacity provisioning or data movement during failures.
During regular operations, the active zone handles coordinator traffic for both read and write requests, as well as shard query traffic. The standby zone, on the other hand, only receives replication traffic. OpenSearch Service utilizes a synchronous replication protocol for write requests. This enables the service to promptly promote a standby zone to active status in the event of a failure (mean time to failover <= 1 minute), known as a zonal failover. The previously active zone is then demoted to standby mode, and recovery operations commence to restore its healthy state.
Search traffic routing and failover to guarantee high availability
In an OpenSearch Service domain, a coordinator is any node that handles HTTP(S) requests, especially indexing and search requests. In a Multi-AZ with Standby domain, the data nodes in the active zone act as coordinators for search requests.
During the query phase of a search request, the coordinator determines the shards to be queried and sends a request to the data node hosting the shard copy. The query is run locally on each shard and matched documents are returned to the coordinator node. The coordinator node, which is responsible for sending the request to nodes containing shard copies, runs the process in two steps. First, it creates an iterator that defines the order in which nodes need to be queried for a shard copy so that traffic is uniformly distributed across shard copies. Subsequently, the request is sent to the relevant nodes.
In order to create an ordered list of nodes to be queried for a shard copy, the coordinator node uses various algorithms. These algorithms include round-robin selection, adaptive replica selection, preference-based shard routing, and weighted round-robin.
For Multi-AZ with Standby, the weighted round-robin algorithm is used for shard copy selection. In this approach, active zones are assigned a weight of 1, and the standby zone is assigned a weight of 0. This ensures that no read traffic is sent to data nodes in the standby Availability Zone.
The weights are stored in cluster state metadata as a JSON object:
As shown in the following screenshot, the us-east-1b Region has its zone status as StandBy, indicating that the data nodes in this Availability Zone are in standby state and don’t receive search or indexing requests from the load balancer.

To maintain steady-state operations, the standby Availability Zone is rotated every 30 minutes, ensuring all network parts are covered across Availability Zones. This proactive approach verifies the availability of read paths, further enhancing the system’s resilience during potential failures. The following diagram illustrates this architecture.

In the preceding diagram, Zone-C has a weighted round-robin weight set to zero. This ensures that the data nodes in the standby zone don’t receive any indexing or search traffic. When the coordinator queries data nodes for shard copies, it uses a weighted round-robin weight to decide on the order in which nodes to be queried. Because the weight is zero for the standby Availability Zone, coordinator requests are not sent.
In an OpenSearch Service cluster, the active and standby zones can be checked at any time using Availability Zone rotation metrics, as shown in the following screenshot.

During zonal outages, the standby Availability Zone seamlessly switches to fail-open mode for search requests. This means that the shard query traffic is routed to all Availability Zones, even those in standby, when a healthy shard copy is unavailable in the active Availability Zone. This fail-open approach safeguards search requests from disruption during failures, ensuring continuous service. The following diagram illustrates this architecture.

In the preceding diagram, during the steady state, the shard query traffic is sent to the data node in the active Availability Zones (Zone-A and Zone-B). Due to node failures in Zone-A, the standby Availability Zone (Zone-C) fails open to take shard query traffic so that there isn’t any impact to the search requests. Eventually, Zone-A is detected as unhealthy and the read failover switches the standby to Zone-A.
How failover ensures high availability during write impairment
The OpenSearch Service replication model follows a primary backup model, characterized by its synchronous nature, where acknowledgement from all shard copies is necessary before a write request can be acknowledged to the user. One notable drawback of this replication model is its susceptibility to slowdowns in the event of any impairment in the write path. These systems rely on an active leader node to identify failures or delays and then broadcast this information to all nodes. The duration it takes to detect these issues (mean time to detect) and subsequently resolve them (mean time to repair) largely determines how long the system will operate in an impaired state. Additionally, any networking event that affects inter-zone communications can significantly impede write requests due to the synchronous nature of replication.
OpenSearch Service utilizes an internal node-to-node communication protocol for replicating write traffic and coordinating metadata updates through an elected leader. Consequently, putting the zone experiencing stress in standby wouldn’t effectively address the issue of write impairment.
Zonal write failover: Cutting off inter-zone replication traffic
For Multi-AZ with Standby, to mitigate potential performance issues caused during unforeseen events like zonal failures and networking events, zonal write failover is an effective approach. This approach involves graceful removal of nodes in the impacted zone from the cluster, effectively cutting off ingress and egress traffic between zones. By severing the inter-zone replication traffic, the impact of zonal failures can be contained within the affected zone. This provides a more predictable experience for customers and ensures that the system continues to operate reliably.
Graceful write failover
The orchestration of a write failover within OpenSearch Service is carried out by the elected leader node through a well-defined mechanism. This mechanism involves a consensus protocol for cluster state publication, ensuring unanimous agreement among all nodes to designate a single zone (at all times) for decommissioning. Importantly, metadata related to the affected zone is replicated across all nodes to ensure its persistence, even during a full restart in the event of an outage.
Furthermore, the leader node ensures a smooth and graceful transition by initially placing the nodes in the impacted zones on standby for a duration of 5 minutes before initiating I/O fencing. This deliberate approach prevents any new coordinator traffic or shard query traffic from being directed to the nodes within the impacted zone. This, in turn, allow these nodes to complete their ongoing tasks gracefully and gradually handle any inflight requests before being taken out of service. The following diagram illustrates this architecture.

In the process of implementing a write failover for a leader node, OpenSearch Service follows these key steps:
- Leader abdication – If the leader node happens to be located in a zone scheduled for write failover, the system ensures that the leader node voluntarily steps down from its leadership role. This abdication is carried out in a controlled manner, and the entire process is handed over to another eligible node, which then takes charge of the actions required.
- Prevent reelection of to-be-decommissioned leader – To prevent the reelection of a leader from a zone marked for write failover, when the eligible leader node initiates the write failover action, it takes measures to ensure that any to-be-decommissioned leader nodes do not participate in any further elections. This is achieved by excluding the to-be-decommissioned leader node from the voting configuration, effectively preventing it from voting during any critical phase of the cluster’s operation.
Metadata related to the write failover zone is stored within the cluster state, and this information is published to all nodes in the distributed OpenSearch Service cluster as follows:
The following screenshot depicts that during a networking slowdown in a zone, write failover helps recover availability.

Zonal recovery after write failover
The process of zonal recommissioning plays a crucial role in the recovery phase following a zonal write failover. After the impacted zone has been restored and is considered stable, the nodes that were previously decommissioned will rejoin the cluster. This recommissioning typically occurs within a time frame of 2 minutes after the zone has been recommissioned.
This enables them to synchronize with their peer nodes and initiates the recovery process for replica shards, effectively restoring the cluster to its desired state.
Conclusion
The introduction of OpenSearch Service Multi-AZ with Standby provides businesses with a powerful solution to achieve high availability and consistent performance for critical workloads. With this deployment option, businesses can enhance their infrastructure’s resilience, simplify cluster configuration and management, and enforce best practices. With features like weighted round-robin shard copy selection, proactive failover mechanisms, and fail-open standby Availability Zones, OpenSearch Service Multi-AZ with Standby ensures a reliable and efficient search experience for demanding enterprise environments.
For more information about Multi-AZ with Standby, refer to Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby.
About the Author
Anshu Agarwal is a Senior Software Engineer working on AWS OpenSearch at Amazon Web Services. She is passionate about solving problems related to building scalable and highly reliable systems.
Rishab Nahata is a Software Engineer working on OpenSearch at Amazon Web Services. He is fascinated about solving problems in distributed systems. He is active contributor to OpenSearch.
Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in distributed and autonomous systems. He is an active contributor to OpenSearch.
Ranjith Ramachandra is an Engineering Manager working on Amazon OpenSearch Service at Amazon Web Services.