Thunderbird for Android now available

Post Syndicated from jzb original https://lwn.net/Articles/996326/

The first stable release of the Thunderbird mail client for Android is now available:

Just over two years ago, we announced
our plans
to bring Thunderbird to Android by taking K-9 Mail under
our wing. The journey took a little
longer than we had originally anticipated
and there was a lot to
learn along the way, but the wait is finally over! For all of you who
have ever asked “when is Thunderbird for Android coming out?”, the
answer is – today!

It is immediately available on the Google
Play Store
, via GitHub
Releases
, or from the Thunderbird web site, and
it will be “coming soon” to the F-Droid repository for FOSS Android
applications. See the release
notes
for detailed information about Thunderbird 8.0 for
Android.

Simson Garfinkel on Spooky Cryptographic Action at a Distance

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/10/simson-garfinkel-on-spooky-cryptographic-action-at-a-distance.html

Excellent read. One example:

Consider the case of basic public key cryptography, in which a person’s public and private key are created together in a single operation. These two keys are entangled, not with quantum physics, but with math.

When I create a virtual machine server in the Amazon cloud, I am prompted for an RSA public key that will be used to control access to the machine. Typically, I create the public and private keypair on my laptop and upload the public key to Amazon, which bakes my public key into the server’s administrator account. My laptop and that remove server are thus entangled, in that the only way to log into the server is using the key on my laptop. And because that administrator account can do anything to that server­—read the sensitivity data, hack the web server to install malware on people who visit its web pages, or anything else I might care to do­—the private key on my laptop represents a security risk for that server.

Here’s why it’s impossible to evaluate a server and know if it is secure: as long that private key exists on my laptop, that server has a vulnerability. But if I delete that private key, the vulnerability goes away. By deleting the data, I have removed a security risk from the server and its security has increased. This is true entanglement! And it is spooky: not a single bit has changed on the server, yet it is more secure.

Read it all.

Security updates for Wednesday

Post Syndicated from jzb original https://lwn.net/Articles/996310/

Security updates have been issued by AlmaLinux (buildah), Debian (python-git, texlive-bin, and xorg-server), Mageia (chromium-browser-stable), Red Hat (kernel), SUSE (Botan, go1.22-openssl, go1.23-openssl, grafana, libgsf, pcp, pgadmin4, python310-pytest-html, python313, xorg-x11-server, and xwayland), and Ubuntu (nano, python-urllib3, and xorg-server, xwayland).

The Importance of Asset Context in Attack Surface Management.

Post Syndicated from Jon Schipp original https://blog.rapid7.com/2024/10/30/the-importance-of-asset-context-in-attack-surface-management/

The Importance of Asset Context in Attack Surface Management.

This is the last of the four blogs (Help, I can’t see! A Primer for Attack Surface Management Blog Series, The Main Components of an Attack Surface Management (ASM) Strategy, and Understanding your Attack Surface: Different Approaches to Asset Discovery)  covering the foundational elements of Attack Surface Management (ASM), and this topic covers one of the main drivers for ASM and why companies are investing in it, the context it delivers to inform better security decision making.

ASM goes far beyond traditional IT asset visibility by bringing in the relevant security context that helps teams better prioritize and remediate. In general, the more context that you can make sense of, the more equipped your teams will be to make good decisions and drive toward action.

A clear example of this can be seen in an investigation of a machine under an active threat recorded by your SIEM or XDR solution. You likely have thousands of assets in your environment where the security team is unclear about the machine’s purpose. You now leverage context from your ASM solution to learn that the machine has access to several critical business networks and that it has a high-risk exposure on it related to an ongoing active threat. It’s just a matter of time before compromise and lateral movement. This augmented context during an investigation enables you to immediately make this the number one priority for your team.

Another key example involves identities. By inventorying all the identities across your environment, you can easily determine which ones have MFA disabled, and further filter based on those that have administrative access to a business application. To improve this identity context even further,  you can pull in additional context from tools like KnowBe4 to understand how likely the user is to click on a phishing email based on their phishing training success rate. The marriage of identity data with security controls and business context helps teams better prioritize their most at-risk users for remediation.

Let’s look further at the key types of asset context that we believe are critical for effective ASM.

Business Context-Aware

The first, and arguably most important, is the asset’s business context. This enables teams to understand the business function and risk, as well as the chain of command for contact or remediation. Visibility into the chain of command provides teams with the system owner, primary user, and which department and leader they fall under.

This business context is often pulled from CMDBs such as ServiceNow, Directory Services, HR tools like Workday, and by ingesting tags from CSP and security tool data sources. To effectively leverage business context, organizations need to develop and maintain an information architecture across the environment. Business context also helps identify which assets are a key dependency for business critical applications.

Exposures & Security Controls-Aware

Understanding an asset’s vulnerabilities and exposures along with security control, mitigations, and business context is key to giving vulnerability teams the necessary means to make the best prioritization decisions. If a group of 100 machines all contain a Known Exploitable Vulnerability (KEV) that is being used in the wild by a specific piece of malware that is targeting your industry, your team may need to be up all night trying to remediate or mitigate this critical risk. But what if the majority of those same machines also have a security control or configuration in place that effectively causes that piece of malware to fail? Instead, your team can focus on a much smaller number within that group that lacks the required controls and focus on remediating those instead. Being able to harness all the available security context for assets enables teams to prioritize much more effectively.

Threat-Aware

Finally, threat context derived from SIEM, Threat Intelligence Platforms (TIP), and endpoint security tools enables security operations teams to gain insight into active threats and investigations when looking at an asset. It also enables teams to  threat-hunt across all asset data, understand the blast radius from a compromised machine, and use threat insights to prioritize response. If you can identify all machines that have a specific vulnerability and are also seeing TTPs related to it, remediation activities for these  machines can be prioritized.

Data Confidence, Aggregation & Correlation

A key factor in having confidence in security data and the context derived from it is having belief in the accuracy and integrity of the data itself. There are a few ways in which technology can help deliver that confidence. Because ASM is all about having visibility across your data and tooling silos, the final thing to consider is technology features related to an organization’s ability to analyze, troubleshoot, and configure data so that it matches your view of the attack surface. We can break this section into 3 main areas:

Unified Data Ingestion & Correlation

According to research from 451 Group, most security teams rely on between 11 and 30 different security tools to manage and secure their environments. Each of these tools only provides a partial view of the environment, and only from a particular perspective. As an example, Active Directory typically only sees Windows machines that are joined to the Domain Controller, DHCP only sees networked devices that have broadcasted and been given a lease, and CSPM tools only see cloud resources for Cloud Service Providers that have been configured.

Due to these visibility gaps, a holistic ASM solution must be able to see across these data silos and tools by ingesting and correlating data from many different sources, deduplicating it to deliver an accurate, continuously updated view of an organization’s asset landscape.

Data Transparency

Data transparency is all about giving users the ability to understand where their data has come from, how well the data is being ingested, and how the data is populated within the data model. This also enables users to follow & configure correlation logic. It is critical that you trust the data of a solution that is intended to become the ‘single source of truth’ for security data in your organization, so we cannot emphasize enough the importance of having the right visibility into how data is used in an ASM solution.

For reference, I’m including several examples of how data transparency is a core capability of Rapid7’s Surface Command.

In the image below, we’re looking at the distribution of raw asset records to uniquely correlated assets in an organization. The system has received over 200,000 raw assets from many different data sources, and is able to narrow it down through its asset correlation algorithm to 63,179 unique assets.

The Importance of Asset Context in Attack Surface Management.

The next example shows correlation effectiveness and property fulfillment (data fields with actual values) for Azure AD’s Device type. This capability is available on a per-connector basis and can be used to see how well the data source in question is correlating with other data sources (i.e., are they seeing the same assets?), and also how much of the data is being fulfilled by the API which can help pinpoint configuration issues that are limiting your view of your attack surface.

The Importance of Asset Context in Attack Surface Management.

The final example is a table view of all the data sources coming into the system and key insights from them. This can be used to assess the quality of your data sources and to debug issues like when duplicate records occur. In that case, correlation rules can be updated to reduce those duplications so users get the best correlation, and thus the best and most accurate view of their attack surface.

The Importance of Asset Context in Attack Surface Management.

This transparency into data ingestion and correlation is also critical when working with other stakeholders in the business, ensuring that everyone is in alignment on the most accurate data.

Data Prioritization

The final key aspect to successful ASM is being able to customize data in the way that an organization wants to see it. Teams rely on some tools more than others, and the weighting of those tools should match the overall preferences of the business. If Active Directory is your source of truth for ‘business owner’ and ‘department’ information over ServiceNow CMDB, then the system should be able to re-correlate the data based on the way an organization sees and utilizes the data.

Below, we show an example of how we are able to configure data prioritization in Rapid7’s Surface Command. Weighting the data can be configured on a per-property basis, so any ingestible and correlatable field can be customized to prioritize which tool should be preferred in the event of a data conflict. This enables teams to select and leverage the tools that they trust the most for specific data and use cases, so the attack surface matches the way they see their environment.

The Importance of Asset Context in Attack Surface Management.
[Example: Where ServiceNow takes priority on the Business Owner of an asset, followed by Azure AD.]

Conclusion: The Value of Context in Attack Surface Management

Over the past four blogs, I have tried to cover some of the key benefits and use cases for ASM. Much of it comes down to the core value that you can only protect what you know about, but in reality, it’s more complex than that.

The context that ASM solutions can provide you about both the external threat, and internal cyber risks, help security teams focus on what is most critical to protecting their organization. With the ever-growing number of vulnerabilities and non-patchable exposures, it just isn’t practical to expect to address everything, so prioritization is key. This is where the real value of ASM lies.

Once we understand our overall security posture, which assets are the most critical to the business, which services are the most exposed to attacks, we have the context needed to drive an effective cybersecurity program. We can take these insights and make them actionable, working with colleagues in DevOps and IT to harden machines and patch the most high-risk vulnerabilities. If we are successful in finding the gaps before the attacker, then we should also reduce the burden downstream on our SOC and IR teams.

I hope you found this blog series valuable. I’d encourage you to explore more information on Rapid7’s market-leading attack surface and exposure management solutions at https://www.rapid7.com/products/command/attack-surface-management-asm/.

Backblaze Partners with Opti9 and Adds Canadian Data Region

Post Syndicated from Teresa Dodson original https://www.backblaze.com/blog/backblaze-partners-with-opti9-and-adds-canadian-data-region/

A decorative image showing the Backblaze and Opti9 logos.

Backblaze and Opti9 are partnering up to bring Backblaze B2 Cloud Storage to joint customers around the world as well as businesses in Canada who are required to keep their data within national borders.

The who and the why

Opti9 is the international leader in hybrid cloud solutions that delivers managed cloud services, application development and modernization, backup and disaster recovery, security, and compliance solutions to businesses around the world. By bringing Backblaze into their solution set, Opti9 is onboarding high performance, low cost cloud storage that works within all the solutions they provide.

Increasingly, companies seeking managed services support are demanding solutions made up of best-in-breed providers. While traditional cloud platforms work against this principle of interoperability, Backblaze and solution providers like Opti9 are committed to delivering cloud solutions without the limitations, complexity, and high pricing that are holding businesses back.

As Jim Stechyson, the President of Opti9 put it:

Backblaze and Opti9 focus on empowering businesses with the best cloud solutions available. Being able to integrate the high performance and low total cost of ownership of Backblaze’s object storage into our set of solutions will greatly enhance our ability to drive success for our customers.

How to get started

Interested resellers or customers who want to start working with Opti9 and Backblaze today can go to the Opti9 website. Check out our joint S3 compatible hot storage offering and book your demo to get started.

Book a Demo ➔ 

For customers based in Canada, Backblaze will be opening a new data region centered in Toronto in the first quarter of 2025. As part of the partnership, Opti9 will be the exclusive provider of server backup solutions in the Canadian channel for Backblaze B2 Reserve and the Powered by Backblaze program.

More about the new Backblaze data region in Canada

The new Canadian data region gives businesses the freedom to access Backblaze’s open, interoperable cloud solution, while still allowing customers to benefit from local storage and compliance. Located in Toronto, Ontario, the data center has been assessed and maintains a security program that addresses the requirements of SOC 1 Type 2, SOC 2 Type 2, ISO 27001, PCI DSS, and HIPAA. The region will be available to customers in the first quarter of 2025. 

If you’d like to receive notifications about the data region opening date and when you can start storing data in Canada, you can sign up for the waitlist today.

Notify Me ➔ 

The post Backblaze Partners with Opti9 and Adds Canadian Data Region appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Monitoring a Complex Infrastructure Environment with Zabbix

Post Syndicated from Nyein Chan Zaw original https://blog.zabbix.com/monitoring-a-complex-infrastructure-environment-with-zabbix/28954/

Inviting the members of our global community to share their Zabbix dashboards with us prompted a flood of fascinating responses, and we’re highlighting a few of the most interesting submissions here on our blog. This week’s entry comes to us from Nyein Chan Zaw, who is based in Bangkok, Thailand and works as an Infrastructure Specialist for Green Will Solution. Read on to see how he uses his Zabbix dashboard to monitor a highly intricate infrastructure in real time. 

I appreciate the chance to share my dashboard, and I would also like to share a use case that demonstrates the practical implementation of Zabbix for real-time infrastructure monitoring.

This Zabbix dashboard provides a comprehensive view of the network’s real-time health, server availability, traffic patterns, and key performance metrics of essential infrastructure components. It is designed for monitoring production, office, and virtual server zones, including network devices, physical servers, and virtual machines. The current view is the first page of a two-page dashboard, which focuses on general network monitoring:

The second page is dedicated solely to monitoring infrastructure nodes:

Key features monitored

Traffic Monitoring: The dashboard tracks real-time traffic from critical network uplinks, including AIS and TRUE, offering visibility into bandwidth usage (e.g., 64.50 Kbps and 13.05 Kbps). It also monitors internal traffic and key devices like the FortiGate firewall, helping ensure optimal network performance and security.

Host Health Monitoring: CPU and memory utilization for top hosts (e.g., GW-WINDOW11, GW-AD-DOMAIN) are displayed, enabling efficient resource management. Alerts are triggered for high resource usage, allowing for a proactive response to performance issues.

Disk Usage: Disk space on key hosts, such as the Zabbix virtual machine and other core servers, is monitored to avoid file system over-utilization, which can lead to potential service interruptions.

Availability Overview: The dashboard provides a summary of host availability, including how many are available, unavailable, or have unknown statuses. Monitoring methods like active agent and SNMP are also shown, giving an overall view of network health.

Visual Topology Map: A detailed network map shows the production, office, virtual, and test zones, along with devices and connections. This visualization aids in quickly identifying problem areas and understanding how systems are interlinked.

Severity and Problem Monitoring: The dashboard classifies issues by severity, from critical problems to warnings. Real-time issues (such as VM downtime or system failures) are highlighted, enabling the team to resolve issues quickly.

Performance Metrics: Graphs display performance metrics, such as bandwidth usage and CPU load, offering insights into system bottlenecks or overuse, particularly in critical devices like firewalls.

Impact

This Zabbix dashboard enables an infrastructure team to efficiently monitor network performance, manage resource usage, and ensure device availability. The clear visual interface helps quickly identify issues, reducing downtime and ensuring higher reliability of critical services.

Conclusion

The first page of the dashboard demonstrates Zabbix’s capabilities for centralized monitoring across large infrastructures. By integrating data from network devices, servers, and virtual machines, it empowers IT teams to make informed decisions and address issues before they escalate. The second page provides a detailed focus on the infrastructure nodes, ensuring that all critical systems are effectively monitored for optimal operation across the IT environment.

The post Monitoring a Complex Infrastructure Environment with Zabbix appeared first on Zabbix Blog.

Слаби амбиции, непознаване на избирателите – мижав резултат

Post Syndicated from Светла Енчева original https://www.toest.bg/slabi-ambitsii-nepoznavane-na-izbiratelite-mizhav-rezultat/

Слаби амбиции, непознаване на избирателите – мижав резултат

Когато парламентарните избори са толкова начесто, колкото през последните години, човек претръпва. И все по-малко се стряска от резултатите – както от прогресиращото раздробяване на парламента (в който за малко не влязоха 9 партии и коалиции), така и от поредната популистка партия, почти неочаквано преминала 4-процентовата бариера (в случая – партията на Радостин Василев МЕЧ).

Перспективите пред 51-вото Народно събрание

А не е да няма от какво да се стресне човек. Четири от деветте групи в новото Народно събрание („Възраждане“, БСП, ИТН и МЕЧ) са евроскептични и повече или по-малко проруски. Разделеното на две ДПС никога не се е славило с твърда геополитическа ориентация. Председателят на ГЕРБ нееднократно е показвал, че ако на някого не се харесват евроатлантическите му ценности, той има и други.

Така че алтернативите за съставяне на кабинет са няколко.

Първата е пореден нестабилен опит за проевропейско управление с участието на ГЕРБ и ПП–ДБ в коалиция било с ДПС – Доган, било с някоя от другите партии. След провала на кабинета на Николай Денков подобен опит от страна на ПП–ДБ трудно ще се прости от избирателите на коалицията. Още повече че няма основание да се очаква, че Пеевски няма и този път демонстративно да дърпа конците, независимо дали ДПС – Ново начало е формално в управлението, или не.

Втората алтернатива е ГЕРБ да се опита да състави правителство без ПП–ДБ. Това няма да е лесна задача, защото постигането ѝ включва още няколко партии и коалиции. На този етап Борисов изключва съвместно управление с „Възраждане“, а партията на Радостин Василев иска да участва в съставянето на мнозинство против ГЕРБ и ДПС.

Ако „Възраждане“ се опита да състави проруска коалиция без ГЕРБ и ПП–ДБ, в нея ще трябва да участва и ДПС – Ново начало, за да има тя мнозинство. А Костадин Костадинов едва ли би искал да получава публична подкрепа точно от Пеевски.

Вариант е и някаква форма на правителство на малцинството – със скрита подкрепа от страна на партии и коалиции, които уж са против, но когато се наложи, депутатите им услужливо не се регистрират за гласуване. Или гласуват предложенията на правителството, а после си намират оправдание.

И разбира се, отново може да не се състави редовно правителство, да имаме пореден служебен кабинет и поредни предсрочни избори.

Всъщност защо трябва да е така?

Ако има някаква изненада в резултатите от вота, тя е в избирателната активност, която е по-висока, отколкото на предходното гласуване на 7 юни 2024 г., макар да се очакваше тенденцията за намаляване на броя на отишлите до урните да се запази. И все пак едва малко под 39% от имащите право на глас са го упражнили, а останалите 61% не са.

За ниската избирателна активност има ред предпоставки: твърде честите предсрочни избори, липсата на перспектива някоя от политическите сили да управлява самостоятелно, съответно – безпредметността на голяма част от предизборните обещания, които поради липсата на подкрепящо мнозинство не могат да бъдат изпълнени.

Всичко това обаче е вярно само при наличие на изходната предпоставка, че гласуват има-няма една трета от избирателите. Ако някоя партия или коалиция успее да намери път към негласуващите, политическата картина в България може да бъде коренно различна. Но по нищо не личи това да се прави. Напротив, гласуването се е ограничило до твърдите ядра. С изключение на нововъзникващите популистки партии и ДПС – Ново начало. 

Когато не се поставят амбициозни цели и не се прави възможното те да се постигнат, а хоризонтът не стига много по-далече от носа, няма как да се очакват впечатляващи резултати.

Дори първата политическа сила – ГЕРБ – вече не си и поставя за цел да спечели повече от 50% от гласовете, за да може да управлява самостоятелно. Декларираната от партията цел, при чието постигане Бойко Борисов би станал отново премиер, беше едва 80 депутати – с 41 по-малко от необходимите за мнозинство. Друг е въпросът дали ГЕРБ изобщо иска да управлява самостоятелно.

ПП–ДБ от своя страна дори не дадоха заявка, че е възможно да спечелят изборите. Те се радват, че са на второ място, а не на трето или четвърто. Влязоха в кампанията с ясното убеждение, че няма да бъдат първи. И поставиха невъзможното за реализация условие да се избере равноотдалечен от всички партии премиер. А за тях не може да се каже, че не искат да управляват самостоятелно. Напротив – основният им проблем е, че няма други групи в парламента, които да споделят целите им.

Изобщо сещате ли се за послание от последната кампания, независимо на коя политическа сила, което да разбуни духовете? Да вдъхнови едни, да ядоса други, хората да говорят за него, да мотивира избиратели да отидат до урните? Сещате ли се за предложение – извън общото повишаване на доходите, – което да е насочено към представителите на определена социална група така, че да стигне до много от тях? Колко предложения за подобряване конкретно на вашия живот можете да изброите?

А междувременно в САЩ…

Докато в България се провеждат парламентарни избори, кандидатпрезидентската кампания в САЩ е в разгара си. Тамошният политическият пейзаж е различен от българския по ред причини.

Първо, политиката в Щатите е на практика двупартийна – не че не съществуват кандидати, които не са нито от републиканците, нито от демократите, но те нямат особени шансове. Второ, избирателната система е структурирана по коренно различен начин – от значение е не общият брой на гласувалите, а кой печели в отделните щати. Трето, самото общество и култура се отличават не само от българските, но и от европейските като цяло.

Като имаме предвид тези уточнения, политическите сили в България имат какво да научат от американските по отношение на провеждането на кампания. Особено предвид факта, че избирателната активност в САЩ през годините по-скоро нараства и е стигнала нива от около две трети от имащите право на глас.

На първо място, американските граждани не са тера инкогнита нито за политиците, нито за агенциите за изследване на общественото мнение (които никой в САЩ не нарича социологически). Избирателите са разделени в множество групи, всяка от които има определени типологични характеристики, ценности и интереси.

Няма просто жени например. Жените могат да бъдат млади или не, от предградията, от конкретен щат и населено място, с определена расова принадлежност, етнос, религия и пр. Или избирателите с латиноамерикански произход, които също не могат да се слагат под един знаменател. Имигрантите от Венесуела и Куба преобладаващо гласуват за републиканците, защото републиканците наричат демократите комунисти, а те точно от комунистите са избягали. На много чернокожи мъже пък им е трудно да гласуват за жена, била тя и чернокожа. И т.н., и т.н.

И демократите, и републиканците имат послания към определени групи и правят всичко по силите си тези послания да стигнат до своите адресати. В т.нар. колебаещи се щати буквално няколко гласа могат да бъдат решаващи за това кой ще стане следващият президент на САЩ. Затова кампаниите се опитват да стигнат до всеки.

Един от най-оспорваните щати е Пенсилвания. И макар евреите да са едва 2% от населението на страната, и републиканци, и демократи правят всичко по силите си да говорят с всеки евреин в щата и да спечелят гласа му.

И Доналд Тръмп, и Камала Харис се стараят да привличат максимален брой избиратели, макар и по различен начин. Тръмп избягва да излиза извън зоната си на комфорт и предпочита да се заобикаля с аудитория, която подкрепя – и усилва – посланията му. Харис от своя страна се опитва в максимална степен да компенсира факта, че избирателите не я познават добре, и не пропуска възможност да се представи – дори на територията на „вражеската“ телевизия Fox. Защото знае, че е в по-слаба позиция, и за да победи, трябва „да се вдигне за косата“.

Кандидат-президентката на демократите дори участва в телевизионна дискусия с избиратели, които още не са решили дали и за кого да гласуват. По време на дискусията ѝ се наложи да отговаря както на гласоподавателка, защитаваща палестинците, така и на еврейка, притеснена от антисемитизма в американските университети. И направи всичко по силите си отговорите ѝ да не отблъснат нито една от двете, колкото и трудно да изглежда това.

Какви са българските избиратели?

В България за избирателите обикновено се говори в много мъгляви категории, често натоварени с негативни конотации – „жълтопаветници“, „гербаджии“, „националисти“, „турци“, „роми“ и пр. Още по-мъгляво е възприемането на имащите право на глас като някакво единно цяло. То се изразява в твърдения, че българите са по дефиниция консервативни или пък че „не са готови“ за някоя по-либерална промяна.

Социологическите агенции правят по-детайлни разбивки на гласоподавателите – например по пол, възраст, населено място, образование, както и за кого са гласували на предишни избори. Но от техните анализи не следва особена промяна в поведението на политическите сили. Сякаш е природен факт, че едни гласуват така, други иначе, а повече от половината изобщо не гласуват.

Да вземем негласуващите. На последните избори най-малко са гласувалите в областите Сливен (28,78%), Кърджали (29,93%), Шумен (30,85%), Пловдив-окръг (31,71%). В София пък има значителна разлика между 23-ти МИР, където до урните са отишли най-голям дял избиратели спрямо цялата страна – 46,61%, и 24-ти МИР, в който бюлетина са пуснали едва 31,65%. Какви хора живеят на тези места? Какви са проблемите им? Защо около 70% от тях се чувстват непредставени?

Да се познават различни групи избиратели с техните ценности, интереси, страхове и надежди – без да се стига до типичната за САЩ детайлност – не е невъзможна задача. Не е нито твърде скъпо за джоба на парламентарно представените партии и коалиции, нито прекалено сложно да се направят количествени и качествени изследвания, чиито резултати да хвърлят светлина върху потенциалните гласоподаватели. И да се предложат решения за адекватни послания към тях.

Отказът да се опознаят избирателите стига до степен, в която те практически са все по-малко парламентарно представени. И макар по данни от 2023 г. над 60% от българските граждани да смятат, че демокрацията е най-добрата форма на управление, а към март 2024 г. около 60% категорично да подкрепят европейската ориентация на страната ни, рисковете за геополитически завой към Русия се увеличават.

Най-лесно е цялата отговорност да се прехвърля върху избирателите и те да бъдат обвинявани, че не гласуват, с което не само не си защитават интересите, а и увеличават тежестта на контролирания вот. По-трудно е да се намери път към тях, да се убеждават, да се предложат решения на проблемите им, да се дадат отговори на страховете им.

Трудно е, но е необходимо. Защото продължаването на поредицата от предсрочни избори не е най-лошото, което може да се случи. По-лошо ще е да се докараме до диктатура, в която изборите вече няма да имат съвсем никакво значение. 

How we reduced peak memory and CPU usage of the product configuration management SDK

Post Syndicated from Grab Tech original https://engineering.grab.com/reduced-memory-cpu-usage-grabx-sdk

Introduction

GrabX is Grab’s central platform for product configuration management. It has the capacity to control any component within Grab’s backend systems through configurations that are hosted directly on GrabX.

GrabX clients read these configurations through an SDK, which reads the configurations in a way that’s asynchronous and eventually consistent. As a result, it takes about a minute for any updates to the configurations to reach the client SDKs.

In this article, we discuss our analysis and the steps we took to reduce the peak memory and CPU usage of the SDK.

Observations on potential SDK improvements

Our GrabX clients noticed that the GrabX SDK tended to require high memory and CPU usage. From this, we saw opportunities for further improvements that could:

  • Optimise the tail latencies of client services.
  • Enable our clients to use their resources more effectively.
  • Reduce operation costs and improve the efficiency of using the GrabX SDK.
  • Accelerate the adoption of GrabX by Grab’s internal services.

SDK design

At a high-level, creating, updating, and serving configuration values via the GrabX SDK involved the following process:

Figure 1. Previous GrabX SDK design.
  1. The process begins when GrabX clients either create or update configurations. This is done through the GrabX web portal or by making an API call.
  2. Once the configurations are created or updated, the GrabX backend module takes over. It stores the new configuration into an SQL database table.
  3. The GrabX backend ensures that the latest configuration data is available to client SDKs.

    a. The GrabX backend checks every minute for any newly created or updated configurations.

    b. If there are new or updated configurations, GrabX backend creates a new JSON file. This file contains all existing and newly created configurations. It’s important to note that all configurations across all services are stored in a single JSON file.

    c. The backend module uploads this newly created JSON file to an AWS S3 bucket.

    d. The backend module assigns a version number to the new JSON file and updates a text file in the AWS S3 bucket. This text file stores the latest JSON file version number. The client SDK refers to this version file to check if a newer version of the configuration data is available.

  4. The client SDK performs a check on the version file every minute to determine if a newer version is available. This mechanism is crucial to maintain data consistency across all instances of a service. If any instance fell out of sync, it would be brought back in sync within a minute.
  5. If a new version of the configuration JSON file is available, the client SDK downloads this new file. Following the download, it loads the configuration data into memory. Storing the configuration data in memory reduces the read latency for the configurations.

Areas of improvement for existing SDK design

In this section we outline the areas of improvement we identified within the SDK design.

Service-based data partitioning

We saw an opportunity for service-based data partitioning. The configuration data for all services was consolidated into a single JSON file. Upon studying the data read patterns of client services, we observed that most services primarily needed to access configuration data specific to their own service. However, the present design required storing configuration data for all other services. This resulted in unnecessary memory consumption.

Retaining only new version of configuration in the same file

By using a single JSON file for storing old and new configuration data, we saw a significant increase in the size of the JSON file.

The SDK only needs the full data when it starts; the more common case is that it needs to stay updated with the latest configuration. Even in that scenario, the SDK needed to fetch a complete new JSON file every minute no matter the size of the updates. Consequently, the process of downloading, decoding, and loading high volumes of data at a high frequency (every minute) caused the client SDK to spike in memory and CPU usage.

More efficient JSON decoding

An additional factor which contributed to memory and CPU usage during the decoding phase was the inefficiency of the default JSON decode library to decode this large (>100MB) JSON file. Decoding this JSON file was heavy on available CPU resources, which tended to starve the service of its ability to handle incoming requests. This manifested as increasing the P99 latency of the service.

Figure 2. Graph illustrating the increased P99 latency due to CPU throttling for a service.

Implemented solution

We proposed modifications to the existing SDK design, which we discuss in this section.

Partition data by service

The proposed solution involved partitioning the data based on services. We chose this approach because a single configuration typically belonged to a single service, and most services primarily needed to read configurations that pertained to their own service.

Upon analysing the distribution of service-configuration, we discovered that 98% of client services required less than 1% of the total configuration data. Despite this, they were required to maintain and reload 100% of the configuration data. Furthermore, the service with the largest number of configurations only required 20% of the total configuration data.

Therefore, we proposed a shift towards service-based partitioning of configuration data. This allowed individual client services to access only the data they needed to read.

Figure 3. Graph showing the number of services with varying amounts of configurations.

Create separate JSON files for each configuration

Our proposal also included creating a separate JSON file for each configuration in a service. Previously, all data was stored in a single JSON file housed in an AWS S3 bucket, which supported a maximum of 3,500 write/update requests and 5,500 read requests per second.

By storing each configuration in a separate JSON file, we were able to create a different S3 prefix for each configuration file. These S3 prefixes helped us to maximise S3 throughput by enhancing the read/write performance for each configuration. AWS S3 can handle at least 3,500 PUT/COPY/POST/DELETE requests or 5,500 GET/HEAD requests per second for each partitioned Amazon S3 prefix.

Therefore, with each configuration’s data stored in a separate S3 file with a different prefix, the GrabX platform could achieve a throughput of 5,500 read requests and 3,500 write/update requests per second per configuration. This was beneficial for boosting read/write capacity when needed.

Implement a service-level changelog

We proposed to create a changelog file at the service level. In other words, a changelog file was created for each service. This file was used to keep track of the latest update version, as well as previous service configuration update versions. This file also recorded the configurations which were created or updated in each version. This enables the SDK to accurately identify the configurations that were created or updated in each update version. This was useful to update the specific configurations belonging to a service on the client side.

Implement service-based SDK

We proposed that SDK client services should be allowed to subscribe to a list of services for which they need to read configuration data. The SDK was initialised with data of the subscribed services and received updates only for configurations corresponding to the subscribed services.

Figure 4. This flowchart shows our proposed service-based SDK implementation.

The SDK only sought updates for the subscribed services. The client SDK needed to read the changelog file for each of the subscribed services, comparing the latest changelog version against the SDK version number. Whenever a newer changelog version was available, the SDK updated the variables with the latest version.

This approach significantly reduced the volume of data that the SDK needed to download, decode, and load into memory during both initialisation and each subsequent update.

Conclusion

In summary, we identified ways to optimise CPU and memory usage in the GrabX SDK. Our analysis revealed that frequent high resource consumption hindered the wider adoption of GrabX. We proposed a series of modifications, including partitioning data by service and creating separate JSON files for each configuration.

After benchmarking the proposed solution with a variety of configuration data sizes, we found that the solution has the potential to reduce memory utilisation by up to 70% and decrease the maximum CPU utilisation by more than 50%. These improvements significantly enhance the performance and scalability of the GrabX SDK.

Figure 5. Bar charts showcasing memory(MB) & CPU(%) utilisation for Service A before and after using the discussed solution.

Moving forward, we plan to continue optimising the GrabX SDK by exploring additional improvements, such as reducing its initialisation time. These efforts aim to make GrabX an even more robust and reliable solution for product configuration management within Grab’s ecosystem.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 700 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Cloudflare’s perspective of the October 30 OVHcloud outage

Post Syndicated from Bryton Herdes original https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage

On October 30, 2024, cloud hosting provider OVHcloud (AS16276) suffered a brief but significant outage. According to their incident report, the problem started at 13:23 UTC, and was described simply as “An incident is in progress on our backbone infrastructure.” OVHcloud noted that the incident ended 17 minutes later, at 13:40 UTC. As a major global cloud hosting provider, some customers use OVHcloud as an origin for sites delivered by Cloudflare — if a given content asset is not in our cache for a customer’s site, we retrieve the asset from OVHcloud.

We observed traffic starting to drop at 13:21 UTC, just ahead of the reported start time. By 13:28 UTC, it was approximately 95% lower than pre-incident levels. Recovery appeared to start at 13:31 UTC, and by 13:40 UTC, the reported end time of the incident, it had reached approximately 50% of pre-incident levels.


Traffic from OVHcloud (AS16276) to Cloudflare

Cloudflare generally exchanges most of our traffic with OVHcloud over peering links. However, as shown below, peered traffic volume during the incident fell significantly. It appears that some small amount of traffic briefly began to flow over transit links from Cloudflare to OVHcloud due to sudden changes in which Cloudflare data centers we were receiving OVHcloud requests. (Peering is a direct connection between two network providers for the purpose of exchanging traffic. Transit is when one network pays an intermediary network to carry traffic to the destination network.) 


Because we peer directly, we exchange most traffic over our private peering sessions with OVHcloud. Instead, we found OVHcloud routing to Cloudflare dropped entirely for a few minutes, then switched to just a single Internet Exchange port in Amsterdam, and finally normalized globally minutes later.

As the graphs below illustrate, we normally see the largest amount of traffic from OVHcloud in our Frankfurt and Paris data centers, as OVHcloud has large data center presences in these regions. However, in that shift to transit, and the shift to an Amsterdam Internet Exchange peering point, we saw a spike in traffic routed to our Amsterdam data center. We suspect the routing shifts are the earliest signs of either internal BGP reconvergence, or general network recovery within AS12676, starting with their presence nearest our Amsterdam peering point.


The postmortem published by OVHcloud noted that the incident was caused by “an issue in a network configuration mistakenly pushed by one of our peering partner[s]” and that “We immediately reconfigured our network routes to restore traffic.” One possible explanation for the backbone incident may be a BGP route leak by the mentioned peering partner, where OVHcloud could have accepted a full Internet table from the peer and therefore overwhelmed their network or the peering partner’s network with traffic, or caused unexpected internal BGP route updates within AS12676.

Upon investigating what route leak may have caused this incident impacting OVHcloud, we found evidence of a maximum prefix-limit threshold being breached on our peering with Worldstream (AS49981) in Amsterdam. 

Oct 30 13:16:53  edge02.ams01 rpd[9669]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 141.101.65.53 (External AS 49981) changed state from Established to Idle (event PrefixLimitExceeded) (instance master)

As the number of received prefixes exceeded the limits configured for our peering session with Worldstream, the BGP session automatically entered an idle state. This prevented the route leak from impacting Cloudflare’s network. In analyzing BGP Monitoring Protocol (BMP) data from AS49981 prior to the automatic session shutdown, we were able to confirm Worldstream was sending advertisements with AS paths that contained their upstream Tier 1 transit provider.

During this time, we also detected over 500,000 BGP announcements from AS49981, as Worldstream was announcing routes to many of their peers, visible on Cloudflare Radar.


Worldsteam later posted a notice on their status page, indicating that their network experienced a route leak, causing routes to be unintentionally advertised to all peers:

“Due to a configuration error on one of the core routers, all routes were briefly announced to all our peers. As a result, we pulled in more traffic than expected, leading to congestion on some paths. To address this, we temporarily shut down these BGP sessions to locate the issue and stabilize the network. We are sorry for the inconvenience.”

We believe Worldstream also leaked routes on an OVHcloud peering session in Amsterdam, which caused today’s impact.

Conclusion

Cloudflare has written about impactful route leaks before, and there are multiple methods available to prevent BGP route leaks from impacting your network. One is setting max prefix-limits for a peer, so the BGP session is automatically torn down when a peer sends more prefixes than they are expected to. Other forward-looking measures are Autonomous System Provider Authorization (ASPA) for BGP, where Resource Public Key Infrastructure (RPKI) helps protect a network from accepting BGP routes with an invalid AS path, or RFC9234, which prevents leaks by tying strict customer and provider relationships to BGP updates. For improved Internet resilience, we recommend that network operators follow recommendations defined within MANRS for Network Operators.

МВР и НСО блокираха цял квартал заради Пеевски на приема на турското посолство

Post Syndicated from Николай Марченко original https://bivol.bg/dps-peevski-fakti-bivol.html

вторник 29 октомври 2024


Лидерът на “ДПС Ново начало” Делян Славчев Пеевски е станал причината за мобилизиране на много на брой екипи, съответно и големи разходи за сметка на бюджета, от страна на Министерството…

Using Semantic Versioning to Simplify Release Management

Post Syndicated from Brandon Kindred original https://aws.amazon.com/blogs/devops/using-semantic-versioning-to-simplify-release-management/

Any organization that manages software libraries and applications needs a standardized way to catalog, reference, import, fix bugs and update the versions of those libraries and applications. Semantic Versioning enables developers, testers, and project managers to have a more standardized process for committing code and managing different versions. It’s benefits also extend beyond development teams to end users by using change logs and transparent feature documentation. This alleviates the operational burden of communicating changes for each release cycle.

In this post, we dive deep into how Semantic Versioning works and how to use the NodeJS package semantic-release/npm in a GitHub Actions continuous integration and continuous delivery (CI/CD) pipeline to automatically version an Amazon Web Services (AWS) Cloud Development Kit (AWS CDK) project. Once we’re done, you’ll be ready to deliver AWS CDK projects that are automatically versioned so your team can focus more on code instead of versioning for each release.

How does Semantic Versioning work?

With Semantic Versioning, version number changes convey a specific meaning about the underlying code change. This helps others to understand what has changed from one version to the next.

Semantic Versioning is defined in a Major.Minor.Patch format, which is a guideline to track the scope and potential impact of changes for feature development, major updates and bug fixes. The Major version number change signifies that this release will have breaking changes and any upgrade to this version should be validated for backward compatibility. The Minor version number change signifies a feature release in a backward-compatible manner. Finally, the Patch version number signifies a small insignificant change or a bug fix and users can upgrade to this version without introducing breaking changes or new features.

Each version number is always incremented by one digit and when a higher digit increments, the lower digits reset to zero. For example, version 1.5.2 means that the software is in its first major version, and has released 5 features with 2 patches implemented. When a new feature is added, it will be released as version 1.6.0. If there are any bugs reported and a fix is release for that bug, the next release version will be 1.6.1. For a version 2.0.0 release there will be breaking changes, and may not be backward compatible with previous 1.x.x versions.

What is semantic-release?

Now that we have covered how Semantic Versioning works, and why it’s useful in software projects, let’s review semantic-release. It is a Node.js tool that simplifies Semantic Versioning management. Semantic-release can do more than just apply Semantic Versioning to your projects. With a variety of plugins available, you can track what sort of changes are being made with commit analysis or by generating a changelog for each release. This automates what would otherwise have been a manual and time-consuming process to create and review roadmaps and documentation tracking the new features and fixes that comprise a release. Semantic-release is also straightforward to integrate with CI tools such as GitHub Actions, GitLab CI, Travis CI, and CircleCI 2.0 workflows.

Unfortunately, if your commit messages don’t match the format semantic-release requires, the automatic versioning won’t work correctly. To verify that commit messages are in the right format, you can integrate tools such as commitizen or commitlint, which can validate that commit messages are in a valid format.

Using semantic-release

Commit message formats

By default, semantic-release uses Angular Commit Message Conventions, however the commit message format can be altered by using config options. Let’s take a quick look at the three different types of commit messages, fixes, features, and breaking changes.

Patches and fixes

For patch or fix commits, you need to start your commit message with “fix(context):”. This tells semantic-release that this is a minor patch—meaning that if your version is 0.0.1, after this commit is merged the version will be incremented to 0.0.2. Following is an example commit message.

fix(printer): inform user when printer is out of paper

Features

Feature commit messages start with “feat(context):” which indicates to semantic-release that this is a feature release, also referred to as a minor release. If your version is 0.1.0, after a feature commit is merged, the version will be incremented to 0.2.0. Following is an example of a feature commit.

feat(printer): add option for printing in color

Breaking changes

Breaking change commits are intended for marking a major breaking change—meaning that the commit introduces changes that are not compatible with existing or previous versions. The way that commit messages are marked as breaking changes is slightly different than the previous commit types. For a breaking change, the commit footer needs to include “BREAKING CHANGE:”. For example, a breaking change commit, might look like the following.

perf(printer): remove color printing option

BREAKING CHANGE: The color printing option has been removed. The default black and white printing option is always used for performance reasons.

Release change log

To generate a release change log, we only need to add the semantic-release/changelog plugin to the package.json file. We’ll demonstrate how to do this in the next section.

Adding Semantic Release to an AWS CDK project

To demonstrate semantic-release in action we will build an example with AWS CDK and the NPM semantic-release library. All code provided in the remainder of this blog can be found in the semantic-versioning demo repository (aws-cdk-semantic-release) on GitHub.

To get started we’ll need to create a Node.js CDK application, add semantic-release and its plugins, commit changes, then build the project. If you don’t have AWS CDK installed yet, check out the Getting Stated with CDK guide.

To begin, navigate to the folder where the project will be saved. Next, we initialize a new AWS CDK project and add semantic-release. Then, we configure the branches that Semantic Versioning should be applied to and add a few NPM plugins to extend the functionality of semantic-release. While there are many plugins available, only a few are needed to get started. Below, we’ll walk through which plugins to install and how to install them.

Initiate a new AWS CDK Typescript project and add semantic-release

In a command terminal, navigate to the folder where you want to store your AWS CDK project and run the following three commands. The first line will create the project and the next two will install the semantic-release NPM and Git plugins in the AWS CDK project.

cd <<PROJECT_FOLDER>>
cdk init app --language typescript
npm install @semantic-release/npm -D
npm install @semantic-release/git -D

Define which branches to apply versioning to

We need to tell NPM which branches can trigger a new release. In the example we’ll use the main branch and the staging branches for release. We will add a new section to the package.json file along with the branches we want to use for releases.

"release": {
  "branches": ["main"]
}

Add supporting plugins to package.json

In order for semantic-release to pick up commits, generate release notes, and be able to perform a release with GitHub we need to add the commit-analyzer, release-notes-generator, changelog, NPM, Git, and GitHub plugins. Add the following plugins section to the package.json file after the “release” section we added in the previous step.

"plugins": [
    "@semantic-release/commit-analyzer",
    "@semantic-release/release-notes-generator",
    "@semantic-release/changelog",
    "@semantic-release/npm",
    [
        "@semantic-release/git",
        {
            "assets": [
                "package.json",
                "CHANGELOG.md"
            ],
            "message": "chore(release): ${nextRelease.version}[skip ci]\n\n${nextRelease.notes}"
        }
    ],
    "@semantic-release/github"
],

Putting it all together, our package.json file should look similar to the following.

{
    "name": "aws-cdk-semantic-release",
    "version": "0.1.0",
    "bin": {
        "aws-cdk-semantic-release": "bin/aws-cdk-semantic-release.js"
    },
    "scripts": {
        "build": "tsc",
        "watch": "tsc -w",
        "test": "jest",
        "cdk": "cdk"
    },
    "plugins": [
        "@semantic-release/commit-analyzer",
        "@semantic-release/release-notes-generator",
        "@semantic-release/changelog",
        "@semantic-release/npm",
        [
            "@semantic-release/git",
            {
                "assets": [
                    "package.json",
                    "CHANGELOG.md"
                ],
                "message": "chore(release): ${nextRelease.version} [skip ci]\n\n${nextRelease.notes}"
            }
        ],
        "@semantic-release/github"
    ],
    "release": {
        "branches": [
            "main"
        ]
    },
    "devDependencies": {
        "@semantic-release/changelog": "^6.0.3",
        "@semantic-release/git": "^10.0.1",
        "@semantic-release/npm": "^11.0.2",
        "@types/jest": "^29.5.12",
        "@types/node": "20.11.16",
        "aws-cdk": "2.127.0",
        "jest": "^29.7.0",
        "ts-jest": "^29.1.2",
        "ts-node": "^10.9.2",
        "typescript": "~5.3.3"
    },
    "dependencies": {
        "aws-cdk-lib": "2.127.0",
        "constructs": "^10.0.0",
        "source-map-support": "^0.5.21"
    }
}

With the necessary package.json changes in place, we can commit our code and build the project again.

git commit -m "feat(semver): added semantic release plugins and set main branch for versioning"
npm run build

Configure GitHub Actions

In the projects root folder, we are going to create folders for .github/workflows so that we can configure GitHub Actions release steps.

mkdir .github
cd .github
mkdir workflows
cd workflows

Create a file called release.yaml in the workflows folder that has the following.

name: Release
on:
  push:
    branches:
      - main
jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
         uses: actions/checkout@v3
      - name: Setup Node.js
         uses: actions/setup-node@v3
         with:
           node-version: '20.8.1'
      - name: Install Dependencies
         run: npm install
      - name: Semantic Release
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
           NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
         run: npx semantic-release

The GITHUB_TOKEN is automatically generated and managed by GitHub Actions, so we don’t need to worry about retrieving or defining this value in our GitHub Actions steps. However, NPM_TOKEN is not automatically provided to us, so we need to define the NPM token. To do this we need to login to npmjs.com and create an account. Once we’re logged in, we need to create an access token. Once the NPM access token is created, then we need to add the token to GitHub secrets.

Verify Semantic-Release is setup correctly

Now that we have everything configured, new merges and commits to the main branch will kick off a CI/CD pipeline through GitHub Actions that builds the project and increments the version. To verify that everything is working correctly, we’ll alter the CDK code to include an AWS Lambda function. Afterward, we’ll commit changes to the main branch to verify that semantic-release is working correctly.

Make a change, commit, push and verify

In the AWS CDK project lib/aws-cdk-semantic-release-stack.ts file we’ll update the visibilityTimeout duration to 600, then commit the changes. We want to be sure to use the Semantic Versioning format for the commit message so that our changes are versioned correctly. For this we use the following commit message.

“fix(sqs-visibility): increased visibility timeout to 600”

After committing the changes, we push the changes to the remote GitHub repository to initiate the GitHub Actions build process. Once the build is complete, we see that the build number has been incremented from the previous version. You can review the releases for the sample code to see how it has been versioned. Following are examples of what it should look like before and after executing this. Please note that the version numbers depicted may not match yours.

GitHub releases view with information about release v1.1.0 including features and assets that were updated as part of v1.1.0

Screenshot of GitHub releases page showing release 1.0.0 and the associated features changelog

Figure 1 – Before the new commits

GitHub releases view with information about release v1.1.1 followed by the previous release v1.1.0.

Screenshot of GitHub releases page showing a previous release and the latest release along with their respective versions of 1.1.0 and 1.1.1

Figure 2 – After GitHub Actions generates a new build

Conclusion

Semantic Versioning offers a structured and reliable way to manage software changes. When paired with the significant benefits of using Node.js packages like semantic-release, version management can become effortless. All of this enables automated version management Node.js projects such as AWS CDK pipelines for automated code deployment. It enhances clarity, stability, and collaboration, while also supporting automation and fostering user confidence.

By adopting Semantic Versioning, development teams can achieve more predictable and efficient workflows, ultimately leading to higher quality software. It also builds confidence among users without going into much details with respect to software upgrades and backward compatibility. As a best practice, start incorporating Semantic Versioning for existing and future applications.

Contact an AWS Representative to know how we can help accelerate your business.

Reference

CDK setup and deployment of application with CDK V2 Construct:

Headshot of Brandon Kindred (CAA), a white man with short hair, a beard and wearing a black shirt

Brandon Kindred

Brandon is a Cloud Application Architect at AWS ProServe where he helps companies achieve their cloud goals through tailored strategies and cost-effective solutions. With a passion for teaching, he enjoys empowering others to harness the full potential of cloud technologies while optimizing their cloud spend. Brandon also enjoys providing guidance on development best practices, ensure teams build resilient and scalable applications in the cloud.

Sunita Dubey (Sr CAA)

Sunita Dubey

Sunita Dubey is a senior Cloud Application Architect based out of New Jersey. She focuses on designing, architecting and delivering end to end solutions for customers in financial domain. She has solved multiple customer challenges to move their workload to AWS and saved cost in the migration process. She insists on highest standard in all the deliverables.

Leverage Amazon Q Developer and AWS Chatbot within Slack

Post Syndicated from Jonathan Wong original https://aws.amazon.com/blogs/devops/leverage-amazon-q-developer-and-aws-chatbot-within-slack/

The release of Amazon Q Developer and its ability to be integrated into AWS Chatbot allows users who use Microsoft Teams or Slack to stay within their communication platform and interact with a conversational generative artificial intelligence (AI) AWS expert.

Amazon Q Developer is a conversational generative AI chatbot that provides AWS assistance in the form of best practices, documentation, and answers your AWS related questions. AWS Chatbot is a service that lets you interact with AWS services directly from your communications platform such as Microsoft Teams, Amazon Chime, or Slack. Users can ask Q about best practices, building solutions, troubleshooting issues, and more, creating a productive and collaborative environment. Users can also interface with Chatbot to run AWS CLI commands or open support cases all within Slack.

In this post, we show you how you can leverage Q Developer and Chatbot in your Slack workspace by highlighting a number of use cases along with solution screenshots that can enhance a company’s AWS productivity. We will also showcase an architecture diagram, detailing the flow of actions and the use of different services. To learn more about how to implement Q Developer and Chatbot in Slack, refer to this documentation.

Disclaimer: The information and solutions provided by Q Developer are based on patterns from AWS-related data and best practices. While we strive to offer accurate and helpful guidance, please note that the suggestions may not always be fully accurate or applicable to every situation. It is essential to conduct additional research and verify the information with official AWS documentation or consult with AWS support before implementing any recommendations. Always use your judgment and consider the specific requirements of your environment when making decisions based on AI-generated advice.

Leveraging Q Developer and Chatbot

Q Developer and Chatbot serve a wide range of personas across an organization, catering to both AWS-savvy users and those with limited cloud expertise. Software engineers, for instance, can leverage Q Developer to quickly locate documentation, troubleshoot issues, or find best practices, streamlining their workflow. Security engineers can interact with Chatbot to monitor incidents and receive real-time alerts. Even non-technical users, like project managers or operations staff, can benefit from these tools without needing deep cloud knowledge. Together, these tools enhance productivity and collaboration across the company, regardless of technical expertise.

Use Cases

The use cases section is split into two categories, one for Q Developer, and the other for Chatbot. Both services provide unique abilities to interact with AWS to get the response you are looking for and can be accessed by sending a message to @aws on Slack. Q Developer allows users to ask questions in natural language and responds back with a response and a list of sources. Chatbot allows users to open support cases and to run a number of AWS CLI commands for services such as S3, Lambda, and CloudWatch.

Q Developer Use Cases

Q Developer is a versatile tool designed to assist teams for a number of AWS related use cases. In this post, we will focus on training and onboarding, troubleshooting issues, and implementing AWS best practices.

Training and Onboarding

Benefit: Q Developer can act as a virtual learning assistant, providing personalized training and learning paths for users based on their role, skill level, and current projects. It helps team members stay updated with the latest AWS features and best practices, enhances their skills, and ensures that they can leverage AWS services effectively and efficiently. By offering targeted resources, Q Developer supports continuous learning and helps users prepare for AWS certifications or new roles.

Use Case: AWS Beginner Recommendations. When a new employee joins the team, Q Developer can help them get up to speed by suggesting beginner-level tutorials and essential AWS concepts based on the team’s current tech stack and projects.

The conversation covers recommendations for resources to learn more about AWS, including AWS Documentation, AWS Training and Certification, AWS Blogs and Community, and AWS re:Invent and other events.

Figure 1 – AWS Beginner Recommendations

Use Case: Certification Guidance. An employee aims to get another AWS certification. They can ask Q Developer to provide a structured learning path with recommended courses, study guides, whitepapers, and practice exams to prepare effectively.

The conversation discusses a structured learning path to prepare for the AWS Machine Learning Specialty Certification, covering topics like the AWS Certified Cloud Practitioner certification, the AWS Certified Machine Learning - Specialty certification, and recommended study materials and practices.

Figure 2 – Certification Guidance

Troubleshooting Issues

Benefit: Q Developer provides targeted troubleshooting guidance, helping users to diagnose and resolve issues efficiently. By leveraging AWS service documentation, best practices, and community discussions, Q Developer reduces the time spent on searching for solutions and allows users to focus on resolving issues faster. This improves operational efficiency and minimizes downtime or disruptions.

Use Case: Optimization Recommendations. A developer is facing an issue with running their application on EC2 during peak hours and is looking for recommendations to diagnose the issue.

The conversation provides recommendations to address performance issues with an EC2 instance, including EBS volume configuration, network optimization, system optimization, and cost-effective solutions.

Figure 3 – Optimizations Recommendations

Use Case: Service Troubleshooting. An engineer is working on configuring API Gateway with their application but receives a 504 Gateway Timeout error. Q Developer can look up HTTP response codes for specific services and recommend a plan to tackle the issue.

The conversation discusses troubleshooting a 504 Gateway Timeout error with an API Gateway, providing steps to check CloudWatch logs, review the Lambda function, optimize the Lambda function's performance, and implement client-side retry logic.

Figure 4 – Service Troubleshooting

Best Practices

Benefit: Q Developer provides access to AWS best practices, ensuring that users can build, manage, and maintain their cloud infrastructure effectively. By adhering to best practices, users can optimize their applications for performance, security, scalability, and cost-efficiency. Q Developer helps users stay informed about evolving best practices for using AWS services, ensuring their deployments are up-to-date and compliant with industry standards.

Use Case: Designing Resilient Architectures. A solutions architect is designing a new application on AWS and wants to ensure it’s highly available and fault-tolerant. By asking Q Developer for best practices, they can receive guidance on a number topics including region selection, software, architecture, and deployment strategies to maximize uptime and reliability.

The conversation covers best practices for designing a highly available and fault-tolerant application on AWS, including region selection, alignment to demand, software and architecture, data management, hardware and services, process and culture, deployment strategies, and monitoring and logging.

Figure 5 – Designing Resilient Architectures

Use Case: Deploying Applications for Operational Excellence. An engineer is looking for best practices to deploy an application onto AWS Elastic Beanstalk. Q Developer can assist with providing specific tips for the job that conforms with AWS’ operational excellence pillar found in the AWS Well-Architected Framework.

Recommends several best practices such as choosing the right deployment policy, using rolling updates, implementing auto scaling, and optimizing for content delivery.

Figure 6 – Operational Excellence

Chatbot Use Cases

Chatbot can be used to run AWS CLI commands, open support cases, and more within Slack. To learn more about how to get started with these commands, please visit Chatbot’s documentation and refer to this AWS Blog for additional information.

Using Chatbot and Q Developer Together

We can use Chatbot and Q Developer together to provide clarity in situations where an organization receives alerts on their Slack channel. For example, you can configure Chatbot to receive notifications using Amazon Simple Notification Service based off of rules set up within Amazon EventBridge and it will be delivered directly into your Slack channel. Given that an organization can have many types of notifications enabled for their AWS services, there may be times where the message that is being sent to Slack can be confusing and not well understood. You can take the message provided to you from the notification and provide that as context to Q Developer to help you dive deep into the situation and help figure out next steps. To learn more about setting up notifications and having them be sent to your Slack, please refer to this documentation.

Notification from Chatbot on Slack indicating to the user that there is an issue.

Figure 7 – Chatbot Error Notification

Q to address the issue, such as verifying the instance's health, ensuring the Auto Scaling group's configuration is correct, and reviewing the instance's configuration.

Figure 8 – Q Developer Deep Dive into Chatbot Notification

Architecture Diagram

Diagram illustrating the flow of information between a user, Slack Workspace, AWS Chatbot, and an Amazon Q Developer.

Figure 9 – Solution Overview 

  1. A user logs into Slack and can either ask a question, run AWS command(s), or open a support case.
  2. Slack sends the request to Chatbot which then validates that it can be processed from the channel role and associated guardrail policies, both of which are setup through AWS Identity and Access Management. If the request follows the Chatbot use case(s), we can disregard step 3 and move to step 4.
  3. The request is forwarded to Q Developer where it is processed and formulates a response which is then sent back to Chatbot. Chatbot will then relay the response back to Slack which is displayed to the user.
  4. Logs are captured from the original message and the response and can be located within Amazon CloudWatch

 

Next Steps

Refer to these AWS documentation links that cover how to get started with setting up Q Developer and Chatbot in Slack. It is important to follow the order of the listed documents and to adhere to each of the steps listed to be able to get started with using the solution.

Integration Steps

  • Setting up AWS Chatbot
    1. AWS Chatbot Getting Started documentation outlines the steps to set up AWS Chatbot for interacting with AWS infrastructure. It covers steps such as setting up an AWS account, configuring IAM permissions, and setting up Amazon SNS topics for notifications.
  • Configuring Slack with Chatbot
    1. This documentation shows how to integrate AWS Chatbot with Slack, enabling AWS notifications and interactions in Slack channels. It covers Slack client and channel configuration and testing notifications from AWS services to Slack. Once completed with setting up Slack with Chatbot, refer back to the main Chatbot documentation where you can additional links on monitoring AWS services, customizing Chatbot and performing CLI commands on the lefthand side.
  • Setting up Q Developer with Chatbot
    1. After following the previous documentation steps,you can now integrate Amazon Q Developer with AWS Chatbot in Slack, allowing users to ask questions about AWS services directly in chat. It includes IAM role setup with managed policies and necessary configuration steps. Once completed, this will allow you to use Q Developer through Chatbot’s interface on Slack.

Conclusion

This post highlights how using Q Developer and Chatbot within Slack can boost productivity for a number of use cases. Individuals, teams, and organizations can use these two services’ capabilities to navigate the intricacies of AWS, troubleshoot ongoing issues, and provide real-time guidance all without leaving the familiarity of Slack.

Jonathan Wong

Jonathan Wong is a Solutions Architect at AWS assisting with initiatives within Strategic Accounts. He is passionate about solving customer challenges and has been exploring emerging technologies to accelerate innovation.

Improve OpenSearch Service cluster resiliency and performance with dedicated coordinator nodes

Post Syndicated from Akshay Zade original https://aws.amazon.com/blogs/big-data/improve-opensearch-service-cluster-resiliency-and-performance-with-dedicated-coordinator-nodes/

Today, we are announcing dedicated coordinator nodes for Amazon OpenSearch Service domains deployed on managed clusters. When you use Amazon OpenSearch Service to create OpenSearch domains, the data nodes serve dual roles of coordinating data-related requests like indexing requests, and search requests, and of doing the work of processing the requests – indexing documents and responding to search queries. Additionally, data nodes also serve the OpenSearch Dashboards. Because of these multiple responsibilities, data nodes can become a hot spot in the OpenSearch Service domain, leading to resource scarcity, and ultimately node failures. Dedicated coordinator nodes help you mitigate this problem by limiting the request coordination and Dashboards to the coordinator nodes, and request processing to the data nodes. This leads to more resilient, scalable domains.

Amazon OpenSearch Service is a managed service that you can use to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. The service allows you to configure clusters with different types of nodes such as data nodes, dedicated cluster manager nodes, and UltraWarm nodes. When you send requests to your OpenSearch Service domain, the request is broadcast to the nodes with shards that will process that request. By assigning roles through deploying dedicated nodes, like dedicated cluster manager nodes, you concentrate the processing of those kinds of requests and remove that processing from nodes in other roles.

OpenSearch Service has recently expanded its node type options to include dedicated coordinator nodes, alongside data nodes, dedicated cluster manager nodes, and UltraWarm nodes. These dedicated coordinator nodes offload coordination tasks and dashboard hosting from data nodes, freeing up CPU and memory resources. By provisioning dedicated coordinator nodes, you can improve a cluster’s overall performance and resiliency. Dedicated coordinator nodes also let you scale the coordination capacity of your cluster independently of the data storage capacity. Dedicated coordinator nodes are available in Amazon OpenSearch Service for all OpenSearch engine versions. See the documentation for engine and version support.

A brief introduction to coordination

OpenSearch operates as a distributed system, where data is stored in multiple shards across various nodes. Consequently, a node handling a request must coordinate with several other nodes to store or retrieve data.

Here are a few examples of coordination operations performed to successfully serve different user requests:

  • A bulk indexing request might contain data that belongs to multiple shards. The coordination process splits such a request into multiple shard-specific subrequests and routes them to the corresponding shards for indexing.
  • A search request might require querying various shards that are present in different nodes. The coordination process splits the request into multiple shard level search requests and sends those requests to the corresponding data nodes holding the data. Each of those data nodes processes the data locally and returns a shard-level response. The coordination process gathers these responses and builds the final response.
  • For queries with aggregations, the coordination process performs the additional computation of re-aggregating the aggregation responses from data nodes.

In OpenSearch Service, each data node is implicitly capable of coordination. In the absence of dedicated coordinator nodes, the data node receiving the request will perform the coordinating duties, though it might not have the relevant shards for the request. By adding dedicated coordinator nodes to a cluster, you can reduce the burden on data nodes. The following sections walk through some of the improvements.

Higher indexing and search throughput

In an OpenSearch cluster, each indexing request goes through three broad phases: coordination, primary, and replica. With coordination responsibilities offloaded to dedicated coordinators, the data nodes have more resources at their disposal for the primary and replica phases. By adding coordinator nodes, we observed as much as 15% higher indexing throughput in workloads such as Stack Overflow and Big5.

A search request in OpenSearch can involve something as trivial as looking up a single document by ID or something complex, such as bucketing a large amount of data and performing aggregations on each of the buckets. The impact of adding dedicated coordinator nodes can vary widely depending on the query. In a query workload containing date histograms with multiple aggregations such as average, p50, p99, and so on, we were able to achieve about 20% higher throughput. The term and multi-term aggregations also benefit from the addition of coordinator nodes. Depending on the key composition throughput improvement of 15% to 20% was observed.

More resilient clusters

Dedicated coordinator nodes provide a separation of responsibilities that prevents data nodes from being overwhelmed by complex queries or sudden spikes in request volume. In the case of complex aggregations, the coordinator nodes absorb the CPU impact ensuring that the data nodes focus on filtering, matching, scoring, sorting, and returning the search response, and maintaining the integrity of the data. In addition to coordination responsibilities, coordinator nodes also serve the OpenSearch Dashboards frontend. This ensures that the dashboards stay responsive even during high loads, ensuring a smooth user experience.

Complex aggregations consume a lot of memory. Memory intensive operations can lead to out of memory (OOM) errors causing node crashes and data loss. By adding dedicated coordinator nodes in a cluster, you can isolate the impact away from the data nodes. Coordinator nodes can greatly improve performance by significantly reducing or even completely eliminating query-induced OOM errors on data nodes. Because coordinator nodes don’t hold any data, the cluster still remains functional even if one of the coordinator nodes fails.

Efficient scaling

Dedicated coordinator nodes separate a cluster’s coordination capacity from data storage capacity. This allows you to choose the amount of memory and CPU required for your workload without impacting the stored data. For example, a cluster with high throughput might require a lot of lightweight nodes while a cluster with complex aggregations should have fewer but larger nodes.

Having a dedicated coordinator node allows you to adjust the number of nodes according to anticipated traffic patterns. For example, you can scale up the number of coordinators in high traffic hours and scale them down during low traffic hours.

Smaller IP reservations for VPC domains

With dedicated coordinator nodes, you can achieve up to 90% reduction in the number of IP addresses reserved by the service in your VPC. This reduction allows deployments of larger clusters that might otherwise face resource constraints.

When you create a virtual private cloud (VPC) domain without dedicated coordinator nodes, OpenSearch Service places an elastic network interface (ENI) in the VPC for each data node. Each ENI is assigned an IP address. At the time of domain creation, the service reserves three IP addresses for each data node. See Architecture for more information. When dedicated coordinator nodes are used, the ENIs are attached to the coordinator nodes instead of the data nodes. Because there are typically fewer coordinator nodes than data nodes fewer IP addresses are reserved. The following diagram shows the domain architecture of a VPC domain with dedicated coordinator nodes.

Picking the right configuration

OpenSearch Service offers two key parameters for managing dedicated coordinator nodes:

  1. Instance type, which determines the memory and compute capacity of each coordinator node.
  2. Instance count, which specifies the number of coordinator nodes.

Identify your use case

To get the most benefits out of coordinator nodes, you must pick the right type as well as the right count. As a general rule, we recommend that you set the count to 10% of the number of data nodes and choose a size that’s similar to the size of the data nodes. See the documentation to find out the supported instance types for dedicated coordinator nodes. The following guidelines should help tailor the configuration further to specific workloads:

  • Indexing: Indexing requires compute power to split the bulk upload request payload into shard-specific chunks. We recommend using CPU optimized instances of a size similar to that of the data nodes. While the count is dependent on the indexing throughput that you want to achieve, 10% of the number of data nodes is a good starting point.
  • High search throughput: Achieving high search throughput requires a lot of network capacity. Increasing the number of coordinator nodes will sustain the traffic load while providing high availability. We recommend setting the coordinator node count at from 10% to 15% of the number of data nodes.
  • Complex aggregations: Aggregations are memory intensive. For example, to calculate a p50 value, a coordinator node must first gather the entire dataset in memory. Moreover, crunching these numbers requires CPU cycles. We recommend that you use general purpose coordinator nodes that are one size larger than the data nodes. While the node count can be tuned by the use case, 8% to 10% of the number of data nodes is a good start.

Coordinator metrics

While the guidelines above are a good start, every use case is unique. To arrive at an optimal configuration, you must experiment with your own workload, observe the performance, and identify the bottlenecks. OpenSearch Service provides some key metrics and APIs to observe how coordinator nodes are doing.

  • CoordinatorCPUUtilization metric: This metric provides information about how much CPU is being consumed on the coordinator nodes. This metric is available at both the node and the cluster levels. If you see CPU consistently breaching the 80% mark, it might be a time to use larger coordinator nodes.
  • CoordinatorJVMMemoryPressure, CoordinatorJVMGCOldCollectionCount and CoordinatorJVMGCOldCollectionTime metrics: The CoordinatorJVMMemoryPressure metric indicates the percentage of JVM memory used by the OpenSearch process. This metric is available at both the cluster and node levels. Consistently high JVM memory pressure suggests that coordination tasks are using memory efficiently. It’s important to assess this metric alongside the JVM garbage collection (GC) metrics, which show how many old generation GC runs have been triggered and how long they lasted. In a properly scaled cluster, GC runs should be infrequent and short. If GC runs occur too often, they might also negatively impact CPU performance.
  • CoordinatingWriteRejected metric: This metric should be evaluated alongside other metrics, such as PrimaryWriteRejected and ReplicaWriteRejected. An increase in primary or replica write rejections suggests that the data nodes are underscaled and unable to process requests quickly enough. However, if the CoordinatingWriteRejected metric rises independently of the other two, it indicates that the coordinating node is struggling to handle the indexing coordination process, preventing it from processing queued requests. Indexing requires many resources, any of which could be a bottleneck. You can alleviate indexing pressure where the CPU is the bottleneck with more or larger instances that have more vCPUs.
  • Circuit breaker statistics API: Circuit breakers prevent OpenSearch from causing a Java OutOfMemoryError. The circuit breaker statistics for coordinator nodes can be retrieved with following API:
    _nodes/coordinating_only:true/stats/breaker
    Every time a circuit breaker trips for a request the client receives a 429 error with the circuit_breaking_exception message. These indicate that the result size of the request was too big to fit on a coordinator node. To avoid these errors, it’s recommended to use an instance with more memory.

Provision a dedicated coordinator node

You can add one or more dedicated coordinator nodes by updating the domain configuration with the appropriate options for coordinator nodes. This will trigger a blue/green deployment, and the domain will have dedicated coordinator nodes once the deployment is complete. Alternatively, you can create a new domain with dedicated coordinator nodes.

In either scenario, you can expand or reduce the number of coordinator nodes without requiring a blue/green deployment, giving you the flexibility to experiment.

Conclusion

In real-world production environments, dedicated coordinator nodes in Amazon OpenSearch Service provide an effective way to separate coordination tasks from data processing. This shift enhances resource efficiency, often delivering up to a 15% increase in indexing throughput and a 20% improvement in query performance, depending on workload demands. By offloading coordination tasks, you reduce the risk of node overloads, improve system stability, and gain better cost control by scaling coordination and data tasks independently.

For workloads with complex queries and high traffic, dedicated coordinator nodes help ensure that your cluster maintains optimal performance and is prepared to handle future growth with greater resilience. Start experimenting with dedicated coordinator nodes today to unlock more efficient resource management and enhanced performance in your OpenSearch clusters.


About the author

Akshay Zade is a Senior SDE working for Amazon OpenSearch Service, passionate about solving real-world problems with the power of large-scale distributed systems. Outside of work, he enjoys drawing, painting, and diving into fantasy books.