[$] The things nobody wants to pay for

Post Syndicated from corbet original https://lwn.net/Articles/959069/

The free-software community has managed to build a body of software that is
worth, by most estimates, many billions of dollars; all of this code is
freely available to anybody who wants to use or modify it. It is an
unparalleled example of independent actors working cooperatively on a
common resource. Free software is certainly a success story, but all is
not perfect. One of the community’s greatest strengths — convincing
companies to contribute to this common resource — is also part of one of
its biggest weaknesses.

Ексклузивно в “Биволъ” Видео разкрива зловещото убийство на Пейо Пеев (цензурирано 18+)

Post Syndicated from Екип на Биволъ original https://bivol.bg/peyo-peev-video.html

четвъртък 25 януари 2024


Биволъ се сдоби с основното доказателство въз основа на което бе разкрито зверското убийство на 44 г. Пейо Пеев. Съществуването на видеозапис на охранителна камера бе широко популяризирано, коментирано и…

GCC security features from AdaCore

Post Syndicated from corbet original https://lwn.net/Articles/959461/

The AdaCore blog describes
some hardening features
contributed to GCC for the GCC 14 release.

With -fharden-control-flow-redundancy, the compiler now verifies,
at the end of functions, whether the traversed basic blocks align
with a legitimate execution path. The purpose of this protective
measure is to detect and thwart attacks attempting to infiltrate
the middle of functions, thereby enhancing the overall security
posture of the compiled code.

Building the Best SOC Takes Strategic Thinking

Post Syndicated from Rapid7 original https://blog.rapid7.com/2024/01/25/building-the-best-soc-takes-strategic-thinking/

Building the Best SOC Takes Strategic Thinking

So your security team is ready to scale up its security operations center, or SOC, to better meet the security needs of your organization. That’s great news. But there are some very important strategic questions that need to be answered if you want to build the most effective SOC you can and avoid some of the most common pitfalls teams of any size can encounter.

The Gartner® report SOC Model Guide, is an excellent resource for understanding how to ask the right questions regarding your security needs and what to do once those questions are answered.

Question 1: Which Model is Right for You?

There are several different ways to build an effective SOC. And while some are more complicated (perhaps even prohibitively so) than others, knowing what your needs and resources are at the outset will help you make this crucial initial decision.

Gartner puts it this way:

“A SOC model defines a strategy for variation in the use of internal teams and external service providers when running a SOC. It ensures all roles required to operate a SOC are allocated to those best suited to discharge the associated responsibilities. An effective SOC model lets SRM leaders allocate resources based on business priorities, available skill sets and budget…”

There are effectively three ways to build a SOC: internal, external, and hybrid. The report has this to say:

“Opting for a hybrid SOC is one way to help grow capabilities, while managing scale and cost. A hybrid SOC is one in which more than one team, both insourced and outsourced, plays a role in the activities required for proper SOC operation. The question of which teams, roles, jobs and activities are best kept in-house or outsourced is complex. Building a SOC model helps you answer it and ensure a hybrid SOC is well-balanced.”

Question 2: Who Does What?

Let’s assume your organization is opting for a hybrid approach. The next question you will need to ask yourself is what roles am I outsourcing and what roles am I keeping in-house? Understanding your business needs and whether internal or external partners are the best course of action can take some serious soul-searching on your part.

Luckily, Gartner has some recommendations. From the report:

Gartner says “Some SOC tasks are strategic, such as those performed by the roles of senior investigator, incident response manager and red team tester. They are often best performed by in-house staff who understand the business’s needs and the security issues.

“Other SOC tasks are tactical, such as building detection content for common
attacks. They are generally best performed by a larger external team, which can do
them more efficiently, on a bigger scale, and for longer periods.”

Question 3: How Do We Keep Everything Humming Along?

Once you’ve chosen your SOC model and built your team, it is important to be monitoring and reacting to the ways in which the internal and external partners work together. Let’s assume you’ve followed Gartner recommendations and outsourced your tactical needs and some highly specific skill sets and kept your strategic thinkers in-house, then you need to have a way for the teams to work together that is as dynamic as the environment they are seeking to protect.

Gartner offers this advice:

“Have clear demarcations between objective handlers, but ensure there is shared awareness. A challenge with hybrid models that use different providers or teams to handle objectives is that it can be hard to instill a results-oriented mindset. An external provider or internal team often gets “tunnel vision” — focusing only on its own individual objective — and loses sight of the big picture of SOC performance. You must ensure each provider or team is aware of its impact on adjacent objectives, not just its own.”

Just because different teams are going to have relatively different goals does not mean they should operate in silos. Ensuring that internal and external team members are able to see the big picture and understand the capabilities and limitations of others on the team is a critical component of building a SOC that works well today and grows well together.

Building a SOC from scratch is no easy feat and it is made harder without some serious strategic thinking and soul searching before building the team. Understand your unique needs, the general needs of a SOC team, what your resources are, and the expectations of your organization before building your own A-team of crack security professionals.

To read more about SOC Models check out Gartner SOC Model Guide here.

Gartner, SOC Model Guide, Eric Ahlm, Mitchell Schneider, Pete Shoard, 18 October 2023

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/959455/

Security updates have been issued by Debian (chromium, firefox-esr, php-phpseclib, phpseclib, thunderbird, and zabbix), Fedora (dotnet7.0, firefox, fonttools, and python-jinja2), Mageia (avahi and chromium-browser-stable), Oracle (java-1.8.0-openjdk, java-11-openjdk, LibRaw, openssl, and python-pillow), Red Hat (gnutls, kpatch-patch, php:8.1, and squid:4), SUSE (apache-parent, apache-sshd, bluez, cacti, cacti-spine, erlang, firefox, java-11-openjdk, opera, python-Pillow, tomcat, tomcat10, and xwayland), and Ubuntu (paramiko and puma).

Code Club at Number Ten Downing Street

Post Syndicated from Philip Colligan original https://www.raspberrypi.org/blog/code-club-number-ten-downing-street/

With the rapid advances in digital technologies like artificial intelligence, it’s more important than ever that every young person has the opportunity to learn how computers are being used to change the world and to develop the skills and confidence to get creative with technology. 

Learners at a Code Club taking place at Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

There’s no better way to develop those abilities (super powers even) than getting hands-on experience of programming, whether that’s coding an animation, designing a game, creating a website, building a robot buggy, or training an AI classification model. That’s what tens of thousands of young people do every day in Code Clubs all over the world. 

Lessons at 10 

We were absolutely thrilled to organise a Code Club at Number Ten Downing Street last week, hosted by the UK Prime Minister’s wife Akshata Murty as part of Lessons at 10.

A Code Club session taking place at Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

Lessons at 10 is an initiative to bring school children from all over the UK into Number Ten Downing Street, the official residence of the Prime Minister. Every week different schools visit to attend lessons led by education partners covering all kinds of subjects. 

A Code Club session taking place at Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

We ran a Code Club for 20 Year 7 students (ages 11 to 12) from schools in Coventry and Middlesex. The young people had a great time with the Silly eyes and Ghostbusters projects from our collections of Scratch projects. Both stone-cold classics in my opinion, and a great place to start if you’re new to programming.

A Code Club session taking place at Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

You may have spotted in the photos that the young people were programming on Raspberry Pi computers (the incredible Raspberry Pi 400 made in Wales). We also managed to get our hands on some cool new monitors. 

Mrs Murty’s father was one of the founders of Infosys, which ranks among the world’s most successful technology companies, founded in India and now operating all over the world. So it is perhaps no surprise that she spoke eloquently to the students about the importance of every young person learning about technology and seeing themselves as digital creators not consumers.

Akshata Murty talks to Philip Colligan, CEO of the Raspberry Pi Foundation.
Crown copyright. Licensed under the Open Government Licence.

We were lucky enough to be in one of the rather fancy rooms in Number Ten, featuring a portrait by John Constable of his niece Ada Lovelace, the world’s first computer programmer. Mrs Murty reminded us that one of the lessons we learn from Ada Lovelace is that computer programming combines both the logical and artistic aspects of human intelligence. So true. 

A global movement 

Since Code Club’s launch in April 2012, it has grown to be the world’s largest movement of free computing clubs and has supported over 2 million young people to get creative with technology.

Learners from a Code Club in front of Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

Code Clubs provide a free, fun, and safe environment for young people from all backgrounds to develop their digital skills. Run by teachers and volunteers, most Code Clubs take place in schools, and there are also lots in libraries and other community venues. 

The Raspberry Pi Foundation provides a broad range of projects that young people use to build their confidence and skills with lots of different hardware and software. The ultimate goal is that they are empowered to combine their logical and artistic skills to create something original. Just like Ada Lovelace did all those years ago.

Learners at a Code Club taking place at Number Ten Downing Street.
Crown copyright. Licensed under the Open Government Licence.

All of our projects are designed to be self-directed, so young people can learn independently or in groups. That means that you don’t need to be a tech expert to set up or run a Code Club. We provide you with all the support that you need to get started.

If you want to find out more about how to set up a Code Club, visit the website here.

The post Code Club at Number Ten Downing Street appeared first on Raspberry Pi Foundation.

Quantum Computing Skeptics

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/quantum-computing-skeptics.html

Interesting article. I am also skeptical that we are going to see useful quantum computers anytime soon. Since at least 2019, I have been saying that this is hard. And that we don’t know if it’s “land a person on the surface of the moon” hard, or “land a person on the surface of the sun” hard. They’re both hard, but very different.

2023 in Review: A Bigger, Bolder, and Better Zabbix

Post Syndicated from Michael Kammer original https://blog.zabbix.com/2023-in-review-a-bigger-bolder-and-better-zabbix/27272/

It hardly seems possible, but somehow 2023 is already in the rearview mirror. It’s been quite a ride, full of dynamic growth, popular events, new releases, and exciting additions to our global community. Without further ado, let’s take a look at the highlights!

Spreading the word

We radically expanded our slate of events this year in an attempt to spread the good word about the world’s finest open-source monitoring solution and meet our vibrant community. Our efforts took the form of:

• 31 meetings (in locations ranging from Kuala Lumpur to Seoul to Paris)
• 3 forums (in Shenzhen, Shanghai, and Mexico City)
• 16 meetups (online and in multiple locations around the globe)
• 5 conferences (in Germany, Benelux, China, Japan, and Latin America)
• Countless exhibitions, trade fairs, and expos from Las Vegas to Tokyo and all points in between

Oh, and one blowout Zabbix Summit in Riga in October!

Building a better product

This year we released Zabbix 6.4, which included many important new features:

• Just-in-time (JIT) user provisioning
• Cause and symptom events
• Instant propagation of configuration changes
• Zero-downtime upgrades
• SNMP discovery/bulk data collection speed and performance improvements
• A new menu layout
• The ability to stream metrics and events from Zabbix to external systems over HTTP
• Template versioning
• A development framework for widget creation
• Optional interfaces for server-originated checks
• Streamlined media type configuration for multiple email service providers

Zabbix 6.4 also comes with many new templates for the most popular vendors and cloud providers, including:

• Microsoft Azure MySQL servers
• Microsoft Azure PostgreSQL servers
• Microsoft Azure virtual machines
• Low-level discovery improvements in AWS by HTTP template
• Veeam Backup Enterprise Manager
• Veeam Backup and Replication
• Cisco Nexus 9000 Series
• BMC Control-M
• Cisco Meraki dashboard
• OS processes by Zabbix agent
• Improvements to filesystem discovery in official Zabbix OS templates

Speaking of templates, since the release of Zabbix 6.0, we have developed 38 new integrations, including:

• 16 application templates
• 4 cloud templates
• 2 database templates
• 6 webhooks
• 2 net templates
• 3 SAN templates
• 5 server templates

Maintaining security

In January, we received an ISO/IEC 27001:2013 certificate for information security. The certification stands as proof positive that Zabbix protects all our information within the highest internationally acknowledged security standards and reaffirms our commitment to prioritize information security best practices everywhere within our organization.

February saw us launch a public bug bounty program in partnership with HackerOne, the world’s number one ethical hacker-powered platform. The program’s purpose is to discover potential security vulnerabilities by letting hackers proactively search for and report Zabbix security vulnerabilities and get rewarded for found and validated issues. The program has been a massive success, with 15 reports resolved and $17,800 in bounties being paid out so far.

The power of growth

In 2023 we managed to grow our headcount across every location we operate in, while adding to a growing roster of remote workers from around the world. On March 29, we officially opened a new office in Mexico, joining our offices in Brazil (opened in 2020), the United States (2016), Japan (2012), and Latvia (2005).

To celebrate this momentous occasion, we invited our community of users, partners, and customers to participate in a free and exclusive event dedicated entirely to Zabbix. They were able to learn a little more about the company, ask questions about the plans for the new office, and share knowledge with our team of experts.

Our Integration team also saw significant growth in 2023, which has resulted in a faster rollout of popular templates and integrations as well as higher levels of quality than ever before. The Partners team had a busy year as well, adding 19 new certified partners around the globe and upgrading several others to Premium and Certified Reseller status.

Lending a helping hand

As an open-source company, we champion knowledge sharing and a more open world. It’s why we took part in the career day at the Transport and Telecommunication Institute in Riga, supported the “Youth Has Talent” contest in Latvia organized by the Laiks Jauniešiem association, and sent our Head of Training Kristine Lamberte as a guest speaker to Rezekne Technical School.

Our team in Latin America got in on the action by working with the DEDICATE Foundation to develop the Zabbix Innova Challenge. It’s a free activity that’s designed to promote the development of technological projects that involve young people in Mexico, while boosting the technology community and stimulating the development of creative solutions.

Our goal in showing up at all these events is to encourage young talent, support and invest in local social projects that empower and inspire future generations, share our skills and experience, and showcase some of the amazing career opportunities that Zabbix can offer.

We aim to create a world without interruption, and just as we strive to make the world a better place by building the best monitoring tool possible, we also do what we can to help those around us whose lives have been interrupted by circumstances beyond their control.

In 2023, that involved donating a total of €378,000 to organizations like the Children’s Hospital Foundation, Samaritan International Latvia, The Oncological Patient Support Association “Tree of Life”, the Children’s Foundation of Latvia, the Autism Support Point in Rēzekne, and ziedot.lv.

Getting noticed

The world continued to sit up and take notice of what we’ve been doing in 2023. Brazilian tech journal iMasters started off the year by noting Zabbix LATAM’s incredible 300% growth rate, while another Brazilian journal, Baguete, published an outstanding piece on the opening of the Zabbix office in Mexico.

In May, we were recognized as the top monitoring solution on Peerspot, and July saw us spotlighted in Labs of Latvia, a media platform for tech and innovation, which reported on our global expansion.

October brought with it a wave of favorable press coverage – Zabbix Summit 2023 speaker Dr. Hiroshi Abe had great things to say about us when profiled in El Español, and the same publication also published a well-researched company profile after the Summit.

In addition, Guaratã Almeida, a Zabbix partner and the technology director of the Brazilian city of Maceió, was an enthusiastic4 participant in the Summit, as noted by the city’s website.

Meanwhile, ThinkIT in Japan published an insightful interview with Zabbix Engineers Elina Pulke and Eliza Sekace, plus an inside look at the Summit proceedings.

Belgian website ITdaily followed that up with a post-Summit look at our business model and future plans, while Techzine published a glowing profile of their own as November drew to a close.

The icing on the cake of 2023 was Zabbix being named to the list of the “Top 101 Latvia’s Most Valuable Enterprises in 2023.” It’s a good measure of our significant contribution to Latvia’s economy and a reminder of our increasingly global impact.

Carrying our momentum into 2024

It was a year full of growth and accomplishments, and it was all possible because of our incredible community of customers and contributors! As 2024 approaches, you can look forward to a long list of new upgrades, events, and inspiration. Keep following us on social media, reading our blog, and checking our forum to stay on top of all the latest Zabbix news and events!

The post 2023 in Review: A Bigger, Bolder, and Better Zabbix appeared first on Zabbix Blog.

Думи на годината 2023. Изкуственият интелект във времеубежището на сглобката

Post Syndicated from original https://www.toest.bg/dumi-na-godinata-2023/

Думи на годината 2023. Изкуственият интелект във времеубежището на сглобката

Кампанията на образователната платформа „Как се пише?“ приключи, победителите са ясни и са публично достояние. А оттук нататък? Това, че знаем думите на 2023 година, само по себе си е интересно, дори любопитно, но нека да отидем отвъд занимателната страна и да се запитаме: какво ни казват те?

Защо провеждаме кампанията

Вече за трети път заедно с Доротея Николова – преподавателка по български език и литература и журналистка, организираме това своеобразно проучване на обществените нагласи, допитвайки се направо до хората. Една от целите ни е да направим бърз „преговор“ на годината чрез знаковите, ключовите думи, бележещи отминалите събития, като изследваме онова, което се е запечатало по-трайно в съзнанието на българите. За всяко общество е важно да знае какво го е вълнувало, тревожило, на какво е отдало значение, какво е пренебрегнало. Защото думите на годината синтезирано казват много – и присъстващите в класацията, и отсъстващите, заметените под килима, – стига да можем да ги разтълкуваме.

Етапи на избора

Процедурата, през която премина допитването, тръгна от хората и завърши с тях. В първия етап всеки желаещ можеше да даде своите предложения на страницата на „Как се пише?“ във Facebook и в Threads, както и в личния ми профил в Х. След това обобщихме резултатите и съставихме списък с 30-те думи и словосъчетания, за които получихме най-много предложения, и го изпратихме на членовете на журито. Освен мен и Доротея Николова като организатори, то включваше: доц. Георги Лозанов, философ и медиен експерт, Веселина Седларска, писателка и журналистка, и д-р Иван Ланджев, поет и есеист. След като дискутирахме предложенията, всички се обединихме около 10 от тях, които според нас най-адекватно отразяват духа на изминалата година и покриват възможно най-много теми. Финалните думи и словосъчетания включихме в анкета на „Как се пише?“, в която отново всеки желаещ можеше да участва. Резултатите от нея обявихме публично и ги изпратихме до медиите. 

Отделна и независима част от проучването „Думи на годината 2023“ е анализът на езика в медиите през изминалата година, за който ни помагат специалистите от Sensika и който се основава на обработения от тях огромен текстов материал от онлайн източници на български език.

Победителите не ги съдят

Но може да ги анализират. Очаквано или не, изкуствен интелект, сглобка и времеубежище оглавиха класацията.

Безспорната новина е, че естественият интелект издигна изкуствения интелект на пиедестала на 2023-та. Дали защото създава неподозирани възможности във всички сфери на живота, или защото заплашва съществуващите модели – пак във всички сфери на живота? Или защото събужда у нас любопитството да проверим какво може, но и страха да не се окажем ненужни? Със сигурност технологичният феномен ни вълнува и трябва да следим какви права му делегираме. Това словосъчетание ни сближава и със света: през декември речникът „Колинс“ обяви съкращението AI (ИИ) за дума на годината във Великобритания, а според речника на Кеймбридж знаков е бил глаголът халюцинирам, свързан отново с изкуствения интелект, този път – с произвеждането на невярна информация от него. Наблюденията на специалистите от „Колинс“ показват, че употребата на изкуствен интелект е скочила четири пъти в сравнения с предишната година. Отново може да се сравним със света – в българските медийни публикации това словосъчетание се е срещало почти пет пъти по-често през 2023 година, отколкото през предходната¹.

Преди началото на кампанията прогнозата ми беше, че сглобката трудно може да бъде победена. Думата стана удобна за назоваване на една форма на управление, за която на самите участници в него не им стигна смелостта или пък откровеността да я нарекат коалиция. Мисля, че именно отказът им да си послужат със съществуваща дума в езика вдъхна нов живот на сглобка, която доскоро означаваше само проста по същността си връзка, съединение. Названието заживя нов, политически живот, но върху него продължава да тегне стигмата на механичното свързване, лишено от сложността и смислеността на едни връзки от доста по-високо ниво, каквито (би трябвало да) са управленските. Ето защо сглобка се оказа толкова предпочитана дума от критиците на правителството, които не пропускат възможност да го иронизират чрез нея.

Един неологизъм си проби път в класацията, а вече го прави и в самия език. Времеубежище е преди всичко заглавие на роман, но има достатъчно свидетелства, че се употребява като съществително нарицателно име². С членовете на журито отчетохме факта и затова решихме да включим думата с малка буква и без кавички в анкетата. За мен ще е интересно да наблюдавам дали времеубежище ще се утвърди в езика. Първо, защото е създадена от писател, а не се случва всеки ден български писател да изкове нова дума; второ, защото е сложна, има специфично значение, което тежнее към абстрактното. Опитвайки се да го изясня за себе си, се питам: времеубежището (извън романа) всъщност убежище във или от времето е? Ето такива въпроси пораждат сложните думи, защото, когато двете им части се споят в едно цяло, между тях възникват куп отношения. А многото гласове, които думата получи, си обяснявам с това, че ние, българите, все пак сме способни да оценим успеха на наш сънародник и да се зарадваме.

Останалите седем думи в класацията

Познатите ни политически теми са тук, за да ни напомнят, че ни е трудно да мислим за важното и ключовото за страната, без да споменем война, демонтаж, ротация, Шенген, евроатлантик. Неслучайно извън десетте думи в анкетата остана не-коалиция, макар да беше предложена от доста хора в първия етап от кампанията. Журито реши, че три думи за формата на управление (заедно със сглобка и ротация) много биха натежали, да не говорим че включихме и евроатлантик – евроатлантизмът беше едно от ключовите понятия за миналата година, отнасящи се за властовата спойка. 

Без да подценявам резултата на демонтаж и важността на събитието, възможно е немалка роля да е изиграло и времето. Паметникът на Съветската армия беше актуална тема съвсем наскоро, през месец декември, и все още помним добре какво се случи. Дали обаче нямаше да позабравим демонтажа, ако се беше състоял през февруари например? Забравяме със страшна сила и скорост. Вижте примерите в следващата част.

Името на една международна награда – „Букър“, присъства в класацията, за да ни каже, че и ний сме дали нещо на световната литература, но на другия полюс, за отрезвяване, е PISA. За да ни напомня – веднъж на три години – колко далеч е образователната ни система от международните стандарти и достижения. Впрочем и това удобно ще го забравим.

Липсващите теми

Както писа Георги Лозанов в „Дойче Веле“, класациите „Думи на годината“ са „най-бързият начин да разбереш за какво на хората им се мисли и за какво не“. По съществени теми и за значими събития не ни се мисли: домашното насилие, войната в Близкия изток, промените в Конституцията… В началото на кампанията получихме само единични предложения за „домашно насилие“, „макетно ножче“, „ивицата Газа“, „Конституция“. И трите теми привлякоха общественото внимание през миналата година, първите две акумулираха енергия и в социалните мрежи, медиите отразиха събитията, но човешкото съзнание ги счете за незначими. Може би защото насилието в семейството е твърде болезнено или не се отнася до нас конкретно, войната в Газа е далече, а конституционните промени вече станаха банални. Причини и обяснения ще намерим, стига да ги потърсим. Но ще потърсим ли решения поне за това, което сме в състояние да променим?

Все пак има и добри новини. Да погледнем отново към победителите в класацията, но в съпоставка с думите на предишните години: преценям, антиваксър и изчегъртване за 2021-ва и война, инфлация, избори и Украйна за 2022-ра. С избора на изкуствен интелект – от технологичната сфера, и на времеубежище – от културната, като че ли се опитваме да се откъснем от тягостното политическо говорене и да се насочим към нови хоризонти. Колкото и да е определяща политиката за живота в едно общество, не бива да допускаме общественият разговор да се води под нейния диктат. Дано лека-полека започнем да го осъзнаваме.

1 Данните са от анализа на Sensika.

2 Невярващите може да напишат например „времеубежището“, „моето времеубежище“ или „нашето времеубежище“ (в кавички) в някоя търсачка, която услужливо ще им предложи резултати, включително и в текстове, нямащи нищо общо с творбата на Георги Господинов.


Езикът всъщност може да е вкусен и извън блюдото – онзи, българският език, на който говорим от малки и на който около 24 май се кълнем в обич. А той в основната си същност е средство за общуване и за да ни служи добре, непрекъснато се променя. Да го погледнем в неговата динамика и да се опитаме да разберем какво всъщност става и защо, кои са движещите механизми и как те са свързани с обществените процеси. И тъй като задачата не е лека, ще го правим постепенно – на порции.

How GitHub’s Developer Experience team improved innerloop development

Post Syndicated from belaltaher8 original https://github.blog/2024-01-24-how-githubs-developer-experience-team-improved-innerloop-development/


Building confidence in new code before deploying is a crucial part of any good development loop. This is especially challenging when working in a distributed or microservice system with multiple teams operating on different services. This modular team structure gives rise to an important question: how can we provide teams with fast and reliable development cycles when testing and shipping requires them to test inside an ecosystem of other services? Optimizing the solution to this problem greatly improves engineering efficiency and can contribute to more successful outcomes for the organization as a whole.

This problem is one the Developer Experience (DX) team at GitHub grappled with again and again, ultimately delivering a solution we call “Hubber Codespace” (HCS). HCS is a tool that Hubbers (people who work at GitHub) can use to locally stand up the entire distributed GitHub ecosystem in any environment by simply querying an endpoint or adding a couple lines of configuration to their development containers.

In this post, we’ll tell you how we landed on the HCS solution to this common problem over some possible alternatives, and you’ll get a first-hand look at how GitHub’s developer-first mindset helped us deliver the best tool for Hubbers to ship code quickly and safely in our own distributed environment.

One big (un)-happy environment

To understand the problem we were trying to solve, we have to go back in time. There was a point at which GitHub was just a couple teams and a much simpler product. Back then, having a monorepo in which everyone iterated and built confidence in their changes made sense. Splitting responsibilities up across repositories would have added overhead that bogged down early Hubbers. Fast forward to today, and GitHub has grown into a big organization with hundreds of different teams. Now, the balancing act of evaluating between velocity vs. complexity can look very different.

Let’s consider these complexities a bit further. Different services can have entirely different sets of dependencies and even have dependencies on different versions of the same software (for example, one service requires Ruby 2.2 while another requires Ruby 2.4). In smaller collaborative settings, the engineers can easily reconcile these needs. But this complexity grows exponentially as more teams are introduced. Trying to provide a single environment in which these kinds of disparate services can run and interact in development becomes difficult to do. It can result in ad-hoc “hacks” in development loops like deleting a .ruby-version file depending on which service’s development loop you’re working through. These are the kinds of problems that you encounter when trying to work with a monorepo that contains the codebases for a set of disparate services.

So, we decided to design a new solution. Instead of bringing the developers to the ecosystem, what if we brought the ecosystem to the developers?

Enter HCS

This line of thinking led us to build HCS, a Docker-Compose project that does exactly that. In the post “How we build containerized services at GitHub using GitHub,” we detailed how we build containerized services that power microservices on the GitHub.com platform and many internal tools. Our task now was to take these containers and wire them up such that partner teams could spin up a full GitHub ecosystem on demand. This would allow them to test their changes in an integrated environment. Developers could see how their code behaves when introduced to GitHub’s distributed system, rather than only observing it in the isolated environment of the application being developed before deploying within the full system. In this way, developers could gain confidence that the services they were changing behaved correctly when interacting with their up and downstream dependencies.

When considering how to orchestrate all the required containers, a few solutions came to mind: Docker-Compose, an internal tool called Codespace-Compose that allows us to SSH tunnel between multiple codespaces, and Minikube. Any of these three solutions could solve the ecosystem problem and would have unique tradeoffs. Let’s look at some of those tradeoffs now.

Minikube offers a robust Kubernetes architecture, but we had concerns about the overall user experience. We ultimately decided against it as the issues we identified, such as networking complexity and long cycle times, could bog down development speed.

Codespace-Compose allows us to easily connect teams’ everyday development environments, but we reasoned that, since Codespace-Compose is an internal experiment without any SLA, we’d incur a maintenance cost on our own team by adopting this.

Docker-Compose seemed to fit our needs the best. It didn’t incur any additional maintenance burden since it’s publicly available and actively managed. It offers all the same benefits of Minikube without the long cycle time. Most importantly, using Docker in Docker in a codespace, which allows us to create docker containers on a host which is a docker container itself, is a well-paved path that has lots of prior art. Given all these considerations, we decided on orchestrating our containers using Docker-Compose.

After deciding on Docker-Compose as our orchestrator, the next steps were to figure out the interface. Docker-Compose already supplies end users with commands, but we wanted to optimize the UX around HCS. To do this, we built a user-friendly CLI in Golang with parallel versioning to HCS. This abstracted away all the complexity of using the two together. Simply download a specific release version for HCS, get the same version of the CLI binary, and you’re good to go!

CLI and release automation

Ensuring HCS is useful means ensuring a couple of things. One important goal is ease of use. Docker-Compose already offers an interface for end users, but considering some of the built in commands are long and use predictable options, we decided to wrap it in a custom Golang CLI. This abstracted many of the underlying details away, such as static file locations, formatting options, entrypoint commands, etc. to improve end-user experience. The code below shows this by juxtaposing the Docker-Compose commands with their equivalent HCS CLI command.

The following example compares the commands to start up the integrated environment provided by HCS.

# Start using Docker-Compose

docker compose --project-name hcs \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-actions.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-base.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-bg.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-core.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-volume.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-test.yml \
--file /workspaces/hubber-codespace-dist/docker-compose-hcs-vendor.yml \
--profile full up -d --remove-orphans

# Start using CLI

hcs start

This next example compares how to get a shell to run commands from inside the various containers in GitHub’s distributed ecosystem. This allows developers to modularly interact with and make ephemeral changes to the system.

# Run command from inside a container in the system using Docker-Compose

docker compose --project-name hcs exec bash

# Run from inside a container using CLI

hcs shell

This example compares how to check the status of the containers in the project so end-users can easily see the health of the entire system.

# Status using Docker-Compose

docker compose --project-name hcs ps --format json

# Status using CLI

hcs status

In addition to this easy-to-use and ergonomic CLI, we had to ensure that HCS runs an up-to-date version of the GitHub ecosystem. GitHub is made up of so many different moving pieces that testing new changes on code that’s even a couple days old would not be sufficient to build confidence. When iterating directly on the monorepo, this was a non-issue since folks just fetched the main branch. For HCS, this required us to build automation that cuts releases on a frequent cron schedule. A release of HCS is a software artifact containing the compiled Golang binary for HCS and its CLI that can be pulled using the gh CLI.

The diagram below illustrates how this process works.

This diagram shows the nightly release cycle of HCS. HCS's repository gets SHAs from the monorepo and other service repositories. Then it publishes a release with all the SHAs, the Docker-Compose configs, and the CLI binary.

End-user experience

Using HCS directly in your codespace

We’ve recently made efforts to push all development at GitHub onto GitHub Codespaces. A codespace is a custom development container, or devcontainer, based on a configuration file in a repository. A repository can have multiple codespaces associated with it as long as each has a unique configuration file. On top of the obvious benefits of having a reproducible environment on demand to develop and iterate in, devcontainers offer features. This abstraction allows developers to easily add software to their environments. HCS is also consumable this way. The code block below shows the couple lines needed to bring this entire ecosystem to a partner team’s preferred environment (that is, their codespace).

{
…
  "features": {
    …
    "ghcr.io/devcontainers/features/github-cli:1": {
      "version": "latest"
    },
    //docker-in-docker required for hcs
    "ghcr.io/devcontainers/features/docker-in-docker:2": {},
    // Include the hubber-codespace feature
    "ghcr.io/github/hubber-codespace/hcs:1": {},
    "ghcr.io/devcontainers/features/go:1": {}
    …
  }
}

Now, teams can perform integration testing against the many other services in GitHub’s ecosystem from directly in the codespace where they were doing local development.

Release binary

Even with the push towards codespaces, not every context that requires an ecosystem will be a devcontainer. In light of this, we also gave end users the option to download the release directly from the GitHub API. The commands to do so can be seen below. With a couple simple commands, Hubbers now have everything they need to bring the entire GitHub ecosystem to whatever environment they want.

gh release download --repo github/hubber-codespace  -p hcs -D /tmp/

chmod +x /tmp/hcs

sudo mv /tmp/hcs /usr/local/bin

hcs init

hcs pull

hcs start

Testimonials

But don’t just take my word for it. Check out what our partner teams have had to say about HCS improving their development loop:

“HCS has improved our dev loop for [our service] by making it simple to test [it] against [the rest of GitHub’s ecosystem]. It’s turned what used to be a number of manual steps to clone our repository into the [monorepo environment] into two simple commands in our own codespace. This has made it much easier to validate our changes without having to deploy to a staging environment.”

“Given that we are a service operating outside GitHub but with a heavy reliance on the services running within GitHub, we’ve had to go through a lot of bells and whistles to ensure we can have a smooth development experience. In my four years working on [our service], HCS has been the most seamless experience in going from a blank devbox to breakpointing live running code for our service.”

Conclusion

Solving the ecosystem problem is always a balancing act. Luckily, thanks to GitHub’s push towards containerization, and tooling such as repository automation and publishing/consuming releases through the GitHub CLI, we were adequately equipped to develop a solution with HCS. Hubbers can now leverage a development loop that allows them to deploy with confidence, having tested their changes within GitHub’s complex multi-service system.

The post How GitHub’s Developer Experience team improved innerloop development appeared first on The GitHub Blog.

[$] Python, packaging, and pip—again

Post Syndicated from jake original https://lwn.net/Articles/959236/

Python packaging discussions seem like they often just go around and
around, ending up where they started and recapitulating many of the points that
have come up before. A recent discussion revolves around the pip package installer, as they
often do. The central role that is occupied by pip has both
good points and bad. There is a clear need for something that
can install from the Python Package Index
(PyPI) immediately after Python itself is installed. Whether there
should be additional features, including project management, that come
“inside the box”, as well,
is much less clear—not unlike the question of which project management
“style” should be chosen.

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/use-amazon-athena-with-spark-sql-for-your-open-source-transactional-table-formats/

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data consistent with business events. To ensure files are updated in a transactionally consistent manner, a growing number of customers are using open-source transactional table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake that help you store data with high compression rates, natively interface with your applications and frameworks, and simplify incremental data processing in data lakes built on Amazon S3. These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. Each storage format implements this functionality in slightly different ways; for a comparison, refer to Choosing an open table format for your transactional data lake on AWS.

In 2023, AWS announced general availability for Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake in Amazon Athena for Apache Spark, which removes the need to install a separate connector or associated dependencies and manage versions, and simplifies the configuration steps required to use these frameworks.

In this post, we show you how to use Spark SQL in Amazon Athena notebooks and work with Iceberg, Hudi, and Delta Lake table formats. We demonstrate common operations such as creating databases and tables, inserting data into the tables, querying data, and looking at snapshots of the tables in Amazon S3 using Spark SQL in Athena.

Prerequisites

Complete the following prerequisites:

Download and import example notebooks from Amazon S3

To follow along, download the notebooks discussed in this post from the following locations:

After you download the notebooks, import them into your Athena Spark environment by following the To import a notebook section in Managing notebook files.

Navigate to specific Open Table Format section

If you are interested in Iceberg table format, navigate to Working with Apache Iceberg tables section.

If you are interested in Hudi table format, navigate to Working with Apache Hudi tables section.

If you are interested in Delta Lake table format, navigate to Working with Linux foundation Delta Lake tables section.

Working with Apache Iceberg tables

When using Spark notebooks in Athena, you can run SQL queries directly without having to use PySpark. We do this by using cell magics, which are special headers in a notebook cell that change the cell’s behavior. For SQL, we can add the %%sql magic, which will interpret the entire cell contents as a SQL statement to be run on Athena.

In this section, we show how you can use SQL on Apache Spark for Athena to create, analyze, and manage Apache Iceberg tables.

Set up a notebook session

In order to use Apache Iceberg in Athena, while creating or editing a session, select the Apache Iceberg option by expanding the Apache Spark properties section. It will pre-populate the properties as shown in the following screenshot.

This image shows the Apache Iceberg properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section is available in the SparkSQL_iceberg.ipynb file to follow along.

Create a database and Iceberg table

First, we create a database in the AWS Glue Data Catalog. With the following SQL, we can create a database called icebergdb:

%%sql
CREATE DATABASE icebergdb

Next, in the database icebergdb, we create an Iceberg table called noaa_iceberg pointing to a location in Amazon S3 where we will load the data. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE icebergdb.noaa_iceberg(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string)
USING iceberg
PARTITIONED BY (year string)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaaiceberg/'

Insert data into the table

To populate the noaa_iceberg Iceberg table, we insert data from the Parquet table sparkblogdb.noaa_pq that was created as part of the prerequisites. You can do this using an INSERT INTO statement in Spark:

%%sql
INSERT INTO icebergdb.noaa_iceberg select * from sparkblogdb.noaa_pq

Alternatively, you can use CREATE TABLE AS SELECT with the USING iceberg clause to create an Iceberg table and insert data from a source table in one step:

%%sql
CREATE TABLE icebergdb.noaa_iceberg
USING iceberg
PARTITIONED BY (year)
AS SELECT * FROM sparkblogdb.noaa_pq

Query the Iceberg table

Now that the data is inserted in the Iceberg table, we can start analyzing it. Let’s run a Spark SQL to find the minimum recorded temperature by year for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, min(MIN) as minimum_temperature
from icebergdb.noaa_iceberg
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

We get following output.

Image shows output of first select query

Update data in the Iceberg table

Let’s look at how to update data in our table. We want to update the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea-Tac'. Using Spark SQL, we can run an UPDATE statement against the Iceberg table:

%%sql
UPDATE icebergdb.noaa_iceberg
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We can then run the previous SELECT query to find the minimum recorded temperature for the 'Sea-Tac' location:

%%sql
select name, year, min(MIN) as minimum_temperature
from icebergdb.noaa_iceberg
where name = 'Sea-Tac'
group by 1,2

We get the following output.

Image shows output of second select query

Compact data files

Open table formats like Iceberg work by creating delta changes in file storage, and tracking the versions of rows through manifest files. More data files leads to more metadata stored in manifest files, and small data files often cause an unnecessary amount of metadata, resulting in less efficient queries and higher Amazon S3 access costs. Running Iceberg’s rewrite_data_files procedure in Spark for Athena will compact data files, combining many small delta change files into a smaller set of read-optimized Parquet files. Compacting files speeds up the read operation when queried. To run compaction on our table, run the following Spark SQL:

%%sql
CALL spark_catalog.system.rewrite_data_files
(table => 'icebergdb.noaa_iceberg', strategy=>'sort', sort_order => 'zorder(name)')

rewrite_data_files offers options to specify your sort strategy, which can help reorganize and compact data.

List table snapshots

Each write, update, delete, upsert, and compaction operation on an Iceberg table creates a new snapshot of a table while keeping the old data and metadata around for snapshot isolation and time travel. To list the snapshots of an Iceberg table, run the following Spark SQL statement:

%%sql
SELECT *
FROM spark_catalog.icebergdb.noaa_iceberg.snapshots

Expire old snapshots

Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small. It will never remove files that are still required by a non-expired snapshot. In Spark for Athena, run the following SQL to expire snapshots for the table icebergdb.noaa_iceberg that are older than a specific timestamp:

%%sql
CALL spark_catalog.system.expire_snapshots
('icebergdb.noaa_iceberg', TIMESTAMP '2023-11-30 00:00:00.000')

Note that the timestamp value is specified as a string in format yyyy-MM-dd HH:mm:ss.fff. The output will give a count of the number of data and metadata files deleted.

Drop the table and database

You can run the following Spark SQL to clean up the Iceberg tables and associated data in Amazon S3 from this exercise:

%%sql
DROP TABLE icebergdb.noaa_iceberg PURGE

Run the following Spark SQL to remove the database icebergdb:

%%sql
DROP DATABASE icebergdb

To learn more about all the operations you can perform on Iceberg tables using Spark for Athena, refer to Spark Queries and Spark Procedures in the Iceberg documentation.

Working with Apache Hudi tables

Next, we show how you can use SQL on Spark for Athena to create, analyze, and manage Apache Hudi tables.

Set up a notebook session

In order to use Apache Hudi in Athena, while creating or editing a session, select the Apache Hudi option by expanding the Apache Spark properties section.

This image shows the Apache Hudi properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section should be available in the SparkSQL_hudi.ipynb file to follow along.

Create a database and Hudi table

First, we create a database called hudidb that will be stored in the AWS Glue Data Catalog followed by Hudi table creation:

%%sql
CREATE DATABASE hudidb

We create a Hudi table pointing to a location in Amazon S3 where we will load the data. Note that the table is of copy-on-write type. It is defined by type= 'cow' in the table DDL. We have defined station and date as the multiple primary keys and preCombinedField as year. Also, the table is partitioned on year. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE hudidb.noaa_hudi(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string,
year string)
USING HUDI
PARTITIONED BY (year)
TBLPROPERTIES(
primaryKey = 'station, date',
preCombineField = 'year',
type = 'cow'
)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaahudi/'

Insert data into the table

Like with Iceberg, we use the INSERT INTO statement to populate the table by reading data from the sparkblogdb.noaa_pq table created in the previous post:

%%sql
INSERT INTO hudidb.noaa_hudi select * from sparkblogdb.noaa_pq

Query the Hudi table

Now that the table is created, let’s run a query to find the maximum recorded temperature for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

Update data in the Hudi table

Let’s change the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea–Tac'. We can run an UPDATE statement on Spark for Athena to update the records of the noaa_hudi table:

%%sql
UPDATE hudidb.noaa_hudi
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We run the previous SELECT query to find the maximum recorded temperature for the 'Sea-Tac' location:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi
where name = 'Sea-Tac'
group by 1,2

Run time travel queries

We can use time travel queries in SQL on Athena to analyze past data snapshots. For example:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi timestamp as of '2023-12-01 23:53:43.100'
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

This query checks the Seattle Airport temperature data as of a specific time in the past. The timestamp clause lets us travel back without altering current data. Note that the timestamp value is specified as a string in format yyyy-MM-dd HH:mm:ss.fff.

Optimize query speed with clustering

To improve query performance, you can perform clustering on Hudi tables using SQL in Spark for Athena:

%%sql
CALL run_clustering(table => 'hudidb.noaa_hudi', order => 'name')

Compact tables

Compaction is a table service employed by Hudi specifically in Merge On Read (MOR) tables to merge updates from row-based log files to the corresponding columnar-based base file periodically to produce a new version of the base file. Compaction is not applicable to Copy On Write (COW) tables and only applies to MOR tables. You can run the following query in Spark for Athena to perform compaction on MOR tables:

%%sql
CALL run_compaction(op => 'run', table => 'hudi_table_mor');

Drop the table and database

Run the following Spark SQL to remove the Hudi table you created and associated data from the Amazon S3 location:

%%sql
DROP TABLE hudidb.noaa_hudi PURGE

Run the following Spark SQL to remove the database hudidb:

%%sql
DROP DATABASE hudidb

To learn about all the operations you can perform on Hudi tables using Spark for Athena, refer to SQL DDL and Procedures in the Hudi documentation.

Working with Linux foundation Delta Lake tables

Next, we show how you can use SQL on Spark for Athena to create, analyze, and manage Delta Lake tables.

Set up a notebook session

In order to use Delta Lake in Spark for Athena, while creating or editing a session, select Linux Foundation Delta Lake by expanding the Apache Spark properties section.

This image shows the Delta Lake properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section should be available in the SparkSQL_delta.ipynb file to follow along.

Create a database and Delta Lake table

In this section, we create a database in the AWS Glue Data Catalog. Using following SQL, we can create a database called deltalakedb:

%%sql
CREATE DATABASE deltalakedb

Next, in the database deltalakedb, we create a Delta Lake table called noaa_delta pointing to a location in Amazon S3 where we will load the data. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE deltalakedb.noaa_delta(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string)
USING delta
PARTITIONED BY (year string)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaadelta/'

Insert data into the table

We use an INSERT INTO statement to populate the table by reading data from the sparkblogdb.noaa_pq table created in the previous post:

%%sql
INSERT INTO deltalakedb.noaa_delta select * from sparkblogdb.noaa_pq

You can also use CREATE TABLE AS SELECT to create a Delta Lake table and insert data from a source table in one query.

Query the Delta Lake table

Now that the data is inserted in the Delta Lake table, we can start analyzing it. Let’s run a Spark SQL to find the minimum recorded temperature for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, max(MAX) as minimum_temperature
from deltalakedb.noaa_delta
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

Update data in the Delta lake table

Let’s change the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea–Tac'. We can run an UPDATE statement on Spark for Athena to update the records of the noaa_delta table:

%%sql
UPDATE deltalakedb.noaa_delta
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We can run the previous SELECT query to find the minimum recorded temperature for the 'Sea-Tac' location, and the result should be the same as earlier:

%%sql
select name, year, max(MAX) as minimum_temperature
from deltalakedb.noaa_delta
where name = 'Sea-Tac'
group by 1,2

Compact data files

In Spark for Athena, you can run OPTIMIZE on the Delta Lake table, which will compact the small files into larger files, so the queries are not burdened by the small file overhead. To perform the compaction operation, run the following query:

%%sql
OPTIMIZE deltalakedb.noaa_delta

Refer to Optimizations in the Delta Lake documentation for different options available while running OPTIMIZE.

Remove files no longer referenced by a Delta Lake table

You can remove files stored in Amazon S3 that are no longer referenced by a Delta Lake table and are older than the retention threshold by running the VACCUM command on the table using Spark for Athena:

%%sql
VACUUM deltalakedb.noaa_delta

Refer to Remove files no longer referenced by a Delta table in the Delta Lake documentation for options available with VACUUM.

Drop the table and database

Run the following Spark SQL to remove the Delta Lake table you created:

%%sql
DROP TABLE deltalakedb.noaa_delta

Run the following Spark SQL to remove the database deltalakedb:

%%sql
DROP DATABASE deltalakedb

Running DROP TABLE DDL on the Delta Lake table and database deletes the metadata for these objects, but doesn’t automatically delete the data files in Amazon S3. You can run the following Python code in the notebook’s cell to delete the data from the S3 location:

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('<your-S3-bucket>')
bucket.objects.filter(Prefix="<prefix>/noaadelta/").delete()

To learn more about the SQL statements that you can run on a Delta Lake table using Spark for Athena, refer to the quickstart in the Delta Lake documentation.

Conclusion

This post demonstrated how to use Spark SQL in Athena notebooks to create databases and tables, insert and query data, and perform common operations like updates, compactions, and time travel on Hudi, Delta Lake, and Iceberg tables. Open table formats add ACID transactions, upserts, and deletes to data lakes, overcoming limitations of raw object storage. By removing the need to install separate connectors, Spark on Athena’s built-in integration reduces configuration steps and management overhead when using these popular frameworks for building reliable data lakes on Amazon S3. To learn more about selecting an open table format for your data lake workloads, refer to Choosing an open table format for your transactional data lake on AWS.


About the Authors

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Raj Devnath is a Product Manager at AWS on Amazon Athena. He is passionate about building products customers love and helping customers extract value from their data. His background is in delivering solutions for multiple end markets, such as finance, retail, smart buildings, home automation, and data communication systems.

Security updates for Wednesday

Post Syndicated from corbet original https://lwn.net/Articles/959325/

Security updates have been issued by Debian (jinja2, openjdk-11, ruby-httparty, and xorg-server), Fedora (ansible-core and mingw-jasper), Gentoo (GOCR, Ruby, and sudo), Oracle (gstreamer-plugins-bad-free, java-17-openjdk, java-21-openjdk, python-cryptography, and xorg-x11-server), Red Hat (kernel, kernel-rt, kpatch-patch, LibRaw, python-pillow, and python-pip), Slackware (mozilla), SUSE (python-Pillow, rear118a, and redis7), and Ubuntu (libapache-session-ldap-perl and pycryptodome).

Introducing Foundations – our open source Rust service foundation library

Post Syndicated from Ivan Nikulin http://blog.cloudflare.com/author/ivan-nikulin/ original https://blog.cloudflare.com/introducing-foundations-our-open-source-rust-service-foundation-library


In this blog post, we’re excited to present Foundations, our foundational library for Rust services, now released as open source on GitHub. Foundations is a foundational Rust library, designed to help scale programs for distributed, production-grade systems. It enables engineers to concentrate on the core business logic of their services, rather than the intricacies of production operation setups.

Originally developed as part of our Oxy proxy framework, Foundations has evolved to serve a wider range of applications. For those interested in exploring its technical capabilities, we recommend consulting the library’s API documentation. Additionally, this post will cover the motivations behind Foundations’ creation and provide a concise summary of its key features. Stay with us to learn more about how Foundations can support your Rust projects.

What is Foundations?

In software development, seemingly minor tasks can become complex when scaled up. This complexity is particularly evident when comparing the deployment of services on server hardware globally to running a program on a personal laptop.

The key question is: what fundamentally changes when transitioning from a simple laptop-based prototype to a full-fledged service in a production environment? Through our experience in developing numerous services, we’ve identified several critical differences:

  • Observability: locally, developers have access to various tools for monitoring and debugging. However, these tools are not as accessible or practical when dealing with thousands of software instances running on remote servers.
  • Configuration: local prototypes often use basic, sometimes hardcoded, configurations. This approach is impractical in production, where changes require a more flexible and dynamic configuration system. Hardcoded settings are cumbersome, and command-line options, while common, don’t always suit complex hierarchical configurations or align with the “Configuration as Code” paradigm.
  • Security: services in production face a myriad of security challenges, exposed to diverse threats from external sources. Basic security hardening becomes a necessity.

Addressing these distinctions, Foundations emerges as a comprehensive library, offering solutions to these challenges. Derived from our Oxy proxy framework, Foundations brings the tried-and-tested functionality of Oxy to a broader range of Rust-based applications at Cloudflare.

Foundations was developed with these guiding principles:

  • High modularity: recognizing that many services predate Foundations, we designed it to be modular. Teams can adopt individual components at their own pace, facilitating a smooth transition.
  • API ergonomics: a top priority for us is user-friendly library interaction. Foundations leverages Rust’s procedural macros to offer an intuitive, well-documented API, aiming for minimal friction in usage.
  • Simplified setup and configuration: our goal is for engineers to spend minimal time on setup. Foundations is designed to be ‘plug and play’, with essential functions working immediately and adjustable settings for fine-tuning. We understand that this focus on ease of setup over extreme flexibility might be debatable, as it implies a trade-off. Unlike other libraries that cater to a wide range of environments with potentially verbose setup requirements, Foundations is tailored for specific, production-tested environments and workflows. This doesn’t restrict Foundations’ adaptability to other settings, but we approach this with compile-time features to manage setup workflows, rather than a complex setup API.

Next, let’s delve into the components Foundations offers. To better illustrate the functionality that Foundations provides we will refer to the example web server from Foundations’ source code repository.

Telemetry

In any production system, observability, which we refer to as telemetry, plays an essential role. Generally, three primary types of telemetry are adequate for most service needs:

  • Logging: this involves recording arbitrary textual information, which can be enhanced with tags or structured fields. It’s particularly useful for documenting operational errors that aren’t critical to the service.
  • Tracing: this method offers a detailed timing breakdown of various service components. It’s invaluable for identifying performance bottlenecks and investigating issues related to timing.
  • Metrics: these are quantitative data points about the service, crucial for monitoring the overall health and performance of the system.

Foundations integrates an API that encompasses all these telemetry aspects, consolidating them into a unified package for ease of use.

Tracing

Foundations’ tracing API shares similarities with tokio/tracing, employing a comparable approach with implicit context propagation, instrumentation macros, and futures wrapping:

#[tracing::span_fn("respond to request")]
async fn respond(
    endpoint_name: Arc<String>,
    req: Request<Body>,
    routes: Arc<Map<String, ResponseSettings>>,
) -> Result<Response<Body>, Infallible> {
    …
}

Refer to the example web server and documentation for more comprehensive examples.

However, Foundations distinguishes itself in a few key ways:

  • Simplified API: we’ve streamlined the setup process for tracing, aiming for a more minimalistic approach compared to tokio/tracing.
  • Enhanced trace sampling flexibility: Foundations allows for selective override of the sampling ratio in specific code branches. This feature is particularly useful for detailed performance bug investigations, enabling a balance between global trace sampling for overall performance monitoring and targeted sampling for specific accounts, connections, or requests.
  • Distributed trace stitching: our API supports the integration of trace data from multiple services, contributing to a comprehensive view of the entire pipeline. This functionality includes fine-tuned control over sampling ratios, allowing upstream services to dictate the sampling of specific traffic flows in downstream services.
  • Trace forking capability: addressing the challenge of long-lasting connections with numerous multiplexed requests, Foundations introduces trace forking. This feature enables each request within a connection to have its own trace, linked to the parent connection trace. This method significantly simplifies the analysis and improves performance, particularly for connections handling thousands of requests.

We regard telemetry as a vital component of our software, not merely an optional add-on. As such, we believe in rigorous testing of this feature, considering it our primary tool for monitoring software operations. Consequently, Foundations includes an API and user-friendly macros to facilitate the collection and analysis of tracing data within tests, presenting it in a format conducive to assertions.

Logging

Foundations’ logging API shares its foundation with tokio/tracing and slog, but introduces several notable enhancements.

During our work on various services, we recognized the hierarchical nature of logging contextual information. For instance, in a scenario involving a connection, we might want to tag each log record with the connection ID and HTTP protocol version. Additionally, for requests served over this connection, it would be useful to attach the request URL to each log record, while still including connection-specific information.

Typically, achieving this would involve creating a new logger for each request, copying tags from the connection’s logger, and then manually passing this new logger throughout the relevant code. This method, however, is cumbersome, requiring explicit handling and storage of the logger object.

To streamline this process and prevent telemetry from obstructing business logic, we adopted a technique similar to tokio/tracing’s approach for tracing, applying it to logging. This method relies on future instrumentation machinery (tracing-rs documentation has a good explanation of the concept), allowing for implicit passing of the current logger. This enables us to “fork” logs for each request and use this forked log seamlessly within the current code scope, automatically propagating it down the call stack, including through asynchronous function calls:

 let conn_tele_ctx = TelemetryContext::current();

 let on_request = service_fn({
        let endpoint_name = Arc::clone(&endpoint_name);

        move |req| {
            let routes = Arc::clone(&routes);
            let endpoint_name = Arc::clone(&endpoint_name);

            // Each request gets independent log inherited from the connection log and separate
            // trace linked to the connection trace.
            conn_tele_ctx
                .with_forked_log()
                .with_forked_trace("request")
                .apply(async move { respond(endpoint_name, req, routes).await })
        }
});

Refer to example web server and documentation for more comprehensive examples.

In an effort to simplify the user experience, we merged all APIs related to context management into a single, implicitly available in each code scope, TelemetryContext object. This integration not only simplifies the process but also lays the groundwork for future advanced features. These features could blend tracing and logging information into a cohesive narrative by cross-referencing each other.

Like tracing, Foundations also offers a user-friendly API for testing service’s logging.

Metrics

Foundations incorporates the official Prometheus Rust client library for its metrics functionality, with a few enhancements for ease of use. One key addition is a procedural macro provided by Foundations, which simplifies the definition of new metrics with typed labels, reducing boilerplate code:

use foundations::telemetry::metrics::{metrics, Counter, Gauge};
use std::sync::Arc;

#[metrics]
pub(crate) mod http_server {
    /// Number of active client connections.
    pub fn active_connections(endpoint_name: &Arc<String>) -> Gauge;

    /// Number of failed client connections.
    pub fn failed_connections_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of HTTP requests.
    pub fn requests_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of failed requests.
    pub fn requests_failed_total(endpoint_name: &Arc<String>, status_code: u16) -> Counter;
}

Refer to the example web server and documentation for more information of how metrics can be defined and used.

In addition to this, we have refined the approach to metrics collection and structuring. Foundations offers a streamlined, user-friendly API for both these tasks, focusing on simplicity and minimalism.

Memory profiling

Recognizing the efficiency of jemalloc for long-lived services, Foundations includes a feature for enabling jemalloc memory allocation. A notable aspect of jemalloc is its memory profiling capability. Foundations packages this functionality into a straightforward and safe Rust API, making it accessible and easy to integrate.

Telemetry server

Foundations comes equipped with a built-in, customizable telemetry server endpoint. This server automatically handles a range of functions including health checks, metric collection, and memory profiling requests.

Security

A vital component of Foundations is its robust and ergonomic API for seccomp, a Linux kernel feature for syscall sandboxing. This feature enables the setting up of hooks for syscalls used by an application, allowing actions like blocking or logging. Seccomp acts as a formidable line of defense, offering an additional layer of security against threats like arbitrary code execution.

Foundations provides a simple way to define lists of all allowed syscalls, also allowing a composition of multiple lists (in addition, Foundations ships predefined lists for common use cases):

  use foundations::security::common_syscall_allow_lists::{ASYNC, NET_SOCKET_API, SERVICE_BASICS};
    use foundations::security::{allow_list, enable_syscall_sandboxing, ViolationAction};

    allow_list! {
        static ALLOWED = [
            ..SERVICE_BASICS,
            ..ASYNC,
            ..NET_SOCKET_API
        ]
    }

    enable_syscall_sandboxing(ViolationAction::KillProcess, &ALLOWED)
 

Refer to the web server example and documentation for more comprehensive examples of this functionality.

Settings and CLI

Foundations simplifies the management of service settings and command-line argument parsing. Services built on Foundations typically use YAML files for configuration. We advocate for a design where every service comes with a default configuration that’s functional right off the bat. This philosophy is embedded in Foundations’ settings functionality.

In practice, applications define their settings and defaults using Rust structures and enums. Foundations then transforms Rust documentation comments into configuration annotations. This integration allows the CLI interface to generate a default, fully annotated YAML configuration files. As a result, service users can quickly and easily understand the service settings:

use foundations::settings::collections::Map;
use foundations::settings::net::SocketAddr;
use foundations::settings::settings;
use foundations::telemetry::settings::TelemetrySettings;

#[settings]
pub(crate) struct HttpServerSettings {
    /// Telemetry settings.
    pub(crate) telemetry: TelemetrySettings,
    /// HTTP endpoints configuration.
    #[serde(default = "HttpServerSettings::default_endpoints")]
    pub(crate) endpoints: Map<String, EndpointSettings>,
}

impl HttpServerSettings {
    fn default_endpoints() -> Map<String, EndpointSettings> {
        let mut endpoint = EndpointSettings::default();

        endpoint.routes.insert(
            "/hello".into(),
            ResponseSettings {
                status_code: 200,
                response: "World".into(),
            },
        );

        endpoint.routes.insert(
            "/foo".into(),
            ResponseSettings {
                status_code: 403,
                response: "bar".into(),
            },
        );

        [("Example endpoint".into(), endpoint)]
            .into_iter()
            .collect()
    }
}

#[settings]
pub(crate) struct EndpointSettings {
    /// Address of the endpoint.
    pub(crate) addr: SocketAddr,
    /// Endoint's URL path routes.
    pub(crate) routes: Map<String, ResponseSettings>,
}

#[settings]
pub(crate) struct ResponseSettings {
    /// Status code of the route's response.
    pub(crate) status_code: u16,
    /// Content of the route's response.
    pub(crate) response: String,
}

The settings definition above automatically generates the following default configuration YAML file:

---
# Telemetry settings.
telemetry:
  # Distributed tracing settings
  tracing:
    # Enables tracing.
    enabled: true
    # The address of the Jaeger Thrift (UDP) agent.
    jaeger_tracing_server_addr: "127.0.0.1:6831"
    # Overrides the bind address for the reporter API.
    # By default, the reporter API is only exposed on the loopback
    # interface. This won't work in environments where the
    # Jaeger agent is on another host (for example, Docker).
    # Must have the same address family as `jaeger_tracing_server_addr`.
    jaeger_reporter_bind_addr: ~
    # Sampling ratio.
    #
    # This can be any fractional value between `0.0` and `1.0`.
    # Where `1.0` means "sample everything", and `0.0` means "don't sample anything".
    sampling_ratio: 1.0
  # Logging settings.
  logging:
    # Specifies log output.
    output: terminal
    # The format to use for log messages.
    format: text
    # Set the logging verbosity level.
    verbosity: INFO
    # A list of field keys to redact when emitting logs.
    #
    # This might be useful to hide certain fields in production logs as they may
    # contain sensitive information, but allow them in testing environment.
    redact_keys: []
  # Metrics settings.
  metrics:
    # How the metrics service identifier defined in `ServiceInfo` is used
    # for this service.
    service_name_format: metric_prefix
    # Whether to report optional metrics in the telemetry server.
    report_optional: false
  # Server settings.
  server:
    # Enables telemetry server
    enabled: true
    # Telemetry server address.
    addr: "127.0.0.1:0"
# HTTP endpoints configuration.
endpoints:
  Example endpoint:
    # Address of the endpoint.
    addr: "127.0.0.1:0"
    # Endoint's URL path routes.
    routes:
      /hello:
        # Status code of the route's response.
        status_code: 200
        # Content of the route's response.
        response: World
      /foo:
        # Status code of the route's response.
        status_code: 403
        # Content of the route's response.
        response: bar

Refer to the example web server and documentation for settings and CLI API for more comprehensive examples of how settings can be defined and used with Foundations-provided CLI API.

Wrapping Up

At Cloudflare, we greatly value the contributions of the open source community and are eager to reciprocate by sharing our work. Foundations has been instrumental in reducing our development friction, and we hope it can do the same for others. We welcome external contributions to Foundations, aiming to integrate diverse experiences into the project for the benefit of all.

If you’re interested in working on projects like Foundations, consider joining our team — we’re hiring!