Amazon Aurora PostgreSQL Limitless Database is now generally available

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-aurora-postgresql-limitless-database-is-now-generally-available/

Today, we are announcing the general availability of Amazon Aurora PostgreSQL Limitless Database, a new serverless horizontal scaling (sharding) capability of Amazon Aurora. With Aurora PostgreSQL Limitless Database, you can scale beyond the existing Aurora limits for write throughput and storage by distributing a database workload over multiple Aurora writer instances while maintaining the ability to use it as a single database.

When we previewed Aurora PostgreSQL Limitless Database at AWS re:Invent 2023, I explained that it uses a two-layer architecture consisting of multiple database nodes in a DB shard group – either routers or shards to scale based on the workload.

  • Routers – Nodes that accept SQL connections from clients, send SQL commands to shards, maintain system-wide consistency, and return results to clients.
  • Shards – Nodes that store a subset of tables and full copies of data, which accept queries from routers.

There will be three types of tables that contain your data: sharded, reference, and standard.

  • Sharded tables – These tables are distributed across multiple shards. Data is split among the shards based on the values of designated columns in the table, called shard keys. They are useful for scaling the largest, most I/O-intensive tables in your application.
  • Reference tables – These tables copy data in full on every shard so that join queries can work faster by eliminating unnecessary data movement. They are commonly used for infrequently modified reference data, such as product catalogs and zip codes.
  • Standard tables – These tables are like regular Aurora PostgreSQL tables. Standard tables are all placed together on a single shard so join queries can work faster by eliminating unnecessary data movement. You can create sharded and reference tables from standard tables.

Once you have created the DB shard group and your sharded and reference tables, you can load massive amounts of data into Aurora PostgreSQL Limitless Database and query data in those tables using standard PostgreSQL queries. To learn more, visit Limitless Database architecture in the Amazon Aurora User Guide.

Getting started with Aurora PostgreSQL Limitless Database
You can get started in the AWS Management Console and AWS Command Line Interface (AWS CLI) to create a new DB cluster that uses Aurora PostgreSQL Limitless Database, add a DB shard group to the cluster, and query your data.

1. Create an Aurora PostgreSQL Limitless Database Cluster
Open the Amazon Relational Database Service (Amazon RDS) console and choose Create database. For Engine options, choose Aurora (PostgreSQL Compatible) and Aurora PostgreSQL with Limitless Database (Compatible with PostgreSQL 16.4).

For Aurora Limitless Database, enter a name for your DB shard group and values for minimum and maximum capacity measured by Aurora capacity units (ACUs) across all routers and shards. The initial number of routers and shards in a DB shard group is determined by this maximum capacity. Aurora PostgreSQL Limitless Database scales a node up to a higher capacity when its current utilization is too low to handle the load. It scales the node down to a lower capacity when its current capacity is higher than needed.

For DB shard group deployment, choose whether to create standbys for the DB shard group: no compute redundancy, one compute standby in a different Availability Zone, or two compute standbys in two different Availability Zones.

You can set the remaining DB settings to what you prefer and choose Create database. After the DB shard group are created, they’re displayed on the Databases page.

You can connect, reboot, or delete a DB shard group, or you can change the capacity, split a shard, or add a router in the DB shard group. To learn more, visit Working with DB shard groups in the Amazon Aurora User Guide.

2. Create Aurora PostgreSQL Limitless Database tables
As shared previously, Aurora PostgreSQL Limitless Database has three table types: sharded, reference, and standard. You can convert standard tables to sharded or reference tables to distribute or replicate existing standard tables or create new sharded and reference tables.

You can use variables to create sharded and reference tables by setting the table creation mode. The tables that you create will use this mode until you set a different mode. The following examples show how to use these variables to create sharded and reference tables.

For example, create a sharded table named items with a shard key composed of the item_id and item_cat columns.

SET rds_aurora.limitless_create_table_mode='sharded';
SET rds_aurora.limitless_create_table_shard_key='{"item_id", "item_cat"}';
CREATE TABLE items(item_id int, item_cat varchar, val int, item text);

Now, create a sharded table named item_description with a shard key composed of the item_id and item_cat columns and collocate it with the items table.

SET rds_aurora.limitless_create_table_collocate_with='items';
CREATE TABLE item_description(item_id int, item_cat varchar, color_id int, ...);

You can also create a reference table named colors.

SET rds_aurora.limitless_create_table_mode='reference';
CREATE TABLE colors(color_id int primary key, color varchar);

You can find information about Limitless Database tables by using the rds_aurora.limitless_tables view, which contains information about tables and their types.

postgres_limitless=> SELECT * FROM rds_aurora.limitless_tables;

 table_gid | local_oid | schema_name | table_name  | table_status | table_type  | distribution_key
-----------+-----------+-------------+-------------+--------------+-------------+------------------
         1 |     18797 | public      | items       | active       | sharded     | HASH (item_id, item_cat)
         2 |     18641 | public      | colors      | active       | reference   | 

(2 rows)

You can convert standard tables into sharded or reference tables. During the conversion, data is moved from the standard table to the distributed table, then the source standard table is deleted. To learn more, visit Converting standard tables to limitless tables in the Amazon Aurora User Guide.

3. Query Aurora PostgreSQL Limitless Database tables
Aurora PostgreSQL Limitless Database is compatible with PostgreSQL syntax for queries. You can query your Limitless Database using psql or any other connection utility that works with PostgreSQL. Before querying tables, you can load data into Aurora Limitless Database tables by using the COPY command or by using the data loading utility.

To run queries, connect to the cluster endpoint, as shown in Connecting to your Aurora Limitless Database DB cluster. All PostgreSQL SELECT queries are performed on the router to which the client sends the query and shards where the data is located.

To achieve a high degree of parallel processing, Aurora PostgreSQL Limitless Database utilizes two querying methods: single-shard queries and distributed queries, which determines whether your query is single-shard or distributed and processes the query accordingly.

  • Single-shard query – A query where all the data needed for the query is on one shard. The entire operation can be performed on one shard, including any result set generated. When the query planner on the router encounters a query like this, the planner sends the entire SQL query to the corresponding shard.
  • Distributed query – A query run on a router and more than one shard. The query is received by one of the routers. The router creates and manages the distributed transaction, which is sent to the participating shards. The shards create a local transaction with the context provided by the router, and the query is run.

For examples of single-shard queries, you use the following parameters to configure the output from the EXPLAIN command.

postgres_limitless=> SET rds_aurora.limitless_explain_options = shard_plans, single_shard_optimization;
SET

postgres_limitless=> EXPLAIN SELECT * FROM items WHERE item_id = 25;

                     QUERY PLAN
--------------------------------------------------------------
 Foreign Scan  (cost=100.00..101.00 rows=100 width=0)
   Remote Plans from Shard postgres_s4:
         Index Scan using items_ts00287_id_idx on items_ts00287 items_fs00003  (cost=0.14..8.16 rows=1 width=15)
           Index Cond: (id = 25)
 Single Shard Optimized
(5 rows) 

To learn more about the EXPLAIN command, see EXPLAIN in the PostgreSQL documentation.

For examples of distributed queries, you can insert new items named Book and Pen into the items table.

postgres_limitless=> INSERT INTO items(item_name)VALUES ('Book'),('Pen')

This makes a distributed transaction on two shards. When the query runs, the router sets a snapshot time and passes the statement to the shards that own Book and Pen. The router coordinates an atomic commit across both shards, and returns the result to the client.

You can use distributed query tracing, a tool to trace and correlate queries in PostgreSQL logs across Aurora PostgreSQL Limitless Database. To learn more, visit Querying Limitless Database in the Amazon Aurora User Guide.

Some SQL commands aren’t supported. For more information, see Aurora Limitless Database reference in the Amazon Aurora User Guide.

Things to know
Here are a couple of things that you should know about this feature:

  • Compute – You can only have one DB shard group per DB cluster and set the maximum capacity of a DB shard group to 16–6144 ACUs. Contact us if you need more than 6144 ACUs. The initial number of routers and shards is determined by the maximum capacity that you set when you create a DB shard group. The number of routers and shards doesn’t change when you modify the maximum capacity of a DB shard group. To learn more, see the table of the number of routers and shards in the Amazon Aurora User Guide.
  • Storage – Aurora PostgreSQL Limitless Database only supports the Amazon Aurora I/O-Optimized DB cluster storage configuration. Each shard has a maximum capacity of 128 TiB. Reference tables have a size limit of 32 TiB for the entire DB shard group. To reclaim storage space by cleaning up your data, you can use the vacuuming utility in PostgreSQL.
  • Monitoring – You can use Amazon CloudWatch, Amazon CloudWatch Logs, or Performance Insights to monitor Aurora PostgreSQL Limitless Database. There are also new statistics functions and views and wait events for Aurora PostgreSQL Limitless Database that you can use for monitoring and diagnostics.

Now available
Amazon Aurora PostgreSQL Limitless Database is available today with PostgreSQL 16.4 compatibility in the AWS US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) Regions.

Give Aurora PostgreSQL Limitless Database a try in the Amazon Aurora console. For more information, visit the Amazon Aurora User Guide and send feedback to AWS re:Post for Amazon Aurora or through your usual AWS support contacts.

Channy

[$] The Overture open-mapping project

Post Syndicated from corbet original https://lwn.net/Articles/995992/

OpenStreetMap tends to dominate
the space for open mapping data, but it is not the only project working in
this area. At the 2024 Open
Source Summit Japan
, Marc Prioleau presented the Overture Maps Foundation, which is
building and distributing a set of worldwide maps under open licenses.
Overture may have a similar goal to OpenStreetMap, but its approach and
intended uses are significantly different.

Приобщаващото образование – мисията (почти) възможна

Post Syndicated from Надежда Цекулова original https://www.toest.bg/priobshtavashtoto-obrazovanie-misiyata-pochti-vuzmozhna/

Приобщаващото образование – мисията (почти) възможна

Сара Краус е директор „Училищни партньорства“ във Фондация „Шърман“ към Университета на Мериленд в Балтимор. Започва кариерата си като учителка в системата на държавното образование в Балтимор, където преподава на първи и втори клас в продължение на десет години. След това е помощник-директор в училище за пет години, а от три години работи в университета, като развива програми за обучение на учители и подобряване на приобщаващите политики в държавните училища. 

Наскоро Сара Краус и нейни колеги посетиха България по покана на Фондация „Тръст за социална алтернатива“, за да участват в Mеждународната конференция за ранно детско развитие „За благоденствието на всички деца е нужна цяла екосистема за ранно детско развитие“. 

Надежда Цекулова разговаря с нея за сходствата и различията в приобщаващото образование у нас и в САЩ и за еднаквите предизвикателства, пред които всички се изправяме като човеци.

Как включвате ученици с различни образователни нужди или увреждания в системата, с която имате опит?

Да започнем оттам, че образователните политики са разнообразни в различните щати и това, което ви разказвам, е валидно за региона, в който работя. В Балтимор имаме философия на приобщаващо образование, при която учениците с различни нужди или увреждания са част от традиционната класна стая. Вместо да имаме отделни класове за специално образование, прилагаме по-приобщаващ модел – специализирани учители често идват да работят с учениците с увреждания в рамките на общия клас. Понякога учениците с различни нужди се извеждат за кратък период през деня, за да получат необходимите услуги. Идеята е по-голямата част от времето си те все пак да прекарват в общообразователната среда.

Колко време отне преходът от отделни класове за специално образование към приобщаващи класни стаи?

Това е постоянно развиваща се система. В този смисъл не можем да кажем, че е протекла някаква промяна и тя е приключила. Реализирането на приобщаващо образование изисква специализирано обучение за учителите, които работят с ученици с увреждания, а понякога учителите в общообразователната система нямат същото ниво на подготовка. Намирането на правилния баланс за осигуряване на необходимата подкрепа е непрекъснато развиваща се задача. 

Можете ли да опишете как изглежда приобщаващата класна стая? Например колко ученици и учители има в нея и дали има специален учител за децата със специални нужди?

Конфигурацията може да е различна в различните класове и дори в различните часове в един и същи клас, но в идеалния случай приобщаващата класна стая трябва да изглежда така, че да не можете да различите децата, които получават специални услуги, от останалите. Понякога се използва модел на ко-обучение, особено ако значителен брой ученици изискват специална подкрепа. В такъв случай специален учител може да се присъедини към учителя в общообразователната система. 

Какво имате предвид под „значителен брой“?

Класовете може да имат от 20 до над 30 ученици, като около 25% може да се нуждаят от допълнителна подкрепа. Както споменах, понякога учениците със специални нужди имат и специални занимания или рехабилитация, или друг вид услуга според потребностите си, а после се връщат в клас.

Това означава ли, че тези допълнителни услуги се извършват в сградата на училището и са организирани, така да се каже, институционално? Не се налага родителят да дойде, да вземе детето от училище, да го заведе до специалната услуга някъде другаде и после евентуално да го върне?

Не, не се налага. Децата са в едно и също училище, в една и съща сграда, срещат се непрекъснато, обядват заедно. Дори понякога часовете по изкуства се адаптират, за да са достъпни за всички и да могат заедно да се занимават в тях.

Как се справяте с ученици със сензорни увреждания, например с увредено зрение или слух?

Някои училища са специално предназначени за деца със сензорни увреждания. В държавните училища тези ученици могат да имат асистент, който да помага в комуникацията в приобщаващата класна стая. Аз лично не съм работила с такива ученици. Теоретично в общите класове това е възможно при наличието на подходяща подкрепа. 

В България законът изисква приобщаващо образование, но на практика много учители и училищни ръководители не се чувстват подготвени, което засяга както децата с увреждания, така и техните съученици. Имате ли подобни предизвикателства?

Да, имаме сходни пропуски и срещаме подобни проблеми. Училищата са задължени по закон да приемат всички ученици, но ако нямат ресурсите, може да възникнат предизвикателства. Училищата не могат да откажат прием на ученици на база увреждания, но намирането на адекватна подкрепа може да бъде трудно и обикновено децата страдат в резултат на това. Трябва да призная, че по тази причина децата със сензорни нарушения например рядко попадат в общия клас. В сферата на държавното образование обаче специалното училище за такива деца се намира на територията на общообразователното, така че все пак не са изолирани. Има, разбира се, и частни училища, които предлагат напълно различна организация, и семействата обичайно избират според предпочитанията и възможностите си.

Какви обучения и подкрепа са необходими на учителите, за да работят ефективно с разнообразни класове?

Учителите се нуждаят от обучение как да адаптират учебния материал спрямо различните нужди на учениците. От една страна, учителят има програма, която трябва да изпълни. От друга, трябва да съумее да я приспособи към разнообразните нужди, пред които го изправя класната стая, когато влезе в нея. Например някои ученици може да се нуждаят от гласови инструкции или от достъпни материали, с които да могат да работят. 

За учителите е предизвикателство да балансират тези нужди, особено в големи класове, и със сигурност имат необходимост както от специално обучение, така и от продължаваща подкрепа в хода на работата си. Всъщност смятам, че това е основният въпрос, защото не си представям, че съществува учител, който някога е влязъл в класна стая, и си е казал: „Не искам да се занимавам с това дете.“ Но той трябва да се чувства подготвен да намери индивидуален подход към конкретното дете с дислексия например и след това да повтори същото още 25 пъти. И не на последно място, когато учителят се сблъска с проблем, с който не знае как да се справи, трябва да е наясно откъде може да получи подкрепа.

Как във Вашия университет първоначално подготвяте учителите да отговорят на тези разнообразни нужди?

В САЩ учителите обичайно завършват четиригодишно обучение и придобиват опит в класната стая. В Мериленд прекарват един семестър в клас като наблюдатели и в този период могат да асистират на основния учител. След това още един цял семестър са на пълен работен ден като учители под ръководството на ментор. Трябва да държат и изпити по конкретни дисциплини, за да получат правоспособност. Преди няколко години обаче, в отговор на драстичния недостиг на учители в страната, беше разработена програма за съкратено обучение. Тя даде възможност хора с различни специалности след кратко обучение да влязат в класната стая и да получат правоспособност в хода на активната си преподавателска дейност. Това беше критично нужно, но породи редица трудности. Аз самата влязох в образователната система по този начин. Образованието ми беше в сферата на социалните услуги. Като учител стажувах едва един месец, преди да ми поверят собствен клас. 

Звучи плашещо.

За мен беше плашещо. Учех се в движение, паралелно вечер посещавах теоретични занимания. Не съжалявам, но беше трудно, много трудно. 

Има ли специално обучение за приобщаващо образование в програмите за подготовка на учители? Визирам четиригодишните програми.

Да, има курсове за приобщаващо образование, които се фокусират върху разбирането на многообразието и самоосъзнаването. Учителите изследват собствените си предразсъдъци и гледни точки, за да разбират по-добре нуждите на всички ученици.

Звучи като психоанализа за учители.

(Смее се.) Да, донякъде е това. Изключително необходимо и полезно е, за да можеш, когато влезеш в класната стая и видиш деца с различен цвят на кожата, различен език и култура, някои с дефицит на вниманието или аутизъм, или дислексия… Като видиш цялото това многообразие, да познаваш собствените си предразсъдъци и евентуални ограничения, за да можеш да бъдеш оптимално полезен на всички деца в тяхната различност и многообразие. Извън университета също има курсове за учители, които се занимават с това.

Друг експерт, с когото разговарях наскоро, каза, че никаква приобщаваща политика няма значение, ако не научим хората да приемат различията. Срещате ли такова предизвикателство и как помагате на учителите и учениците да приемат различията в клас?

Това е въпрос на осъзнаване и насърчаване на „смели разговори“. И също е процес, който не можем да кажем, че сме завършили, но държим на създаването на приобщаваща култура в клас. При нас например се говори много за различния цвят на кожата, за различния език и култура, тъй като голям дял от учениците ни са испаноезични и тъмнокожи. Но трудно се водят разговорите за приемане на хората с хомосексуална ориентация например. 

В България понякога се случва учителите да са готови да търсят решения за някое дете, но родителите на другите деца да не са. Ставали сме свидетели на петиции за извеждане на деца от клас заради смущения в дисциплината или културни различия. Как училищата могат да се справят с такива ситуации?

И ние се сблъскваме с подобни ситуации. Училищата често се фокусират върху това да премахнат неприемливото поведение, вместо да потърсят причината за него, за да не стават обект на натиск. Нашата философия е, че не можем да изключваме дете заради единичен случай или дори заради поредица от случаи. Важно е винаги да се търси първопричината за проблема, а не да се отхвърля детето. Защото то има право на образование като всички останали и често има нужда от нашата подкрепа и закрила. 


Интервюто е част от поредица разговори за достъпа до образование на децата от уязвими групи. Проектът се осъществява благодарение на най-голямата социално отговорна инициатива на „Лидл България“ – „Ти и Lidl за нашето утре“, в партньорство с Фондация „Работилница за граждански инициативи“, Българския дарителски форум и Асоциацията на европейските журналисти.

Приобщаващото образование – мисията (почти) възможна

Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/quoth-the-drive-stats-nevermore-an-elegy-for-our-seagate-4tb-drives/

A decorative image showing a gravestone with ravens around it.

Once upon a midnight dreary, as I typed another query

Seeking many a quaint and curious fact of hidden Drive Stats lore—

While I waited, time advancing, suddenly the stats came dancing

Lines of empty datasets; the database had nothing more

“Is that right?” I muttered, “The database had nothing more—

So those drives, I must explore.”

Ah, distinctly I remember, it was just past this September

I requested failure rates of Seagate drives with terabytes of four

Eagerly I typed the query, even though my eyes were bleary

The count of Seagate fours was eerie, eerie; there was nothing more.

The sad and certain count screamed like it never had before;

No Seagate drives with terabytes of four.

There are missing rows, I’m certain, and files waiting to explore.

The reality I kept dismissing, the Seagate data must be missing

With hours gone to data fishing, the facts shook me to the core;

The spinning life is over for our Seagate drives with terabytes of four—

Those Seagate drives are nevermore.

(My apologies to Edgar Allen Poe.)

Shortly, we will publish the Q3 2024 Backblaze Drive Stats report, and an old faithful will be missing from the tables, the 4TB Seagate drive model ST4000DM000. This drive model has graced our Drive Stats charts and tables since the very first Drive Stats report, and it would be a ghastly mistake if we let the drive slip into the afterlife unnoticed. So on this All Hallows’ Eve, it’s only fitting we say nevermore to these Seagate drives.

The first 45 of these Seagate 4TB drives were installed in a 45-drive Backblaze Storage Pod in May 2013. That was before 60-drive Storage Pods, Backblaze Vaults, and even Backblaze B2. Over the next two years, thousands of new Seagate 4TB drives were added each quarter, and by Q3 2016, there were 34,744 spinning souls in service. That represented more than 50% of all the drives in service at the time—a howling success that has not been duplicated by any other drive model.

Alas, that didn’t last as the first wave of 8TB drives arrived in mid-2016 and with that, no additional 4TB Seagate drives were procured. Over time, as 4TB Seagate drives met their maker, the count decreased, and when Storage Pods containing these drives started being phased out in 2018, the count dropped faster. The final nail in the coffin came when, in 2023, our CVT drive migration system became fixated on the replacement of all the remaining 4TB Seagate drives, and here we are.

As for those intrepid 45 original drives installed in May 2013, they were not around at the end. They were unceremoniously replaced in a Storage Pod upgrade back in 2017. A few were resurrected as drive replacements, but today they only exist in the spirit world, having died or been replaced by 2020. Still many other 4TB Seagate drives have lived long happy lives, with nearly 100 exceeding 100 months of service (8.4 years) before being sent to their final resting place by the CVT reaper.

And so it is time; we shall gather in a circle, cross our arms and hold hands and chant “our Seagate drives…with terabytes of four…are nevermore!”

The post Quoth the Drive Stats, Nevermore: An Elegy for Our Seagate 4TB Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-mesh-to-accelerate-digital-transformation-using-amazon-datazone/

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

Volkswagen Autoeuropa is a Volkswagen Group plant that produces the T-Roc. The plant is located near Lisbon, Portugal and produces about 934 cars per day. In 2023, Volkswagen Autoeuropa represented 1.3% of the national GDP of Portugal and 4% in national export of goods impact with a sales volume of 3.3511 billion Euros. Volkswagen Autoeuropa aims to become a data-driven factory and has been using cutting-edge technologies to enhance digitalization efforts.

In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.

Understanding Volkswagen Autoeuropa’s challenges

At the time of writing this post, Volkswagen Autoeuropa has already implemented more than 15 successful digital use cases in the context of real-time visualization, business intelligence, industrial computer vision, and AI.

Before the AWS partnership, Volkswagen Autoeuropa faced the following challenges.

  • Long lead time to access data – The digital use cases launched by Volkswagen Autoeuropa spent most of their project time getting access to the data that was relevant to their use cases. After the right data for the use case was found, the IT team provided access to the data through manual configuration. The lead time to access data was often from several days to weeks.
  • Insufficient data governance and auditing – Data was shared directly to use cases by copying it. Therefore, the IT team connected the data manually from their sources to the desired destinations multiple times. This process wasn’t centrally tracked to discover any information on the data sharing process. For example, if the data was copied in the past, how many use cases have access to the data, when access was granted, and who granted the access.
  • Redundant effort to process the same information – Because the IT team copied the data sources based on the exact use case requirements, they shared specific columns of the tables from the data. As additional use cases requested access to the same data with different column requirements, even more copies of the data were created.
  • Repeated process to establish security and governance guardrails – Each time the IT and the security team provided a connection to a new data source, they had to set up the security and governance guardrails. This required repeated manual effort.
  • Data quality issues – Because the data was processed redundantly and shared multiple times, there was no guarantee of or control over the quality of the data. This led to reduced trust in the data.
  • Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists. Furthermore, no process to discover new data existed. Similar to the consumption process, use cases would consult specialists to understand the context of the data and if it could provide value.

Envisioning a data solution for Volkswagen Autoeuropa

To address these challenges, Volkswagen Autoeuropa embarked on a bold vision. They envisioned a seamless data consumption process, similar to an online shopping experience. They envisioned a data marketplace where data users could browse and access high-quality, secure data with clear specifications, business context, and relevant attributes. This vision materialized into a project aimed at transforming data accessibility and governance as the foundation for the digital ecosystem. The vision to be realized: Data as seamless as online shopping.

In collaboration with Amazon Web Services (AWS), Volkswagen Autoeuropa joined the Enhanced Plant Onboarding Program of the Global Volkswagen Group’s Digital Production Platform (DPP EPO) strategy. Through this partnership, AWS and Volkswagen Autoeuropa created a data marketplace that significantly improved data availability.

In the discovery phase of the project, Volkswagen Autoeuropa and AWS evaluated several options to build the data solution. In the end, Volkswagen Autoeuropa chose a solution based on data mesh architecture using Amazon DataZone. Being a managed service, Amazon DataZone provided the necessary speed and agility to build the solution. At the same time, it led to higher operational efficiencies and lower operational overhead. The team adopted a data mesh architecture because the principles of the data mesh aligned with Volkswagen Autoeuropa’s vision of being a data driven factory.

Solution overview

This section describes the key features and architecture of the Volkswagen Autoeuropa data solution. The solution is based on a data mesh architecture.

Data solution features

The following figure shows the key capabilities of the Volkswagen Autoeuropa data solution.

The key capabilities of the solution are:

  • Data quality – In the solution, we’ve built a data quality framework to streamline the process of data quality checks and publishing quality scores. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications to users. This framework can be seamlessly integrated into AWS Glue jobs, providing a quality score for data pipeline jobs. In addition, the quality score is published in the Amazon DataZone data portal, allowing consumers to subscribe to the data based on its quality score.Assigning a quality score to the data helps build trust in the data, and shifts the responsibility of maintaining data quality to the data owner. As a result, the quality of the results delivered by these use cases improves.
  • Data registration – The producers sign in to the Amazon DataZone data portal using their AWS Identity and Access Management (IAM) credentials or single sign-on with integration through AWS IAM Identity Center. They register their data assets, which are stored in Amazon Simple Storage Service (Amazon S3), in the Amazon DataZone data catalog. The metadata of the data assets is stored in an AWS Glue catalog and made available in the business data catalog of Amazon DataZone and in the Amazon DataZone data source. The producers add business context such as business unit name, data owner contact information, and data refresh frequency using Amazon DataZone glossaries and metadata forms. In addition, they use generative AI capabilities to generate business metadata. After the business metadata is generated, they review the changes and modify the metadata if needed.Because all data products in Volkswagen Autoeuropa are now registered in the same location, the likelihood of data duplication is significantly reduced. Moreover, the data producers are improving the quality of the data by adding business context to it.
  • Data discovery – The consumers sign in to the Amazon DataZone data portal using their IAM credentials or single sign-on with integration through IAM Identity Center and search the data using keywords in the search bar. After the results are returned, they can further filter the results using glossary terms and project names. Finally, they review the business metadata of the data assets to evaluate if the data is relevant to their business use cases. They can check the quality score of the data assets and the refresh schedule for their use cases.With a data discovery capability in place, consumers can gain information about the data without the need to consult the source system owners or specialists.
  • Data access management – When the consumers find a data asset that’s relevant to their use case, they request access to it using the subscription feature of Amazon DataZone. Data is classified as public, internal, and confidential. For public and internal data assets, the access request is automatically approved. For confidential data assets, the data producer team reviews the access request and either accepts or rejects the subscription request.With a central place to manage data access, data owners can view which use cases have access to their data and when the access request was granted. The fine-grained access control feature of Amazon DataZone gives data owners granular control of their data at the row and column levels.
  • Data consumption – Upon approval of the subscription request, Amazon DataZone provisions the backend infrastructure to make the data accessible to the corresponding consumers. After this process is complete, the consumers can access the data through Amazon Athena using the deep link feature of Amazon DataZone. The data consumption pattern in Volkswagen Autoeuropa supports two use cases:
    • Cloud-to-cloud consumption – Both data assets and consumer teams or applications are hosted in the cloud.
    • Cloud-to-on-premises consumption – Data assets are hosted in the cloud and consumer use cases or applications are hosted on-premises.

Requirements specific to a use case requires access to the relevant data assets; sharing data to use cases using Amazon DataZone doesn’t require creating multiple copies. As a result, duplication and processing of data. Furthermore, by reducing the number of copies of the data, the overall quality of the data products improves. In addition, the backend automation of Amazon DataZone to make data available to use cases reduces the manual effort and improves the lead time to access data.

  • Single collaborative environment – The Amazon DataZone data portal provides a single collaborative environment to the users in Volkswagen Autoeuropa. Data consumers such as use case owners, data engineers, data scientists, and ML engineers can browse and request access to data assets. At the same time, data producers, such as use case owners and source system owners, can publish and curate their data in the Amazon DataZone data portal. This collaborative experience promotes teamwork and accelerates the realization of business value. Furthermore, the security and governance guardrails scales across the organization as the number of use cases increases.

Data solution architecture

The following figure displays the reference architecture of the data solution at Volkswagen Autoeuropa. In the next part of the post, we discuss how we arrived at the solution.

The architecture includes:

  1. The data from SAP applications, manufacturing execution systems (MES), and supervisory control and data acquisition (SCADA) systems is ingested into the producer accounts of Volkswagen Autoeuropa.
  2. In the producer account, raw data is transformed using AWS Glue. The technical metadata of the data is stored in AWS Glue catalog. The data quality is measured using the data quality framework. The data stored in Amazon Simple Storage Service (Amazon S3) is registered as an asset in the Amazon DataZone data catalog hosted in the central governance account.
  3. The central governance account hosts the Amazon DataZone domain and the related Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. Amazon DataZone projects belonging to the data producers and consumers are created under the related Amazon DataZone domain units.
  4. Consumers of the data products sign in to the Amazon DataZone data portal hosted in the central governance account using their IAM credentials or single sign-on with integration through IAM Identity Center. They search, filter, and view asset information (for example, data quality, business, and technical metadata).
  5. After the consumer finds the asset they need, they request access to the asset using the subscription feature of Amazon DataZone. Based on the validity of the request, the asset owner approves or rejects the request.
  6. After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for a one-time query using Athena and Microsoft Power BI applications hosted on premises. This consumption pattern can be extended for AI and machine learning (AI/ML) model development using Amazon SageMaker and reporting purposes using Amazon QuickSight.

User journey

After discussing the desired system with the use case teams and stakeholders and analyzing the current workflow, Volkswagen Autoeuropa grouped the user personas of the data solution into three main categories: data producer, data consumer, and data solution administrator. This sets the foundation for the desired user experience and what’s needed to achieve the solution goals.

Data producer

Data producers create the data products in the data solution. There are two types of data producers.

  • Data source owners – Data source owners publish the raw data in the Amazon DataZone data portal. These data products are attributed as source-based data.
  • Use case owners – Use case owners publish data that’s fit for consumption by other use cases. These data products are called consumer-based data.

The following figure shows the user journey of a data producer:

 

A data producer’s journey includes:

  1. Identify data of interest
    1. Identify data (Volkswagen Autoeuropa network).
    2. Perform data quality checks (Volkswagen Autoeuropa network).
  2. Connect data to the data solution
    1. Ingest data into the data solution (Amazon DataZone portal).
    2. Start process to connect data using AWS Glue.
  3. Locate the data source in the data solution
    1. Register data (Amazon DataZone portal).
    2. Add data to the inventory in Amazon DataZone.
  4. Add or edit metadata
    1. Add or edit metadata (Amazon DataZone portal).
    2. Publish data assets (Amazon DataZone portal).
  5. Approve or reject subscription request
    1. Review subscription requests.
  6. Maintain data assets
    1. Manage data assets (Amazon DataZone portal).

Data consumer

Data consumers use data for business analytics, machine learning, AI, and business reporting. Data consumers are data engineers, data scientists, ML engineers, and business users. The following diagram shows the journey of a data consumer.

A data consumer’s journey includes:

  1. Access Amazon DataZone portal
    1. Amazon DataZone portal – Access is granted based on the user’s assigned domain and projects.
  2. Search for data assets
    1. Data assets in Amazon DataZone portal – Search for data and brows the results by glossary terms or the project name. Use additional filters to refine the results.
  3. View business metadata
    1. Select a data asset to see additional information – Review the description, data quality score and metadata.
  4. Request access to data (subscribe)
    1. Subscribe to request access.
    2. After the subscription request is approved, review the data products that you have access to.
    3. Query the data to view and consume the data.
  5. Retrieve additional data
    1. Repeat the steps as needed to access and retrieve additional data.

Data solution administrator

Data solution administrators are responsible for performing administrative tasks on the data solution. The following figure shows the common tasks performed by the data solution administrator.

A data administrator’s journey includes:

  1. Manage projects
    1. Manage Amazon DataZone domain.
    2. Manage Amazon DataZone projects within the domain.
  2. Manage environment
    1. Set up the environment to manage the infrastructure.
  3. Manage business metadata glossary
    1. Manage and enable Amazon DataZone glossaries and metadata forms.
  4. Manage data assets
    1. Manage assets.
    2. Query the data to view and consume the data.
  5. Manage access to data solution
    1. Monitor and revoke access when appropriate.

Conclusion

In this post, you learned how Volkswagen Autoeuropa embarked on a bold vision to become a data driven factory. It shows how this vision was put into action by building a data solution based on data mesh architecture using Amazon DataZone. It highlights the key features and architecture of the data solutions and presents the user journey. As of writing this post, Volkswagen Autoeuropa reduced the data discovery time from days to minutes using the data solution. The time to access data took several weeks before the Volkswagen Autoeuropa and AWS collaboration. Now, with the help of the data solution, the data access time has been reduced to several minutes.

In May 2024, the team achieved a major milestone by successfully offering data on the data solution and transporting it instantly to Power BI, a process that previously took several weeks.

“After one year of work, we did the full roundtrip from offering data on our new data marketplace built using Amazon DataZone to transporting it instantly to third-party tools, a process that previously took several weeks. This was a big achievement for our team.”

– Jorge Paulino, Product owner of the data solution. Volkswagen Autoeuropa.

The next post of the two-part series details discusses how we built the solution, its technical details, and the business value created.

If you want to harness the agility and scalability of a data mesh architecture and Amazon DataZone to accelerate innovation and drive business value for your organization, we have the resources to get you started. Be sure to check out the AWS Prescriptive Guidance: Strategies for building a data mesh-based enterprise solution on AWS. This comprehensive guide covers the key considerations and best practices for establishing a robust, well-governed data mesh on AWS. From aligning your data mesh with overall business strategy to scaling the data mesh across your organization, this Prescriptive Guidance provides a clear roadmap to help you succeed.

If you’re curious to get hands-on, see the GitHub repository: Building an enterprise Data Mesh with Amazon DataZone, Amazon DataZone, AWS CDK, and AWS CloudFormation. This open source project delivers a step-by-step guide to build a data mesh architecture using Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.


About the Authors

Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open-source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Ravi Kumar is a Data Architect and Analytics expert at Amazon Web Services; he finds immense fulfillment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. In several years as a Project Manager on Testing Technology for new engine models he also introduced several innovations like human-machine-collaborations and intelligent assistance systems. From 2017, he was responsible for the Shopfloor IT team of the module lines in Zuffenhausen before he became responsible for the Planning of the E-Drive assembly at Porsche. Beside this he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. Since October 2022, he has been assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant driving the Digital Transformation towards a Data Driven Factory.

Weizhou Sun is a Lead Architect at Amazon Web Services, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes Industrial Computer Vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open-source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

Shameka Almond is an Advisory Consultant at Amazon Web Services. She works closely with enterprise customers to help them better understand the business impact and value of implementing data solutions, including data governance best practices. Shameka has over a decade of wide-ranging IT experience in the manufacturing and aerospace industries, and the nonprofit sector. She has supported several data governance initiatives, helping both public and private organizations identify opportunities for improvement and increased efficiency. Outside of the office she enjoys hosting large family gatherings, and supporting community outreach events dedicated to introducing students in K-12 to STEM.

Adjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently Adjoa leads Product Centric Digital Transformation, enabling customers to solve complex manufacturing problems by leveraging Smart Factory and Industry leading transformation mechanisms. Most recently driving value with AI/ML and generative AI use-cases for the plant floor. Adjoa is an experienced leader spending over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Through prior roles, Adjoa brings deep experience across multiple business segments with a focus on business outcome driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible via the right impacting value-based solution.

Roger Grimes on Prioritizing Cybersecurity Advice

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/10/roger-grimes-on-prioritizing-cybersecurity-advice.html

This is a good point:

Part of the problem is that we are constantly handed lists…list of required controls…list of things we are being asked to fix or improve…lists of new projects…lists of threats, and so on, that are not ranked for risks. For example, we are often given a cybersecurity guideline (e.g., PCI-DSS, HIPAA, SOX, NIST, etc.) with hundreds of recommendations. They are all great recommendations, which if followed, will reduce risk in your environment.

What they do not tell you is which of the recommended things will have the most impact on best reducing risk in your environment. They do not tell you that one, two or three of these things…among the hundreds that have been given to you, will reduce more risk than all the others.

[…]

The solution?

Here is one big one: Do not use or rely on un-risk-ranked lists. Require any list of controls, threats, defenses, solutions to be risk-ranked according to how much actual risk they will reduce in the current environment if implemented.

[…]

This specific CISA document has at least 21 main recommendations, many of which lead to two or more other more specific recommendations. Overall, it has several dozen recommendations, each of which individually will likely take weeks to months to fulfill in any environment if not already accomplished. Any person following this document is…rightly…going to be expected to evaluate and implement all those recommendations. And doing so will absolutely reduce risk.

The catch is: There are two recommendations that WILL DO MORE THAN ALL THE REST ADDED TOGETHER TO REDUCE CYBERSECURITY RISK most efficiently: patching and using multifactor authentication (MFA). Patching is listed third. MFA is listed eighth. And there is nothing to indicate their ability to significantly reduce cybersecurity risk as compared to the other recommendations. Two of these things are not like the other, but how is anyone reading the document supposed to know that patching and using MFA really matter more than all the rest?

Tracking World Leaders Using Strava

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/10/tracking-world-leaders-using-strava.html

Way back in 2018, people noticed that you could find secret military bases using data published by the Strava fitness app. Soldiers and other military personal were using them to track their runs, and you could look at the public data and find places where there should be no people running.

Six years later, the problem remains. Le Monde has reported that the same Strava data can be used to track the movements of world leaders. They don’t wear the tracking device, but many of their bodyguards do.

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/996526/

Security updates have been issued by Debian (firefox-esr and openssl), Fedora (firefox, libarchive, micropython, NetworkManager-libreswan, and xorg-x11-server-Xwayland), Red Hat (nano), Slackware (mozilla-firefox, mozilla-thunderbird, tigervnc, and xorg), SUSE (389-ds, Botan, go1.21-openssl, govulncheck-vulndb, java-11-openjdk, lxc, python-Werkzeug, and uwsgi), and Ubuntu (firefox, libarchive, linux-azure-fde, linux-azure-fde-5.15, python-pip, and xorg-server, xorg-server-hwe-16.04, xorg-server-hwe-18.04).

Celebrating 15 years of MariaDB

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2024/10/celebrating-15-years-of-mariadb.html

It is 15 years since the first MariaDB server release of MariaDB 5.1.38 on 29’th of October 2009.

MariaDB got its name from my youngest daughter Maria, following the tradition of MySQL, who got its name from my oldest daughter My.

The MariaDB project started on April 20 2009, the same day when Oracle announced that Oracle will buy Sun Microsystems, who owned MySQL. The initial MariaDB engineering team consisted of some 20 engineers from the MySQL server team at Sun, and me. It has now grown to 45-50 engineers in MariaDB Corporation & MariaDB Foundation + a lot of external contributors.

The reason for creating MariaDB is that we all believed that Oracle would not be a good steward of MySQL and we wanted to ensure that MySQL source and spirit would continue living, outside of Oracle.  My belief is also that without MariaDB, MySQL would not exist today.

MariaDB is actively developed. There have been 27 major releases of MariaDB, of which 18 have been long term (LTS) releases.

All MariaDB changes are tested on 4 different compilers, 6 different architectures, 4 different operating systems and 7 OS distributions. In addition we compile with many different compiler options and code checkers to find issuers earlier.

MariaDB is available on all major OS distributions and on all public clouds.

MariaDB Corporation/plc has in addition done 6 major release of the MariaDB Enterprise server. The Enterprise server is a targeted database for users who want longer release and maintenance cycles, higher stability, more performance and enterprise level support with direct contact to the engineers that wrote the code. This includes less reasons to upgrade, thanks to backported features from newer MariaDB releases, and easier upgrades, thanks to tools like MaxScale.

Until MariaDB 10.0, MariaDB was a true fork of MySQL with a lot of enhancements and performance improvements. Starting from MariaDB 10.0 we stopped doing merges of code from MySQL as this enabled us to add more features and bigger improvements to the code, without having to be constrained by the MySQL code base. We still kept up with most MySQL features and syntax to ensure that it should be trivial to migrate from MySQL to MariaDB. One can still move from MySQL 5.7 and earlier MySQL versions to any newer MariaDB version with almost no changes. MySQL 8.0 changed how things are stored on disk, which means that to move from MySQL 8.0 and above to MariaDB one has mysqldump/mariadb-dump and restore. Apart from that, moving from MySQL to MariaDB is still in many cases easier than moving between MySQL versions.

Here comes a list of some of the most notable features created in MariaDB.  Note that many of these features were later copied by MySQL, usually with a different syntax. These are marked by (*) in the list below. I am very happy to see that MariaDB has forced MySQL to innovate! There are of course a few cases where MySQL adds a feature before MariaDB. These are also noted in the following list.

New storage engines

  • Aria storage engine (MyISAM replacement, initialled called Maria. Used for temporary results)
  • ColumnStore (Columnar storage, for analytical queries)
  • Connect (Allows one to connect to external databases through JDBC/ODBC and also read a lot of legacy database formats)
  • Mroonga (fulltext search)
  • MyRocks (Compressed storage, used by Facebook)
  • Sequence (Allows the creation of ascending or descending sequences. Great to quickly generate test data)
  • Spider (Sharding over multiple MariaDB servers)
  • S3 Storage engine

Performance

  • Pool of threads (MySQL had a similar capability in 5.4 community but later removed it from the community version and added  it to MySQL Enterprise)

Optimizer

  • Table elimination
  • Better optimizer (First stage in MariaDB 5.3-5.5 and second in MariaDB 11.0)
    • Starting from 11.0 almost all aspects of the optimizer is cost based and costs are tunable.
    • Optimised for modern hardware
  • Subquery optimizations in 5.3 (*)
  • Index Condition pushdown (*)
  • Semi-join (*)
  • Batched key access (*)
  • Materalization (*)
  • Index_merge / Sort_intersection
  • Cost-based choice of range vs. index_merge
  • Use extended (hidden) primary keys for InnoDB
  • Subquery cache
  • Block hash join
  • Null-rejecting conditions tested early for NULLs
  • Optimizer trace (MySQL had this in 5.6. MariaDB did later a different implementation).
  • ANALYZE … SELECT|UPDATE|DELETE (*)
  • See https://mariadb.com/kb/en/optimizer-feature-comparison-matrix/ for a more complete list for the older optimizer features.
  • Histogram based statistics (*)
  • Split Grouping Optimization (?)
  • Descending indexes (MySQL had this first)
  • Sargable date and year
  • Vector search (*) (MySQL only offers a vector datatype, but no indexing possibilities)

Security

  • Plugable authentication (*)
  • Unix socket authentication (*)
  • Roles ((*)
  • Table level encryption (*) ; Patch from Google
  • Password validate plugin (MySQL had this first, but we could not use it as it had too many limitations and gotchas so we had to implement one from scratch)
  • Password expiration and account locking (MySQL had this first)
  • ED25519, PARSEC authentication plugins
  • Password reuse plugin
  • Hashicorp Key Management Plugin
  • SSL enabled by default. No configuration necessary. ; MySQL does not have zero-config SSL

Replication

  • Group commit with binary log (*)
  • Multi-source replication (*)
  • Parallel replication (*)
  • Enhanced semisync replication (*)
  • Multi-master with Galera (*) MySQL later implemented group replication with provides a similar feature
  • Global transaction id (MySQL had this one first)
  • Annotated row based events
  • Checksums for binlog events (MySQL backport)
  • Binary log checksums calculated during event creation and not during commit. This gives a great performance boost to replication when using checksums.
  • Delay slave (MySQL backport)
  • Semi-sync plugin moved inside server which gives notable better performance.
  • Lag free ALTER TABLE in replication

Logging

  • EXPLAIN in slow query log
  • Engine statistics in slow query log

DDL enhancements

  • Progress reports for ALTER TABLE, CHECK TABLE etc.
  • RETURNING for INSERT, UPDATE and DELETE
  • OR REPLACE for CREATE table and other DDL
  • ALTER ONLINE TABLE (MySQL backport ; Released at the same time)
  • INSTANT ADD COLUMN (MySQL had this one first, code from Tencent Games)
  • INSTANT DROP COLUMN, MODIFY COLUMN (*)
  • CHECK CONSTRAINT (*)
  • DECIMAL decimals increased from 30 to 38 (banking requirement)
  • CREATE SEQUENCE
  • Multiple triggers for same state (MySQL was first)
  • Invisible columns (*)
  • Atomic DDL (MySQL was first, but Oracle changed the storage format which makes it impossible to downgrade back. MariaDB did the same feature without changing storage format).
  • Once can update the table even if  ALTER TABLE is running.

DML & DQL enhancements

  • SELECT … OFFSET … FETCH
  • SELECT … SKIP LOCKED ; MySQL had this first
  • Natural sorting

Other Features

  • Microsecond support for time data types (*)
  • Virtual columns (*)
  • Non-blocking client API Library
  • Shutdown statement (*)
  • Improved spatial functions (MariaDB has more Spatial functions than MySQL)
  • Improved GET_LOCK() with timeout in microseconds.
  • Window functions (*)
  • PERCENTILE_CONT, PERCENTILE_DISC, and MEDIAN window functions
  • Common table expressions (* ; Released about the same time in MySQL and MariaDB)
  • Oracle compatibility (LOTS of functions, PL-SQL, packages, null handling etc). This allows one to move many type of Oracle applications unchanged to MariaDB.
  • FLASHBACK  ; Use binary log to roll back data to a previous state. (Contribution by Alibaba)
  • JSON functions (MySQL had initially better JSON support but MariaDB has caught up)
  • JSON Table (MySQL had this first)
  • System versioned tables (known as AS OF or Temporal Tables)
  • Table value constructors (*)
  • ROW data type
  • INet4 and INet6 data type
  • UUID data type (MySQL had this first)
  • INTERSECT & EXCEPT (*)
  • Storage engine independent column compression (Percona server had this first. Not in MySQL)
  • Support for Persistent Memory
  • mariadb-backup and backup locks  (Only in MySQL Enterprise. However MariaDB can also do backup while ALTER TABLE is running)
  • sys schema (MySQL had this first)
  • SFORMAT for arbitrary text formatting
  • Connection redirection (*)

MariaDB Corporation also provides LGPL connectors that works with MariaDB and MySQL for the following languages:

  • C
  • C++
  • Java 8+
  • ODBC
  • Python
  • Node.js
  • R2DBC

MariaDB Corporation are also ensuring that the PHP and Perl connectors works with MariaDB.

Last, I want to thank all the MariaDB developers, testers, MariaDB employees, MariaDB contributors, investors, sponsors, customers and user, all who has contributed to make MariaDB a successful project.

These has been an amazing first 15 years and there is many more to come!

May your database always keep running!

Michael “Monty” Widenius

Moving Baselime from AWS to Cloudflare: simpler architecture, improved performance, over 80% lower cloud costs

Post Syndicated from Boris Tane original https://blog.cloudflare.com/80-percent-lower-cloud-cost-how-baselime-moved-from-aws-to-cloudflare

Introduction

When Baselime joined Cloudflare in April 2024, our architecture had evolved to hundreds of AWS Lambda functions, dozens of databases, and just as many queues. We were drowning in complexity and our cloud costs were growing fast. We are now building Baselime and Workers Observability on Cloudflare and will save over 80% on our cloud compute bill. The estimated potential Cloudflare costs are for Baselime, which remains a stand-alone offering, and the estimate is based on the Workers Paid plan. Not only did we achieve huge cost savings, we also simplified our architecture and improved overall latency, scalability, and reliability.

Daily Cost

Before (AWS)

After (Cloudflare)

Compute

$650 – AWS Lambda

$25 – Cloudflare Workers

CDN

$140 – Cloudfront

$0 – Free

Data Stream + Analytics database

$1,150 – Kinesis Data Stream + EC2

$300 – Workers Analytics Engine

Total

$1,940

$325 (83% cost reduction)

Table 1: Daily Costs Comparison ($USD)

When we joined Cloudflare, we immediately saw a surge in usage, and within the first week following the announcement, we were processing over a billion events daily and our weekly active users tripled.

As the platform grew, so did the challenges of managing real-time observability with new scalability, reliability, and cost considerations. This drove us to rebuild Baselime on the Cloudflare Developer Platform, where we could innovate quickly while reducing operational overhead.

Initial architecture — all on AWS

Our initial architecture was all on Amazon Web Services (AWS). We’ll focus here on the data pipeline, which covers ingestion, processing, and storage of tens of billions of events daily.

This pipeline was built on top of AWS Lambda, Cloudfront, Kinesis, EC2, DynamoDB, ECS, and ElastiCache.


Figure1: Initial data pipeline architecture

The key elements are:

  • Data receptors: Responsible for receiving telemetry data from multiple sources, including OpenTelemetry, Cloudflare Logpush, CloudWatch, Vercel, etc. They cover validation, authentication, and transforming data from each source into a common internal format. The data receptors were deployed either on AWS Lambda (using function URLs and Cloudfront) or ECS Fargate depending on the data source.

  • Kinesis Data Stream: Responsible for transporting the data from the receptors to the next step: data processing.

  • Processor: A single AWS Lambda function responsible for enriching and transforming the data for storage. It also performed real-time error tracking and detecting patterns in logs.

  • ClickHouse cluster: All the telemetry data was ultimately indexed and stored in a self-hosted ClickHouse cluster on EC2.

In addition to these key elements, the existing stack also included orchestration with Firehose, S3 buckets, SQS, DynamoDB and RDS for error handling, retries, and storing metadata.

While this architecture served us well in the early days, it started to show major cracks as we scaled our solution to more and larger customers.

Handling retries at the interface between the data receptors and the Kinesis Data Stream was complex, requiring introducing and orchestrating Firehose, S3 buckets, SQS, and another Lambda function.

Self-hosting ClickHouse also introduced major challenges at scale, as we continuously had to plan our capacity and update our setup to keep pace with our growing user base whilst attempting to maintain control over costs.

Costs began scaling unpredictably with our growing workloads, especially in AWS Lambda, Kinesis, and EC2, but also in less obvious ways, such as in Cloudfront (required for a custom domain in front of Lambda function URLs) and DynamoDB. Specifically, the time spent on I/O operations in AWS Lambda was a particularly costly piece. At every step, from the data receptors to the ClickHouse cluster, moving data to the next stage required waiting for a network request to complete, accounting for over 70% of wall time in the Lambda function.

In a nutshell, we were continuously paged by our alerts, innovating at a slower pace, and our costs were out of control.

Additionally, the entire solution was deployed in a single AWS region: eu-west-1. As a result, all developers located outside continental Europe were experiencing high latency when emitting logs and traces to Baselime. 

Modern architecture — transitioning to Cloudflare

The shift to the Cloudflare Developer Platform enabled us to rethink our architecture to be exceptionally fast, globally distributed, and highly scalable, without compromising on cost, complexity, or agility. This new architecture is built on top of Cloudflare primitives.


Figure 2: Modern data pipeline architecture

Cloudflare Workers: the core of Baselime

Cloudflare Workers are now at the core of everything we do. All the data receptors and the processor run in Workers. Workers minimize cold-start times and are deployed globally by default. As such, developers always experience lower latency when emitting events to Baselime.

Additionally, we heavily use JavaScript-native RPC for data transfer between steps of the pipeline. It’s low-latency, lightweight, and simplifies communication between components. This further simplifies our architecture, as separate components behave more as functions within the same process, rather than completely separate applications.

export default {
  async fetch(request: Request, env: Bindings, ctx: ExecutionContext): Promise<Response> {
      try {
        const { err, apiKey } = auth(request);
        if (err) return err;

        const data = {
          workspaceId: apiKey.workspaceId,
          environmentId: apiKey.environmentId,
          events: request.body
        };
        await env.PROCESSOR.ingest(data);

        return success({ message: "Request Accepted" }, 202);
      } catch (error) {
        return failure({ message: "Internal Error" });
      }
  },
};

Code Block 1: Simplified data receptor using JavaScript-native RPC to execute the processor.

Workers also expose a Rate Limiting binding that enables us to automatically add rate limiting to our services, which we previously had to build ourselves using a combination of DynamoDB and ElastiCache.

Moreover, we heavily use ctx.waitUntil within our Worker invocations, to offload data transformation outside the request / response path. This further reduces the latency of calls developers make to our data receptors.

Durable Objects: stateful data processing

Durable Objects is a unique service within the Cloudflare Developer Platform, as it enables building stateful applications in a serverless environment. We use Durable Objects in the data pipelines for both real-time error tracking and detecting log patterns.

For instance, to track errors in real-time, we create a durable object for each new type of error, and this durable object is responsible for keeping track of the frequency of the error, when to notify customers, and the notification channels for the error. This implementation with a single building block removes the need for ElastiCache, Kinesis, and multiple Lambda functions to coordinate protecting the RDS database from being overwhelmed by a high frequency error.


Figure 3: Real-time error detection architecture comparison

Durable Objects gives us precise control over consistency and concurrency of managing state in the data pipeline.

In addition to the data pipeline, we use Durable Objects for alerting. Our previous architecture required orchestrating EventBridge Scheduler, SQS, DynamoDB and multiple AWS Lambda functions, whereas with Durable Objects, everything is handled within the alarm handler. 

Workers Analytics Engine: high-cardinality analytics at scale

Though managing our own ClickHouse cluster was technically interesting and challenging, it took us away from building the best observability developer experience. With this migration, more of our time is spent enhancing our product and none is spent managing server instances.

Workers Analytics Engine lets us synchronously write events to a scalable high-cardinality analytics database. We built on top of the same technology that powers Workers Analytics Engine. We also made internal changes to Workers Analytics Engine to natively enable high dimensionality in addition to high cardinality.

Moreover, Workers Analytics Engine and our solution leverages Cloudflare’s ABR analytics. ABR stands for Adaptive Bit Rate, and enables us to store telemetry data in multiple tables with varying resolutions, from 100% to 0.0001% of the data. Querying the table with 0.0001% of the data will be several orders of magnitudes faster than the table with all the data, with a corresponding trade-off in accuracy. As such, when a query is sent to our systems, Workers Analytics Engine dynamically selects the most appropriate table to run the query, optimizing both query time and accuracy. Users always get the most accurate result with optimal query time, regardless of the size of their dataset or the timeframe of the query. Compared to our previous system, which was always running queries on the full dataset, the new system now delivers faster queries across our entire user base and use cases.

In addition to these core services (Workers, Durable Objects, Workers Analytics Engine), the new architecture leverages other building blocks from the Cloudflare Developer Platform. Queues for asynchronous messaging, decoupling services and enabling an event-driven architecture; D1 as our main database for transactional data (queries, alerts, dashboards, configurations, etc.); Workers KV for fast distributed storage; Hono for all our APIs, etc.

How did we migrate?

Baselime is built on an event-driven architecture, where every user action triggers an event. It operates on the principle that every user action is recorded as an event and emitted to the rest of the system — whether it’s creating a user, editing a dashboard, or performing any other action. Migrating to Cloudflare involved transitioning our event-driven architecture without compromising uptime and data consistency. Previously, this was powered by AWS EventBridge and SQS, and we moved entirely to Cloudflare Queues.

We followed the strangler fig pattern to incrementally migrate the solution from AWS to Cloudflare. It consists of gradually replacing specific parts of the system with newer services, with minimal disruption to the system. Early in the process, we created a central Cloudflare Queue which acted as the backbone for all transactional event processing during the migration. Every event, whether a new user signup or a dashboard edit, was funneled into this Queue. From there, events were dynamically routed, each event to the relevant part of the application. User actions were synced into D1 and KV, ensuring that all user actions were mirrored across both AWS and Cloudflare during the transition.

This syncing mechanism enabled us to maintain consistency and ensure that no data was lost as users continued to interact with Baselime.

Here’s an example of how events are processed:

export default {
  async queue(batch, env) {
    for (const message of batch.messages) {
      try {
        const event = message.body;
        switch (event.type) {
          case "WORKSPACE_CREATED":
            await workspaceHandler.create(env, event.data);
            break;
          case "QUERY_CREATED":
            await queryHandler.create(env, event.data);
            break;
          case "QUERY_DELETED":
            await queryHandler.remove(env, event.data);
            break;
          case "DASHBOARD_CREATED":
            await dashboardHandler.create(env, event.data);
            break;
          //
          // Many more events...
          //
          default:
            logger.info("Matched no events", { type: event.type });
        }
        message.ack();
      } catch (e) {
        if (message.attempts < 3) {
          message.retry({ delaySeconds: Math.ceil(30 ** message.attempts / 10), });
        } else {
          logger.error("Failed handling event - No more retrys", { event: message.body, attempts: message.attempts }, e);
        }
      }
    }
  },
} satisfies ExportedHandler<Env, InternalEvent>;

Code Block 2: Simplified internal events processing during migration.

We migrated the data pipeline from AWS to Cloudflare with an outside-in method: we started with the data receptors and incrementally moved the data processor and the ClickHouse cluster to the new architecture. We began writing telemetry data (logs, metrics, traces, wide-events, etc.) to both ClickHouse (in AWS) and to Workers Analytics Engine simultaneously for the duration of the retention period (30 days).

The final step was rewriting all of our endpoints, previously hosted on AWS Lambda and ECS containers, into Cloudflare Workers. Once those Workers were ready, we simply switched the DNS records to point to the Workers instead of the existing Lambda functions.

Despite the complexity, the entire migration process, from the data pipeline to all re-writing API endpoints, took our then team of 3 engineers less than three months.

We ended up saving over 80% on our cloud bill

Savings on the data receptors

After switching the data receptors from AWS to Cloudflare in early June 2024, our AWS Lambda cost was reduced by over 85%. These costs were primarily driven by I/O time the receptors spent sending data to a Kinesis Data Stream in the same region.


Figure 4: Baselime daily AWS Lambda cost [note: the gap in data is the result of AWS Cost Explorer losing data when the parent organization of the cloud accounts was changed.]

Moreover, we used Cloudfront to enable custom domains pointing to the data receptors. When we migrated the data receptors to Cloudflare, there was no need for Cloudfront anymore. As such, our Cloudfront cost was reduced to $0.


Figure 5: Baselime daily Cloudfront cost [note: the gap in data is the result of AWS Cost Explorer losing data when the parent organization of the cloud accounts was changed.]

If we were a regular Cloudflare customer, we estimate that our daily Cloudflare Workers bill would be around \$25 after the switch, against \$790 on AWS: over 95% cost reduction. These savings are primarily driven by the Workers pricing model, since Workers charge for CPU time, and the receptors are primarily just moving data, and as such, are mostly I/O bound.

Savings on the ClickHouse cluster

To evaluate the cost impact of switching from self-hosting ClickHouse to using Workers Analytics Engine, we need to take into account not only the EC2 instances, but also the disk space, networking, and the Kinesis Data Stream cost.

We completed this switch in late August, achieving over 95% cost reduction in both the Kinesis Data Stream and all EC2 related costs.


Figure 6: Baselime daily Kinesis Data Stream cost [note: the gap in data is the result of AWS Cost Explorer losing data when the parent organization of the cloud accounts was changed.]


Figure 7: Baselime daily EC2 cost [note: the gap in data is the result of AWS Cost Explorer losing data when the parent organization of the cloud accounts was changed.]

If we were a regular Cloudflare customer, we estimate that our daily Workers Analytics Engine cost would be around \$300 after the switch, compared to \$1150 on AWS, a cost reduction of over 70%.

Not only did we significantly reduce costs by migrating to Cloudflare, but we also improved performance across the board. Responses to users are now faster, with real-time event ingestion happening across Cloudflare’s network, closer to our users. Responses to users querying their data are also much faster, thanks to Cloudflare’s deep expertise in operating ClickHouse at scale.

Most importantly, we’re no longer bound by limitations in throughput or scale. We launched Workers Logs on September 26, 2024, and our system now handles a much higher volume of events than before, with no sacrifices in speed or reliability.

These cost savings are outstanding as is, and do not include the total cost of ownership of those systems. We significantly simplified our systems and our codebase, as the platform is taking care of more for us. We’re paged less, we spend less time monitoring infrastructure, and we can focus on delivering product improvements.

Conclusion

Migrating Baselime to Cloudflare has transformed how we build and scale our platform. With Workers, Durable Objects, Workers Analytics Engine, and other services, we now run a fully serverless, globally distributed system that’s more cost-efficient and agile. This shift has significantly reduced our operational overhead and enabled us to iterate faster, delivering better observability tooling to our users.

You can start observing your Cloudflare Workers today with Workers Logs. Looking ahead, we’re excited about the features we will deliver directly in the Cloudflare Dashboard, including real-time error tracking, alerting, and a query builder for high-cardinality and dimensionality events. All coming by early 2025.

Workers Builds: integrated CI/CD built on the Workers platform

Post Syndicated from Serena Shah-Simpson original https://blog.cloudflare.com/workers-builds-integrated-ci-cd-built-on-the-workers-platform

During 2024’s Birthday Week, we launched Workers Builds in open beta — an integrated Continuous Integration and Delivery (CI/CD) workflow you can use to build and deploy everything from full-stack applications built with the most popular frameworks to simple static websites onto the Workers platform. With Workers Builds, you can connect a GitHub or GitLab repository to a Worker, and Cloudflare will automatically build and deploy your changes each time you push a commit.

Workers Builds is intended to bridge the gap between the developer experiences for Workers and Pages, the latter of which launched with an integrated CI/CD system in 2020. As we continue to merge the experiences of Pages and Workers, we wanted to bring one of the best features of Pages to Workers: the ability to tie deployments to existing development workflows in GitHub and GitLab with minimal developer overhead. 

In this post, we’re going to share how we built the Workers Builds system on Cloudflare’s Developer Platform, using Workers, Durable Objects, Hyperdrive, Workers Logs, and Smart Placement.

The design problem

The core problem for Workers Builds is how to pick up a commit from GitHub or GitLab and start a containerized job that can clone the repo, build the project, and deploy a Worker. 


Pages solves a similar problem, and we were initially inclined to expand our existing architecture and tech stack, which includes a centralized configuration plane built on Go in Kubernetes. We also considered the ways in which the Workers ecosystem has evolved in the four years since Pages launched — we have since launched so many more tools built for use cases just like this! 

The distributed nature of Workers offers some advantages over a centralized stack — we can spend less time configuring Kubernetes because Workers automatically handles failover and scaling. Ultimately, we decided to keep using what required no additional work to re-use from Pages (namely, the system for connecting GitHub/GitLab accounts to Cloudflare, and ingesting push events from them), and for the rest build out a new architecture on the Workers platform, with reliability and minimal latency in mind.

The Workers Builds system

We didn’t need to make any changes to the system that handles connections from GitHub/GitLab to Cloudflare and ingesting push events from them. That left us with two systems to build: the configuration plane for users to connect a Worker to a repo, and a build management system to run and monitor builds.

Client Worker 

We can begin with our configuration plane, which consists of a simple Client Worker that implements a RESTful API (using Hono) and connects to a PostgreSQL database. It’s in this database that we store build configurations for our users, and through this Worker that users can view and manage their builds. 

We use a Hyperdrive binding to connect to our database securely over Cloudflare Access (which also manages connection pooling and query caching).

We considered a more distributed data model (like D1, sharded by account), but ultimately decided that keeping our database in a datacenter more easily fit our use-case. The Workers Builds data model is relational — Workers belong to Cloudflare Accounts, and Builds belong to Workers — and build metadata must be consistent in order to properly manage build queues. We chose to keep our failover-ready database in a centralized datacenter and take advantage of two other Workers products, Smart Placement and Hyperdrive, in order to keep the benefits of a distributed control plane. 


Everything that you see in the Cloudflare Dashboard related to Workers Builds is served by this Worker. 

Build Management Worker

The more challenging problem we faced was how to run and manage user builds effectively. We wanted to support the same experience that we had achieved with Pages, which led to these key requirements:

  1. Builds should be initiated with minimal latency.

  2. The status of a build should be tracked and displayed through its entire lifecycle, starting when a user pushes a commit.

  3. Customer build logs should be stored in a secure, private, and long-lived way.

To solve these problems, we leaned heavily into the technology of Durable Objects (DO). 

We created a Build Management Worker with two DO classes: A Scheduler class to manage the scheduling of builds, and a class called BuildBuddy to manage individual builds. We chose to design our system this way for an efficient and scalable system. Since each build is assigned its own build manager DO, its operation won’t ever block other builds or the scheduler, meaning we can start up builds with minimal latency. Below, we dive into each of these Durable Objects classes.


Scheduler DO

The Scheduler DO class is relatively simple. Using Durable Objects Alarms, it is triggered every second to pull up a list of user build configurations that are ready to be started. For each of those builds, the Scheduler creates an instance of our other DO Class, the Build Buddy. 

import { DurableObject } from 'cloudflare:workers'


export class BuildScheduler extends DurableObject {
   state: DurableObjectState
   env: Bindings


   constructor(ctx: DurableObjectState, env: Bindings) {
       super(ctx, env)
   }
   
   // The DO alarm handler will be called every second to fetch builds
   async alarm(): Promise<void> {
// set alarm to run again in 1 second
       await this.updateAlarm()


       const builds = await this.getBuildsToSchedule()
       await this.scheduleBuilds(builds)
   }


   async scheduleBuilds(builds: Builds[]): Promise<void> {
       // Don't schedule builds, if no builds to schedule
       if (builds.length === 0) return


       const queue = new PQueue({ concurrency: 6 })
       // Begin running builds
       builds.forEach((build) =>
           queue.add(async () => {
       	  // The BuildBuddy is another DO described more in the next section! 
               const bb = getBuildBuddy(this.env, build.build_id)
               await bb.startBuild(build)
           })
       )


       await queue.onIdle()
   }


   async getBuildsToSchedule(): Promise<Builds[]> {
       // returns list of builds to schedule
   }


   async updateAlarm(): Promise<void> {
// We want to ensure we aren't running multiple alarms at once, so we only set the next alarm if there isn’t already one set. 
       const existingAlarm = await this.ctx.storage.getAlarm()
       if (existingAlarm === null) {
           this.ctx.storage.setAlarm(Date.now() + 1000)
       }
   }
}

Build Buddy DO

The Build Buddy DO class is what we use to manage each individual build from the time it begins initializing to when it is stopped. Every build has a buddy for life!

Upon creation of a Build Buddy DO instance, the Scheduler immediately calls startBuild() on the instance. The startBuild() method is responsible for fetching all metadata and secrets needed to run a build, and then kicking off a build on Cloudflare’s container platform (not public yet, but coming soon!). 

As the containerized build runs, it reports back to the Build Buddy, sending status updates and logs for the Build Buddy to deal with. 

Build status

As a build progresses, it reports its own status back to Build Buddy, sending updates when it has finished initializing, has completed successfully, or been terminated by the user. The Build Buddy is responsible for handling this incoming information from the containerized build, writing status updates to the database (via a Hyperdrive binding) so that users can see the status of their build in the Cloudflare dashboard.

Build logs

A running build generates output logs that are important to store and surface to the user. The containerized build flushes these logs to the Build Buddy every second, which, in turn, stores those logs in DO storage

The decision to use Durable Object storage here makes it easy to multicast logs to multiple clients efficiently, and allows us to use the same API for both streaming logs and viewing historical logs. 

// build-management-app.ts

// We created a Hono app to for use by our Client Worker API
const app = new Hono<HonoContext>()
   .post(
       '/api/builds/:build_uuid/status',
       async (c) => {
           const buildStatus = await c.req.json()


           // fetch build metadata
           const build = ...


           const bb = getBuildBuddy(c.env, build.build_id)
           return await bb.handleStatusUpdate(build, statusUpdate)
       }
   )
   .post(
       '/api/builds/:build_uuid/logs',
       async (c) => {
           const logs = await c.req.json()
     // fetch build metadata
           const build = ...


           const bb = getBuildBuddy(c.env, build.build_id)
           return await bb.addLogLines(logs.lines)
       }
   )


export default {
   fetch: app.fetch
}

// build-buddy.ts

import { DurableObject } from 'cloudflare:workers'


export class BuildBuddy extends DurableObject {
   compute: WorkersBuildsCompute


   constructor(ctx: DurableObjectState, env: Bindings) {
       super(ctx, env)
       this.compute = new ComputeClient({
           // ...
       })
   }


   // The Scheduler DO calls startBuild upon creating a BuildBuddy instance
   startBuild(build: Build): void {
       this.startBuildAsync(build)         
   }


   async startBuildAsync(build: Build): Promise<void> {
       // fetch all necessary metadata build, including
	// environment variables, secrets, build tokens, repo credentials, 
// build image URI, etc
	// ...


	// start a containerized build
       const computeBuild = await this.compute.createBuild({
           // ...
       })
   }


   // The Build Management worker calls handleStatusUpdate when it receives an update
   // from the containerized build
   async handleStatusUpdate(
       build: Build,
       buildStatusUpdatePayload: Payload
   ): Promise<void> {
// Write status updates to the database
   }


   // The Build Management worker calls addLogLines when it receives flushed logs
   // from the containerized build
   async addLogLines(logs: LogLines): Promise<void> {
       // Generate nextLogsKey to store logs under      
       this.ctx.storage.put(nextLogsKey, logs)
   }


   // The Client Worker can call methods on a Build Buddy via RPC, using a service binding to the Build Management Worker.
   // The getLogs method retrieves logs for the user, and the cancelBuild method forwards a request from the user to terminate a build. 
   async getLogs(cursor: string){
       const decodedCursor = cursor !== undefined ? decodeLogsCursor(cursor) : undefined
       return await this.getLogs(decodedCursor)
   }


   async cancelBuild(compute_id: string, build_id: string): void{
      await this.terminateBuild(build_id, compute_id)
   }


   async terminateBuild(build_id: number, compute_id: string): Promise<void> {
       await this.compute.stopBuild(compute_id)
   }
}


   export function getBuildBuddy(
   env: Pick<Bindings, 'BUILD_BUDDY'>,
   build_id: number
): DurableObjectStub<BuildBuddy> {
   const id = env.BUILD_BUDDY.idFromName(build_id.toString())
   return env.BUILD_BUDDY.get(id)
}
Alarms

We utilize alarms in the Build Buddy to check that a build has a healthy startup and to terminate any builds that run longer than 20 minutes. 

How else have we leveraged the Developer Platform?

Now that we’ve gone over the core behavior of the Workers Builds control plane, we’d like to detail a few other features of the Workers platform that we use to improve performance, monitor system health, and troubleshoot customer issues.

Smart Placement and location hints

While our control plane is distributed in the sense that it can be run across multiple datacenters, to reduce latency costs, we want most requests to be served from locations close to our primary database in the western US.

While a build is running, Build Buddy, a Durable Object, is continuously writing status updates to our database. For the Client and the Build Management API Workers, we enabled Smart Placement with location hints to ensure requests run close to the database.


This graph shows the reduction in round trip time (RTT) observed for our Worker with Smart Placement turned on. 

Workers Logs

We needed a logging tool that allows us to aggregate and search across persistent operational logs from our Workers to assist with identifying and troubleshooting issues. We worked with the Workers Observability team to become early adopters of Workers Logs.

Workers Logs worked out of the box, giving us fast and easy to use logs directly within the Cloudflare dashboard. To improve our ability to search logs, we created a tagging library that allows us to easily add metadata like the git tag of the deployed worker that the log comes from, allowing us to filter logs by release.

See a shortened example below for how we handle and log errors on the Client Worker. 

// client-worker-app.ts

// The Client Worker is a RESTful API built with Hono
const app = new Hono<HonoContext>()
   // This is from the workers-tagged-logger library - first we register the logger
   .use(useWorkersLogger('client-worker-app'))
   // If any error happens during execution, this middleware will ensure we log the error
   .onError(useOnError)
   // routes
   .get(
       '/apiv4/builds',
       async (c) => {
           const { ids } = c.req.query()
           return await getBuildsByIds(c, ids)
       }
   )


function useOnError(e: Error, c: Context<HonoContext>): Response {
   // Set the project identifier n the error
   logger.setTags({ release: c.env.GIT_TAG })
 
   // Write a log at level 'error'. Can also log 'info', 'log', 'warn', and 'debug'
   logger.error(e)
   return c.json(internal_error.toJSON(), internal_error.statusCode)
}

This setup can lead to the following sample log message from our Workers Log dashboard. You can see the release tag is set on the log.


We can get a better sense of the impact of the error by adding filters to the Workers Logs view, as shown below. We are able to filter on any of the fields since we’re logging with structured JSON.  


R2

Coming soon to Workers Builds is build caching, used to store artifacts of a build for subsequent builds to reuse, such as package dependencies and build outputs. Build caching can speed up customer builds by avoiding the need to redownload dependencies from NPM or to rebuild projects from scratch. The cache itself will be backed by R2 storage. 

Testing

We were able to build up a great testing story using Vitest and workerd — unit tests, cross-worker integration tests, the works. In the example below, we make use of the runInDurableObject stub from cloudflare:test to test instance methods on the Scheduler DO directly.

// scheduler.spec.ts

import { env, runInDurableObject } from 'cloudflare:test'
import { expect, test } from 'vitest'
import { BuildScheduler } from './scheduler'


test('getBuildsToSchedule() runs a queued build', async () => {
   // Our test harness creates a single build for our scheduler to pick up
   const { build } = await harness.createBuild()


   // We create a scheduler DO instance
   const id = env.BUILD_SCHEDULER.idFromName(crypto.randomUUID())
   const stub = env.BUILD_SCHEDULER.get(id)
   await runInDurableObject(stub, async (instance: BuildScheduler) => {
       expect(instance).toBeInstanceOf(BuildScheduler)


// We check that the scheduler picks up 1 build
       const builds = await instance.getBuildsToSchedule()
       expect(builds.length).toBe(1)
	
// We start the build, which should mark it as running
       await instance.scheduleBuilds(builds)
   })


   // Check that there are no more builds to schedule
   const queuedBuilds = ...
   expect(queuedBuilds.length).toBe(0)
})

We use SELF.fetch() from cloudflare:test to run integration tests on our Client Worker, as shown below. This integration test covers our Hono endpoint and database queries made by the Client Worker in retrieving the metadata of a build.

// builds_api.test.ts

import { env, SELF } from 'cloudflare:test'
   
it('correctly selects a single build', async () => {
   // Our test harness creates a randomized build to test with
   const { build } = await harness.createBuild()


   // We send a request to the Client Worker itself to fetch the build metadata
   const getBuild = await SELF.fetch(
       `https://example.com/builds/${build1.build_uuid}`,
       {
           method: 'GET',
           headers: new Headers({
               Authorization: `Bearer JWT`,
               'content-type': 'application/json',
           }),
       }
   )


   // We expect to receive a 200 response from our request and for the 
   // build metadata returned to match that of the random build that we created
   expect(getBuild.status).toBe(200)
   const getBuildV4Resp = await getBuild.json()
   const buildResp = getBuildV4Resp.result
   expect(buildResp).toBeTruthy()
   expect(buildResp).toEqual(build)
})

These tests run on the same runtime that Workers run on in production, meaning we have greater confidence that any code changes will behave as expected when they go live. 

Analytics

We use the technology underlying the Workers Analytics Engine to collect all of the metrics for our system. We set up Grafana dashboards to display these metrics. 

JavaScript-native RPC

JavaScript-native RPC was added to Workers in April of 2024, and it’s pretty magical. In the scheduler code example above, we call startBuild() on the BuildBuddy DO from the Scheduler DO. Without RPC, we would need to stand up routes on the BuildBuddy fetch() handler for the Scheduler to trigger with a fetch request. With RPC, there is almost no boilerplate — all we need to do is call a method on a class. 

const bb = getBuildBuddy(this.env, build.build_id)


// Starting a build without RPC 😢
await bb.fetch('http://do/api/start_build', {
    method: 'POST',
    body: JSON.stringify(build),
})


// Starting a build with RPC 😸
await bb.startBuild(build)

Conclusion

By using Workers and Durable Objects, we were able to build a complex and distributed system that is easy to understand and is easily scalable. 

It’s been a blast for our team to build on top of the very platform that we work on, something that would have been much harder to achieve on Workers just a few years ago. We believe in being Customer Zero for our own products — to identify pain points firsthand and to continuously improve the developer experience by applying them to our own use cases. It was fulfilling to have our needs as developers met by other teams and then see those tools quickly become available to the rest of the world — we were collaborators and internal testers for Workers Logs and private network support for Hyperdrive (both released on Birthday Week), and the soon to be released container platform.

Opportunities to build complex applications on the Developer Platform have increased in recent years as the platform has matured and expanded product offerings for more use cases. We hope that Workers Builds will be yet another tool in the Workers toolbox that enables developers to spend less time thinking about configuration and more time writing code. 

Want to try it out? Check out the docs to learn more about how to deploy your first project with Workers Builds.

Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release
. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers

and the
press
have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said
: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems
. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections
.)

Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release
. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers

and the
press
have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said
: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems
. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections
.)

Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release
. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers

and the
press
have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said
: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems
. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections
.)

Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release
. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers

and the
press
have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said
: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems
. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections
.)

Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release
. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers

and the
press
have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said
: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study
. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems
. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections
.)