Open Source AI Definition Erodes the Meaning of “Open Source”

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2024/10/31/open-source-ai-osaid-osi.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC). I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This week, the Open Source Initiative (OSI) made their new Open
Source Artificial Intelligence Definition (OSAID) official with its 1.0 release. With this
announcement, we have reached the moment that software freedom advocates have
feared for decades: the definition of “open source” —
with which OSI was entrusted — now differs in significant
ways from the views of most software freedom advocates.

There has been substantial acrimony during the drafting process of OSAID, and this blog post does not summarize all the
community complaints about the OSAID and its drafting
process. Other
bloggers
and the
press have covered those. The
TLDR here,
IMO is simply stated: the OSAID fails to
require reproducibility by the
public of the scientific process of building these systems, because the OSAID fails to place sufficient
requirements on the licensing and public disclosure of training sets for so-called “Open Source” systems. The
OSI refused to add this requirement because of a fundamental flaw in their process; they decided that “there
was no point in publishing a definition that no existing AI system could
currently meet”. This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit from OSI’s retroactive declaration that their systems are “open source”. The OSI should have refrained from publishing a definition yet, and instead
labeled this document as ”recommendations” for now.

As the publication date of the OSAID approached, I could not help but
remember a fascinating statement that Donald E. Knuth, one of the founders
of the field of computer
science, once
said: [M]y role is to be on the bottom of things. … I try to
digest … knowledge into a form that is accessible to people who don’t
have time for such study. If we wish to engage in the
highly philosophical (and easily politically corruptible) task
of defining what terms like “software freedom” and
“open source” mean, we must learn to be on the “bottom of
things”. OSI made an unforced error in this regard. While they could
have humbly announced this as “recommendations” or “guidelines”,
they instead formalized it as a “definition” — with equivalent authority to their
OSD.

Yet, OSI itself only turned its attention to AI only recently, when they
announced their “deep dive” — for which Microsoft’s GitHub was OSI’s “Thought Leader”.
OSI has responded too rapidly to this industry ballyhoo. Their celerity of response made OSI
an easy target for regulatory capture.

By comparison, the original OSD was first published in February 1999.
That was at least twelve years after the widespread industry adoption of
various FOSS programs (such as the GNU C Compiler and BSD). The concept was explored and discussed publicly (under the moniker “Free Software”)
for decades before it was officially “defined”.
The OSI announced itself as the “marketing department for Free Software” and
based the OSD in large part on the independently
developed Debian Free Software Guidelines (DFSG). The OSD was thus the
culmination of decades of thought and consideration, and primarily developed
by a third-party (Debian) — which provided a balance on OSI’s authority.
(Interestingly, some folks from Debian are attempting to check OSI’s authority again due to the premature publication of the OSAID.)

OSI claims that they must move quickly so that they can
counter the software companies from coopting
the term “open source” for their own aims. But
OSI failed to pursue trademark protection for “open source” in the early days, so the OSI can’t stop Mark Zuckerberg and his
cronies in any event from using the “open source”
moniker for his Facebook and Instagram products — let alone his
new Llama product.
Furthermore, OSI’s insistence
that the definition was urgently needed and that the definition
be engineered as a retrofit to apply to an existing, available system has yielded troublesome results.
Simply put, OSI has a tiny sample set to examine, in 2024,
of what LLM-backed generative AI systems look like. To make a final decision
about the software freedom and rights implications of such a nascent field led to
an automatic bias to accept the actions of first movers as legitimate.
By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems
as “open source” by definition!

OSI also disenfranchised the users and content creators in this process.
FOSS activists should
be engaging with
the larger discussions with
impacted communities of content creators about what “open
source” means to them, and how they feel about incorporation of
their data in the training sets into these third-party systems. The line between data and code is so easily crossed with
these systems that we cannot rely on old, rote conclusions that the
“data is separate and can be proprietary (or even unavailable), and yet the system remains ‘open
source’”. That adage fails us when analyzing this technology,
and we must take careful steps — free from the for-profit corporate
interest of AI fervor — as we decide how our well-established
philosophies apply to these changes.

FOSS activists err when we unilaterally dictate and define what is
ethical, moral, open and Free in areas outside of software. Software rights
theorists can (and should) make meaningful contributions in these
other areas, but not without substantial collaboration with those creative
individuals who produce the source material. Where were the painters, the
novelists, the actors, the playwrights, the musicians, and the poets in the
OSAID drafting process? The OSD was (of course) easier because our
community is mostly programmers and developers (or folks adjacent
to those fields); software creators knew best how to consider philosophical implications of pure software products.
The OSI, and the folks in its leadership, definitely
know software well, but I wouldn’t name any of them (or myself) as great
thinkers in these many areas outside software that are noticeably impacted by the promulgation of
LLMs that are trained on those creative works. The Open Source community remains
consistently in danger of excessive insularity, and the OSAID is an
unfortunate example of how insular we can be.

Meanwhile, I have spent literally months of time over the last 30 years trying to make sure the
coalition of software freedom & rights activists remained in basic
congruence (at least publicly) with those (like OSI) who are oriented towards a more
for-profit and corporate open source approach. Until today, I was always able to say:
“I believe that anything the OSI calls ‘open source’
gives you all the rights and freedoms that you deserve”. I now cannot
say that again unless/until the OSI revokes the OSAID. Unfortunately, that
Rubicon may have now been permanently crossed! OSI
has purposely made it politically unviable for them to
revoke the OSAID. Instead, they plan only incremental updates to the OSAID. Once
entities begin to rely on this definition as written, OSI will find it nearly impossible to
later declare systems that were “open source” under 1.0 as no longer so (under later versions). So, we are likely stuck
with OSAID’s key problems forever. OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal defintion.

I truly don’t know for sure (yet) if the only way to respect user rights in an LLM-backed
generative AI system is to only use training sets that are publicly
available and licensed under Free Software licenses. I do believe
that’s the ideal and preferred form for modification of those systems. Nevertheless,
a generally useful technical system that is built by collapsing data (in essence, via highly lossy compression) into a table of floating point numbers
is philosophically much more complicated than binary software and its Corresponding Source. So, having
studied the issue myself, I believe the Socratic Epiphany currently applies. Perhaps there is an acceptable
spot for compromise
regarding the issues of training set licensing, availability and similar reproducibility issues.
My instincts, after 25
years as a software rights philosopher, lead me to believe that it will
take at least a decade for our best minds to find a reasonable answer on where the bright line is of
acceptable behavior with regard to these AI systems. While OSI claims their OSAID is humble, I beg
to differ. The humble act now is to admit that it was just too soon to publish a “definition” and
rebrand these the OSAID 1.0 as “current recommendations”. That might not grab as many
headlines or raise as much money as the OSAID did, but it’s the moral and ethical way out of this bad situation.

Finally, rather than merely be a pundit on this matter, I am instead today putting myself forward
to try to be part of the solution. I plan to run for the OSI Board of Directors at the next elections on a single-issue
platform: I will work arduously for my entire term to see the OSAID repealed, and republished
not as a definition, but merely recommendations, and to also issue a statement
that OSI published the definition sooner than was appropriate. I’ll write further about the matter as the
next OSI Board election approaches. I also call on other software rights activists to run with me on a similar platform; the OSI has myriad seats that are elected by different constituents, so there is opportunity to run as a ticket on this issue. (Please contact me privately if you’d like to be involved with this ticket at the next OSI Board election. Note, though, that election results
are not actually binding, as OSI’s by-laws allow the current Board to reject results of the elections.)

Noise

Open Source AI Definition Erodes the Meaning of “Open Source”

The collective thoughts of the interwebz