Sometimes I’m a helpless victim of my urges.
A while back -very late in 2016 – I started work on a program called loccount. This project originally had two purposes.
One is that I wanted a better, faster replacement for David Wheeler’s sloccount tool, which I was using to collect statistics on the amount of virtuous code shrinkage in NTPsec. David is good people and sloccount is a good idea, but internally it’s a slow and messy pile of kludges – so much so that it seems to have exceed his capacity to maintain, at time of writing in 2019 it hadn’t been updated since 2004. I’d been thinking about writing a better replacement, in Python, for a while.
Then I realized this problem was about perfectly sized to be my learn-Go project. Small enough to be tractable, large enough to not be entirely trivial. And there was the interesting prospect of using channels/goroutines to parallelize the data collection. I got it working well enough for NTP statistics pretty quickly, though I didn’t issue a first public release until a little over a month later (mainly because I wanted to have a good test corpus in place to demonstrate correctness). And the parallelized code was both satisfyingly fast and really pretty. I was quite pleased.
The only problem was that the implementation, having been a deliberately straight translation of sloccount’s algorithms in order to preserve comparability of the reports, was a bit of a grubby pile internally. Less so than sloccount’s because it was all in one language. but still. It’s difficult for me to leave that kind of thing alone; the urge to clean it up becomes like a maddening itch.
The rest of this post is about what happened when I succumbed. I got two lessons from this experience: one reinforcement of a thing I already knew, and one who-would-have-thought-it-could-go-this-far surprise. I also learned some interesting things about the landscape of programming languages.
For those of you who haven’t used sloccount, it counts source lines of code (SLOC) in a code tree. This is lines that are non-blank and not comments. SLOC counts are useful for complexity estimation and predicting defect incidence. SLOC is especially good for tracking how the complexity (and hence, for example, the potential attack surface) of a codebase has changed over time. There are more elaborate ways of making numbers that aim at similar usefuleness; the best known is called “cyclomatic complexity” and there’s another metric called LLOC (logical lines of code) that loccount can also tally for many languages. But it turns out empirically that it’s quite difficult to do better than SLOC for forecasting defect incidence, at least at the present state of our knowledge. More complex measures don’t seem to gain much; there is some evidence that LLOC does but even that is disputable.
I wanted to use loccount to post hard numbers about the continuing SLOC shrinkage of NTPSec. We’ve shrunk it by a little better than 4:1 since we forked it, from 231 KLOC to 55 KLOC, and that’s a big deal – a large reduction in attack surface and an equally large improvement in auditability. People have trouble believing a change that dramatic that unless you can show them numbers and how to reproduce those numbers; I wanted to be able to say “Here’s loccount and its test suite. Here’s our repository. See for yourself.”
But the code was still grubby, with multiple different rather ad-hoc parsers in it inherited from sloccount for different categories of languages; sloccount supports 27 in total. I didn’t like that, and over the next two years I would occasionally return to the code, gradually refactoring and simplifying.
Why parsers? Think about what you have to do to count SLOC. You need to run through a source file recognizing the following events:
- Start of comment.
- End of comment
- Newline (so you can bump the SLOC counter when you’re outside a comment)
- Start of string literal (so you start ignoring what would otherwise look like a comment start)
- End of string (you stop ignoring comment starts)
This is complicated by the fact that in any given language there may be multiple syntaxes for start or end comment and start or end string literal. C and most other languages have two styles of comment; /* the block comment, which can contain newlines and has an explicit end delimiter */, versus // the winged comment, which can’t contain newlines and ends at the next one.
The most naive way to handle this simple grammar would be to write a custom parser for each language. Sometimes, when the language syntax is especially grotty you have to do that. There’s still a bespoke parser for Perl inside loccount. But most languages fall into fairly large syntactic families in which all the members can be parsed alike. Sloccount, supporting 27 languages, had at least four of these; C-like, Pascal-like, scripting-language-like. and Fortran-like.
Now, if you’re me, learning that most languages fall into families like that is interesting in itself. But the number of outliers that don’t fit, like Lisp and assembler (which sloccount can tally) and really odd ones like APL (which it can’t) is annoying. I can’t look at code like that without wanting to clean it up, simplify it, fold special cases into general ones, and learn about the problem space as I do so. Victim of my urges, I am.
I’ve written more parsers for messy data than I can remember, so I know the strategies for this. Let’s call the bespoke-parser-for-every-language level 0, and recognizing language families level 1. Sloccount was already at level 1 and loccount inherited that. Level 2 would be reducing the number of distinct parsers, ideally to one. How do we do that?
So instead of two parsers with two functions “is there a a /* next?” and “is there a (* next?”, you write one function “is there a block comment start next?” which dispatches to looking for /* or (*. Your “Is there a winged comment start?” file looks for // in a source file with a .c extension and always returns false in a file with a .pas extension.
That’s level 2. You’ve replaced “look for this family-specific syntax” with “look for the start-block-comment terminal” with the token-getter figuring out what to match based on which family the file contents is in. But that doesn’t handle idiosyncratic languages like – say – D, which uses /+ +/. Well, not unless you’re willing to let your start-block-comment matcher be a huge ugly switch statement with an arm not just for each family but for each outlier file extension.
And you’d need parallel big ugly switches in every other terminal – the one for end of block comment, the one for winged comment, the one for string delimiter, et tiresome cetera. Don’t do that; it’s ugly and hard to maintain.
Level 3 is where you change the big switch/case into a traits table. At start of processing you use the file extension to figure out which traits entry to use. You read your start and end block comment from that entry. It may have flags in it which declare, for example “this language uses single-quoted strings”, or “block comments nest in this language”.
This move from executable code to a traits table is really important, because it drastically reduces the complexity cost of adding support for new languages. Usually, now, each new one is just a table entry that can be easily composed from a read of the syntax part of the language specification. Declarative is greater than imperative!
The project NEWS file and repo history tell us how dramatic this effect is. In the first two years of the project’s lifetime I added about 14 languages to the original set of 27, less than one a month. Late last month I got the ontology of the traits table about right after staring long enough at the languages I’d been supporting. I immediately added five languages on one day.
Then I hit a bump because I needed more trait flags to handle the next couple of languages. But as I added the next half-dozen or so (Algol 60, F#, Dart, Cobra, VRML, Modula-2, and Kotlin) I noticed something interesting; the frequency with which I needed to add trait flags was dropping as I went. I was beginning to mine out the space of relevant syntactic variations.
Now I want to draw a contrast with the project I used to define the term Zeno tarpit. On doclifter I can always make more progress towards lifting the last few percent of increasingly weird malformed legacy documents out of the nroff muck if I’m willing to push hard enough, but the cost is that the pain never ends. Each increment of progress normally requires more effort than the last.
In loccount I’m clearly heading in the opposite direction. In the last week (and remember most of my attention is on NTPsec; this is a side project) I’ve added no fewer than 34 new languages and markup formats to loccount. My time to put in a new one is down to minutes – skim the spec, fill in a new table entry, drop a sample into the test directory, eyeball the results to make sure they’re sane, and update the news file.
The contrast with the effort required to get another 0.1% down the long tail of the man-page corpus is really extreme. What makes the difference?
OK, yes, the nroff-lifting problem is more complicated. You’re parsing for many more different kinds of terminals and relations among them. That’s true but it’s not a “true” that generates much insight. No; the real difference is that in that domain, I still have to write a parser extension – executable code – in order to make progress. In loccount, on the other hand, I’m just composing new table entries. Declarative is greater than imperative!
Actually it has become a game; I’ve made a couple of passes through Wikipedia’s List of programming languages looking for plausible candidates. I’ve put in historical relics like SNOBOL4 and B simply because I could. When you’re on a stamp-collecting streak like this, it’s natural to start wondering just how far you can push it. A victim of my urges sometimes, am I.
I do have some limits, though. I haven’t gone after esoteric languages, because those are often deliberately constructed to be so syntactically bizarre that upgrding my parser would be more like work than fun. And I really don’t care about the 17th dialect of Business BASIC or yet another proprietary database language, thank you. Also I have in general not added languages that I judged to be academic toys intended moe to generate research papers rather than production code, though I gave in on a few I thought to be of particular historical interest.
The lesson here is one I know I’ve written about before, but it deserves reinforcing. Declarative is greater than imperative. Simple code plus smart data is better than enough smart code to do the same job. The move from four parsers to one parser and a big trait table was a win on every level – the result is easier to understand and much easier to extend. This is still an underutilized design strategy.
There is a level 4. I may yet do something like what we did with Open Adventure and move all that smart data from a table initializer to a YAML file compiled back into the trait table at build time. Then adding new languages would usually just be an edit to the YAML, not touching the Go code at all.
I think at this point I’ve come pretty close to entirely excavating the space of production languages that might reasonably appear on a Unix system today. Now prove me wrong. Throw a language loccount -s doesn’t already list at me and let’s see if my generic parser and quirks can cope.
Some restrictions: No esolangs, please; nothing with only closed-source implementations; nothing outright extinct; no fad-of-the-week webdev crap; no academic toys only ever touched by the designer’s research group. Exceptions to these restrictions may be made in cases of exceptional historical interest – it amused me to add B and SNOBOL4 and I’d do it again.
Oh, and the current count of languages? 117 The real surprise is that 117 languages was this easy. There is less variation out there than one might suppose.