Tag Archives: plt

Object models

Post Syndicated from Eevee original https://eev.ee/blog/2017/11/28/object-models/

Anonymous asks, with dollars:

More about programming languages!

Well then!

I’ve written before about what I think objects are: state and behavior, which in practice mostly means method calls.

I suspect that the popular impression of what objects are, and also how they should work, comes from whatever C++ and Java happen to do. From that point of view, the whole post above is probably nonsense. If the baseline notion of “object” is a rigid definition woven tightly into the design of two massively popular languages, then it doesn’t even make sense to talk about what “object” should mean — it does mean the features of those languages, and cannot possibly mean anything else.

I think that’s a shame! It piles a lot of baggage onto a fairly simple idea. Polymorphism, for example, has nothing to do with objects — it’s an escape hatch for static type systems. Inheritance isn’t the only way to reuse code between objects, but it’s the easiest and fastest one, so it’s what we get. Frankly, it’s much closer to a speed tradeoff than a fundamental part of the concept.

We could do with more experimentation around how objects work, but that’s impossible in the languages most commonly thought of as object-oriented.

Here, then, is a (very) brief run through the inner workings of objects in four very dynamic languages. I don’t think I really appreciated objects until I’d spent some time with Python, and I hope this can help someone else whet their own appetite.

Python 3

Of the four languages I’m going to touch on, Python will look the most familiar to the Java and C++ crowd. For starters, it actually has a class construct.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class Vector:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __neg__(self):
        return Vector(-self.x, -self.y)

    def __div__(self, denom):
        return Vector(self.x / denom, self.y / denom)

    @property
    def magnitude(self):
        return (self.x ** 2 + self.y ** 2) ** 0.5

    def normalized(self):
        return self / self.magnitude

The __init__ method is an initializer, which is like a constructor but named differently (because the object already exists in a usable form by the time the initializer is called). Operator overloading is done by implementing methods with other special __dunder__ names. Properties can be created with @property, where the @ is syntax for applying a wrapper function to a function as it’s defined. You can do inheritance, even multiply:

1
2
3
4
class Foo(A, B, C):
    def bar(self, x, y, z):
        # do some stuff
        super().bar(x, y, z)

Cool, a very traditional object model.

Except… for some details.

Some details

For one, Python objects don’t have a fixed layout. Code both inside and outside the class can add or remove whatever attributes they want from whatever object they want. The underlying storage is just a dict, Python’s mapping type. (Or, rather, something like one. Also, it’s possible to change, which will probably be the case for everything I say here.)

If you create some attributes at the class level, you’ll start to get a peek behind the curtains:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class Foo:
    values = []

    def add_value(self, value):
        self.values.append(value)

a = Foo()
b = Foo()
a.add_value('a')
print(a.values)  # ['a']
b.add_value('b')
print(b.values)  # ['a', 'b']

The [] assigned to values isn’t a default assigned to each object. In fact, the individual objects don’t know about it at all! You can use vars(a) to get at the underlying storage dict, and you won’t see a values entry in there anywhere.

Instead, values lives on the class, which is a value (and thus an object) in its own right. When Python is asked for self.values, it checks to see if self has a values attribute; in this case, it doesn’t, so Python keeps going and asks the class for one.

Python’s object model is secretly prototypical — a class acts as a prototype, as a shared set of fallback values, for its objects.

In fact, this is also how method calls work! They aren’t syntactically special at all, which you can see by separating the attribute lookup from the call.

1
2
3
print("abc".startswith("a"))  # True
meth = "abc".startswith
print(meth("a"))  # True

Reading obj.method looks for a method attribute; if there isn’t one on obj, Python checks the class. Here, it finds one: it’s a function from the class body.

Ah, but wait! In the code I just showed, meth seems to “know” the object it came from, so it can’t just be a plain function. If you inspect the resulting value, it claims to be a “bound method” or “built-in method” rather than a function, too. Something funny is going on here, and that funny something is the descriptor protocol.

Descriptors

Python allows attributes to implement their own custom behavior when read from or written to. Such an attribute is called a descriptor. I’ve written about them before, but here’s a quick overview.

If Python looks up an attribute, finds it in a class, and the value it gets has a __get__ method… then instead of using that value, Python will use the return value of its __get__ method.

The @property decorator works this way. The magnitude property in my original example was shorthand for doing this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class MagnitudeDescriptor:
    def __get__(self, instance, owner):
        if instance is None:
            return self
        return (instance.x ** 2 + instance.y ** 2) ** 0.5

class Vector:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    magnitude = MagnitudeDescriptor()

When you ask for somevec.magnitude, Python checks somevec but doesn’t find magnitude, so it consults the class instead. The class does have a magnitude, and it’s a value with a __get__ method, so Python calls that method and somevec.magnitude evaluates to its return value. (The instance is None check is because __get__ is called even if you get the descriptor directly from the class via Vector.magnitude. A descriptor intended to work on instances can’t do anything useful in that case, so the convention is to return the descriptor itself.)

You can also intercept attempts to write to or delete an attribute, and do absolutely whatever you want instead. But note that, similar to operating overloading in Python, the descriptor must be on a class; you can’t just slap one on an arbitrary object and have it work.

This brings me right around to how “bound methods” actually work. Functions are descriptors! The function type implements __get__, and when a function is retrieved from a class via an instance, that __get__ bundles the function and the instance together into a tiny bound method object. It’s essentially:

1
2
3
4
5
class FunctionType:
    def __get__(self, instance, owner):
        if instance is None:
            return self
        return functools.partial(self, instance)

The self passed as the first argument to methods is not special or magical in any way. It’s built out of a few simple pieces that are also readily accessible to Python code.

Note also that because obj.method() is just an attribute lookup and a call, Python doesn’t actually care whether method is a method on the class or just some callable thing on the object. You won’t get the auto-self behavior if it’s on the object, but otherwise there’s no difference.

More attribute access, and the interesting part

Descriptors are one of several ways to customize attribute access. Classes can implement __getattr__ to intervene when an attribute isn’t found on an object; __setattr__ and __delattr__ to intervene when any attribute is set or deleted; and __getattribute__ to implement unconditional attribute access. (That last one is a fantastic way to create accidental recursion, since any attribute access you do within __getattribute__ will of course call __getattribute__ again.)

Here’s what I really love about Python. It might seem like a magical special case that descriptors only work on classes, but it really isn’t. You could implement exactly the same behavior yourself, in pure Python, using only the things I’ve just told you about. Classes are themselves objects, remember, and they are instances of type, so the reason descriptors only work on classes is that type effectively does this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class type:
    def __getattribute__(self, name):
        value = super().__getattribute__(name)
        # like all op overloads, __get__ must be on the type, not the instance
        ty = type(value)
        if hasattr(ty, '__get__'):
            # it's a descriptor!  this is a class access so there is no instance
            return ty.__get__(value, None, self)
        else:
            return value

You can even trivially prove to yourself that this is what’s going on by skipping over types behavior:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class Descriptor:
    def __get__(self, instance, owner):
        print('called!')

class Foo:
    bar = Descriptor()

Foo.bar  # called!
type.__getattribute__(Foo, 'bar')  # called!
object.__getattribute__(Foo, 'bar')  # ...

And that’s not all! The mysterious super function, used to exhaustively traverse superclass method calls even in the face of diamond inheritance, can also be expressed in pure Python using these primitives. You could write your own superclass calling convention and use it exactly the same way as super.

This is one of the things I really like about Python. Very little of it is truly magical; virtually everything about the object model exists in the types rather than the language, which means virtually everything can be customized in pure Python.

Class creation and metaclasses

A very brief word on all of this stuff, since I could talk forever about Python and I have three other languages to get to.

The class block itself is fairly interesting. It looks like this:

1
2
class Name(*bases, **kwargs):
    # code

I’ve said several times that classes are objects, and in fact the class block is one big pile of syntactic sugar for calling type(...) with some arguments to create a new type object.

The Python documentation has a remarkably detailed description of this process, but the gist is:

  • Python determines the type of the new class — the metaclass — by looking for a metaclass keyword argument. If there isn’t one, Python uses the “lowest” type among the provided base classes. (If you’re not doing anything special, that’ll just be type, since every class inherits from object and object is an instance of type.)

  • Python executes the class body. It gets its own local scope, and any assignments or method definitions go into that scope.

  • Python now calls type(name, bases, attrs, **kwargs). The name is whatever was right after class; the bases are position arguments; and attrs is the class body’s local scope. (This is how methods and other class attributes end up on the class.) The brand new type is then assigned to Name.

Of course, you can mess with most of this. You can implement __prepare__ on a metaclass, for example, to use a custom mapping as storage for the local scope — including any reads, which allows for some interesting shenanigans. The only part you can’t really implement in pure Python is the scoping bit, which has a couple extra rules that make sense for classes. (In particular, functions defined within a class block don’t close over the class body; that would be nonsense.)

Object creation

Finally, there’s what actually happens when you create an object — including a class, which remember is just an invocation of type(...).

Calling Foo(...) is implemented as, well, a call. Any type can implement calls with the __call__ special method, and you’ll find that type itself does so. It looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# oh, a fun wrinkle that's hard to express in pure python: type is a class, so
# it's an instance of itself
class type:
    def __call__(self, *args, **kwargs):
        # remember, here 'self' is a CLASS, an instance of type.
        # __new__ is a true constructor: object.__new__ allocates storage
        # for a new blank object
        instance = self.__new__(self, *args, **kwargs)
        # you can return whatever you want from __new__ (!), and __init__
        # is only called on it if it's of the right type
        if isinstance(instance, self):
            instance.__init__(*args, **kwargs)
        return instance

Again, you can trivially confirm this by asking any type for its __call__ method. Assuming that type doesn’t implement __call__ itself, you’ll get back a bound version of types implementation.

1
2
>>> list.__call__
<method-wrapper '__call__' of type object at 0x7fafb831a400>

You can thus implement __call__ in your own metaclass to completely change how subclasses are created — including skipping the creation altogether, if you like.

And… there’s a bunch of stuff I haven’t even touched on.

The Python philosophy

Python offers something that, on the surface, looks like a “traditional” class/object model. Under the hood, it acts more like a prototypical system, where failed attribute lookups simply defer to a superclass or metaclass.

The language also goes to almost superhuman lengths to expose all of its moving parts. Even the prototypical behavior is an implementation of __getattribute__ somewhere, which you are free to completely replace in your own types. Proxying and delegation are easy.

Also very nice is that these features “bundle” well, by which I mean a library author can do all manner of convoluted hijinks, and a consumer of that library doesn’t have to see any of it or understand how it works. You only need to inherit from a particular class (which has a metaclass), or use some descriptor as a decorator, or even learn any new syntax.

This meshes well with Python culture, which is pretty big on the principle of least surprise. These super-advanced features tend to be tightly confined to single simple features (like “makes a weak attribute“) or cordoned with DSLs (e.g., defining a form/struct/database table with a class body). In particular, I’ve never seen a metaclass in the wild implement its own __call__.

I have mixed feelings about that. It’s probably a good thing overall that the Python world shows such restraint, but I wonder if there are some very interesting possibilities we’re missing out on. I implemented a metaclass __call__ myself, just once, in an entity/component system that strove to minimize fuss when communicating between components. It never saw the light of day, but I enjoyed seeing some new things Python could do with the same relatively simple syntax. I wouldn’t mind seeing, say, an object model based on composition (with no inheritance) built atop Python’s primitives.

Lua

Lua doesn’t have an object model. Instead, it gives you a handful of very small primitives for building your own object model. This is pretty typical of Lua — it’s a very powerful language, but has been carefully constructed to be very small at the same time. I’ve never encountered anything else quite like it, and “but it starts indexing at 1!” really doesn’t do it justice.

The best way to demonstrate how objects work in Lua is to build some from scratch. We need two key features. The first is metatables, which bear a passing resemblance to Python’s metaclasses.

Tables and metatables

The table is Lua’s mapping type and its primary data structure. Keys can be any value other than nil. Lists are implemented as tables whose keys are consecutive integers starting from 1. Nothing terribly surprising. The dot operator is sugar for indexing with a string key.

1
2
3
4
5
local t = { a = 1, b = 2 }
print(t['a'])  -- 1
print(t.b)  -- 2
t.c = 3
print(t['c'])  -- 3

A metatable is a table that can be associated with another value (usually another table) to change its behavior. For example, operator overloading is implemented by assigning a function to a special key in a metatable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
local t = { a = 1, b = 2 }
--print(t + 0)  -- error: attempt to perform arithmetic on a table value

local mt = {
    __add = function(left, right)
        return 12
    end,
}
setmetatable(t, mt)
print(t + 0)  -- 12

Now, the interesting part: one of the special keys is __index, which is consulted when the base table is indexed by a key it doesn’t contain. Here’s a table that claims every key maps to itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
local t = {}
local mt = {
    __index = function(table, key)
        return key
    end,
}
setmetatable(t, mt)
print(t.foo)  -- foo
print(t.bar)  -- bar
print(t[3])  -- 3

__index doesn’t have to be a function, either. It can be yet another table, in which case that table is simply indexed with the key. If the key still doesn’t exist and that table has a metatable with an __index, the process repeats.

With this, it’s easy to have several unrelated tables that act as a single table. Call the base table an object, fill the __index table with functions and call it a class, and you have half of an object system. You can even get prototypical inheritance by chaining __indexes together.

At this point things are a little confusing, since we have at least three tables going on, so here’s a diagram. Keep in mind that Lua doesn’t actually have anything called an “object”, “class”, or “method” — those are just convenient nicknames for a particular structure we might build with Lua’s primitives.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
                    ╔═══════════╗        ...
                    ║ metatable ║         ║
                    ╟───────────╢   ┌─────╨───────────────────────┐
                    ║ __index   ╫───┤ lookup table ("superclass") │
                    ╚═══╦═══════╝   ├─────────────────────────────┤
  ╔═══════════╗         ║           │ some other method           ┼─── function() ... end
  ║ metatable ║         ║           └─────────────────────────────┘
  ╟───────────╢   ┌─────╨──────────────────┐
  ║ __index   ╫───┤ lookup table ("class") │
  ╚═══╦═══════╝   ├────────────────────────┤
      ║           │ some method            ┼─── function() ... end
      ║           └────────────────────────┘
┌─────╨─────────────────┐
│ base table ("object") │
└───────────────────────┘

Note that a metatable is not the same as a class; it defines behavior, not methods. Conversely, if you try to use a class directly as a metatable, it will probably not do much. (This is pretty different from e.g. Python, where operator overloads are just methods with funny names. One nice thing about the Lua approach is that you can keep interface-like functionality separate from methods, and avoid clogging up arbitrary objects’ namespaces. You could even use a dummy table as a key and completely avoid name collisions.)

Anyway, code!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
local class = {
    foo = function(a)
        print("foo got", a)
    end,
}
local mt = { __index = class }
-- setmetatable returns its first argument, so this is nice shorthand
local obj1 = setmetatable({}, mt)
local obj2 = setmetatable({}, mt)
obj1.foo(7)  -- foo got 7
obj2.foo(9)  -- foo got 9

Wait, wait, hang on. Didn’t I call these methods? How do they get at the object? Maybe Lua has a magical this variable?

Methods, sort of

Not quite, but this is where the other key feature comes in: method-call syntax. It’s the lightest touch of sugar, just enough to have method invocation.

1
2
3
4
5
6
7
8
9
-- note the colon!
a:b(c, d, ...)

-- exactly equivalent to this
-- (except that `a` is only evaluated once)
a.b(a, c, d, ...)

-- which of course is really this
a["b"](a, c, d, ...)

Now we can write methods that actually do something.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
local class = {
    bar = function(self)
        print("our score is", self.score)
    end,
}
local mt = { __index = class }
local obj1 = setmetatable({ score = 13 }, mt)
local obj2 = setmetatable({ score = 25 }, mt)
obj1:bar()  -- our score is 13
obj2:bar()  -- our score is 25

And that’s all you need. Much like Python, methods and data live in the same namespace, and Lua doesn’t care whether obj:method() finds a function on obj or gets one from the metatable’s __index. Unlike Python, the function will be passed self either way, because self comes from the use of : rather than from the lookup behavior.

(Aside: strictly speaking, any Lua value can have a metatable — and if you try to index a non-table, Lua will always consult the metatable’s __index. Strings all have the string library as a metatable, so you can call methods on them: try ("%s %s"):format(1, 2). I don’t think Lua lets user code set the metatable for non-tables, so this isn’t that interesting, but if you’re writing Lua bindings from C then you can wrap your pointers in metatables to give them methods implemented in C.)

Bringing it all together

Of course, writing all this stuff every time is a little tedious and error-prone, so instead you might want to wrap it all up inside a little function. No problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
local function make_object(body)
    -- create a metatable
    local mt = { __index = body }
    -- create a base table to serve as the object itself
    local obj = setmetatable({}, mt)
    -- and, done
    return obj
end

-- you can leave off parens if you're only passing in 
local Dog = {
    -- this acts as a "default" value; if obj.barks is missing, __index will
    -- kick in and find this value on the class.  but if obj.barks is assigned
    -- to, it'll go in the object and shadow the value here.
    barks = 0,

    bark = function(self)
        self.barks = self.barks + 1
        print("woof!")
    end,
}

local mydog = make_object(Dog)
mydog:bark()  -- woof!
mydog:bark()  -- woof!
mydog:bark()  -- woof!
print(mydog.barks)  -- 3
print(Dog.barks)  -- 0

It works, but it’s fairly barebones. The nice thing is that you can extend it pretty much however you want. I won’t reproduce an entire serious object system here — lord knows there are enough of them floating around — but the implementation I have for my LÖVE games lets me do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
local Animal = Object:extend{
    cries = 0,
}

-- called automatically by Object
function Animal:init()
    print("whoops i couldn't think of anything interesting to put here")
end

-- this is just nice syntax for adding a first argument called 'self', then
-- assigning this function to Animal.cry
function Animal:cry()
    self.cries = self.cries + 1
end

local Cat = Animal:extend{}

function Cat:cry()
    print("meow!")
    Cat.__super.cry(self)
end

local cat = Cat()
cat:cry()  -- meow!
cat:cry()  -- meow!
print(cat.cries)  -- 2

When I say you can extend it however you want, I mean that. I could’ve implemented Python (2)-style super(Cat, self):cry() syntax; I just never got around to it. I could even make it work with multiple inheritance if I really wanted to — or I could go the complete opposite direction and only implement composition. I could implement descriptors, customizing the behavior of individual table keys. I could add pretty decent syntax for composition/proxying. I am trying very hard to end this section now.

The Lua philosophy

Lua’s philosophy is to… not have a philosophy? It gives you the bare minimum to make objects work, and you can do absolutely whatever you want from there. Lua does have something resembling prototypical inheritance, but it’s not so much a first-class feature as an emergent property of some very simple tools. And since you can make __index be a function, you could avoid the prototypical behavior and do something different entirely.

The very severe downside, of course, is that you have to find or build your own object system — which can get pretty confusing very quickly, what with the multiple small moving parts. Third-party code may also have its own object system with subtly different behavior. (Though, in my experience, third-party code tries very hard to avoid needing an object system at all.)

It’s hard to say what the Lua “culture” is like, since Lua is an embedded language that’s often a little different in each environment. I imagine it has a thousand millicultures, instead. I can say that the tedium of building my own object model has led me into something very “traditional”, with prototypical inheritance and whatnot. It’s partly what I’m used to, but it’s also just really dang easy to get working.

Likewise, while I love properties in Python and use them all the dang time, I’ve yet to use a single one in Lua. They wouldn’t be particularly hard to add to my object model, but having to add them myself (or shop around for an object model with them and also port all my code to use it) adds a huge amount of friction. I’ve thought about designing an interesting ECS with custom object behavior, too, but… is it really worth the effort? For all the power and flexibility Lua offers, the cost is that by the time I have something working at all, I’m too exhausted to actually use any of it.

JavaScript

JavaScript is notable for being preposterously heavily used, yet not having a class block.

Well. Okay. Yes. It has one now. It didn’t for a very long time, and even the one it has now is sugar.

Here’s a vector class again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Vector {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }

    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    }

    dot(other) {
        return this.x * other.x + this.y * other.y;
    }
}

In “classic” JavaScript, this would be written as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
function Vector(x, y) {
    this.x = x;
    this.y = y;
}

Object.defineProperty(Vector.prototype, 'magnitude', {
    configurable: true,
    enumerable: true,
    get: function() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
});


Vector.prototype.dot = function(other) {
    return this.x * other.x + this.y * other.y;
};

Hm, yes. I can see why they added class.

The JavaScript model

In JavaScript, a new type is defined in terms of a function, which is its constructor.

Right away we get into trouble here. There is a very big difference between these two invocations, which I actually completely forgot about just now after spending four hours writing about Python and Lua:

1
2
let vec = Vector(3, 4);
let vec = new Vector(3, 4);

The first calls the function Vector. It assigns some properties to this, which here is going to be window, so now you have a global x and y. It then returns nothing, so vec is undefined.

The second calls Vector with this set to a new empty object, then evaluates to that object. The result is what you’d actually expect.

(You can detect this situation with the strange new.target expression, but I have never once remembered to do so.)

From here, we have true, honest-to-god, first-class prototypical inheritance. The word “prototype” is even right there. When you write this:

1
vec.dot(vec2)

JavaScript will look for dot on vec and (presumably) not find it. It then consults vecs prototype, an object you can see for yourself by using Object.getPrototypeOf(). Since vec is a Vector, its prototype is Vector.prototype.

I stress that Vector.prototype is not the prototype for Vector. It’s the prototype for instances of Vector.

(I say “instance”, but the true type of vec here is still just object. If you want to find Vector, it’s automatically assigned to the constructor property of its own prototype, so it’s available as vec.constructor.)

Of course, Vector.prototype can itself have a prototype, in which case the process would continue if dot were not found. A common (and, arguably, very bad) way to simulate single inheritance is to set Class.prototype to an instance of a superclass to get the prototype right, then tack on the methods for Class. Nowadays we can do Object.create(Superclass.prototype).

Now that I’ve been through Python and Lua, though, this isn’t particularly surprising. I kinda spoiled it.

I suppose one difference in JavaScript is that you can tack arbitrary attributes directly onto Vector all you like, and they will remain invisible to instances since they aren’t in the prototype chain. This is kind of backwards from Lua, where you can squirrel stuff away in the metatable.

Another difference is that every single object in JavaScript has a bunch of properties already tacked on — the ones in Object.prototype. Every object (and by “object” I mean any mapping) has a prototype, and that prototype defaults to Object.prototype, and it has a bunch of ancient junk like isPrototypeOf.

(Nit: it’s possible to explicitly create an object with no prototype via Object.create(null).)

Like Lua, and unlike Python, JavaScript doesn’t distinguish between keys found on an object and keys found via a prototype. Properties can be defined on prototypes with Object.defineProperty(), but that works just as well directly on an object, too. JavaScript doesn’t have a lot of operator overloading, but some things like Symbol.iterator also work on both objects and prototypes.

About this

You may, at this point, be wondering what this is. Unlike Lua and Python (and the last language below), this is a special built-in value — a context value, invisibly passed for every function call.

It’s determined by where the function came from. If the function was the result of an attribute lookup, then this is set to the object containing that attribute. Otherwise, this is set to the global object, window. (You can also set this to whatever you want via the call method on functions.)

This decision is made lexically, i.e. from the literal source code as written. There are no Python-style bound methods. In other words:

1
2
3
4
5
// this = obj
obj.method()
// this = window
let meth = obj.method
meth()

Also, because this is reassigned on every function call, it cannot be meaningfully closed over, which makes using closures within methods incredibly annoying. The old approach was to assign this to some other regular name like self (which got syntax highlighting since it’s also a built-in name in browsers); then we got Function.bind, which produced a callable thing with a fixed context value, which was kind of nice; and now finally we have arrow functions, which explicitly close over the current this when they’re defined and don’t change it when called. Phew.

Class syntax

I already showed class syntax, and it’s really just one big macro for doing all the prototype stuff The Right Way. It even prevents you from calling the type without new. The underlying model is exactly the same, and you can inspect all the parts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class Vector { ... }

console.log(Vector.prototype);  // { dot: ..., magnitude: ..., ... }
let vec = new Vector(3, 4);
console.log(Object.getPrototypeOf(vec));  // same as Vector.prototype

// i don't know why you would subclass vector but let's roll with it
class Vectest extends Vector { ... }

console.log(Vectest.prototype);  // { ... }
console.log(Object.getPrototypeOf(Vectest.prototype))  // same as Vector.prototype

Alas, class syntax has a couple shortcomings. You can’t use the class block to assign arbitrary data to either the type object or the prototype — apparently it was deemed too confusing that mutations would be shared among instances. Which… is… how prototypes work. How Python works. How JavaScript itself, one of the most popular languages of all time, has worked for twenty-two years. Argh.

You can still do whatever assignment you want outside of the class block, of course. It’s just a little ugly, and not something I’d think to look for with a sugary class.

A more subtle result of this behavior is that a class block isn’t quite the same syntax as an object literal. The check for data isn’t a runtime thing; class Foo { x: 3 } fails to parse. So JavaScript now has two largely but not entirely identical styles of key/value block.

Attribute access

Here’s where things start to come apart at the seams, just a little bit.

JavaScript doesn’t really have an attribute protocol. Instead, it has two… extension points, I suppose.

One is Object.defineProperty, seen above. For common cases, there’s also the get syntax inside a property literal, which does the same thing. But unlike Python’s @property, these aren’t wrappers around some simple primitives; they are the primitives. JavaScript is the only language of these four to have “property that runs code on access” as a completely separate first-class concept.

If you want to intercept arbitrary attribute access (and some kinds of operators), there’s a completely different primitive: the Proxy type. It doesn’t let you intercept attribute access or operators; instead, it produces a wrapper object that supports interception and defers to the wrapped object by default.

It’s cool to see composition used in this way, but also, extremely weird. If you want to make your own type that overloads in or calling, you have to return a Proxy that wraps your own type, rather than actually returning your own type. And (unlike the other three languages in this post) you can’t return a different type from a constructor, so you have to throw that away and produce objects only from a factory. And instanceof would be broken, but you can at least fix that with Symbol.hasInstance — which is really operator overloading, implement yet another completely different way.

I know the design here is a result of legacy and speed — if any object could intercept all attribute access, then all attribute access would be slowed down everywhere. Fair enough. It still leaves the surface area of the language a bit… bumpy?

The JavaScript philosophy

It’s a little hard to tell. The original idea of prototypes was interesting, but it was hidden behind some very awkward syntax. Since then, we’ve gotten a bunch of extra features awkwardly bolted on to reflect the wildly varied things the built-in types and DOM API were already doing. We have class syntax, but it’s been explicitly designed to avoid exposing the prototype parts of the model.

I admit I don’t do a lot of heavy JavaScript, so I might just be overlooking it, but I’ve seen virtually no code that makes use of any of the recent advances in object capabilities. Forget about custom iterators or overloading call; I can’t remember seeing any JavaScript in the wild that even uses properties yet. I don’t know if everyone’s waiting for sufficient browser support, nobody knows about them, or nobody cares.

The model has advanced recently, but I suspect JavaScript is still shackled to its legacy of “something about prototypes, I don’t really get it, just copy the other code that’s there” as an object model. Alas! Prototypes are so good. Hopefully class syntax will make it a bit more accessible, as it has in Python.

Perl 5

Perl 5 also doesn’t have an object system and expects you to build your own. But where Lua gives you two simple, powerful tools for building one, Perl 5 feels more like a puzzle with half the pieces missing. Clearly they were going for something, but they only gave you half of it.

In brief, a Perl object is a reference that has been blessed with a package.

I need to explain a few things. Honestly, one of the biggest problems with the original Perl object setup was how many strange corners and unique jargon you had to understand just to get off the ground.

(If you want to try running any of this code, you should stick a use v5.26; as the first line. Perl is very big on backwards compatibility, so you need to opt into breaking changes, and even the mundane say builtin is behind a feature gate.)

References

A reference in Perl is sort of like a pointer, but its main use is very different. See, Perl has the strange property that its data structures try very hard to spill their contents all over the place. Despite having dedicated syntax for arrays — @foo is an array variable, distinct from the single scalar variable $foo — it’s actually impossible to nest arrays.

1
2
3
my @foo = (1, 2, 3, 4);
my @bar = (@foo, @foo);
# @bar is now a flat list of eight items: 1, 2, 3, 4, 1, 2, 3, 4

The idea, I guess, is that an array is not one thing. It’s not a container, which happens to hold multiple things; it is multiple things. Anywhere that expects a single value, such as an array element, cannot contain an array, because an array fundamentally is not a single value.

And so we have “references”, which are a form of indirection, but also have the nice property that they’re single values. They add containment around arrays, and in general they make working with most of Perl’s primitive types much more sensible. A reference to a variable can be taken with the \ operator, or you can use [ ... ] and { ... } to directly create references to anonymous arrays or hashes.

1
2
3
my @foo = (1, 2, 3, 4);
my @bar = (\@foo, \@foo);
# @bar is now a nested list of two items: [1, 2, 3, 4], [1, 2, 3, 4]

(Incidentally, this is the sole reason I initially abandoned Perl for Python. Non-trivial software kinda requires nesting a lot of data structures, so you end up with references everywhere, and the syntax for going back and forth between a reference and its contents is tedious and ugly.)

A Perl object must be a reference. Perl doesn’t care what kind of reference — it’s usually a hash reference, since hashes are a convenient place to store arbitrary properties, but it could just as well be a reference to an array, a scalar, or even a sub (i.e. function) or filehandle.

I’m getting a little ahead of myself. First, the other half: blessing and packages.

Packages and blessing

Perl packages are just namespaces. A package looks like this:

1
2
3
4
5
6
7
package Foo::Bar;

sub quux {
    say "hi from quux!";
}

# now Foo::Bar::quux() can be called from anywhere

Nothing shocking, right? It’s just a named container. A lot of the details are kind of weird, like how a package exists in some liminal quasi-value space, but the basic idea is a Bag Of Stuff.

The final piece is “blessing,” which is Perl’s funny name for binding a package to a reference. A very basic class might look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
package Vector;

# the name 'new' is convention, not special
sub new {
    # perl argument passing is weird, don't ask
    my ($class, $x, $y) = @_;

    # create the object itself -- here, unusually, an array reference makes sense
    my $self = [ $x, $y ];

    # associate the package with that reference
    # note that $class here is just the regular string, 'Vector'
    bless $self, $class;

    return $self;
}

sub x {
    my ($self) = @_;
    return $self->[0];
}

sub y {
    my ($self) = @_;
    return $self->[1];
}

sub magnitude {
    my ($self) = @_;
    return sqrt($self->x ** 2 + $self->y ** 2);
}

# switch back to the "default" package
package main;

# -> is method call syntax, which passes the invocant as the first argument;
# for a package, that's just the package name
my $vec = Vector->new(3, 4);
say $vec->magnitude;  # 5

A few things of note here. First, $self->[0] has nothing to do with objects; it’s normal syntax for getting the value of a index 0 out of an array reference called $self. (Most classes are based on hashrefs and would use $self->{value} instead.) A blessed reference is still a reference and can be treated like one.

In general, -> is Perl’s dereferencey operator, but its exact behavior depends on what follows. If it’s followed by brackets, then it’ll apply the brackets to the thing in the reference: ->{} to index a hash reference, ->[] to index an array reference, and ->() to call a function reference.

But if -> is followed by an identifier, then it’s a method call. For packages, that means calling a function in the package and passing the package name as the first argument. For objects — blessed references — that means calling a function in the associated package and passing the object as the first argument.

This is a little weird! A blessed reference is a superposition of two things: its normal reference behavior, and some completely orthogonal object behavior. Also, object behavior has no notion of methods vs data; it only knows about methods. Perl lets you omit parentheses in a lot of places, including when calling a method with no arguments, so $vec->magnitude is really $vec->magnitude().

Perl’s blessing bears some similarities to Lua’s metatables, but ultimately Perl is much closer to Ruby’s “message passing” approach than the above three languages’ approaches of “get me something and maybe it’ll be callable”. (But this is no surprise — Ruby is a spiritual successor to Perl 5.)

All of this leads to one little wrinkle: how do you actually expose data? Above, I had to write x and y methods. Am I supposed to do that for every single attribute on my type?

Yes! But don’t worry, there are third-party modules to help with this incredibly fundamental task. Take Class::Accessor::Fast, so named because it’s faster than Class::Accessor:

1
2
3
package Foo;
use base qw(Class::Accessor::Fast);
__PACKAGE__->mk_accessors(qw(fred wilma barney));

(__PACKAGE__ is the lexical name of the current package; qw(...) is a list literal that splits its contents on whitespace.)

This assumes you’re using a hashref with keys of the same names as the attributes. $obj->fred will return the fred key from your hashref, and $obj->fred(4) will change it to 4.

You also, somewhat bizarrely, have to inherit from Class::Accessor::Fast. Speaking of which,

Inheritance

Inheritance is done by populating the package-global @ISA array with some number of (string) names of parent packages. Most code instead opts to write use base ...;, which does the same thing. Or, more commonly, use parent ...;, which… also… does the same thing.

Every package implicitly inherits from UNIVERSAL, which can be freely modified by Perl code.

A method can call its superclass method with the SUPER:: pseudo-package:

1
2
3
4
sub foo {
    my ($self) = @_;
    $self->SUPER::foo;
}

However, this does a depth-first search, which means it almost certainly does the wrong thing when faced with multiple inheritance. For a while the accepted solution involved a third-party module, but Perl eventually grew an alternative you have to opt into: C3, which may be more familiar to you as the order Python uses.

1
2
3
4
5
6
use mro 'c3';

sub foo {
    my ($self) = @_;
    $self->next::method;
}

Offhand, I’m not actually sure how next::method works, seeing as it was originally implemented in pure Perl code. I suspect it involves peeking at the caller’s stack frame. If so, then this is a very different style of customizability from e.g. Python — the MRO was never intended to be pluggable, and the use of a special pseudo-package means it isn’t really, but someone was determined enough to make it happen anyway.

Operator overloading and whatnot

Operator overloading looks a little weird, though really it’s pretty standard Perl.

1
2
3
4
5
6
7
8
package MyClass;

use overload '+' => \&_add;

sub _add {
    my ($self, $other, $swap) = @_;
    ...
}

use overload here is a pragma, where “pragma” means “regular-ass module that does some wizardry when imported”.

\&_add is how you get a reference to the _add sub so you can pass it to the overload module. If you just said &_add or _add, that would call it.

And that’s it; you just pass a map of operators to functions to this built-in module. No worry about name clashes or pollution, which is pretty nice. You don’t even have to give references to functions that live in the package, if you don’t want them to clog your namespace; you could put them in another package, or even inline them anonymously.

One especially interesting thing is that Perl lets you overload every operator. Perl has a lot of operators. It considers some math builtins like sqrt and trig functions to be operators, or at least operator-y enough that you can overload them. You can also overload the “file text” operators, such as -e $path to test whether a file exists. You can overload conversions, including implicit conversion to a regex. And most fascinating to me, you can overload dereferencing — that is, the thing Perl does when you say $hashref->{key} to get at the underlying hash. So a single object could pretend to be references of multiple different types, including a subref to implement callability. Neat.

Somewhat related: you can overload basic operators (indexing, etc.) on basic types (not references!) with the tie function, which is designed completely differently and looks for methods with fixed names. Go figure.

You can intercept calls to nonexistent methods by implementing a function called AUTOLOAD, within which the $AUTOLOAD global will contain the name of the method being called. Originally this feature was, I think, intended for loading binary components or large libraries on-the-fly only when needed, hence the name. Offhand I’m not sure I ever saw it used the way __getattr__ is used in Python.

Is there a way to intercept all method calls? I don’t think so, but it is Perl, so I must be forgetting something.

Actually no one does this any more

Like a decade ago, a council of elder sages sat down and put together a whole whizbang system that covers all of it: Moose.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package Vector;
use Moose;

has x => (is => 'rw', isa => 'Int');
has y => (is => 'rw', isa => 'Int');

sub magnitude {
    my ($self) = @_;
    return sqrt($self->x ** 2 + $self->y ** 2);
}

Moose has its own way to do pretty much everything, and it’s all built on the same primitives. Moose also adds metaclasses, somehow, despite that the underlying model doesn’t actually support them? I’m not entirely sure how they managed that, but I do remember doing some class introspection with Moose and it was much nicer than the built-in way.

(If you’re wondering, the built-in way begins with looking at the hash called %Vector::. No, that’s not a typo.)

I really cannot stress enough just how much stuff Moose does, but I don’t want to delve into it here since Moose itself is not actually the language model.

The Perl philosophy

I hope you can see what I meant with what I first said about Perl, now. It has multiple inheritance with an MRO, but uses the wrong one by default. It has extensive operator overloading, which looks nothing like how inheritance works, and also some of it uses a totally different mechanism with special method names instead. It only understands methods, not data, leaving you to figure out accessors by hand.

There’s 70% of an object system here with a clear general design it was gunning for, but none of the pieces really look anything like each other. It’s weird, in a distinctly Perl way.

The result is certainly flexible, at least! It’s especially cool that you can use whatever kind of reference you want for storage, though even as I say that, I acknowledge it’s no different from simply subclassing list or something in Python. It feels different in Perl, but maybe only because it looks so different.

I haven’t written much Perl in a long time, so I don’t know what the community is like any more. Moose was already ubiquitous when I left, which you’d think would let me say “the community mostly focuses on the stuff Moose can do” — but even a decade ago, Moose could already do far more than I had ever seen done by hand in Perl. It’s always made a big deal out of roles (read: interfaces), for instance, despite that I’d never seen anyone care about them in Perl before Moose came along. Maybe their presence in Moose has made them more popular? Who knows.

Also, I wrote Perl seriously, but in the intervening years I’ve only encountered people who only ever used Perl for one-offs. Maybe it’ll come as a surprise to a lot of readers that Perl has an object model at all.

End

Well, that was fun! I hope any of that made sense.

Special mention goes to Rust, which doesn’t have an object model you can fiddle with at runtime, but does do things a little differently.

It’s been really interesting thinking about how tiny differences make a huge impact on what people do in practice. Take the choice of storage in Perl versus Python. Perl’s massively common URI class uses a string as the storage, nothing else; I haven’t seen anything like that in Python aside from markupsafe, which is specifically designed as a string type. I would guess this is partly because Perl makes you choose — using a hashref is an obvious default, but you have to make that choice one way or the other. In Python (especially 3), inheriting from object and getting dict-based storage is the obvious thing to do; the ability to use another type isn’t quite so obvious, and doing it “right” involves a tiny bit of extra work.

Or, consider that Lua could have descriptors, but the extra bit of work (especially design work) has been enough of an impediment that I’ve never implemented them. I don’t think the object implementations I’ve looked at have included them, either. Super weird!

In that light, it’s only natural that objects would be so strongly associated with the features Java and C++ attach to them. I think that makes it all the more important to play around! Look at what Moose has done. No, really, you should bear in mind my description of how Perl does stuff and flip through the Moose documentation. It’s amazing what they’ve built.

Let’s stop copying C

Post Syndicated from Eevee original https://eev.ee/blog/2016/12/01/lets-stop-copying-c/

Ah, C. The best lingua franca language we have… because we have no other lingua franca languages.

C is fairly old — 44 years, now! — and comes from a time when there were possibly more architectures than programming languages. It works well for what it is, and what it is is a relatively simple layer of indirection atop assembly.

Alas, the popularity of C has led to a number of programming languages’ taking significant cues from its design, and parts of its design are… slightly questionable. I’ve gone through some common features that probably should’ve stayed in C and my justification for saying so. The features are listed in rough order from (I hope) least to most controversial. The idea is that C fans will give up when I complain about argument order and not even get to the part where I rag on braces. Wait, crap, I gave it away.

I’ve listed some languages that do or don’t take the same approach as C. Plenty of the listed languages have no relation to C, and some even predate it — this is meant as a cross-reference of the landscape (and perhaps a list of prior art), not a genealogy. The language selections are arbitrary and based on what I could cobble together from documentation, experiments, Wikipedia, and attempts to make sense of Rosetta Code. I don’t know everything about all of them, so I might be missing some interesting quirks. Things are especially complicated for very old languages like COBOL or Fortran, which by now have numerous different versions and variants and de facto standard extensions.

Bash” generally means zsh and ksh and other derivatives as well, and when referring to expressions, means the $(( ... )) syntax; “Unix shells” means Bourne and thus almost certainly everything else as well. I didn’t look too closely into, say, fish. Unqualified “Python” means both 2 and 3; likewise, unqualified “Perl” means both 5 and 6. Also some of the puns are perhaps a little too obtuse, but the first group listed is always C-like.

Textual inclusion

#include is not a great basis for a module system. It’s not even a module system. You can’t ever quite tell what symbols came from which files, or indeed whether particular files are necessary at all. And in languages with C-like header files, most headers include other headers include more headers, so who knows how any particular declaration is actually ending up in your code? Oh, and there’s the whole include guards thing.

It’s a little tricky to pick on individual languages here, because ultimately even the greatest module system in the world boils down to “execute this other file, and maybe do some other stuff”. I think the true differentiating feature is whether including/importing/whatevering a file creates a new namespace. If a file gets dumped into the caller’s namespace, that looks an awful lot like textual inclusion; if a file gets its own namespace, that’s a good sign of something more structured happening behind the scenes.

This tends to go hand-in-hand with how much the language relies on a global namespace. One surprising exception is Lua, which can compartmentalize required files quite well, but dumps everything into a single global namespace by default.

Quick test: if you create a new namespace and import another file within that namespace, do its contents end up in that namespace?

Included: ACS, awk, COBOL, Erlang, Forth, Fortran, most older Lisps, Perl 5 (despite that required files must return true), PHP, Ruby, Unix shells.

Excluded: Ada, Clojure, D, Haskell, Julia, Lua (the file’s return value is returned from require), Nim, Node (similar to Lua), Perl 6, Python, Rust.

Special mention: ALGOL appears to have been designed with the assumption that you could include other code by adding its punch cards to your stack. C#, Java, OCaml, and Swift all have some concept of “all possible code that will be in this program”, sort of like C with inferred headers, so imports are largely unnecessary; Java’s import really just does aliasing. Inform 7 has no namespacing, but does have a first-class concept of external libraries, but doesn’t have a way to split a single project up between multiple files.

Optional block delimiters

Old and busted and responsible for gotofail:

1
2
if (condition)
    thing;

New hotness, which reduces the amount of punctuation overall and eliminates this easy kind of error:

1
2
3
if condition {
    thing;
}

To be fair, and unlike most of these complaints, the original idea was a sort of clever consistency: the actual syntax was merely if (expr) stmt, and also, a single statement could always be replaced by a block of statements. Unfortunately, the cuteness doesn’t make up for the ease with which errors sneak in. If you’re stuck with a language like this, I advise you always use braces, possibly excepting the most trivial cases like immediately returning if some argument is NULL. Definitely do not do this nonsense, which I saw in actual code not 24 hours ago.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for (x = ...)
    for (y = ...) {
        ...
    }

    // more code

    for (x = ...)
        for (y = ...)
            buffer[y][x] = ...

The only real argument for omitting the braces is that the braces take up a lot of vertical space, but that’s mostly a problem if you put each { on its own line, and you could just not do that.

Some languages use keywords instead of braces, and in such cases it’s vanishingly rare to make the keywords optional.

Blockheads: ACS, awk, C#, D, Erlang (kinda?), Java, JavaScript.

New kids on the block: Go, Perl 6, Rust, Swift.

Had their braces removed: Ada, ALGOL, BASIC, COBOL, CoffeeScript, Forth, Fortran (but still requires parens), Haskell, Lua, Ruby.

Special mention: Inform 7 has several ways to delimit blocks, none of them vulnerable to this problem. Perl 5 requires both the parentheses and the braces… but it lets you leave off the semicolon on the last statement. Python just uses indentation to delimit blocks in the first place, so you can’t have a block look wrong. Lisps exist on a higher plane of existence where the very question makes no sense.

Bitwise operator precedence

For ease of transition from B, in C, the bitwise operators & | ^ have lower precedence than the comparison operators == and friends. That means they happen later. For binary math operators, this is nonsense.

1
2
3
1 + 2 == 3  // (1 + 2) == 3
1 * 2 == 3  // (1 * 2) == 3
1 | 2 == 3  // 1 | (2 == 3)

Many other languages have copied C’s entire set of operators and their precedence, including this. Because a new language is easier to learn if its rules are familiar, you see. Which is why we still, today, have extremely popular languages maintaining compatibility with a language from 1969 — so old that it probably couldn’t get a programming job.

Honestly, if your language is any higher-level than C, I’m not sure bit operators deserve to be operators at all. Free those characters up to do something else. Consider having a first-class bitfield type; then 99% of the use of bit operations would go away.

Quick test: 1 & 2 == 2 evaluates to 1 with C precedence, false otherwise. Or just look at a precedence table: if equality appears between bitwise ops and other math ops, that’s C style.

A bit wrong: C#, D, expr, JavaScript, Perl 5, PHP.

Wisened up: Bash, F# (ops are &&& ||| ^^^), Go, Julia, Lua (bitwise ops are new in 5.3), Perl 6 (ops are ?& ?| ?^), Python, Ruby, SQL, Swift.

Special mention: Java has C’s precedence, but forbids using bitwise operators on booleans, so the quick test is a compile-time error. Lisp-likes have no operator precedence.

Negative modulo

The modulo operator, %, finds the remainder after division. Thus you might think that this always holds:

1
0 <= a % b < abs(b)

But no — if a is negative, C will produce a negative value. This is so a / b * b + a % b is always equal to a. Truncating integer division rounds towards zero, so the sign of a % b always needs to be away from zero.

I’ve never found this behavior (or the above equivalence) useful. An easy example is that checking for odd numbers with x % 2 == 1 will fail for negative numbers, which produce -1. But the opposite behavior can be pretty handy.

Consider the problem of having n items that you want to arrange into rows with c columns. A calendar, say; you want to include enough empty cells to fill out the last row. n % c gives you the number of items on the last row, so c - n % c seems like it will give you the number of empty spaces. But if the last row is already full, then n % c is zero, and c - n % c equals c! You’ll have either a double-width row or a spare row of empty cells. Fixing this requires treating n % c == 0 as a special case, which is unsatisfying.

Ah, but if we have positive %, the answer is simply… -n % c! Consider this number line for n = 5 and c = 3:

1
2
-6      -3       0       3       6
 | - x x | x x x | x x x | x x - |

a % b tells you how far to count down to find a multiple of b. For positive a, that means “backtracking” over a itself and finding a smaller number. For negative a, that means continuing further away from zero. If you look at negative numbers as the mirror image of positive numbers, then % on a positive number tells you how to much to file off to get a multiple, whereas % on a negative number tells you how much further to go to get a multiple. 5 % 3 is 2, but -5 % 3 is 1. And of course, -6 % 3 is still zero, so that’s not a special case.

Positive % effectively lets you choose whether to round up or down. It doesn’t come up often, but when it’s handy, it’s really handy.

(I have no strong opinion on what 5 % -3 should be; I don’t think I’ve ever tried to use % with a negative divisor. Python makes it negative; Pascal makes it positive. Wikipedia has a whole big chart.)

Quick test: -5 % 3 is -2 with C semantics, 1 with “positive” semantics.

Leftovers: Bash, C#, D, expr, Go, Java, JavaScript, OCaml, PowerShell, PHP, Rust, Scala, SQL, Swift, VimL, Visual Basic. Notably, some of these languages don’t even have integer division.

Paying dividends: Dart, MUMPS (#), Perl, Python, R (%%), Ruby, Smalltalk (\\\\), Standard ML, Tcl.

Special mention: Ada, Haskell, Julia, many Lisps, MATLAB, VHDL, and others have separate mod (Python-like) and rem (C-like) operators. CoffeeScript has separate % (C-like) and %% (Python-like) operators.

Leading zero for octal

Octal notation like 0777 has three uses.

One: to make a file mask to pass to chmod().

Two: to confuse people when they write 013 and it comes out as 11.

Three: to confuse people when they write 018 and get a syntax error.

If you absolutely must have octal (?!) in your language, it’s fine to use 0o777. Really. No one will mind. Or you can go the whole distance and allow literals written in any base, as several languages do.

Gets a zero: awk (gawk only), Bash, Clojure, Go, Groovy, Java, JavaScript, m4, Perl 5, PHP, Python 2, Scala.

G0od: ECMAScript 6, Eiffel (0c — cute!), F#, Haskell, Julia, Nemerle, Nim, OCaml, Perl 6, Python 3, Racket (#o), Ruby, Scheme (#o), Swift, Tcl.

Based literals: Ada (8#777#), Bash (8#777), Erlang (8#777), Icon (8r777), J (8b777), Perl 6 (:8<777>), PostScript (8#777), Smalltalk (8r777).

Special mention: BASIC uses &O and &H prefixes for octal and hex. bc and Forth allow the base used to interpret literals to be changed on the fly, via ibase and BASE respectively. C#, D, expr, Lua, and Standard ML have no octal literals at all. Some COBOL extensions use O# and H#/X# prefixes for octal and hex. Fortran uses the slightly odd O'777' syntax.

No power operator

Perhaps this makes sense in C, since it doesn’t correspond to an actual instruction on most CPUs, but in JavaScript? If you can make + work for strings, I think you can add a **.

If you’re willing to ditch the bitwise operators (or lessen their importance a bit), you can even use ^, as most people would write in regular ASCII text.

Powerless: ACS, C#, Eiffel, Erlang, expr, Forth, Go.

Two out of two stars: Ada, ALGOL ( works too), Bash, COBOL, CoffeeScript, Fortran, F#, Groovy, OCaml, Perl, PHP, Python, Ruby.

I tip my hat: awk, BASIC, bc, COBOL, fish, Lua.

Otherwise powerful: APL (), D (^^).

Special mention: Lisps tend to have a named function rather than a dedicated operator (e.g. Math/pow in Clojure, expt in Common Lisp), but since operators are regular functions, this doesn’t stand out nearly so much. Haskell uses all three of ^, ^^, and ** for typing reasons.

C-style for loops

This construct is bad. It very rarely matches what a human actually wants to do, which 90% of the time is “go through this list of stuff” or “count from 1 to 10”. A C-style for obscures those wishes. The syntax is downright goofy, too: nothing else in the language uses ; as a delimiter and repeatedly executes only part of a line. It’s like a tuple of statements.

I said in my previous post about iteration that having an iteration protocol requires either objects or closures, but I realize that’s not true. I even disproved it in the same post. Lua’s own iteration protocol can be implemented without closures — the semantics of for involve keeping a persistent state value and passing it to the iterator function every time. It could even be implemented in C! Awkwardly. And with a bunch of macros. Which aren’t hygenic in C. Hmm, well.

Loopy: ACS, bc, Fortran.

Cool and collected: C#, Clojure, D, Delphi (recent), Eiffel (recent), Go, Groovy, Icon, Inform 7, Java, Julia, Logo, Lua, Nemerle, Nim, Objective-C, Perl, PHP, PostScript, Prolog, Python, R, Rust, Scala, Smalltalk, Swift, Tcl, Unix shells, Visual Basic.

Special mention: Functional languages and Lisps are laughing at the rest of us here. awk has for...in, but it doesn’t iterate arrays in order which makes it rather less useful. JavaScript has both for...in and for...of, but both are differently broken, so you usually end up using C-style for or external iteration. BASIC has an ergonomic numeric loop, but no iteration loop. Ruby mostly uses external iteration, and its for block is actually expressed in those terms.

Switch with default fallthrough

We’ve been through this before. Wanting completely separate code per case is, by far, the most common thing to want to do. It makes no sense to have to explicitly opt out of the more obvious behavior.

Breaks my heart: C#, Java, JavaScript.

Follows through: Ada, BASIC, CoffeeScript, Go (has a fallthrough statement), Lisps, Swift (has a fallthrough statement), Unix shells.

Special mention: D requires break, but requires something one way or the other — implicit fallthrough is disallowed except for empty cases. Perl 5 historically had no switch block built in, but it comes with a Switch module, and the last seven releases have had an experimental given block which I stress is still experimental. Python has no switch block. Erlang, Haskell, and Rust have pattern-matching instead (which doesn’t allow fallthrough at all).

Type first

1
int foo;

In C, this isn’t too bad. You get into problems when you remember that it’s common for type names to be all lowercase.

1
foo * bar;

Is that a useless expression, or a declaration? It depends entirely on whether foo is a variable or a type.

It gets a little weirder when you consider that there are type names with spaces in them. And storage classes. And qualifiers. And sometimes part of the type comes after the name.

1
extern const volatile _Atomic unsigned long long int * restrict foo[];

That’s not even getting into the syntax for types of function pointers, which might have arbitrary amounts of stuff after the variable name.

And then C++ came along with generics, which means a type name might also have other type names nested arbitrarily deep.

1
extern const volatile std::unordered_map<unsigned long long int, std::unordered_map<const long double * const, const std::vector<std::basic_string<char>>::const_iterator>> foo;

And that’s just a declaration! Imagine if there were an assignment in there too.

The great thing about static typing is that I know the types of all the variables, but that advantage is somewhat lessened if I can’t tell what the variables are.

Between type-first, function pointer syntax, Turing-complete duck-typed templates, and C++’s initialization syntax, there are several ways where parsing C++ is ambiguous or even undecidable! “Undecidable” here means that there exist C++ programs which cannot even be parsed into a syntax tree, because the same syntax means two different things depending on whether some expression is a value or a type, and that question can depend on an endlessly recursive template instantiation. (This is also a great example of ambiguity, where x * y(z) could be either an expression or a declaration.)

Contrast with, say, Rust:

1
let x: ... = ...;

This is easy to parse, both for a human and a computer. The thing before the colon must be a variable name, and it stands out immediately; the thing after the colon must be a type name. Even better, Rust has pretty good type inference, so the type is probably unnecessary anyway.

Of course, languages with no type declarations whatsoever are immune to this problem.

Most vexing: Java, Perl 6

Looks Lovely: Python 3 (annotation syntax and the typing module), Rust, Swift, TypeScript

Weak typing

Please note: this is not the opposite of static typing. Weak typing is more about the runtime behavior of values — if I try to use a value of type T as though it were of type U, will it be implicitly converted?

C lets you assign pointers to int variables and then take square roots of them, which seems like a bad idea to me. C++ agreed and nixed this, but also introduced the ability to make your own custom types implicitly convertible to as many other types you want.

This one is pretty clearly a spectrum, and I don’t have a clear line. For example, I don’t fault Python for implicitly converting between int and float, because int is infinite-precision and float is 64-bit, so it’s usually fine. But I’m a lot more suspicious of C, which lets you assign an int to a char without complaint. (Well, okay. Literal integers in C are ints, which poses a slight problem.)

I do count a combined addition/concatenation operator that accepts different types of arguments as a form of weak typing.

Weak: JavaScript (+), Unix shells (everything’s a string, but even arrays/scalars are somewhat interchangeable)

Strong: Rust (even numeric upcasts must be explicit).

Special mention: Perl 5 is weak, but it avoids most of the ambiguity by having entirely separate sets of operators for string vs numeric operations. Python 2 is mostly strong, but that whole interchangeable bytes/text thing sure caused some ruckus.

Integer division

Hey, new programmers!” you may find yourself saying. “Don’t worry, it’s just like math, see? Here’s how to use $LANGUAGE as a calculator.”

Oh boy!” says your protégé. “Let’s see what 7 ÷ 2 is! Oh, it’s 3. I think the computer is broken.”

They’re right! It is broken. I have genuinely seen a non-trivial number of people come into #python thinking division is “broken” because of this.

To be fair, C is pretty consistent about making math operations always produce a value whose type matches one of the arguments. It’s also unclear whether such division should produce a float or a double. Inferring from context would make sense, but that’s not something C is really big on.

Quick test: 7 / 2 is 3½, not 3.

Integrous: Bash, bc, C#, D, expr, F#, Fortran, Go, OCaml, Python 2, Ruby, Rust (hard to avoid).

Afloat: awk (no integers), Clojure (produces a rational!), Groovy, JavaScript (no integers), Lua (no integers until 5.3), Nim, Perl 5 (no integers), Perl 6, PHP, Python 3.

Special mention: Haskell disallows / on integers. Nim, Perl 6, Python, and probably others have separate integral division operators: div, div, and //, respectively.

Bytestrings

Strings” in C are arrays of 8-bit characters. They aren’t really strings at all, since they can’t hold the vast majority of characters without some further form of encoding. Exactly what the encoding is and how to handle it is left entirely up to the programmer. This is a pain in the ass.

Some languages caught wind of this Unicode thing in the 90s and decided to solve this problem once and for all by making “wide” strings with 16-bit characters. (Even C95 has this, in the form of wchar_t* and L"..." literals.) Unicode, you see, would never have more than 65,536 characters.

Whoops, so much for that. Now we have strings encoded as UTF-16 rather than UTF-8, so we’re paying extra storage cost and we still need to write extra code to do basic operations right. Or we forget, and then later we have to track down a bunch of wonky bugs because someone typed a 💩.

Note that handling characters/codepoints is very different from handling glyphs, i.e. the distinct shapes you see on screen. Handling glyphs doesn’t even really make sense outside the context of a font, because fonts are free to make up whatever ligatures they want. Remember “diverse” emoji? Those are ligatures of three to seven characters, completely invented by a font vendor. A programming language can’t reliably count the display length of that, especially when new combining behaviors could be introduced at any time.

Also, it doesn’t matter how you solve this problem, as long as it appears to be solved. I believe Ruby uses bytestrings, for example, but they know their own encoding, so they can be correctly handled as sequences of codepoints. Having a separate non-default type or methods does not count, because everyone will still use the wrong thing first — sorry, Python 2.

Quick test: what’s the length of “💩”? If 1, you have real unencoded strings. If 2, you have UTF-16 strings. If 4, you have UTF-8 strings. If something else, I don’t know what the heck is going on.

Totally bytes: Lua, Python 2 (separate unicode type).

Comes up short: Java, JavaScript.

One hundred emoji: Python 3, Ruby, Rust.

Special mention: Perl 5 gets the quick test right if you put use utf8; at the top of the file, but Perl 5’s Unicode support is such a confusing clusterfuck that I can’t really give it a 💯.

Autoincrement and autodecrement

I don’t think there are too many compelling reasons to have ++. It means the same as += 1, which is still nice and short. The only difference is that people can do stupid unreadable tricks with ++.

One exception: it is possible to overload ++ in ways that don’t make sense as += 1 — for example, C++ uses ++ to advance iterators, which may do any arbitrary work under the hood.

Double plus ungood:

Double plus good: Python

Special mention: Perl 5 and PHP both allow ++ on strings, in which case it increments letters or something, but I don’t know if much real code has ever used this.

!

A pet peeve. Spot the difference:

1
2
3
4
5
6
if (looks_like_rain()) {
    ...
}
if (!looks_like_rain()) {
    ...
}

That single ! is ridiculously subtle, which seems wrong to me when it makes an expression mean its polar opposite. Surely it should stick out like a sore thumb. The left parenthesis makes it worse, too; it blends in slightly as just noise.

It helps a bit to space after the ! in cases like this:

1
2
3
if (! looks_like_rain()) {
    ...
}

But this seems to be curiously rare. The easy solution is to just spell the operator not. At which point the other two might as well be and and or.

Interestingly enough, C95 specifies and, or, not, and some others as standard alternative spellings, though I’ve never seen them in any C code and I suspect existing projects would prefer I not use them.

Not right: ACS, awk, C#, D, Go, Groovy, Java, JavaScript, Nemerle, PHP, R, Rust, Scala, Swift, Tcl, Vala.

Spelled out: Ada, ALGOL, BASIC, COBOL, Erlang, F#, Fortran, Haskell, Lisps, Lua, Nim, OCaml, Pascal, PostScript, Python, Smalltalk, Standard ML.

Special mention: APL and Julia both use ~, which is at least easier to pick out, which is more than I can say for most of APL. bc and expr, which are really calculators, have no concept of Boolean operations. Forth and Icon, which are not calculators, don’t seem to either. Perl and Ruby have both symbolic and named Boolean operators (Perl 6 has even more), with different precedence (which inside if won’t matter), but I believe the named forms are preferred.

Single return and out parameters

Because C can only return a single value, and that value is often an indication of failure for the sake of an if, “out” parameters are somewhat common.

1
2
double x, y;
get_point(&x, &y);

It’s not immediately clear whether x and y are input or output. Sometimes they might function as both. (And of course, in this silly example, you’d be better off returning a single point struct. Or would you use a point out parameter because returning structs is potentially expensive?)

Some languages have doubled down on this by adding syntax to declare “out” parameters, which removes the ambiguity in the function definition, but makes it worse in function calls. In the above example, using & on an argument is at least a decent hint that the function wants to write to those values. If you have implicit out parameters or pass-by-reference or whatever, that would just be get_point(x, y) and you’d have no indication that those arguments are special in any way.

The vast majority of the time, this can be expressed in a more straightforward way by returning multiple values:

1
x, y = get_point()

That was intended as Python, but technically, Python doesn’t have multiple returns! It seems to, but it’s really a combination of several factors: a tuple type, the ability to make a tuple literal with just commas, and the ability to unpack a tuple via multiple assignment. In the end it works just as well. Also this is a way better use of the comma operator than in C.

But the exact same code could appear in Lua, which has multiple return/assignment as an explicit feature… and no tuples. The difference becomes obvious if you try to assign the return value to a single variable instead:

1
point = get_point()

In Python, point would be a tuple containing both return values. In Lua, point would be the x value, and y would be silently discarded. I don’t tend to be a fan of silently throwing data away, but I have to admit that Lua makes pretty good use of this in several places for “optional” return values that the caller can completely ignore if desired. An existing function can even be extended to return more values than before — that would break callers in Python, but work just fine in Lua.

(Also, to briefly play devil’s advocate: I once saw Python code that returned 14 values all with very complicated values, types, and semantics. Maybe don’t do that. I think I cleaned it up to return an object, which simplified the calling code considerably too.)

It’s also possible to half-ass this. ECMAScript 6::

1
2
3
4
5
function get_point() {
    return [1, 2];
}

var [x, y] = get_point();

It works, but it doesn’t actually look like multiple return. The trouble is that JavaScript has C’s comma operator and C’s variable declaration syntax, so neither of the above constructs could’ve left off the brackets without significantly changing the syntax:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
function get_point() {
    // Whoops!  This uses the comma operator, which evaluates to its last
    // operand, so it just returns 2
    return 1, 2;
}

// Whoops!  This is multiple declaration, where each variable gets its own "=",
// so it assigns nothing to x and the return value to y
var x, y = get_point();
// Now x is undefined and y is 2

This is still better than either out parameters or returning an explicit struct that needs manual unpacking, but it’s not as good as comma-delimited tuples. Note that some languages require parentheses around tuples (and also call them tuples), and I’m arbitrarily counting that as better than bracket.

Single return: Ada, ALGOL, BASIC, C#, COBOL, Fortran, Groovy, Java, Smalltalk.

Half-assed multiple return: C++11, D, ECMAScript 6, Erlang, PHP.

Multiple return via tuples: F#, Go, Haskell, Julia, Nemerle, Nim, OCaml, Perl (just lists really), Python, Ruby, Rust, Scala, Standard ML, Swift, Tcl.

Native multiple return: Common Lisp, Lua.

Special mention: Forth is stack-based, and all return values are simply placed on the stack, so multiple return isn’t a special case. Unix shell functions don’t return values. Visual Basic sets a return value by assigning to the function’s name (?!), so good luck fitting multiple return in there.

Silent errors

Most runtime errors in C are indicated by one of two mechanisms: returning an error code, or segfaulting. Segfaulting is pretty noisy, so that’s okay, except for the exploit potential and all.

Returning an error code kinda sucks. Those tend to be important, but nothing in the language actually reminds you to check them, and of course we silly squishy humans have the habit of assuming everything will succeed at all times. Which is how I segfaulted git two days ago: I found a spot where it didn’t check for a NULL returned as an error.

There are several alternatives here: exceptions, statically forcing the developer to check for an error code, or using something monad-like to statically force the developer to distinguish between an error and a valid return value. Probably some others. In the end I was surprised by how many languages went the exception route.

Quietly wrong: Unix shells. Wow, yeah, I’m having a hard time naming anything else. Good job, us!

Exceptional: Ada, C++, C#, D, Erlang, Forth, Java (exceptions are even part of function signature), JavaScript, Nemerle, Nim, Objective-C, OCaml, Perl 6, Python, Ruby, Smalltalk, Standard ML, Visual Basic.

Monadic: Haskell (Either), Rust (Result).

Special mention: ACS doesn’t really have many operations that can error, and those that do simply halt the script. ALGOL apparently has something called “mending” that I don’t understand. Go tends to use secondary return values, which calling code has to unpack, making them slightly harder to forget about. Lisps have conditions and call/cc, which are different things entirely. Lua and Perl 5 handle errors by taking down the whole program, but offer a construct that can catch that further up the stack, which is clumsy but enough to emulate try..catch. PHP has exceptions, and errors (which are totally different), and a lot of builtin functions that return error codes. Swift has something that looks like exceptions, but it doesn’t involve stack unwinding and does require some light annotation, so I think it’s all sugar for a monadic return value. Visual Basic, and I believe some other BASICs, decided C wasn’t bad enough and introduced the bizarre On Error Resume Next construct which does exactly what it sounds like.

Nulls

The billion dollar mistake.

I think it’s considerably worse in a statically typed language like C, because the whole point is that you can rely on the types. But a double* might be NULL, which is not actually a pointer to a double; it’s a pointer to a segfault. Other kinds of bad pointers are possible, of course, but those are more an issue of memory safety; allowing any reference to be null violates type safety. The root of the problem is treating null as a possible value of any type, when really it’s its own type entirely.

The alternatives tend to be either opt-in nullability or an “optional” generic type (a monad!) which eliminates null as its own value entirely. Notably, Swift does it both ways: optional types are indicated by a trailing ?, but that’s just syntactic sugar for Option<T>.

On the other hand, while it’s annoying to get a None where I didn’t expect one in Python, it’s not like I’m surprised. I occasionally get a string where I expected a number, too. The language explicitly leaves type concerns in my hands. My real objection is to having a static type system that lies. So I’m not going to list every single dynamic language here, because not only is it consistent with the rest of the type system, but they don’t really have any machinery to prevent this anyway.

Nothing doing: C#, D, Go, Java, Nim (non-nullable types are opt in), R.

Nullable types: Swift.

Monads: F# (Option — though technically F# also inherits null from .NET), Haskell (Maybe), Rust (Option), Swift (Optional).

Special mention: awk, Tcl, and Unix shells only have strings, so in a surprising twist, they have no concept of null whatsoever. Java recently introduced an Optional<T> type which explicitly may or may not contain a value, but since it’s still a non-primitive, it could also be null. C++17 doesn’t quite have the same problem with std::optional<T>, since non-reference values can’t be null. Inform 7’s nothing value is an object (the root of half of its type system), which means any object variable might be nothing, but any value of a more specific type cannot be nothing. JavaScript has two null values, null and undefined. Perl 6 is really big on static types, but claims its Nil object doesn’t exist, and I don’t know how to even begin to unpack that.

Assignment as expression

How common a mistake is this:

1
2
3
if (x = 3) {
    ...
}

Well, I don’t know, actually. Maybe not that common, save for among beginners. But I sort of wonder whether allowing this buys us anything. I can only think of two cases where it does. One is with something like iteration:

1
2
3
4
// Typical linked list
while (p = p->next) {
    ...
}

But this is only necessary in C in the first place because it has no first-class notion of iteration. The other is shorthand for checking that a function returned a useful value:

1
2
3
if (ptr = get_pointer()) {
    ...
}

But if a function returns NULL, that’s really an error condition, and presumably you have some other way to handle that too.

What does that leave? The only time I remotely miss this in Python (where it’s illegal) is when testing a regex. You tend to see this a lot instead.

1
2
3
m = re.match('x+y+z+', some_string)
if m:
    ...

re treats failure as an acceptable possibility and returns None, rather than raising an exception. I’m not sure whether this was the right thing to do or not, but off the top of my head I can’t think of too many other Python interfaces that sometimes return None.

Some languages go entirely the opposite direction and make everything an expression, including block constructs like if. In those languages, it makes sense for assignment to be an expression, for consistency with everything else.

Assignment’s an expression: ACS, C#, D, Java, JavaScript, Perl, PHP, Swift.

Everything’s an expression: Ruby, Rust.

Assignment’s a statement: Inform 7, Lua, Python, Unix shells.

Special mention: BASIC uses = for both assignment and equality testing — the meaning is determined from context. Functional languages generally don’t have an assignment operator. Rust has a special if let block that explicitly combines assignment with pattern matching, which is way nicer than the C approach.

No hyphens in identifiers

snake_case requires dancing on the shift key (unless you rearrange your keyboard, which is perfectly reasonable). It slows you down slightly and leads to occasional mistakes like snake-Case. The alternative is dromedaryCase, which is objectively wrong and doesn’t actually solve this problem anyway.

Why not just allow hyphens in identifiers, so we can avoid this argument and use kebab-case?

Ah, but then it’s ambiguous whether you mean an identifier or the subtraction operator. No problem: require spaces for subtraction. I don’t think a tiny way you’re allowed to make your code harder to read is really worth this clear advantage.

Low score: ACS, C#, D, Java, JavaScript, OCaml, Pascal, Perl 5, PHP, Python, Ruby, Rust, Swift, Unix shells.

Nicely-named: COBOL, CSS (and thus Sass), Forth, Inform 7, Lisps, Perl 6, XML.

Special mention: Perl has a built-in variable called $-, and Ruby has a few called $-n for various values of “n”, but these are very special cases.

Braces and semicolons

Okay. Hang on. Bear with me.

C code looks like this.

1
2
3
4
5
some block header {
    line 1;
    line 2;
    line 3;
}

The block is indicated two different ways here. The braces are for the compiler; the indentation is for humans.

Having two different ways to say the same thing means they can get out of sync. They can disagree. And that can be, as previously mentioned, really bad. This is really just a more general form of the problem of optional block delimiters.

The only solution is to eliminate one of the two. Programming languages exist for the benefit of humans, so we obviously can’t get rid of the indentation. Thus, we should get rid of the braces. QED.

As an added advantage, we reclaim all the vertical space wasted on lines containing only a }, and we can stop squabbling about where to put the {.

If you accept this, you might start to notice that there are also two different ways of indicating where a line ends: with semicolons for the compiler, and with vertical whitespace for humans. So, by the same reasoning, we should lose the semicolons.

Right? Awesome. Glad we’re all on the same page.

Some languages use keywords instead of braces, but the effect is the same. I’m not aware of any languages that use keywords instead of semicolons.

Bracing myself: C#, D, Erlang, Java, Perl, Rust.

Braces, but no semicolons: JavaScript (kinda — see below), Lua, Ruby, Swift.

Free and clear: CoffeeScript, Haskell, Python.

Special mention: Lisp, just, in general. Inform 7 has an indented style, but it still requires semicolons.

Here’s some interesting trivia. JavaScript, Lua, and Python all optionally allow semicolons at the end of a statement, but the way each language determines line continuation is very different.

JavaScript takes an “opt-out” approach: it continues reading lines until it hits a semicolon, or until reading the next line would cause a syntax error. That leaves a few corner cases like starting a new line with a (, which could look like the last thing on the previous line is a function you’re trying to call. Or you could have -foo on its own line, and it would parse as subtraction rather than unary negation. You might wonder why anyone would do that, but using unary + is one way to make function parse as an expression rather than a statement! I’m not so opposed to semicolons that I want to be debugging where the language thinks my lines end, so I just always use semicolons in JavaScript.

Python takes an “opt-in” approach: it assumes, by default, that a statement ends at the end of a line. However, newlines inside parentheses or brackets are ignored, which takes care of 99% of cases — long lines are most frequently caused by function calls (which have parentheses!) with a lot of arguments. If you really need it, you can explicitly escape a newline with \\, but this is widely regarded as incredibly ugly.

Lua avoids the problem almost entirely. I believe Lua’s grammar is designed such that it’s almost always unambiguous where a statement ends, even if you have no newlines at all. This has a few weird side effects: void expressions are syntactically forbidden in Lua, for example, so you just can’t have -foo as its own statement. Also, you can’t have code immediately following a return, because it’ll be interpreted as a return value. The upside is that Lua can treat newlines just like any other whitespace, but still not need semicolons. In fact, semicolons aren’t statement terminators in Lua at all — they’re their own statement, which does nothing. Alas, not for lack of trying, Lua does have the same ( ambiguity as JavaScript (and parses it the same way), but I don’t think any of the others exist.

Oh, and the colons that Python has at the end of its block headers, like if foo:? As far as I can tell, they serve no syntactic purpose whatsoever. Purely aesthetic.

Blaming the programmer

Perhaps one of the worst misfeatures of C is the ease with which responsibility for problems can be shifted to the person who wrote the code. “Oh, you segfaulted? I guess you forgot to check for NULL.” If only I had a computer to take care of such tedium for me!

Clearly, computers can’t be expected to do everything for us. But they can be expected to do quite a bit. Programming languages are built for humans, and they ought to eliminate the sorts of rote work humans are bad at whenever possible. A programmer is already busy thinking about the actual problem they want to solve; it’s no surprise that they’ll sometimes forget some tedious detail the language forces them to worry about.

So if you’re designing a language, don’t just copy C. Don’t just copy C++ or Java. Hell, don’t even just copy Python or Ruby. Consider your target audience, consider the problems they’re trying to solve, and try to get as much else out of the way as possible. If the same “mistake” tends to crop up over and over, look for a way to modify the language to reduce or eliminate it. And be sure to look at a lot of languages for inspiration — even ones you hate, even weird ones no one uses! A lot of clever people have had a lot of other ideas in the last 44 years.


I hope you enjoyed this accidental cross-reference of several dozen languages! I enjoyed looking through them all, though it was incredibly time-consuming. Some of them look pretty interesting; maybe give them a whirl.

Also, dammit, now I’m thinking about language design again.

Let’s stop copying C

Post Syndicated from Eevee original https://eev.ee/blog/2016/12/01/lets-stop-copying-c/

Ah, C. The best lingua franca we have… because we have no other lingua francas. Linguae franca. Surgeons general?

C is fairly old — 44 years, now! — and comes from a time when there were possibly more architectures than programming languages. It works well for what it is, and what it is is a relatively simple layer of indirection atop assembly.

Alas, the popularity of C has led to a number of programming languages’ taking significant cues from its design, and parts of its design are… slightly questionable. I’ve gone through some common features that probably should’ve stayed in C and my justification for saying so. The features are listed in rough order from (I hope) least to most controversial. The idea is that C fans will give up when I call it “weakly typed” and not even get to the part where I rag on braces. Wait, crap, I gave it away.

I’ve listed some languages that do or don’t take the same approach as C. Plenty of the listed languages have no relation to C, and some even predate it — this is meant as a cross-reference of the landscape (and perhaps a list of prior art), not a genealogy. The language selections are arbitrary and based on what I could cobble together from documentation, experiments, Wikipedia, and attempts to make sense of Rosetta Code. I don’t know everything about all of them, so I might be missing some interesting quirks. Things are especially complicated for very old languages like COBOL or Fortran, which by now have numerous different versions and variants and de facto standard extensions.

Unix shells” means some handwaved combination that probably includes bash and its descendants; for expressions, it means the (( ... )) syntax. I didn’t look too closely into, say, fish. Unqualified “Python” means both 2 and 3; likewise, unqualified “Perl” means both 5 and 6. Also some of the puns are perhaps a little too obtuse, but the first group listed is always C-like.

Textual inclusion

#include is not a great basis for a module system. It’s not even a module system. You can’t ever quite tell what symbols came from which files, or indeed whether particular files are necessary at all. And in languages with C-like header files, most headers include other headers include more headers, so who knows how any particular declaration is actually ending up in your code? Oh, and there’s the whole include guards thing.

It’s a little tricky to pick on individual languages here, because ultimately even the greatest module system in the world boils down to “execute this other file, and maybe do some other stuff”. I think the true differentiating feature is whether including/importing/whatevering a file creates a new namespace. If a file gets dumped into the caller’s namespace, that looks an awful lot like textual inclusion; if a file gets its own namespace, that’s a good sign of something more structured happening behind the scenes.

This tends to go hand-in-hand with how much the language relies on a global namespace. One surprising exception is Lua, which can compartmentalize required files quite well, but dumps everything into a single global namespace by default.

Quick test: if you create a new namespace and import another file within that namespace, do its contents end up in that namespace?

Included: ACS, awk, COBOL, Erlang, Forth, Fortran, most older Lisps, Perl 5 (despite that required files must return true), PHP, Unix shells.

Excluded: Ada, Clojure, D, Haskell, Julia, Lua (the file’s return value is returned from require), Nim, Node (similar to Lua), Perl 6, Python, Rust.

Special mention: ALGOL appears to have been designed with the assumption that you could include other code by adding its punch cards to your stack. C#, Java, OCaml, and Swift all have some concept of “all possible code that will be in this program”, sort of like C with inferred headers, so imports are largely unnecessary; Java’s import really just does aliasing. Inform 7 has no namespacing, but does have a first-class concept of external libraries, but doesn’t have a way to split a single project up between multiple files. Ruby doesn’t automatically give required files their own namespace, but doesn’t evaluate them in the caller’s namespace either.

Optional block delimiters

Old and busted and responsible for gotofail:

1
2
if (condition)
    thing;

New hotness, which reduces the amount of punctuation overall and eliminates this easy kind of error:

1
2
3
if condition {
    thing;
}

To be fair, and unlike most of these complaints, the original idea was a sort of clever consistency: the actual syntax was merely if (expr) stmt, and also, a single statement could always be replaced by a block of statements. Unfortunately, the cuteness doesn’t make up for the ease with which errors sneak in. If you’re stuck with a language like this, I advise you always use braces, possibly excepting the most trivial cases like immediately returning if some argument is NULL. Definitely do not do this nonsense, which I saw in actual code not 24 hours ago.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for (x = ...)
    for (y = ...) {
        ...
    }

    // more code

    for (x = ...)
        for (y = ...)
            buffer[y][x] = ...

The only real argument for omitting the braces is that the braces take up a lot of vertical space, but that’s mostly a problem if you put each { on its own line, and you could just not do that.

Some languages use keywords instead of braces, and in such cases it’s vanishingly rare to make the keywords optional.

Blockheads: ACS, awk, C#, D, Erlang (kinda?), Java, JavaScript.

New kids on the block: Go, Perl 6, Rust, Swift.

Had their braces removed: Ada, ALGOL, BASIC, COBOL, CoffeeScript, Forth, Fortran (but still requires parens), Haskell, Lua, Ruby.

Special mention: Inform 7 has several ways to delimit blocks, none of them vulnerable to this problem. Perl 5 requires both the parentheses and the braces… but it lets you leave off the semicolon on the last statement. Python just uses indentation to delimit blocks in the first place, so you can’t have a block look wrong. Lisps exist on a higher plane of existence where the very question makes no sense.

Bitwise operator precedence

For ease of transition from B, in C, the bitwise operators & | ^ have lower precedence than the comparison operators == and friends. That means they happen later. For binary math operators, this is nonsense.

1
2
3
1 + 2 == 3  // (1 + 2) == 3
1 * 2 == 3  // (1 * 2) == 3
1 | 2 == 3  // 1 | (2 == 3)

Many other languages have copied C’s entire set of operators and their precedence, including this. Because a new language is easier to learn if its rules are familiar, you see. Which is why we still, today, have extremely popular languages maintaining compatibility with a language from 1969 — so old that it probably couldn’t get a programming job.

Honestly, if your language is any higher-level than C, I’m not sure bit operators deserve to be operators at all. Free those characters up to do something else. Consider having a first-class bitfield type; then 99% of the use of bit operations would go away.

Quick test: 1 & 2 == 2 evaluates to 1 with C precedence, false otherwise. Or just look at a precedence table: if equality appears between bitwise ops and other math ops, that’s C style.

A bit wrong: D, expr, JavaScript, Perl 5, PHP.

Wisened up: F# (ops are &&& ||| ^^^), Go, Julia, Lua (bitwise ops are new in 5.3), Perl 6 (ops are +& +| +^), Python, Ruby, Rust, SQL, Swift, Unix shells.

Special mention: C# and Java have C’s precedence, but forbid using bitwise operators on booleans, so the quick test is a compile-time error. Lisp-likes have no operator precedence.

Negative modulo

The modulo operator, %, finds the remainder after division. Thus you might think that this always holds:

1
0 <= a % b < abs(b)

But no — if a is negative, C will produce a negative value. (Well, since C99; before that it was unspecified, which is probably worse.) This is so a / b * b + a % b is always equal to a. Truncating integer division rounds towards zero, so the sign of a % b always needs to be away from zero.

I’ve never found this behavior (or the above equivalence) useful. An easy example is that checking for odd numbers with x % 2 == 1 will fail for negative numbers, which produce -1. But the opposite behavior can be pretty handy.

Consider the problem of having n items that you want to arrange into rows with c columns. A calendar, say; you want to include enough empty cells to fill out the last row. n % c gives you the number of items on the last row, so c - n % c seems like it will give you the number of empty spaces. But if the last row is already full, then n % c is zero, and c - n % c equals c! You’ll have either a double-width row or a spare row of empty cells. Fixing this requires treating n % c == 0 as a special case, which is unsatisfying.

Ah, but if we have positive %, the answer is simply… -n % c! Consider this number line for n = 5 and c = 3:

1
2
-6      -3       0       3       6
 | - x x | x x x | x x x | x x - |

a % b tells you how far to count down to find a multiple of b. For positive a, that means “backtracking” over a itself and finding a smaller number. For negative a, that means continuing further away from zero. If you look at negative numbers as the mirror image of positive numbers, then % on a positive number tells you how to much to file off to get a multiple, whereas % on a negative number tells you how much further to go to get a multiple. 5 % 3 is 2, but -5 % 3 is 1. And of course, -6 % 3 is still zero, so that’s not a special case.

Positive % effectively lets you choose whether to round up or down. It doesn’t come up often, but when it’s handy, it’s really handy.

(I have no strong opinion on what 5 % -3 should be; I don’t think I’ve ever tried to use % with a negative divisor. Python makes it negative; Pascal makes it positive. Wikipedia has a whole big chart.)

Quick test: -5 % 3 is -2 with C semantics, 1 with “positive” semantics.

Leftovers: C#, D, expr, Go, Java, JavaScript, OCaml, PowerShell, PHP, Rust, Scala, SQL, Swift, Unix shells, VimL, Visual Basic. Notably, some of these languages don’t even have integer division.

Paying dividends: Dart, MUMPS (#), Perl, Python, R (%%), Ruby, Smalltalk (\\\\), Standard ML, Tcl.

Special mention: Ada, Haskell, Julia, many Lisps, MATLAB, VHDL, and others have separate mod (Python-like) and rem (C-like) operators. CoffeeScript has separate % (C-like) and %% (Python-like) operators.

Leading zero for octal

Octal notation like 0777 has three uses.

One: to make a file mask to pass to chmod().

Two: to confuse people when they write 013 and it comes out as 11.

Three: to confuse people when they write 018 and get a syntax error.

If you absolutely must have octal (?!) in your language, it’s fine to use 0o777. Really. No one will mind. Or you can go the whole distance and allow literals written in any base, as several languages do.

Gets a zero: awk (gawk only), Clojure, Go, Groovy, Java, JavaScript, m4, Perl 5, PHP, Python 2, Unix shells.

G0od: ECMAScript 6, Eiffel (0c — cute!), F#, Haskell, Julia, Nemerle, Nim, OCaml, Perl 6, Python 3, Ruby, Rust, Scheme (#o), Swift, Tcl.

Based literals: Ada (8#777#), Bash (8#777), Erlang (8#777), Icon (8r777), J (8b777), Perl 6 (:8<777>), PostScript (8#777), Smalltalk (8r777).

Special mention: BASIC uses &O and &H prefixes for octal and hex. bc and Forth allow the base used to interpret literals to be changed on the fly, via ibase and BASE respectively. C#, D, expr, Lua, Scala, and Standard ML have no octal literals at all. Some COBOL extensions use O# and H#/X# prefixes for octal and hex. Fortran uses the slightly odd O'777' syntax.

No power operator

Perhaps this makes sense in C, since it doesn’t correspond to an actual instruction on most CPUs, but in JavaScript? If you can make + work for strings, I think you can add a **.

If you’re willing to ditch the bitwise operators (or lessen their importance a bit), you can even use ^, as most people would write in regular ASCII text.

Powerless: ACS, C#, Eiffel, Erlang, expr, Forth, Go.

Two out of two stars: Ada, ALGOL ( works too), COBOL, CoffeeScript, ECMAScript 7, Fortran, F#, Groovy, OCaml, Perl, PHP, Python, Ruby, Unix shells.

I tip my hat: awk, BASIC, bc, COBOL, fish, Lua.

Otherwise powerful: APL (), D (^^).

Special mention: Lisps tend to have a named function rather than a dedicated operator (e.g. Math/pow in Clojure, expt in Common Lisp), but since operators are regular functions, this doesn’t stand out nearly so much. Haskell uses all three of ^, ^^, and ** for typing reasons.

C-style for loops

This construct is bad. It very rarely matches what a human actually wants to do, which 90% of the time is “go through this list of stuff” or “count from 1 to 10”. A C-style for obscures those wishes. The syntax is downright goofy, too: nothing else in the language uses ; as a delimiter and repeatedly executes only part of a line. It’s like a tuple of statements.

I said in my previous post about iteration that having an iteration protocol requires either objects or closures, but I realize that’s not true. I even disproved it in the same post. Lua’s own iteration protocol can be implemented without closures — the semantics of for involve keeping a persistent state value and passing it to the iterator function every time. It could even be implemented in C! Awkwardly. And with a bunch of macros. Which aren’t hygenic in C. Hmm, well.

Loopy: ACS, bc, Fortran.

Cool and collected: C#, Clojure, D, Delphi (recent), ECMAScript 6, Eiffel (recent), Go, Groovy, Icon, Inform 7, Java, Julia, Logo, Lua, Nemerle, Nim, Objective-C, Perl, PHP, PostScript, Prolog, Python, R, Rust, Scala, Smalltalk, Swift, Tcl, Unix shells, Visual Basic.

Special mention: Functional languages and Lisps are laughing at the rest of us here. awk has for...in, but it doesn’t iterate arrays in order which makes it rather less useful. JavaScript (pre ES6) has both for...in and for each...in, but both are differently broken, so you usually end up using C-style for or external iteration. BASIC has an ergonomic numeric loop, but no iteration loop. Ruby mostly uses external iteration, and its for block is actually expressed in those terms.

Switch with default fallthrough

We’ve been through this before. Wanting completely separate code per case is, by far, the most common thing to want to do. It makes no sense to have to explicitly opt out of the more obvious behavior.

Breaks my heart: Java, JavaScript.

Follows through: Ada, BASIC, CoffeeScript, Go (has a fallthrough statement), Lisps, Ruby, Swift (has a fallthrough statement), Unix shells.

Special mention: C# and D require break, but require something one way or the other — implicit fallthrough is disallowed except for empty cases. Perl 5 historically had no switch block built in, but it comes with a Switch module, and the last seven releases have had an experimental given block which I stress is still experimental. Python has no switch block. Erlang, Haskell, and Rust have pattern-matching instead (which doesn’t allow fallthrough at all).

Type first

1
int foo;

In C, this isn’t too bad. You get into problems when you remember that it’s common for type names to be all lowercase.

1
foo * bar;

Is that a useless expression, or a declaration? It depends entirely on whether foo is a variable or a type.

It gets a little weirder when you consider that there are type names with spaces in them. And storage classes. And qualifiers. And sometimes part of the type comes after the name.

1
extern const volatile _Atomic unsigned long long int * restrict foo[];

That’s not even getting into the syntax for types of function pointers, which might have arbitrary amounts of stuff after the variable name.

And then C++ came along with generics, which means a type name might also have other type names nested arbitrarily deep.

1
extern const volatile std::unordered_map<unsigned long long int, std::unordered_map<const long double * const, const std::vector<std::basic_string<char>>::const_iterator>> foo;

And that’s just a declaration! Imagine if there were an assignment in there too.

The great thing about static typing is that I know the types of all the variables, but that advantage is somewhat lessened if I can’t tell what the variables are.

Between type-first, function pointer syntax, Turing-complete duck-typed templates, and C++’s initialization syntax, there are several ways where parsing C++ is ambiguous or even undecidable! “Undecidable” here means that there exist C++ programs which cannot even be parsed into a syntax tree, because the same syntax means two different things depending on whether some expression is a value or a type, and that question can depend on an endlessly recursive template instantiation. (This is also a great example of ambiguity, where x * y(z) could be either an expression or a declaration.)

Contrast with, say, Rust:

1
let x: ... = ...;

This is easy to parse, both for a human and a computer. The thing before the colon must be a variable name, and it stands out immediately; the thing after the colon must be a type name. Even better, Rust has pretty good type inference, so the type is probably unnecessary anyway.

Of course, languages with no type declarations whatsoever are immune to this problem.

Most vexing: ACS, ALGOL, C#, D (though [] goes on the type), Fortran, Java, Perl 6.

Looks Lovely: Ada, Boo, F#, Go, Python 3 (via annotation syntax and the typing module), Rust, Swift, TypeScript.

Special mention: BASIC uses trailing type sigils to indicate scalar types.

Weak typing

Please note: this is not the opposite of static typing. Weak typing is more about the runtime behavior of values — if I try to use a value of type T as though it were of type U, will it be implicitly converted?

C lets you assign pointers to int variables and then take square roots of them, which seems like a bad idea to me. C++ agreed and nixed this, but also introduced the ability to make your own custom types implicitly convertible to as many other types you want.

This one is pretty clearly a spectrum, and I don’t have a clear line. For example, I don’t fault Python for implicitly converting between int and float, because int is infinite-precision and float is 64-bit, so it’s usually fine. But I’m a lot more suspicious of C, which lets you assign an int to a char without complaint. (Well, okay. Literal integers in C are ints, which poses a slight problem.)

I do count a combined addition/concatenation operator that accepts different types of arguments as a form of weak typing.

Weak: JavaScript (+), PHP, Unix shells (almost everything’s a string, but even arrays/scalars are somewhat interchangeable).

Strong: F#, Go (explicit numeric casts), Haskell, Python, Rust (explicit numeric casts).

Special mention: ACS only has integers; even fixed-point values are stored in integers, and the compiler has no notion of a fixed-point type, making it the weakest language imaginable. C++ and Scala both allow defining implicit conversions, for better or worse. Perl 5 is weak, but it avoids most of the ambiguity by having entirely separate sets of operators for string vs numeric operations. Python 2 is mostly strong, but that whole interchangeable bytes/text thing sure caused some ruckus. Tcl only has strings.

Integer division

Hey, new programmers!” you may find yourself saying. “Don’t worry, it’s just like math, see? Here’s how to use $LANGUAGE as a calculator.”

Oh boy!” says your protégé. “Let’s see what 7 ÷ 2 is! Oh, it’s 3. I think the computer is broken.”

They’re right! It is broken. I have genuinely seen a non-trivial number of people come into #python thinking division is “broken” because of this.

To be fair, C is pretty consistent about making math operations always produce a value whose type matches one of the arguments. It’s also unclear whether such division should produce a float or a double. Inferring from context would make sense, but that’s not something C is really big on.

Quick test: 7 / 2 is 3½, not 3.

Integrous: bc, C#, D, expr, F#, Fortran, Go, OCaml, Python 2, Ruby, Rust (hard to avoid), Unix shells.

Afloat: awk (no integers), Clojure (produces a rational!), Groovy, JavaScript (no integers), Lua (no integers until 5.3), Nim, Perl 5 (no integers), Perl 6, PHP, Python 3.

Special mention: Haskell disallows / on integers. Nim, Haskell, Perl 6, Python, and probably others have separate integral division operators: div, div, div, and //, respectively.

Bytestrings

Strings” in C are arrays of 8-bit characters. They aren’t really strings at all, since they can’t hold the vast majority of characters without some further form of encoding. Exactly what the encoding is and how to handle it is left entirely up to the programmer. This is a pain in the ass.

Some languages caught wind of this Unicode thing in the 90s and decided to solve this problem once and for all by making “wide” strings with 16-bit characters. (Even C95 has this, in the form of wchar_t* and L"..." literals.) Unicode, you see, would never have more than 65,536 characters.

Whoops, so much for that. Now we have strings encoded as UTF-16 rather than UTF-8, so we’re paying extra storage cost and we still need to write extra code to do basic operations right. Or we forget, and then later we have to track down a bunch of wonky bugs because someone typed a 💩.

Note that handling characters/codepoints is very different from handling glyphs, i.e. the distinct shapes you see on screen. Handling glyphs doesn’t even really make sense outside the context of a font, because fonts are free to make up whatever ligatures they want. Remember “diverse” emoji? Those are ligatures of three to seven characters, completely invented by a font vendor. A programming language can’t reliably count the display length of that, especially when new combining behaviors could be introduced at any time.

Also, it doesn’t matter how you solve this problem, as long as it appears to be solved. I believe Ruby uses bytestrings, for example, but they know their own encoding, so they can be correctly handled as sequences of codepoints. Having a separate non-default type or methods does not count, because everyone will still use the wrong thing first — sorry, Python 2.

Quick test: what’s the length of “💩”? If 1, you have real unencoded strings. If 2, you have UTF-16 strings. If 4, you have UTF-8 strings. If something else, I don’t know what the heck is going on.

Totally bytes: Go, Lua, Python 2 (separate unicode type).

Comes up short: Java, JavaScript.

One hundred emoji: Python 3, Ruby, Rust, Swift (even gets combining characters right!).

Special mention: Go’s strings are explicitly arbitrary byte sequences, but iterating over a string with for..range decodes UTF-8 code points. Perl 5 gets the quick test right if you put use utf8; at the top of the file, but Perl 5’s Unicode support is such a confusing clusterfuck that I can’t really give it a 💯.

Hmm. This one is kind of hard to track down for sure without either knowing a lot about internals or installing fifty different interpreters/compilers.

Increment and decrement

I don’t think there are too many compelling reasons to have ++. It means the same as += 1, which is still nice and short. The only difference is that people can do stupid unreadable tricks with ++.

One exception: it is possible to overload ++ in ways that don’t make sense as += 1 — for example, C++ uses ++ to advance iterators, which may do any arbitrary work under the hood.

Double plus ungood: ACS, awk, C#, D, Go, Java, JavaScript, Perl, Unix shells, Vala.

Double plus good: Lua (which doesn’t have += either), Python, Ruby, Rust, Swift (removed in v3).

Special mention: Perl 5 and PHP both allow ++ on strings, in which case it increments letters or something, but I don’t know if much real code has ever used this.

!

A pet peeve. Spot the difference:

1
2
3
4
5
6
if (looks_like_rain()) {
    ...
}
if (!looks_like_rain()) {
    ...
}

That single ! is ridiculously subtle, which seems wrong to me when it makes an expression mean its polar opposite. Surely it should stick out like a sore thumb. The left parenthesis makes it worse, too; it blends in slightly as just noise.

It helps a bit to space after the ! in cases like this:

1
2
3
if (! looks_like_rain()) {
    ...
}

But this seems to be curiously rare. The easy solution is to just spell the operator not. At which point the other two might as well be and and or.

Interestingly enough, C95 specifies and, or, not, and some others as standard alternative spellings, though I’ve never seen them in any C code and I suspect existing projects would prefer I not use them.

Not right: ACS, awk, C#, D, Go, Groovy, Java, JavaScript, Nemerle, PHP, R, Rust, Scala, Swift, Tcl, Vala.

Spelled out: Ada, ALGOL, BASIC, COBOL, Erlang, F#, Fortran, Haskell, Inform 7, Lisps, Lua, Nim, OCaml, Pascal, PostScript, Python, Smalltalk, Standard ML.

Special mention: APL and Julia both use ~, which is at least easier to pick out, which is more than I can say for most of APL. bc and expr, which are really calculators, have no concept of Boolean operations. Forth and Icon, which are not calculators, don’t seem to either. Inform 7 often blends the negation into the verb, e.g. if the player does not have.... Perl and Ruby have both symbolic and named Boolean operators (Perl 6 has even more), with different precedence (which inside if won’t matter); I believe Perl 5 prefers the words and Ruby prefers the symbols. Perl and Ruby also both have a separate unless block, with the opposite meaning to if. Python has is not and not in operators.

Single return and out parameters

Because C can only return a single value, and that value is often an indication of failure for the sake of an if, “out” parameters are somewhat common.

1
2
double x, y;
get_point(&x, &y);

It’s not immediately clear whether x and y are input or output. Sometimes they might function as both. (And of course, in this silly example, you’d be better off returning a single point struct. Or would you use a point out parameter because returning structs is potentially expensive?)

Some languages have doubled down on this by adding syntax to declare “out” parameters, which removes the ambiguity in the function definition, but makes it worse in function calls. In the above example, using & on an argument is at least a decent hint that the function wants to write to those values. If you have implicit out parameters or pass-by-reference or whatever, that would just be get_point(x, y) and you’d have no indication that those arguments are special in any way.

The vast majority of the time, this can be expressed in a more straightforward way by returning multiple values:

1
x, y = get_point()

That was intended as Python, but technically, Python doesn’t have multiple returns! It seems to, but it’s really a combination of several factors: a tuple type, the ability to make a tuple literal with just commas, and the ability to unpack a tuple via multiple assignment. In the end it works just as well. Also this is a way better use of the comma operator than in C.

But the exact same code could appear in Lua, which has multiple return/assignment as an explicit feature… and no tuples. The difference becomes obvious if you try to assign the return value to a single variable instead:

1
point = get_point()

In Python, point would be a tuple containing both return values. In Lua, point would be the x value, and y would be silently discarded. I don’t tend to be a fan of silently throwing data away, but I have to admit that Lua makes pretty good use of this in several places for “optional” return values that the caller can completely ignore if desired. An existing function can even be extended to return more values than before — that would break callers in Python, but work just fine in Lua.

(Also, to briefly play devil’s advocate: I once saw Python code that returned 14 values all with very complicated values, types, and semantics. Maybe don’t do that. I think I cleaned it up to return an object, which simplified the calling code considerably too.)

It’s also possible to half-ass this. ECMAScript 6::

1
2
3
4
5
function get_point() {
    return [1, 2];
}

var [x, y] = get_point();

It works, but it doesn’t actually look like multiple return. The trouble is that JavaScript has C’s comma operator and C’s variable declaration syntax, so neither of the above constructs could’ve left off the brackets without significantly changing the syntax:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
function get_point() {
    // Whoops!  This uses the comma operator, which evaluates to its last
    // operand, so it just returns 2
    return 1, 2;
}

// Whoops!  This is multiple declaration, where each variable gets its own "=",
// so it assigns nothing to x and the return value to y
var x, y = get_point();
// Now x is undefined and y is 2

This is still better than either out parameters or returning an explicit struct that needs manual unpacking, but it’s not as good as comma-delimited tuples. Note that some languages require parentheses around tuples (and also call them tuples), and I’m arbitrarily counting that as better than bracket.

Single return: Ada, ALGOL, BASIC, C#, COBOL, Fortran, Groovy, Java, Smalltalk.

Half-assed multiple return: C++11, D, ECMAScript 6, Erlang, PHP.

Multiple return via tuples: F#, Haskell, Julia, Nemerle, Nim, OCaml, Perl (just lists really), Python, Ruby, Rust, Scala, Standard ML, Swift, Tcl.

Native multiple return: Common Lisp, Go, Lua.

Special mention: C# has explicit syntax for out parameters, but it’s a compile-time error to not assign to all of them, which is slightly better than C. Forth is stack-based, and all return values are simply placed on the stack, so multiple return isn’t a special case. Unix shell functions don’t return values. Visual Basic sets a return value by assigning to the function’s name (?!), so good luck fitting multiple return in there.

Silent errors

Most runtime errors in C are indicated by one of two mechanisms: returning an error code, or segfaulting. Segfaulting is pretty noisy, so that’s okay, except for the exploit potential and all.

Returning an error code kinda sucks. Those tend to be important, but nothing in the language actually reminds you to check them, and of course we silly squishy humans have the habit of assuming everything will succeed at all times. Which is how I segfaulted git two days ago: I found a spot where it didn’t check for a NULL returned as an error.

There are several alternatives here: exceptions, statically forcing the developer to check for an error code, or using something monad-like to statically force the developer to distinguish between an error and a valid return value. Probably some others. In the end I was surprised by how many languages went the exception route.

Quietly wrong: Unix shells. Wow, yeah, I’m having a hard time naming anything else. Good job, us! And even Unix shells have set -e; it’s just opt-in.

Exceptional: Ada, C++, C#, D, Erlang, Forth, Java (exceptions are even part of function signature), JavaScript, Nemerle, Nim, Objective-C, OCaml, Perl 6, Python, Ruby, Smalltalk, Standard ML, Visual Basic.

Monadic: Haskell (Either), Rust (Result).

Special mention: ACS doesn’t really have many operations that can error, and those that do simply halt the script. ALGOL apparently has something called “mending” that I don’t understand. Go tends to use secondary return values, which calling code has to unpack, making them slightly harder to forget about; it also allows both the assignment and the error check together in the header of an if. Lisps have conditions and call/cc, which are different things entirely. Lua and Perl 5 handle errors by taking down the whole program, but offer a construct that can catch that further up the stack, which is clumsy but enough to emulate try..catch. PHP has exceptions, and errors (which are totally different), and a lot of builtin functions that return error codes. Swift has something that looks like exceptions, but it doesn’t involve stack unwinding and does require some light annotation — apparently sugar for an “out” parameter holding an error. Visual Basic, and I believe some other BASICs, decided C wasn’t bad enough and introduced the bizarre On Error Resume Next construct which does exactly what it sounds like.

Nulls

The billion dollar mistake.

I think it’s considerably worse in a statically typed language like C, because the whole point is that you can rely on the types. But a double* might be NULL, which is not actually a pointer to a double; it’s a pointer to a segfault. Other kinds of bad pointers are possible, of course, but those are more an issue of memory safety; allowing any reference to be null violates type safety. The root of the problem is treating null as a possible value of any type, when really it’s its own type entirely.

The alternatives tend to be either opt-in nullability or an “optional” generic type (a monad!) which eliminates null as its own value entirely. Notably, Swift does it both ways: optional types are indicated by a trailing ?, but that’s just syntactic sugar for Option<T>.

On the other hand, while it’s annoying to get a None where I didn’t expect one in Python, it’s not like I’m surprised. I occasionally get a string where I expected a number, too. The language explicitly leaves type concerns in my hands. My real objection is to having a static type system that lies. So I’m not going to list every single dynamic language here, because not only is it consistent with the rest of the type system, but they don’t really have any machinery to prevent this anyway.

Nothing doing: C#, D, Go, Java, Nim (non-nullable types are opt in).

Nullable types: Swift (sugar for a monad).

Monads: F# (Option — though technically F# also inherits null from .NET), Haskell (Maybe), Rust (Option), Swift (Optional).

Special mention: awk, Tcl, and Unix shells only have strings, so in a surprising twist, they have no concept of null whatsoever. Java recently introduced an Optional<T> type which explicitly may or may not contain a value, but since it’s still a non-primitive, it could also be null. C++17 doesn’t quite have the same problem with std::optional<T>, since non-reference values can’t be null. Inform 7’s nothing value is an object (the root of half of its type system), which means any object variable might be nothing, but any value of a more specific type cannot be nothing. JavaScript has two null values, null and undefined. Perl 6 is really big on static types, but claims its Nil object doesn’t exist, and I don’t know how to even begin to unpack that. R and SQL have a more mathematical kind of NULL, which tends to e.g. vanish from lists.

Assignment as expression

How common a mistake is this:

1
2
3
if (x = 3) {
    ...
}

Well, I don’t know, actually. Maybe not that common, save for among beginners. But I sort of wonder whether allowing this buys us anything. I can only think of two cases where it does. One is with something like iteration:

1
2
3
4
// Typical linked list
while (p = p->next) {
    ...
}

But this is only necessary in C in the first place because it has no first-class notion of iteration. The other is shorthand for checking that a function returned a useful value:

1
2
3
if (ptr = get_pointer()) {
    ...
}

But if a function returns NULL, that’s really an error condition, and presumably you have some other way to handle that too.

What does that leave? The only time I remotely miss this in Python (where it’s illegal) is when testing a regex. You tend to see this a lot instead.

1
2
3
m = re.match('x+y+z+', some_string)
if m:
    ...

re treats failure as an acceptable possibility and returns None, rather than raising an exception. I’m not sure whether this was the right thing to do or not, but off the top of my head I can’t think of too many other Python interfaces that sometimes return None.

Freedom of expression: ACS, C#, Java, JavaScript, Perl, PHP, Swift.

Makes a statement: Inform 7, Lua, Python, Unix shells.

Special mention: BASIC uses = for both assignment and equality testing — the meaning is determined from context. D allows variable declaration as an expression, so if (int x = 3) is allowed, but regular assignment is not. Functional languages generally don’t have an assignment operator. Go disallows assignment as an expression, but assignment and a test can appear together in an if condition, and this is an idiomatic way to check success. Ruby makes everything an expression, so assignment might as well be too. Rust makes everything an expression, but assignment evaluates to the useless () value (due to ownership rules), so it’s not actually useful. Rust and Swift both have a special if let block that explicitly combines assignment with pattern matching, which is way nicer than the C approach.

No hyphens in identifiers

snake_case requires dancing on the shift key (unless you rearrange your keyboard, which is perfectly reasonable). It slows you down slightly and leads to occasional mistakes like snake-Case. The alternative is dromedaryCase, which is objectively wrong and doesn’t actually solve this problem anyway.

Why not just allow hyphens in identifiers, so we can avoid this argument and use kebab-case?

Ah, but then it’s ambiguous whether you mean an identifier or the subtraction operator. No problem: require spaces for subtraction. I don’t think a tiny way you’re allowed to make your code harder to read is really worth this clear advantage.

Low scoring: ACS, C#, D, Java, JavaScript, OCaml, Pascal, Perl 5, PHP, Python, Ruby, Rust, Swift, Unix shells.

Nicely-designed: COBOL, CSS (and thus Sass), Forth, Inform 7, Lisps, Perl 6, XML.

Special mention: Perl has a built-in variable called $-, and Ruby has a few called $-n for various values of “n”, but these are very special cases.

Braces and semicolons

Okay. Hang on. Bear with me.

C code looks like this.

1
2
3
4
5
some block header {
    line 1;
    line 2;
    line 3;
}

The block is indicated two different ways here. The braces are for the compiler; the indentation is for humans.

Having two different ways to say the same thing means they can get out of sync. They can disagree. And that can be, as previously mentioned, really bad. This is really just a more general form of the problem of optional block delimiters.

The only solution is to eliminate one of the two. Programming languages exist for the benefit of humans, so we obviously can’t get rid of the indentation. Thus, we should get rid of the braces. QED.

As an added advantage, we reclaim all the vertical space wasted on lines containing only a }, and we can stop squabbling about where to put the {.

If you accept this, you might start to notice that there are also two different ways of indicating where a line ends: with semicolons for the compiler, and with vertical whitespace for humans. So, by the same reasoning, we should lose the semicolons.

Right? Awesome. Glad we’re all on the same page.

Some languages use keywords instead of braces, but the effect is the same. I’m not aware of any languages that use keywords instead of semicolons.

Bracing myself: C#, D, Erlang, Java, Perl, Rust.

Braces, but no semicolons: Go (ASI), JavaScript (ASI — see below), Lua, Ruby, Swift.

Free and clear: CoffeeScript, Haskell, Python.

Special mention: Lisp, just, in general. Inform 7 has an indented style, but it still requires semicolons. MUMPS doesn’t support nesting at all, but I believe there are extensions that use dots to indicate it.

Here’s some interesting trivia. JavaScript, Lua, and Python all optionally allow semicolons at the end of a statement, but the way each language determines line continuation is very different.

JavaScript takes an “opt-out” approach: it continues reading lines until it hits a semicolon, or until reading the next line would cause a syntax error. (This approach is called automatic semicolon insertion.) That leaves a few corner cases like starting a new line with a (, which could look like the last thing on the previous line is a function you’re trying to call. Or you could have -foo on its own line, and it would parse as subtraction rather than unary negation. You might wonder why anyone would do that, but using unary + is one way to make function parse as an expression rather than a statement! I’m not so opposed to semicolons that I want to be debugging where the language thinks my lines end, so I just always use semicolons in JavaScript.

Python takes an “opt-in” approach: it assumes, by default, that a statement ends at the end of a line. However, newlines inside parentheses or brackets are ignored, which takes care of 99% of cases — long lines are most frequently caused by function calls (which have parentheses!) with a lot of arguments. If you really need it, you can explicitly escape a newline with \\, but this is widely regarded as incredibly ugly.

Lua avoids the problem almost entirely. I believe Lua’s grammar is designed such that it’s almost always unambiguous where a statement ends, even if you have no newlines at all. This has a few weird side effects: void expressions are syntactically forbidden in Lua, for example, so you just can’t have -foo as its own statement. Also, you can’t have code immediately following a return, because it’ll be interpreted as a return value. The upside is that Lua can treat newlines just like any other whitespace, but still not need semicolons. In fact, semicolons aren’t statement terminators in Lua at all — they’re their own statement, which does nothing. Alas, not for lack of trying, Lua does have the same ( ambiguity as JavaScript (and parses it the same way), but I don’t think any of the others exist.

Oh, and the colons that Python has at the end of its block headers, like if foo:? As far as I can tell, they serve no syntactic purpose whatsoever. Purely aesthetic.

Blaming the programmer

Perhaps one of the worst misfeatures of C is the ease with which responsibility for problems can be shifted to the person who wrote the code. “Oh, you segfaulted? I guess you forgot to check for NULL.” If only I had a computer to take care of such tedium for me!

Clearly, computers can’t be expected to do everything for us. But they can be expected to do quite a bit. Programming languages are built for humans, and they ought to eliminate the sorts of rote work humans are bad at whenever possible. A programmer is already busy thinking about the actual problem they want to solve; it’s no surprise that they’ll sometimes forget some tedious detail the language forces them to worry about.

So if you’re designing a language, don’t just copy C. Don’t just copy C++ or Java. Hell, don’t even just copy Python or Ruby. Consider your target audience, consider the problems they’re trying to solve, and try to get as much else out of the way as possible. If the same “mistake” tends to crop up over and over, look for a way to modify the language to reduce or eliminate it. And be sure to look at a lot of languages for inspiration — even ones you hate, even weird ones no one uses! A lot of clever people have had a lot of other ideas in the last 44 years.


I hope you enjoyed this accidental cross-reference of several dozen languages! I enjoyed looking through them all, though it was incredibly time-consuming. Some of them look pretty interesting; maybe give them a whirl.

Also, dammit, now I’m thinking about language design again.

Iteration in one language, then all the others

Post Syndicated from Eevee original https://eev.ee/blog/2016/11/18/iteration-in-one-language-then-all-the-others/

You may have noticed that I like comparing features across different languages. I hope you like it too, because I’m doing it again.

Python

I’m most familiar with Python, and iteration is one of its major concepts, so it’s a good place to start and a good overview of iteration. I’ll dive into Python a little more deeply, then draw parallels to other languages.

Python only has one form of iteration loop, for. (Note that all of these examples are written for Python 3; in Python 2, some of the names are slightly different, and fewer things are lazy.)

1
2
for value in sequence:
    ...

in is also an operator, so value in sequence is also the way you test for containment. This is either very confusing or very satisfying.

When you need indices, or specifically a range of numbers, you can use the built-in enumerate or range functions. enumerate works with lazy iterables as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# This makes use of tuple unpacking to effectively return two values at a time
for index, value in enumerate(sequence):
    ...

# Note that the endpoint is exclusive, and the default start point is 0.  This
# matches how list indexing works and fits the C style of numbering.
# 0 1 2 3 4
for n in range(5):
    ...

# Start somewhere other than zero, and the endpoint is still exclusive.
# 1 2 3 4
for n in range(1, 5):
    ...

# Count by 2 instead.  Can also use a negative step to count backwards.
# 1 3 5 7 9
for n in range(1, 11, 2):
    ...

dicts (mapping types) have several methods for different kinds of iteration. Additionally, iterating over a dict directly produces its keys.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for key in mapping:
    ...

for key in mapping.keys():
    ...

for value in mapping.values():
    ...

for key, value in mapping.items():
    ...

Python distinguishes between an iterable, any value that can be iterated over, and an iterator, a value that performs the actual work of iteration. Common iterable types include list, tuple, dict, str, and set. enumerate and range are also iterable.

Since Python code rarely works with iterators directly, and many iterable types also function as their own iterators, it’s common to hear “iterator” used to mean an iterable. To avoid this ambiguity, and because the words are fairly similar already, I’ll refer to iterables as containers like the Python documentation sometimes does. Don’t be fooled — an object doesn’t actually need to contain anything to be iterable. Python’s range type is iterable, but it doesn’t physically contain all the numbers in the range; it generates them on the fly as needed.

The fundamental basics of iteration are built on these two ideas. Given a container, ask for an iterator; then repeatedly advance the iterator to get new values. When the iterator runs out of values, it raises StopIteration. That’s it. In Python, those two steps can be performed manually with the iter and next functions. A for loop is roughly equivalent to:

1
2
3
4
5
6
7
8
9
_iterator = iter(container)
_done = False
while not _done:
    try:
        value = next(_iterator)
    except StopIteration:
        _done = True
    else:
        ...

An iterator can only move forwards. Once a value has been produced, it’s lost, at least as far as the iterator is concerned. These restrictions are occasionally limiting, but they allow iteration to be used for some unexpected tasks. For example, iterating over an open file produces its lines — even if the “file” is actually a terminal or pipe, where data only arrives once and isn’t persistently stored anywhere.

Generators

A more common form of “only forwards, only once” in Python is the generator, a function containing a yield statement. For example:

1
2
3
4
5
6
7
8
9
def inclusive_range(start, stop):
    val = start
    while val <= stop:
        yield val
        val += 1

# 6 7 8 9
for n in inclusive_range(6, 9):
    ...

Calling a generator function doesn’t execute its code, but immediately creates a generator iterator. Every time the iterator is advanced, the function executes until the next yield, at which point the yielded value is returned as the next value and the function pauses. The next iteration will then resume the function. When the function returns (or falls off the end), the iterator stops.

Since the values here are produced by running code on the fly, it’s of course impossible to rewind a generator.

The underlying protocol is straightforward. A container must have an __iter__ method that returns an iterator, corresponding to the iter function. An iterator must have a __next__ method that returns the next item, corresponding to the next function. If the iterator is exhausted, __next__ must raise StopIteration. An iterator must also have an __iter__ that returns itself — this is so an iterator can be used directly in a for loop.

The above inclusive range generator might be written out explicitly like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        return InclusiveRangeIterator(self)

class InclusiveRangeIterator:
    def __init__(self, incrange):
        self.incrange = incrange
        self.nextval = incrange.start

    def __iter__(self):
        return self

    def __next__(self):
        if self.nextval > self.incrange.stop:
            raise StopIteration

        val = self.nextval
        self.nextval += 1
        return val

This might seem like a lot of boilerplate, but note that the iterator state (here, nextval) can’t go on InclusiveRange directly, because then it’d be impossible to iterate over the same object twice at the same time. (Some types, like files, do act as their own iterators because they can’t meaningfully be iterated in parallel.)

Even Python’s internals work this way. Try iter([]) in a Python REPL; you’ll get a list_iterator object.

In truth, it is a lot of boilerplate. User code usually uses this trick:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        val = self.start
        while val <= self.stop:
            yield val
            val += 1

Nothing about this is special-cased in any way. Now __iter__ is a generator, and calling a generator function returns an iterator, so all the constraints are met. It’s a really easy way to convert a generator function into a type. If this class were named inclusive_range instead, it would even be backwards-compatible; consuming code wouldn’t even have to know it’s a class.

Reversal

But why would you do this? One excellent reason is to add support for other sequence-like operations, like reverse iteration support. An iterator can’t be reversed, but a container might support being iterated in reverse:

1
2
3
4
fruits = ['apple', 'orange', 'pear']
# pear, orange, apple
for value in reversed(fruits):
    ...

Iterating a lazy container doesn’t always make sense, but when it does, it’s easy to implement by returning an iterator from __reversed__.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        val = self.start
        while val <= self.stop:
            yield val
            val += 1

    def __reversed__(self):
        val = self.stop
        while val >= self.start:
            yield val
            val -= 1

Note that Python does not have “bi-directional” iterators, which can freely switch between forwards and reverse iteration on the fly. A bidirectional iterator is useful for cases like doubly-linked lists, where it’s easy to get from one value to the next or previous value, but not as easy to start from the beginning and get the tenth item.

Iteration is often associated with sequences, though they’re not quite the same. In Python, a sequence is a value that can be indexed in order as container[0], container[1], etc. (Indexing is implemented with __getitem__.) All sequences are iterable; in fact, if a type implements indexing but not __iter__, the iter function will automatically try indexing it from zero instead. reversed does the same, though it requires that the type implement __len__ as well so it knows what the last item is.

Much of this is codified more explicitly in the abstract base classes in collections.abc, which also provide default implementations of common methods.

Not all iterables are sequences, and not every value that can be indexed is a sequence! Python’s mapping type, dict, uses indexing to fetch the value for a key; but a dict has no defined order and is not a sequence. However, a dict can still be iterated over, producing its keys (in arbitrary order). A set can be iterated over, producing its values in arbitrary order, but it cannot be indexed at all. A type could conceivably use indexing for something more unusual and not be iterable at all.

A common question

It’s not really related to iteration, but people coming to Python from Ruby often ask why len() is a built-in function, rather than a method. The same question could be asked about iter() and next() (and other Python builtins), which more or less delegate directly to a “reserved” __dunder__ method anyway.

I believe the technical reason is simply the order that features were added to the language in very early days, which is not very interesting.

The philosophical reason, imo, is that Python does not reserve method names for fundamental operations. All __dunder__ names are reserved, of course, but everything else is fair game. This makes it obvious when a method is intended to add support for some language-ish-level operation, even if you don’t know what all the method names are. Occasionally a third-party library invents its own __dunder__ name, which is a little naughty, but the same reasoning applies: “this is a completely generic interface that some external mechanism is expected to use”.

This approach also avoids a namespacing problem. In Ruby, a Rectangle class might want to have width and length attributes… but the presence of length means a Rectangle looks like it functions as a sequence! Since “interface” method names aren’t namespaced in any way, there is no way to say that you don’t mean the same thing as Array.length.

It’s a minor quibble, since everything’s dynamically typed anyway, so the real solution is “well don’t try to iterate a rectangle then”. And Python does use keys as a method name in some obscure cases. Oh, well.

Some cute tricks

The distinction between sequences and iterables can cause some subtle problems. A lot of code that only needs to loop over items can be passed, e.g., a generator. But this can take some conscious care. Compare:

1
2
3
4
5
6
7
8
# This will NOT work with generators, which don't support len() or indexing
for i in range(len(container)):
    value = container[i]
    ...

# But this will
for i, value in enumerate(container):
    ...

enumerate also has a subtle, unfortunate problem: it cannot be combined with reversed. This has bit me more than once, surprisingly.

1
2
3
4
5
6
7
# This produces a TypeError from reversed()
for i, value in reversed(enumerate(container)):
    ...

# This almost works, but the index goes forwards while the values go backwards
for i, value in enumerate(reversed(container)):
    ...

The problem is that enumerate can’t, in general, reverse itself. It counts up from zero as it iterates over its argument; reversing it means starting from one less than the number of items, but it doesn’t yet know how many items there are. But if you just want to run over a list or other sequence backwards, this feels very silly. A trivial helper can make it work:

1
2
3
4
5
def revenum(iterable, end=0):
    start = len(iterable) + end
    for value in iterable:
        start -= 1
        yield start, value

I’ve run into other odd cases where it’s frustrating that a generator doesn’t have a length or indexing. This especially comes up if you make heavy use of generator expressions, which are a very compact way to write a one-off generator. (Python also has list, set, and dict “comprehensions”, which have the same syntax but use brackets or braces instead of parentheses, and are evaluated immediately instead of lazily.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_big_fruits():
    fruits = ['apple', 'orange', 'pear']
    return (fruit.upper() for fruit in fruits)

# Roughly equivalent to:
def get_big_fruits():
    fruits = ['apple', 'orange', 'pear']
    def genexp():
        for fruit in fruits:
            yield fruit.upper()
    return genexp()

If you had thousands of fruits, doing this could save a little memory. The caller is probably just going to loop over them to print them out (or whatever), so using a generator expression means that each uppercase name only exists for a short time; returning a list would mean creating a lot of values all at once.

Ah, but now the caller wants to know how many fruits there are, with minimal fuss. Generators have no length, so that won’t work. Turning this generator expression into a class that also has a __len__ would be fairly ridiculous. So you resort to some slightly ugly trickery.

1
2
3
4
5
6
7
8
# Ugh.  Obvious, but feels really silly.
count = 0
for value in container:
    count += 1

# Better, but weird if you haven't seen it before.  Creates another generator
# expression that just yields 1 for every item, then sums them up.
count = sum(1 for _ in container)

Or perhaps you want the first big fruit? Well, [0] isn’t going to help. This is one of the few cases where using iter and next directly can be handy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Oops!  If the container is empty, this raises StopIteration, which you
# probably don't want.
first = next(iter(container))

# Catch the StopIteration explicitly.
try:
    first = next(iter(container))
except StopIteration:
    # This code runs if there are zero items
    ...

# Regular loop that terminates immediately.
# The "else" clause only runs when the container ends naturally (i.e. NOT if
# the loop breaks), which can only happen here if there are zero items.
for value in container:
    first = value
    break
else:
    ...

# next() -- but not __next__()! -- takes a second argument indicating a
# "default" value to return when the iterator is exhausted.  This only makes
# sense if you were going to substitute a default value anyway; doing this and
# then checking for None will do the wrong thing if the container actually
# contained a None.
first = next(iter(container), None)

Other tricks with iter and next include skipping the first item (or any number of initial items, though consider itertools.islice for more complex cases):

1
2
3
4
5
6
7
it = iter(container)
next(it, None)  # Use second arg to ignore StopIteration
for value in it:
    # Since the first item in the iterator has already been consumed, this loop
    # will start with the second item.  If the container had only one or zero
    # items, the loop will get StopIteration and end immediately.
    ...

Iterating two (or more) items at a time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Obvious way: call next() inside the loop.
it = iter(container)
for value1 in it:
    # With an odd number of items, this will raise an uncaught StopIteration!
    # Catch it or provide a default value.
    value2 = next(it)
    ...

# Moderately clever way: abuse zip().
# zip() takes some number of containers and iterates over them pairwise.  It
# stores an iterator for each container.  When it's asked for its next item, it
# in turn asks all of its iterators for their next items, and returns them as a
# set.  But by giving it the same exact iterator twice, it'll end up advancing
# that iterator twice and returning two consecutive items.
# Note that zip() stops early as soon as an iterator runs dry, so if the
# container has an odd number of items, this will silently skip the last one.
# If you don't want that, use itertools.zip_longest instead.
it = iter(container)
for line1, line2 in zip(it, it):
    ...

# Far too clever way: exactly the same as above, but written as a one-liner.
# zip(iter(), iter()) would create two separate iterators and break the trick.
# List multiplication produces a list containing the same iterator twice.
# One advantage of this is that the 2 can be a variable.
for value1, value2 in zip(*[iter(container)] * 2):
    ...

Wow, that got pretty weird towards the end. Somehow this turned into Stupid Python Iterator Tricks. Don’t worry; I know far less about these other languages.

C

C is an extreme example with no iterator protocol whatsoever. It barely even supports sequences; arrays are just pointer math. All it has is the humble C-style for loop:

1
2
3
4
5
int[] container = {...};
for (int i = 0; i < container_length; i++) {
    int value = container[i];
    ...
}

Unfortunately, it’s really the best C can do. C arrays don’t know their own length, so no matter what, the developer has to provide it some other way. Even without that, a built-in iterator protocol is impossible — iterators require persistent state (the current position) to be bundled alongside code (how to get to the next position). That pretty much means one of two things: closures or objects. C has neither.

Lua

Lua has two forms of for loop. The first is a simple numeric loop.

1
2
3
4
-- 1 3 5 7 9 11
for value = 1, 11, 2 do
    ...
end

The three values after the = are the start, end, and step. They work similarly to Python’s range(), except that everything in Lua is always inclusive, so for i = 1, 5 will count from 1 to 5.

The generic form uses in.

1
2
3
for value in iterate(container) do
    ...
end

iterate isn’t a special name here, but most of the time a generic for will look like this.

See, Lua doesn’t have objects. It has enough tools that you can build objects fairly easily, but the core language has no explicit concept of objects or method calls. An iterator protocol needs to bundle state and behavior somehow, so Lua uses closures for that. But you still need a way to get that closure, and that means calling a function, and a plain value can’t have functions attached to it. So iterating over a table (Lua’s single data structure) looks like this:

1
2
for key, value in pairs(container) do
    ...

pairs is a built-in function. Lua also has an ipairs, which iterates over consecutive keys and values starting from key 1. (Lua starts at 1, not 0. Lua also represents sequences as tables with numeric keys.)

Lua does have a way to associate “methods” with values, which is how objects are made, but for loops almost certainly came first. So iteration is almost always over a function call, not a bare value.

Also, because objects are built out of tables, having a default iteration behavior for all tables would mean having the same default for all objects. Nothing’s stopping you from using pairs on an object now, but at least that looks deliberate. It’s easy enough to give objects iteration methods and iterate over obj:iter(), though it’s slightly unfortunate that every type might look slightly different. Unfortunately, Lua has no truly generic interface for “this can produce a sequence of values”.

The iteration protocol is really just calling a function repeatedly to get new values. When the function returns nil, the iteration ends. (That means nil can never be part of an iteration! You can work around this by returning two values and making sure the first one is something else that’s never nil, like an index.) The manual explains the exact semantics of the generic for with Lua code, a move I wish every language would make.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- This:
for var_1, ···, var_n in explist do block end

-- Is equivalent to this:
do
    local _func, _state, _lastval = explist
    while true do
        local var_1, ···, var_n = _func(_state, _lastval)
        if var_1 == nil then break end
        _lastval = var_1
        block
    end
end

Important to note here is the way multiple-return works in Lua. Lua doesn’t have tuples; multiple assignment is a distinct feature of the language, and multiple return works exactly the same way as multiple assignment. If there are too few values, the extra variables become nil; if there are too many values, the extras are silently discarded.

So in the line local _func, _state, _lastval = explist, the “state” value _state and the “last loop value” _lastval are both optional. Lua doesn’t use them, except to pass them back to the iterator function _func, and they aren’t visible to the for loop body. An iterator can thus be only a function and nothing else, letting _state and _lastval be nil — but they can be a little more convenient at times. Compare:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Usual approach: return only a closure, completely ignoring state and lastval
local function inclusive_range(start, stop)
    local nextval = start
    return function()
        if nextval > stop then
            return
        end
        local val = nextval
        nextval = nextval + 1
        return val
    end
end

-- Alternative approach, not using closures at all.  This is the function we
-- return; each time it's called with the same "state" value and whatever it
-- returned last time it was called.
-- This function could even be written exactly a method (a la Python's
-- __next__), where the state value is the object itself.
local function inclusive_range_iter(stop, prev)
    -- "stop" is the state value; "prev" is the last value we returned
    local val = prev + 1
    if val > stop then
        return
    end
    return val
end
local function inclusive_range(start, stop)
    -- Return the iterator function, and pass it the stop value as its state.
    -- The "last value" is a little weird here; on the first iteration, there
    -- is no last value.  Here we can fake it by subtracting 1 from the
    -- starting number, but in other cases, it might make more sense if the
    -- "state" were a table containing both the start and stop values.
    return inclusive_range_iter, stop, start - 1
end

-- 6 7 8 9 with both implementations
for n in inclusive_range(6, 9) do
    ...
end

Lua doesn’t have generators. Surprisingly, it has fully-fledged coroutines — call stacks that can be paused at any time. Lua sometimes refers to them as “threads”, but only one can be running at a time. Effectively they’re like Python generators, except you can call a function which calls a function which calls a function which eventually yields, and the entire call stack from that point up to the top of the coroutine is paused and preserved.

In Python, the mere presence of yield causes a function to become a generator. In Lua, since any function might try to yield the coroutine it’s currently in, a function has to be explicitly called as a coroutine using functions in the coroutine library.

But this post is about iterators, not coroutines. Coroutines don’t function as iterators, but Lua provides a coroutine.wrap() that takes a function, turns it into a coroutine, and returns a function that resumes the coroutine. That’s enough to allow a coroutine to be turned into an iterator. The Lua book even has a section about this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
local function inclusive_range(start, stop)
    local val = start
    while val <= stop do
        coroutine.yield(val)
        val = val + 1
    end
end
-- Unfortunately, coroutine.wrap() doesn't have any way to pass initial
-- arguments to the function it wraps, so we need this dinky wrapper.
-- I should clarify that the ... here is literal syntax for once.
local function iter_coro(entry_point, ...)
    local args = {...}
    return coroutine.wrap(function()
        entry_point(unpack(args))
    end)
end

# 6 7 8 9
for n in iter_coro(inclusive_range, 6, 9) do
    ...
end

So, that’s cool. Lua doesn’t do a lot for you — unfortunately, list processing tricks can be significantly more painful in Lua — but it has some pretty interesting primitives that compose with each other remarkably well.

Perl 5

Perl has a very straightforward C-style for loop, which looks and works exactly as you might expect. my, which appears frequently in these examples, is just local variable declaration.

1
2
3
for (my $i = 0; $i < 10; $i++) {
    ...
}

Nobody uses it. Everyone uses the iteration-style for loop. (It’s occasionally called foreach, which is extra confusing because both for and foreach can be used for both kinds of loop. Nobody actually uses the foreach keyword.)

1
2
3
for my $value (@container) {
    ...
}

The iteration loop can be used for numbers, as well, since Perl has a .. inclusive range operator. For iterating over an array with indexes, Perl has the slightly odd $#array syntax, which is the index of the last item in @array. Creating something like Python’s enumerate is a little tricky in Perl, because you can’t directly return a list of lists, and the workaround doesn’t support unpacking. It’s complicated.

1
2
3
4
5
6
7
8
for my $i (1..10) {
    ...
}

for my $index (0..$#array) {
    my $value = $array[$index];
    ...
}

A hash (Perl’s mapping “shape”) can’t be iterated directly. Or, well, it can, but the loop will alternate between keys and values because Perl is weird. Instead you need the keys or values built-in functions to get the keys or values as regular lists. (These functions also work on arrays as of Perl 5.12.)

1
2
3
for my $key (keys %container) {
    ...
}

For iterating over both keys and values at the same time, Perl has an each function. The behavior is a little weird, since every call to the function advances an internal iterator inside the hash and returns a new pair. If a loop using each terminates early, the next use of each may silently start somewhere in the middle of the hash, skipping a bunch of its keys. This is probably why I’ve never seen each actually used.

1
2
3
while (my ($key, value) = each %container) {
    ...
}

Despite being very heavily built on the concept of lists, Perl doesn’t have an explicit iterator protocol, and its support for lazy iteration in general is not great. When they’re used at all, lazy iterators tend to be implemented as ad-hoc closures or callable objects, which require a while loop:

1
2
3
4
my $iter = custom_iterator($collection);
while (my $value = $iter->()) {
    ...
}

Here be dragons

It is possible to sorta-kinda fake an iterator protocol. If you’re not familiar, Perl’s variables come in several different “shapes” — hash, array, scalar — and it’s possible to “tie” a variable to a backing object which defines the operations for a particular shape. It’s a little like operator overloading, except that Perl also has operator overloading and it’s a completely unrelated mechanism. In fact, you could use operator overloading to make your object return a tied array when dereferenced as an array. I am talking gibberish now.

Anyway, the trick is to tie an array and return a new value for each consecutive fetch of an index. Like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
use v5.12;
package ClosureIterator;

# This is the tie "constructor" and just creates a regular object to store
# our state
sub TIEARRAY {
    my ($class, $closure) = @_;
    my $self = {
        closure => $closure,
        nextindex => 0,
    };
    return bless $self, $class;
}

# This is called to fetch the item at a particular index; for an iterator,
# only the next item is valid
sub FETCH {
    my ($self, $index) = @_;

    if ($index == 0) {
        # Always allow reading index 0, both to mean a general "get next
        # item" and so that looping over the same array twice will work as
        # expected
        $self->{nextindex} = 0;
    }
    elsif ($index != $self->{nextindex}) {
        die "ClosureIterator does not support random access";
    }

    $self->{nextindex}++;
    return $self->{closure}->();
}

# The built-in shift() function means "remove and return the first item", so
# it's a good fit for a general "advance iterator"
sub SHIFT {
    my ($self) = @_;
    $self->{nextindex} = 0;
    return $self->{closure}->();
}

# Yes, an array has to be able to report its own size...  but luckily, a for
# loop fetches the size on every iteration!  As long as this returns
# increasingly large values, such a loop will continue indefinitely
sub FETCHSIZE {
    my ($self) = @_;
    return $self->{nextindex} + 1;
}

# Most other tied array operations are for modifying the array, which makes no
# sense here.  They're deliberately omitted, so trying to use them will cause a
# "can't locate object method" error.


package main;

# Create an iterator that yields successive powers of 2
tie my @array, 'ClosureIterator', sub {
    # State variables are persistent, like C statics
    state $next = 1;
    my $ret = $next;
    $next *= 2;
    return $ret;
};

# This will print out 1, 2, 4, 8, ... 1024, at which point the loop breaks
for my $i (@array) {
    say $i;
    last if $i > 1000;
}

This transparently works like any other array… sort of. You can loop over it (forever!); you can use shift to pop off the next value; you can stop a loop and then continue reading from it later.

Unfortunately, this is just plain weird, even for Perl, and I very rarely see it used. Ultimately, Perl’s array operations come in a set, and this is an array that pretends not to be able to do half of them. Even Perl developers are likely to be surprised by an array, a fundamental “shape” of the language, with quirky behavior.

The biggest problem is that, as I said, Perl is heavily built on lists. Part of that design is that @arrays are very eager to spill their contents into a surrounding context. Naïvely passing an array to a function, for example, will expand its elements into separate arguments, losing the identity of the array itself (and losing any tied-ness). Interpolating an array into a string automatically space-separates its elements.

Unlike a for loop, these operations only ask the array for its size once — so rather than printing an infinite sequence, they’ll print a completely arbitrary prefix of it. In the case above, spilling a fresh array will read one item; spilling the array after the example loop will read eleven items. So while a tied array works nicely with a for loop, it’s at odds with the most basic rules of Perl syntax.

Also, Perl’s list-based nature means it’s attracted a lot of list-processing utilities — but these naturally expect to receive a spilled list of arguments and cannot work with a lazy iterator.

I found multiple mentions of the List::Gen module while looking into this. I’d never heard of it before and I’ve never seen it used, but it tries to fill this gap (and makes use of array tying, among other things). It’s a bit weird, and its source code is extremely weird, and it took me twenty minutes to figure out how it was using <...> as a quoting construct.

(<...> in Perl does filename globbing, so it’s usually seen as <*.txt>. The same syntax is used for reading from a filehandle, which makes this confusing and ambiguous, so it’s generally discouraged in favor of the built-in glob function which does the same thing. Well, it turns out that <...> must just call glob() at Perl-level, because List::Gen manages to co-opt this syntax simply by exporting its own glob function. Perl is magical.)

Perl 6

Perl 6, a mad experiment to put literally every conceivable feature into one programming language, naturally has a more robust concept of iteration.

At first glance, many of the constructs are similar to those of Perl 5. The C-style for loop still exists for some reason, but has been disambiguated under the loop keyword.

1
2
3
4
5
6
7
8
loop (my $i = 1; $i <= 10; $i++) {
    ...
}

# More interestingly, loop can be used completely bare for an infinite loop
loop {
    ...
}

The for block has slightly different syntax and a couple new tricks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Unlike in Perl 5, $value is automatically declared and scoped to the block,
# without needing an explicit 'my'
for @container -> $value {
    ...
}

for 1..10 -> $i {
    ...
}

# This doesn't iterate in pairs; it reads two items at a time from a flat list!
for 1..10 -> $a, $b {
    ...
}

Not apparent in the above code is that ranges are lazy in Perl 6, as in Python; the elements are computed on demand. In fact, Perl 6 supports a range like 1..Inf.

Loop variables are also aliases. By default they’re read-only, so this appears to work like Python… but Perl has always had a C-like language-level notion of “slots” that Python does not, and it becomes apparent if the loop variable is made read-write:

1
2
3
4
5
6
7
8
my @fruits = «apple orange pear»;
for @fruits -> $fruit is rw {
    # This is "apply method inplace", i.e. shorthand for:
    # $fruit = $fruit.uc;
    # Yes, you can do that.
    $fruit .= uc;
}
say @fruits;  # APPLE ORANGE PEAR

For iterating with indexes, there’s a curious idiom:

1
2
3
4
5
6
7
# ^Inf is shorthand for 0..Inf, read as "up to Inf".
# Z is the zip operator, which interleaves its arguments' elements into a
# single flat list.
# This makes use of the "two at a time" trick from above.
for ^Inf Z @array -> $index, $value {
    ...
}

Iterating hashes is somewhat simpler; hashes have methods, and the .kv method returns the keys and values. (It actually returns them in a flat list interleaved, which again uses “two at a time” syntax. If you only use a single loop variable, your loop iterations will alternate between a key and a value. Iterating a hash directly produces pairs, which are a first-class data type in Perl 6, but I can’t find any syntax for directly unpacking a pair within a loop header.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for %container.kv -> $key, value {
    ...
}

# No surprises here
for %container.keys -> $key {
    ...
}
for %container.values -> $value {
    ...
}

Perl 6 is very big on laziness, which is perhaps why it took fifteen years to see a release. It has the same iterable versus iterator split as Python. Given a container (iterable), ask for an iterator; given an iterator, repeatedly ask for new values. When the iterator is exhausted, it returns the IterationEnd sentinel. Exactly the same ideas. I’m not clear on the precise semantics of the for block and can’t find a simple reference, but they’re probably much like Python’s… plus a thousand special cases.

Generators, kinda

Perl 6 also has its own version of generators, though with a few extra twists. Curiously, generators are a block called gather, rather than a kind of function — this means that a one-off gather is easier to create, but a gather factory must be explicitly wrapped in a function. gather can even take a single expression rather than a block, so there’s no need for separate “generator expression” syntax as in Python.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sub inclusive-range($start, $stop) {
    return gather {
        my $val = $start;
        while $val <= $stop {
            take $val;
            $val++;
        }
    };
}

# 6 7 8 9
for inclusive-range(6, 9) -> $n {
    ...
}

Unlike Python’s yield, Perl 6’s take is dynamically scoped — i.e., take can be used anywhere in the call stack, and it will apply to the most recent gather caller. That means arbitrary-depth coroutines, which seems like a big deal to me, but the documentation mentions it almost as an afterthought.

The documentation also says gather/take can generate values lazily, depending on context,” but neglects to clarify how context factors in. The code I wrote above turns out to be lazy, but this ambiguity inclines me to use the explicit lazy marker everywhere.

Ultimately it’s a pretty flexible feature, but has a few quirks that make it a bit clumsier to use as a straightforward generator. Given that the default behavior is an eagerly-evaluated block, I think the original intention was to avoid the slightly unsatisfying pattern of “push onto an array every iteration through a loop” — instead you can now do this:

1
2
3
4
5
6
my @results = gather {
    for @source-data -> $datum {
        next unless some-test($datum);
        take process($datum);
    }
};

Using a simple (syntax-highlighted!) take puts the focus on the value being taken, rather than the details of putting it where it wants to go and how it gets there. It’s an interesting idea and I’m surprised I’ve never seen it demonstrated this way.

With gather and some abuse of Perl’s exceptionally compactable syntax, I can write a much shorter version of the infinite Perl 5 iterator above.

1
2
3
4
5
6
7
8
my @powers-of-two = lazy gather take (state $n = 1) *= 2 for ^Inf;

# Binds to $_ by default
for @powers-of-two {
    # Method calls are on $_ by default
    .say;
    last if $_ > 1000;
}

It’s definitely shorter, I’ll give it that. Leaving off the lazy in this case causes an infinite loop as Perl tries to evaluate the entire list; using a $ instead of a @ produces a “Cannot .elems a lazy list” error; using $ without lazy prints a ...-terminated representation of the infinite list and then hangs forever. I don’t quite understand the semantics of stuffing a list into a scalar ($) variable in Perl 6, and to be honest the list/array semantics seem to be far more convoluted than Perl 5, so I have no idea what’s going on here. Perl 6 has a lot of fascinating toys that are very easy to use incorrectly.

Nuts and bolts

Iterables and iterators are encoded explicitly as the Iterable and Iterator roles. An Iterable has an .iterator method that should return an Iterator. An Iterator has a .pull-one method that returns the next value, or the IterationEnd sentinel when the iterator is exhausted. Both roles offer several other methods, but they have suitable default implementations.

inclusive-range might be transformed into a class thusly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class InclusiveRangeIterator does Iterator {
    has $.range is required;
    has $!nextval = $!range.start;

    method pull-one() {
        if $!nextval > $!range.stop {
            return IterationEnd;
        }

        # Perl people would probably phrase this:
        # ++$!nextval
        # and they are wrong.
        my $val = $!nextval;
        $!nextval++;
        return $val;
    }
}

class InclusiveRange does Iterable {
    has $.start is required;
    has $.stop is required;

    # Don't even ask
    method new($start, $stop) {
        self.bless(:$start, :$stop);
    }

    method iterator() {
        InclusiveRangeIterator.new(range => self);
    }
}

# 6 7 8 9
for InclusiveRange.new(6, 9) -> $n {
    ...
}

Can we use gather to avoid the need for an extra class, just as in Python? We sure can! The only catch is that Perl 6 iterators don’t also pretend to be iterables (remember, in Python, iter(it) should produce it), so we need to explicitly return a gather block’s iterator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
class InclusiveRange does Iterable {
    has $.start is required;
    has $.stop is required;

    # Don't even ask
    method new($start, $stop) {
        self.bless(:$start, :$stop);
    }

    method iterator() {
        gather {
            my $val = $!start;
            while $val <= $!stop {
                take $val;
                $val++;
            }
        }.iterator;  # <- this is important
    }
}

For sequences, Perl 6 has the Seq type. Curiously, even an infinite lazy gather is still a Seq. Indexing and length are not part of Seq — both are implemented as separate methods.

Curiously, even though Perl 6 became much stricter overall, the indexing methods don’t seem to be part of a role; you only need define them, much like Python’s __dunder__ methods. In fact, the preceding examples, does Iterator isn’t necessary at all; the for block will blindly try to call an iterator method and doesn’t much care where it came from.

I’m sure there are plenty of cute tricks possible with Perl 6, but, er, I’ll leave those as an exercise for the reader.

Ruby

Ruby is a popular and well-disguised Perl variant, if Perl just went completely all-in on Smalltalk. It has no C-style for, but it does have an infinite loop block and a very Python-esque for:

1
2
3
for value in sequence do
    ...
end

Nobody uses this. No, really, the core language documentation outright says:

The for loop is rarely used in modern ruby programs.

Instead, you’ll probably see this:

1
2
3
sequence.each do |value|
    ...
end

It doesn’t look it, but this is completely backwards from everything seen so far. All of these other languages have used external iterators, where an object is repeatedly asked to produce values and calling code can do whatever it wants with them. Here, something very different is happening. The entire do ... end block acts as a closure whose argument is value; it’s passed to the each method, which calls it once for each value in the sequence. This is an internal iterator.

Pass a block to a function which can then call it a lot” is a built-in syntactic feature of Ruby, so these kinds of iterators are fairly common. The upside is that they look almost like a custom block, so they fit naturally with the language. The downside is that all of these block-accepting methods are implemented on Array, rather than as generic functions: bsearch, bsearch_index, collect, collect!, combination, count, cycle, delete, delete_if, drop_while, each, each_index, fetch, fill, find_index, index, keep_if, map, map!, permutation, product, reject, reject!, repeated_combination, repeated_permutation, reverse_each, rindex, select, select!, sort, sort!, sort_by!, take_while, uniq, uniq!, zip. Some of those, as well as a number of additional methods, are provided by the Enumerable mixin which can express them in terms of each. I suppose the other upside is that any given type can provide its own more efficient implementation of these methods, if it so desires.

I guess that huge list of methods answers most questions about how to iterate over indices or in reverse. The only bit missing is that .. range syntax exists in Ruby as well, and it produces Range objects which also have an each method. If you don’t care about each index, you can also use the cute 3.times method.

Ruby blocks are a fundamental part of the language and built right into the method-calling syntax. Even break is defined in terms of blocks, and it works with an argument!

1
2
3
4
# This just doesn't feel like it should work, but it does.  Prints 17.
# Braces are conventionally used for inline blocks, but do/end would work too.
primes = [2, 3, 5, 7, 11, 13, 17, 19]
puts primes.each { |p| break p if p > 16 }

each() doesn’t need to do anything special here; break will just cause its return value to be 17. Somehow. (Honestly, this is the sort of thing that makes me wary of Ruby; it seems so ad-hoc and raises so many questions. A language keyword that changes the return value of a different function? Does the inside of each() know about this or have any control over it? How does it actually work? Is there any opportunity for cleanup? I have no idea, and the documentation doesn’t seem to think this is worth commenting on.)

Using blocks

Anyway, with block-passing as a language feature, the “iterator protocol” is pretty straightforward: just write a method that takes a block.

1
2
3
4
5
def each
    for value in self do
        yield value
    end
end

Be careful! Though it’s handy for iteration, that yield is not the same as Python’s yield. Ruby’s yield calls the passed-in block — yields control to the caller — with the given value(s).

I pulled a dirty trick there, because I expressed each in terms of for. So how does for work? Well, ah, it just delegates to each. Oops!

How, then, do you write an iterator completely from scratch? The obvious way is to use yield repeatedly. That gives you something that looks rather a lot like Python, though it doesn’t actually pause execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class InclusiveRange
    # This gets you a variety of other iteration methods, all defined in
    # terms of each()
    include Enumerable

    def initialize(start, stop)
        @start = start
        @stop = stop
    end
    def each
        val = @start
        while val <= @stop do
            yield val
            val += 1
        end
    end
end

# 6 7 8 9
# A `for` loop would also work here
InclusiveRange.new(6, 9).each do |n|
    ...
end

Enumerators

Well, that’s nice for creating a whole collection type, but what if I want an ad-hoc custom iterator? Enter the Enumerator class, which allows you to create… ah, enumerators.

Note that the relationship between Enumerable and Enumerator is not the same as the relationship between “iterable” and “iterator”. Most importantly, neither is really an interface. Enumerable is a set of common iteration methods that any collection type may want to have, and it expects an each to exist. Enumerator is a generic collection type, and in fact mixes in Enumerable. Maybe I should just show you some code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def inclusive_range(start, stop)
    Enumerator.new do |y|
        val = start
        while val <= stop do
            y.yield val
            val += 1
        end
    end
end

# 6 7 8 9
inclusive_range(6, 9).each do |n|
    puts n
end

Enumerator turns a block into a fully-fledged data stream. The block is free to do whatever it wants, and whenever it wants to emit a value, it calls y.yield value. The y argument is a “yielder” object, an opaque magic type; y.yield is a regular method call, unrelated to the yield keyword. (y << value is equivalent; << is Ruby’s “append” operator. And also, yes, bit shift.)

The amazing bit is that you can do this:

1
2
# 6
puts inclusive_range(6, 9).first

Enumerator has all of the Enumerable methods, one of which is first. So, that’s nice.

The really amazing bit is that if you stick some debugging code into the block passed to Enumerator.new, you’ll find that… the values are produced lazily. That call to first() doesn’t generate the full sequence and then discard everything after the first item; it only generates the first item, then stops.

(Beware! The values are produced lazily, but many Enumerable methods are eager. I’ll get back to this in a moment.)

Hang on, didn’t I say yield doesn’t pause execution? Didn’t I also say the above yield is just a method call, not the keyword?

I did! And I wasn’t lying. The really truly amazing bit, which I’ve seen shockingly little excitement about while researching this, is that under the hood, this is all using Fibers. Coroutines.

Enumerator.new takes a block and turns it into a coroutine. Every time something wants a value from the enumerator, it resumes the coroutine. The yielder object’s yield method then calls Fiber.yield() to pause the coroutine. It works just like Lua, but it’s designed to work with existing Ruby conventions, like the piles of internal iteration methods developers expect to find.

So Enumerator.new can produce Python-style generators, albeit in a slightly un-native-looking way. There’s also one other significant difference: an Enumerator can restart itself for each method called on it, simply by calling the block again. This code will print 6 three times:

1
2
3
4
ir = inclusive_range(6, 9)
puts ir.first
puts ir.first
puts ir.first

For something like an inclusive range object, that’s pretty nice. For something like a file, maybe not so nice. It also means you need to be sure to put your setup code inside the block passed to Enumerator.new, or funny things will happen when the block is restarted.

Something like generators

But wait, there’s more. Specifically, this common pattern, which pretty much lets you ignore Enumerator.new entirely.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def some_iterator_method
    # __method__ is the current method name.  block_given? is straightforward.
    return enum_for(__method__) unless block_given?

    # An extremely accurate simulation of a large list.
    (1..1000).each do |item|
        puts "having a look at #{item}"
        # Blocks are invisible to `yield`; this will yield to the block passed
        # to some_iterator_method.
        yield item if item.even?
    end
end

# having a look at 1
# having a look at 2
# 2
puts some_iterator_method.first

Okay, bear with me.

First, some_iterator_method() is called. It doesn’t have a block attached, so block_given? is false, and it returns enum_for(...), whatever that does. Then first() is called on the result, and that produces a single element and stops.

The above code has no magic yielder object. It uses the straightforward yield keyword. Why doesn’t it loop over the entire range from 1 to 1000?

Remember, Enumerator uses coroutines under the hood. One neat thing coroutines can do is pause code that doesn’t know it’s in a coroutine. Python’s generators pause themselves with yield, and the mere presence of yield turns a function into a generator; but in Lua or Ruby or any other language with coroutines, any function can pause at any time. You can even make a closure that pauses, then pass that closure to another function which calls it, without that function ever knowing anything happened.

(This arguably has some considerable downsides as well — it becomes difficult to know when or where your code might pause, which makes reasoning about the order of operations much harder. That’s why Python and some other languages opted to implement async IO with an await keyword — anyone reading the code knows that it can only pause where an await appears.)

(Also, I’m saying “pause” here instead of “yield” because Ruby has really complicated the hell out of this by already having a yield keyword that does something totally different, and naming its coroutine pause function yield.)

Anyway, that’s exactly what’s happening here. enum_for returns an Enumerator that wraps the whole method. (It doesn’t need to know self, because enum_for is actually a method inherited from Object, goodness gracious.) When the Enumerator needs some items, it calls the method a second time with its own block, running in a coroutine, just like a block passed to Enumerator.new. Eventually the method emits a value using the regular old yield keyword, and that value reaches the block created by Enumerator, and that block pauses the call stack. It doesn’t matter that Range.each is eager, because its iteration is still happening in code somewhere, and that code is part of a call stack in a coroutine, so it can be paused. Eventually the coroutine is no longer useful and gets thrown away, so the eager each call simply stops midway through its work, unaware that anything unusual ever happened.

In fact, despite being an Object method, enum_for isn’t special at all. It can be expressed in pure Ruby very easily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def my_enum_for(receiver, method)
    # Enumerator.new creates a coroutine-as-iteration-source, as above.
    Enumerator.new do |y|
        # All it does is call the named method with a trivial block.  Every
        # time the method produces a value with the `yield` keyword, we pass it
        # along to the yielder object, which pauses the coroutine.
        # This is nothing more than a bridge between "yield" in the Ruby block
        # sense, and "yield" in the coroutine sense.
        receiver.send method do |value|
            y.yield value
        end
    end
end

So, that’s pretty neat. Incidentally, several built-in methods like Array.each and Enumerable.collect act like this, returning an Enumerator if called with no arguments.

Full laziness

I mentioned above that while an Enumerator fetches items lazily, many of the methods are eager. To clarify what I mean by that, consider:

1
2
3
4
5
inclusive_range(6, 9000).collect {
    |n|
    puts "considering #{n}"
    "a" * n
}.first(3)

collect() is one of those common Enumerable methods. You might know it by its other name, map(). Ruby is big on multiple names for the same thing: one that everyone uses in practice, and another that people who don’t use Ruby will actually recognize.

Even though this code ultimately only needs three items, and even though there’s all this coroutine machinery happening under the hood, this still evaluates the entire range. Why?

The problem is that collect() has always returned an array, and is generally expected to continue doing so. It has no way of knowing that it’s about to be fed into first. Rather than violate this API, Ruby added a new method, Enumerable.lazy. This stops after three items:

1
2
3
4
5
inclusive_range(6, 9000).lazy.collect {
    |n|
    puts "considering #{n}"
    "a" * n
}.first(3)

All this does is return an Enumerator::Lazy object, which has lazy implementations of various methods that would usually do a full iteration. Methods like first(3) are still “eager” (in the sense that they just return an array), since their results have a fixed finite size.

This seems a little clunky to me, since the end result is still an object with a collect method that doesn’t return an array. I suspect the real reason is just that Enumerator was added first; even though the coroutine support was already there, Enumerator::Lazy only came along later. Changing existing eager methods to be lazy can, ah, cause problems.

The only built-in type that seems to have interesting lazy behavior is Range, which can be infinite.

1
2
3
4
# Whoops, infinite loop.
(1..Float::INFINITY).select { |n| n.even? }.first(5)
# 2 4 6 8 10
(1..Float::INFINITY).lazy.select { |n| n.even? }.first(5)

A loose end

I think the only remaining piece of this puzzle is something I stumbled upon but can’t explain. Enumerator has a next method, which returns the next value or raises StopIteration.

Wow, that sounds awfully familiar.

But I can’t find anything in the language or standard library that uses this, with one single and boring exception: the loop construct. It catches StopIteration and exits the block.

1
2
3
4
5
6
enumerator = [1, 2, 3].each
loop do
    while true do
        puts enumerator.next
    end
end

On the fourth call, next() will be out of items, so it raises StopIteration. Removing the loop block makes this quite obvious.

That’s it. That’s the only use of it in the language, as far as I can tell. It seems almost… vestigial. It’s also a little weird, since it keeps the current iteration state inside the Enumerator, unlike any of its other methods. But it’s also the only form of external iteration that I know of in Ruby, and that’s handy to have sometimes.

And, uh, so on

I intended to foray into a few more languages, including some recent lower-level friends like C++/Rust/Swift, but this post somehow spiraled out of control and hit nine thousand words. No one has read this far.

Handily, it turns out that the above languages pretty much cover the basic ways of approaching iteration; if any of this made sense, other languages will probably seem pretty familiar.

  • C++’s iteration protocol(s) has existed for a long time in the form of ++it to advance an iterator and *it to read the current item, though this was usually written manually in a C-style for loop, and loops were generally terminated with an explicit endpoint.

    C++11 added the range-based for, which does basically the same stuff under the hood. Idiomatic C++ is inscrutible, but maybe you can make sense of this project which provides optionally-infinite iterable ranges.

  • Rust has an entire (extremely well-documented) iter module with numerous iterators and examples of how to create your own. The core of the Iterator trait is just a next method which returns None when exhausted. It also has a lot of handy Ruby-like chainable methods, so working directly with iterators is more common in Rust than in Python.

  • Swift also has (well-documented) simple next-based iterators, though these return nil when exhausted, which means (like Lua) that an iterator cannot produce nil as a value. (This isn’t the case with Rust, where next returns an Option<T> — a valid None would be returned as Some(None).)

I could probably keep finding more languages indefinitely, so I’m gonna take a break from this now.

Iteration in one language, then all the others

Post Syndicated from Eevee original https://eev.ee/blog/2016/11/18/iteration-in-one-language-then-all-the-others/

You may have noticed that I like comparing features across different languages. I hope you like it too, because I’m doing it again.

Python

I’m most familiar with Python, and iteration is one of its major concepts, so it’s a good place to start and a good overview of iteration. I’ll dive into Python a little more deeply, then draw parallels to other languages.

Python only has one form of iteration loop, for. (Note that all of these examples are written for Python 3; in Python 2, some of the names are slightly different, and fewer things are lazy.)

1
2
for value in sequence:
    ...

in is also an operator, so value in sequence is also the way you test for containment. This is either very confusing or very satisfying.

When you need indices, or specifically a range of numbers, you can use the built-in enumerate or range functions. enumerate works with lazy iterables as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# This makes use of tuple unpacking to effectively return two values at a time
for index, value in enumerate(sequence):
    ...

# Note that the endpoint is exclusive, and the default start point is 0.  This
# matches how list indexing works and fits the C style of numbering.
# 0 1 2 3 4
for n in range(5):
    ...

# Start somewhere other than zero, and the endpoint is still exclusive.
# 1 2 3 4
for n in range(1, 5):
    ...

# Count by 2 instead.  Can also use a negative step to count backwards.
# 1 3 5 7 9
for n in range(1, 11, 2):
    ...

dicts (mapping types) have several methods for different kinds of iteration. Additionally, iterating over a dict directly produces its keys.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for key in mapping:
    ...

for key in mapping.keys():
    ...

for value in mapping.values():
    ...

for key, value in mapping.items():
    ...

Python distinguishes between an iterable, any value that can be iterated over, and an iterator, a value that performs the actual work of iteration. Common iterable types include list, tuple, dict, str, and set. enumerate and range are also iterable.

Since Python code rarely works with iterators directly, and many iterable types also function as their own iterators, it’s common to hear “iterator” used to mean an iterable. To avoid this ambiguity, and because the words are fairly similar already, I’ll refer to iterables as containers like the Python documentation sometimes does. Don’t be fooled — an object doesn’t actually need to contain anything to be iterable. Python’s range type is iterable, but it doesn’t physically contain all the numbers in the range; it generates them on the fly as needed.

The fundamental basics of iteration are built on these two ideas. Given a container, ask for an iterator; then repeatedly advance the iterator to get new values. When the iterator runs out of values, it raises StopIteration. That’s it. In Python, those two steps can be performed manually with the iter and next functions. A for loop is roughly equivalent to:

1
2
3
4
5
6
7
8
9
_iterator = iter(container)
_done = False
while not _done:
    try:
        value = next(_iterator)
    except StopIteration:
        _done = True
    else:
        ...

An iterator can only move forwards. Once a value has been produced, it’s lost, at least as far as the iterator is concerned. These restrictions are occasionally limiting, but they allow iteration to be used for some unexpected tasks. For example, iterating over an open file produces its lines — even if the “file” is actually a terminal or pipe, where data only arrives once and isn’t persistently stored anywhere.

Generators

A more common form of “only forwards, only once” in Python is the generator, a function containing a yield statement. For example:

1
2
3
4
5
6
7
8
9
def inclusive_range(start, stop):
    val = start
    while val <= stop:
        yield val
        val += 1

# 6 7 8 9
for n in inclusive_range(6, 9):
    ...

Calling a generator function doesn’t execute its code, but immediately creates a generator iterator. Every time the iterator is advanced, the function executes until the next yield, at which point the yielded value is returned as the next value and the function pauses. The next iteration will then resume the function. When the function returns (or falls off the end), the iterator stops.

Since the values here are produced by running code on the fly, it’s of course impossible to rewind a generator.

The underlying protocol is straightforward. A container must have an __iter__ method that returns an iterator, corresponding to the iter function. An iterator must have a __next__ method that returns the next item, corresponding to the next function. If the iterator is exhausted, __next__ must raise StopIteration. An iterator must also have an __iter__ that returns itself — this is so an iterator can be used directly in a for loop.

The above inclusive range generator might be written out explicitly like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        return InclusiveRangeIterator(self)

class InclusiveRangeIterator:
    def __init__(self, incrange):
        self.incrange = incrange
        self.nextval = incrange.start

    def __iter__(self):
        return self

    def __next__(self):
        if self.nextval > self.incrange.stop:
            raise StopIteration

        val = self.nextval
        self.nextval += 1
        return val

This might seem like a lot of boilerplate, but note that the iterator state (here, nextval) can’t go on InclusiveRange directly, because then it’d be impossible to iterate over the same object twice at the same time. (Some types, like files, do act as their own iterators because they can’t meaningfully be iterated in parallel.)

Even Python’s internals work this way. Try iter([]) in a Python REPL; you’ll get a list_iterator object.

In truth, it is a lot of boilerplate. User code usually uses this trick:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        val = self.start
        while val <= self.stop:
            yield val
            val += 1

Nothing about this is special-cased in any way. Now __iter__ is a generator, and calling a generator function returns an iterator, so all the constraints are met. It’s a really easy way to convert a generator function into a type. If this class were named inclusive_range instead, it would even be backwards-compatible; consuming code wouldn’t even have to know it’s a class.

Reversal

But why would you do this? One excellent reason is to add support for other sequence-like operations, like reverse iteration support. An iterator can’t be reversed, but a container might support being iterated in reverse:

1
2
3
4
fruits = ['apple', 'orange', 'pear']
# pear, orange, apple
for value in reversed(fruits):
    ...

Iterating a lazy container doesn’t always make sense, but when it does, it’s easy to implement by returning an iterator from __reversed__.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class InclusiveRange:
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

    def __iter__(self):
        val = self.start
        while val <= self.stop:
            yield val
            val += 1

    def __reversed__(self):
        val = self.stop
        while val >= self.start:
            yield val
            val -= 1

Note that Python does not have “bi-directional” iterators, which can freely switch between forwards and reverse iteration on the fly. A bidirectional iterator is useful for cases like doubly-linked lists, where it’s easy to get from one value to the next or previous value, but not as easy to start from the beginning and get the tenth item.

Iteration is often associated with sequences, though they’re not quite the same. In Python, a sequence is a value that can be indexed in order as container[0], container[1], etc. (Indexing is implemented with __getitem__.) All sequences are iterable; in fact, if a type implements indexing but not __iter__, the iter function will automatically try indexing it from zero instead. reversed does the same, though it requires that the type implement __len__ as well so it knows what the last item is.

Much of this is codified more explicitly in the abstract base classes in collections.abc, which also provide default implementations of common methods.

Not all iterables are sequences, and not every value that can be indexed is a sequence! Python’s mapping type, dict, uses indexing to fetch the value for a key; but a dict has no defined order and is not a sequence. However, a dict can still be iterated over, producing its keys (in arbitrary order). A set can be iterated over, producing its values in arbitrary order, but it cannot be indexed at all. A type could conceivably use indexing for something more unusual and not be iterable at all.

A common question

It’s not really related to iteration, but people coming to Python from Ruby often ask why len() is a built-in function, rather than a method. The same question could be asked about iter() and next() (and other Python builtins), which more or less delegate directly to a “reserved” __dunder__ method anyway.

I believe the technical reason is simply the order that features were added to the language in very early days, which is not very interesting.

The philosophical reason, imo, is that Python does not reserve method names for fundamental operations. All __dunder__ names are reserved, of course, but everything else is fair game. This makes it obvious when a method is intended to add support for some language-ish-level operation, even if you don’t know what all the method names are. Occasionally a third-party library invents its own __dunder__ name, which is a little naughty, but the same reasoning applies: “this is a completely generic interface that some external mechanism is expected to use”.

This approach also avoids a namespacing problem. In Ruby, a Rectangle class might want to have width and length attributes… but the presence of length means a Rectangle looks like it functions as a sequence! Since “interface” method names aren’t namespaced in any way, there is no way to say that you don’t mean the same thing as Array.length.

It’s a minor quibble, since everything’s dynamically typed anyway, so the real solution is “well don’t try to iterate a rectangle then”. And Python does use keys as a method name in some obscure cases. Oh, well.

Some cute tricks

The distinction between sequences and iterables can cause some subtle problems. A lot of code that only needs to loop over items can be passed, e.g., a generator. But this can take some conscious care. Compare:

1
2
3
4
5
6
7
8
# This will NOT work with generators, which don't support len() or indexing
for i in range(len(container)):
    value = container[i]
    ...

# But this will
for i, value in enumerate(container):
    ...

enumerate also has a subtle, unfortunate problem: it cannot be combined with reversed. This has bit me more than once, surprisingly.

1
2
3
4
5
6
7
# This produces a TypeError from reversed()
for i, value in reversed(enumerate(container)):
    ...

# This almost works, but the index goes forwards while the values go backwards
for i, value in enumerate(reversed(container)):
    ...

The problem is that enumerate can’t, in general, reverse itself. It counts up from zero as it iterates over its argument; reversing it means starting from one less than the number of items, but it doesn’t yet know how many items there are. But if you just want to run over a list or other sequence backwards, this feels very silly. A trivial helper can make it work:

1
2
3
4
5
def revenum(iterable, end=0):
    start = len(iterable) + end
    for value in iterable:
        start -= 1
        yield start, value

I’ve run into other odd cases where it’s frustrating that a generator doesn’t have a length or indexing. This especially comes up if you make heavy use of generator expressions, which are a very compact way to write a one-off generator. (Python also has list, set, and dict “comprehensions”, which have the same syntax but use brackets or braces instead of parentheses, and are evaluated immediately instead of lazily.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_big_fruits():
    fruits = ['apple', 'orange', 'pear']
    return (fruit.upper() for fruit in fruits)

# Roughly equivalent to:
def get_big_fruits():
    fruits = ['apple', 'orange', 'pear']
    def genexp():
        for fruit in fruits:
            yield fruit.upper()
    return genexp()

If you had thousands of fruits, doing this could save a little memory. The caller is probably just going to loop over them to print them out (or whatever), so using a generator expression means that each uppercase name only exists for a short time; returning a list would mean creating a lot of values all at once.

Ah, but now the caller wants to know how many fruits there are, with minimal fuss. Generators have no length, so that won’t work. Turning this generator expression into a class that also has a __len__ would be fairly ridiculous. So you resort to some slightly ugly trickery.

1
2
3
4
5
6
7
8
# Ugh.  Obvious, but feels really silly.
count = 0
for value in container:
    count += 1

# Better, but weird if you haven't seen it before.  Creates another generator
# expression that just yields 1 for every item, then sums them up.
count = sum(1 for _ in container)

Or perhaps you want the first big fruit? Well, [0] isn’t going to help. This is one of the few cases where using iter and next directly can be handy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Oops!  If the container is empty, this raises StopIteration, which you
# probably don't want.
first = next(iter(container))

# Catch the StopIteration explicitly.
try:
    first = next(iter(container))
except StopIteration:
    # This code runs if there are zero items
    ...

# Regular loop that terminates immediately.
# The "else" clause only runs when the container ends naturally (i.e. NOT if
# the loop breaks), which can only happen here if there are zero items.
for value in container:
    first = value
    break
else:
    ...

# next() -- but not __next__()! -- takes a second argument indicating a
# "default" value to return when the iterator is exhausted.  This only makes
# sense if you were going to substitute a default value anyway; doing this and
# then checking for None will do the wrong thing if the container actually
# contained a None.
first = next(iter(container), None)

Other tricks with iter and next include skipping the first item (or any number of initial items, though consider itertools.islice for more complex cases):

1
2
3
4
5
6
7
it = iter(container)
next(it, None)  # Use second arg to ignore StopIteration
for value in it:
    # Since the first item in the iterator has already been consumed, this loop
    # will start with the second item.  If the container had only one or zero
    # items, the loop will get StopIteration and end immediately.
    ...

Iterating two (or more) items at a time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Obvious way: call next() inside the loop.
it = iter(container)
for value1 in it:
    # With an odd number of items, this will raise an uncaught StopIteration!
    # Catch it or provide a default value.
    value2 = next(it)
    ...

# Moderately clever way: abuse zip().
# zip() takes some number of containers and iterates over them pairwise.  It
# stores an iterator for each container.  When it's asked for its next item, it
# in turn asks all of its iterators for their next items, and returns them as a
# set.  But by giving it the same exact iterator twice, it'll end up advancing
# that iterator twice and returning two consecutive items.
# Note that zip() stops early as soon as an iterator runs dry, so if the
# container has an odd number of items, this will silently skip the last one.
# If you don't want that, use itertools.zip_longest instead.
it = iter(container)
for line1, line2 in zip(it, it):
    ...

# Far too clever way: exactly the same as above, but written as a one-liner.
# zip(iter(), iter()) would create two separate iterators and break the trick.
# List multiplication produces a list containing the same iterator twice.
# One advantage of this is that the 2 can be a variable.
for value1, value2 in zip(*[iter(container)] * 2):
    ...

Wow, that got pretty weird towards the end. Somehow this turned into Stupid Python Iterator Tricks. Don’t worry; I know far less about these other languages.

C

C is an extreme example with no iterator protocol whatsoever. It barely even supports sequences; arrays are just pointer math. All it has is the humble C-style for loop:

1
2
3
4
5
int[] container = {...};
for (int i = 0; i < container_length; i++) {
    int value = container[i];
    ...
}

Unfortunately, it’s really the best C can do. C arrays don’t know their own length, so no matter what, the developer has to provide it some other way. Even without that, a built-in iterator protocol is impossible — iterators require persistent state (the current position) to be bundled alongside code (how to get to the next position). That pretty much means one of two things: closures or objects. C has neither.

Lua

Lua has two forms of for loop. The first is a simple numeric loop.

1
2
3
4
-- 1 3 5 7 9 11
for value = 1, 11, 2 do
    ...
end

The three values after the = are the start, end, and step. They work similarly to Python’s range(), except that everything in Lua is always inclusive, so for i = 1, 5 will count from 1 to 5.

The generic form uses in.

1
2
3
for value in iterate(container) do
    ...
end

iterate isn’t a special name here, but most of the time a generic for will look like this.

See, Lua doesn’t have objects. It has enough tools that you can build objects fairly easily, but the core language has no explicit concept of objects or method calls. An iterator protocol needs to bundle state and behavior somehow, so Lua uses closures for that. But you still need a way to get that closure, and that means calling a function, and a plain value can’t have functions attached to it. So iterating over a table (Lua’s single data structure) looks like this:

1
2
for key, value in pairs(container) do
    ...

pairs is a built-in function. Lua also has an ipairs, which iterates over consecutive keys and values starting from key 1. (Lua starts at 1, not 0. Lua also represents sequences as tables with numeric keys.)

Lua does have a way to associate “methods” with values, which is how objects are made, but for loops almost certainly came first. So iteration is almost always over a function call, not a bare value.

Also, because objects are built out of tables, having a default iteration behavior for all tables would mean having the same default for all objects. Nothing’s stopping you from using pairs on an object now, but at least that looks deliberate. It’s easy enough to give objects iteration methods and iterate over obj:iter(), though it’s slightly unfortunate that every type might look slightly different. Unfortunately, Lua has no truly generic interface for “this can produce a sequence of values”.

The iteration protocol is really just calling a function repeatedly to get new values. When the function returns nil, the iteration ends. (That means nil can never be part of an iteration! You can work around this by returning two values and making sure the first one is something else that’s never nil, like an index.) The manual explains the exact semantics of the generic for with Lua code, a move I wish every language would make.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- This:
for var_1, ···, var_n in explist do block end

-- Is equivalent to this:
do
    local _func, _state, _lastval = explist
    while true do
        local var_1, ···, var_n = _func(_state, _lastval)
        if var_1 == nil then break end
        _lastval = var_1
        block
    end
end

Important to note here is the way multiple-return works in Lua. Lua doesn’t have tuples; multiple assignment is a distinct feature of the language, and multiple return works exactly the same way as multiple assignment. If there are too few values, the extra variables become nil; if there are too many values, the extras are silently discarded.

So in the line local _func, _state, _lastval = explist, the “state” value _state and the “last loop value” _lastval are both optional. Lua doesn’t use them, except to pass them back to the iterator function _func, and they aren’t visible to the for loop body. An iterator can thus be only a function and nothing else, letting _state and _lastval be nil — but they can be a little more convenient at times. Compare:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Usual approach: return only a closure, completely ignoring state and lastval
local function inclusive_range(start, stop)
    local nextval = start
    return function()
        if nextval > stop then
            return
        end
        local val = nextval
        nextval = nextval + 1
        return val
    end
end

-- Alternative approach, not using closures at all.  This is the function we
-- return; each time it's called with the same "state" value and whatever it
-- returned last time it was called.
-- This function could even be written exactly a method (a la Python's
-- __next__), where the state value is the object itself.
local function inclusive_range_iter(stop, prev)
    -- "stop" is the state value; "prev" is the last value we returned
    local val = prev + 1
    if val > stop then
        return
    end
    return val
end
local function inclusive_range(start, stop)
    -- Return the iterator function, and pass it the stop value as its state.
    -- The "last value" is a little weird here; on the first iteration, there
    -- is no last value.  Here we can fake it by subtracting 1 from the
    -- starting number, but in other cases, it might make more sense if the
    -- "state" were a table containing both the start and stop values.
    return inclusive_range_iter, stop, start - 1
end

-- 6 7 8 9 with both implementations
for n in inclusive_range(6, 9) do
    ...
end

Lua doesn’t have generators. Surprisingly, it has fully-fledged coroutines — call stacks that can be paused at any time. Lua sometimes refers to them as “threads”, but only one can be running at a time. Effectively they’re like Python generators, except you can call a function which calls a function which calls a function which eventually yields, and the entire call stack from that point up to the top of the coroutine is paused and preserved.

In Python, the mere presence of yield causes a function to become a generator. In Lua, since any function might try to yield the coroutine it’s currently in, a function has to be explicitly called as a coroutine using functions in the coroutine library.

But this post is about iterators, not coroutines. Coroutines don’t function as iterators, but Lua provides a coroutine.wrap() that takes a function, turns it into a coroutine, and returns a function that resumes the coroutine. That’s enough to allow a coroutine to be turned into an iterator. The Lua book even has a section about this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
local function inclusive_range(start, stop)
    local val = start
    while val <= stop do
        coroutine.yield(val)
        val = val + 1
    end
end
-- Unfortunately, coroutine.wrap() doesn't have any way to pass initial
-- arguments to the function it wraps, so we need this dinky wrapper.
-- I should clarify that the ... here is literal syntax for once.
local function iter_coro(entry_point, ...)
    local args = {...}
    return coroutine.wrap(function()
        entry_point(unpack(args))
    end)
end

# 6 7 8 9
for n in iter_coro(inclusive_range, 6, 9) do
    ...
end

So, that’s cool. Lua doesn’t do a lot for you — unfortunately, list processing tricks can be significantly more painful in Lua — but it has some pretty interesting primitives that compose with each other remarkably well.

Perl 5

Perl has a very straightforward C-style for loop, which looks and works exactly as you might expect. my, which appears frequently in these examples, is just local variable declaration.

1
2
3
for (my $i = 0; $i < 10; $i++) {
    ...
}

Nobody uses it. Everyone uses the iteration-style for loop. (It’s occasionally called foreach, which is extra confusing because both for and foreach can be used for both kinds of loop. Nobody actually uses the foreach keyword.)

1
2
3
for my $value (@container) {
    ...
}

The iteration loop can be used for numbers, as well, since Perl has a .. inclusive range operator. For iterating over an array with indexes, Perl has the slightly odd $#array syntax, which is the index of the last item in @array. Creating something like Python’s enumerate is a little tricky in Perl, because you can’t directly return a list of lists, and the workaround doesn’t support unpacking. It’s complicated.

1
2
3
4
5
6
7
8
for my $i (1..10) {
    ...
}

for my $index (0..$#array) {
    my $value = $array[$index];
    ...
}

A hash (Perl’s mapping “shape”) can’t be iterated directly. Or, well, it can, but the loop will alternate between keys and values because Perl is weird. Instead you need the keys or values built-in functions to get the keys or values as regular lists. (These functions also work on arrays as of Perl 5.12.)

1
2
3
for my $key (keys %container) {
    ...
}

For iterating over both keys and values at the same time, Perl has an each function. The behavior is a little weird, since every call to the function advances an internal iterator inside the hash and returns a new pair. If a loop using each terminates early, the next use of each may silently start somewhere in the middle of the hash, skipping a bunch of its keys. This is probably why I’ve never seen each actually used.

1
2
3
while (my ($key, value) = each %container) {
    ...
}

Despite being very heavily built on the concept of lists, Perl doesn’t have an explicit iterator protocol, and its support for lazy iteration in general is not great. When they’re used at all, lazy iterators tend to be implemented as ad-hoc closures or callable objects, which require a while loop:

1
2
3
4
my $iter = custom_iterator($collection);
while (my $value = $iter->()) {
    ...
}

Here be dragons

It is possible to sorta-kinda fake an iterator protocol. If you’re not familiar, Perl’s variables come in several different “shapes” — hash, array, scalar — and it’s possible to “tie” a variable to a backing object which defines the operations for a particular shape. It’s a little like operator overloading, except that Perl also has operator overloading and it’s a completely unrelated mechanism. In fact, you could use operator overloading to make your object return a tied array when dereferenced as an array. I am talking gibberish now.

Anyway, the trick is to tie an array and return a new value for each consecutive fetch of an index. Like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
use v5.12;
package ClosureIterator;

# This is the tie "constructor" and just creates a regular object to store
# our state
sub TIEARRAY {
    my ($class, $closure) = @_;
    my $self = {
        closure => $closure,
        nextindex => 0,
    };
    return bless $self, $class;
}

# This is called to fetch the item at a particular index; for an iterator,
# only the next item is valid
sub FETCH {
    my ($self, $index) = @_;

    if ($index == 0) {
        # Always allow reading index 0, both to mean a general "get next
        # item" and so that looping over the same array twice will work as
        # expected
        $self->{nextindex} = 0;
    }
    elsif ($index != $self->{nextindex}) {
        die "ClosureIterator does not support random access";
    }

    $self->{nextindex}++;
    return $self->{closure}->();
}

# The built-in shift() function means "remove and return the first item", so
# it's a good fit for a general "advance iterator"
sub SHIFT {
    my ($self) = @_;
    $self->{nextindex} = 0;
    return $self->{closure}->();
}

# Yes, an array has to be able to report its own size...  but luckily, a for
# loop fetches the size on every iteration!  As long as this returns
# increasingly large values, such a loop will continue indefinitely
sub FETCHSIZE {
    my ($self) = @_;
    return $self->{nextindex} + 1;
}

# Most other tied array operations are for modifying the array, which makes no
# sense here.  They're deliberately omitted, so trying to use them will cause a
# "can't locate object method" error.


package main;

# Create an iterator that yields successive powers of 2
tie my @array, 'ClosureIterator', sub {
    # State variables are persistent, like C statics
    state $next = 1;
    my $ret = $next;
    $next *= 2;
    return $ret;
};

# This will print out 1, 2, 4, 8, ... 1024, at which point the loop breaks
for my $i (@array) {
    say $i;
    last if $i > 1000;
}

This transparently works like any other array… sort of. You can loop over it (forever!); you can use shift to pop off the next value; you can stop a loop and then continue reading from it later.

Unfortunately, this is just plain weird, even for Perl, and I very rarely see it used. Ultimately, Perl’s array operations come in a set, and this is an array that pretends not to be able to do half of them. Even Perl developers are likely to be surprised by an array, a fundamental “shape” of the language, with quirky behavior.

The biggest problem is that, as I said, Perl is heavily built on lists. Part of that design is that @arrays are very eager to spill their contents into a surrounding context. Naïvely passing an array to a function, for example, will expand its elements into separate arguments, losing the identity of the array itself (and losing any tied-ness). Interpolating an array into a string automatically space-separates its elements.

Unlike a for loop, these operations only ask the array for its size once — so rather than printing an infinite sequence, they’ll print a completely arbitrary prefix of it. In the case above, spilling a fresh array will read one item; spilling the array after the example loop will read eleven items. So while a tied array works nicely with a for loop, it’s at odds with the most basic rules of Perl syntax.

Also, Perl’s list-based nature means it’s attracted a lot of list-processing utilities — but these naturally expect to receive a spilled list of arguments and cannot work with a lazy iterator.

I found multiple mentions of the List::Gen module while looking into this. I’d never heard of it before and I’ve never seen it used, but it tries to fill this gap (and makes use of array tying, among other things). It’s a bit weird, and its source code is extremely weird, and it took me twenty minutes to figure out how it was using <...> as a quoting construct.

(<...> in Perl does filename globbing, so it’s usually seen as <*.txt>. The same syntax is used for reading from a filehandle, which makes this confusing and ambiguous, so it’s generally discouraged in favor of the built-in glob function which does the same thing. Well, it turns out that <...> must just call glob() at Perl-level, because List::Gen manages to co-opt this syntax simply by exporting its own glob function. Perl is magical.)

Perl 6

Perl 6, a mad experiment to put literally every conceivable feature into one programming language, naturally has a more robust concept of iteration.

At first glance, many of the constructs are similar to those of Perl 5. The C-style for loop still exists for some reason, but has been disambiguated under the loop keyword.

1
2
3
4
5
6
7
8
loop (my $i = 1; $i <= 10; $i++) {
    ...
}

# More interestingly, loop can be used completely bare for an infinite loop
loop {
    ...
}

The for block has slightly different syntax and a couple new tricks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Unlike in Perl 5, $value is automatically declared and scoped to the block,
# without needing an explicit 'my'
for @container -> $value {
    ...
}

for 1..10 -> $i {
    ...
}

# This doesn't iterate in pairs; it reads two items at a time from a flat list!
for 1..10 -> $a, $b {
    ...
}

Not apparent in the above code is that ranges are lazy in Perl 6, as in Python; the elements are computed on demand. In fact, Perl 6 supports a range like 1..Inf.

Loop variables are also aliases. By default they’re read-only, so this appears to work like Python… but Perl has always had a C-like language-level notion of “slots” that Python does not, and it becomes apparent if the loop variable is made read-write:

1
2
3
4
5
6
7
8
my @fruits = «apple orange pear»;
for @fruits -> $fruit is rw {
    # This is "apply method inplace", i.e. shorthand for:
    # $fruit = $fruit.uc;
    # Yes, you can do that.
    $fruit .= uc;
}
say @fruits;  # APPLE ORANGE PEAR

For iterating with indexes, there’s a curious idiom:

1
2
3
4
5
6
7
# ^Inf is shorthand for 0..Inf, read as "up to Inf".
# Z is the zip operator, which interleaves its arguments' elements into a
# single flat list.
# This makes use of the "two at a time" trick from above.
for ^Inf Z @array -> $index, $value {
    ...
}

Iterating hashes is somewhat simpler; hashes have methods, and the .kv method returns the keys and values. (It actually returns them in a flat list interleaved, which again uses “two at a time” syntax. If you only use a single loop variable, your loop iterations will alternate between a key and a value. Iterating a hash directly produces pairs, which are a first-class data type in Perl 6, but I can’t find any syntax for directly unpacking a pair within a loop header.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for %container.kv -> $key, value {
    ...
}

# No surprises here
for %container.keys -> $key {
    ...
}
for %container.values -> $value {
    ...
}

Perl 6 is very big on laziness, which is perhaps why it took fifteen years to see a release. It has the same iterable versus iterator split as Python. Given a container (iterable), ask for an iterator; given an iterator, repeatedly ask for new values. When the iterator is exhausted, it returns the IterationEnd sentinel. Exactly the same ideas. I’m not clear on the precise semantics of the for block and can’t find a simple reference, but they’re probably much like Python’s… plus a thousand special cases.

Generators, kinda

Perl 6 also has its own version of generators, though with a few extra twists. Curiously, generators are a block called gather, rather than a kind of function — this means that a one-off gather is easier to create, but a gather factory must be explicitly wrapped in a function. gather can even take a single expression rather than a block, so there’s no need for separate “generator expression” syntax as in Python.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sub inclusive-range($start, $stop) {
    return gather {
        my $val = $start;
        while $val <= $stop {
            take $val;
            $val++;
        }
    };
}

# 6 7 8 9
for inclusive-range(6, 9) -> $n {
    ...
}

Unlike Python’s yield, Perl 6’s take is dynamically scoped — i.e., take can be used anywhere in the call stack, and it will apply to the most recent gather caller. That means arbitrary-depth coroutines, which seems like a big deal to me, but the documentation mentions it almost as an afterthought.

The documentation also says gather/take can generate values lazily, depending on context,” but neglects to clarify how context factors in. The code I wrote above turns out to be lazy, but this ambiguity inclines me to use the explicit lazy marker everywhere.

Ultimately it’s a pretty flexible feature, but has a few quirks that make it a bit clumsier to use as a straightforward generator. Given that the default behavior is an eagerly-evaluated block, I think the original intention was to avoid the slightly unsatisfying pattern of “push onto an array every iteration through a loop” — instead you can now do this:

1
2
3
4
5
6
my @results = gather {
    for @source-data -> $datum {
        next unless some-test($datum);
        take process($datum);
    }
};

Using a simple (syntax-highlighted!) take puts the focus on the value being taken, rather than the details of putting it where it wants to go and how it gets there. It’s an interesting idea and I’m surprised I’ve never seen it demonstrated this way.

With gather and some abuse of Perl’s exceptionally compactable syntax, I can write a much shorter version of the infinite Perl 5 iterator above.

1
2
3
4
5
6
7
8
my @powers-of-two = lazy gather take (state $n = 1) *= 2 for ^Inf;

# Binds to $_ by default
for @powers-of-two {
    # Method calls are on $_ by default
    .say;
    last if $_ > 1000;
}

It’s definitely shorter, I’ll give it that. Leaving off the lazy in this case causes an infinite loop as Perl tries to evaluate the entire list; using a $ instead of a @ produces a “Cannot .elems a lazy list” error; using $ without lazy prints a ...-terminated representation of the infinite list and then hangs forever. I don’t quite understand the semantics of stuffing a list into a scalar ($) variable in Perl 6, and to be honest the list/array semantics seem to be far more convoluted than Perl 5, so I have no idea what’s going on here. Perl 6 has a lot of fascinating toys that are very easy to use incorrectly.

Nuts and bolts

Iterables and iterators are encoded explicitly as the Iterable and Iterator roles. An Iterable has an .iterator method that should return an Iterator. An Iterator has a .pull-one method that returns the next value, or the IterationEnd sentinel when the iterator is exhausted. Both roles offer several other methods, but they have suitable default implementations.

inclusive-range might be transformed into a class thusly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class InclusiveRangeIterator does Iterator {
    has $.range is required;
    has $!nextval = $!range.start;

    method pull-one() {
        if $!nextval > $!range.stop {
            return IterationEnd;
        }

        # Perl people would probably phrase this:
        # ++$!nextval
        # and they are wrong.
        my $val = $!nextval;
        $!nextval++;
        return $val;
    }
}

class InclusiveRange does Iterable {
    has $.start is required;
    has $.stop is required;

    # Don't even ask
    method new($start, $stop) {
        self.bless(:$start, :$stop);
    }

    method iterator() {
        InclusiveRangeIterator.new(range => self);
    }
}

# 6 7 8 9
for InclusiveRange.new(6, 9) -> $n {
    ...
}

Can we use gather to avoid the need for an extra class, just as in Python? We sure can! The only catch is that Perl 6 iterators don’t also pretend to be iterables (remember, in Python, iter(it) should produce it), so we need to explicitly return a gather block’s iterator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
class InclusiveRange does Iterable {
    has $.start is required;
    has $.stop is required;

    # Don't even ask
    method new($start, $stop) {
        self.bless(:$start, :$stop);
    }

    method iterator() {
        gather {
            my $val = $!start;
            while $val <= $!stop {
                take $val;
                $val++;
            }
        }.iterator;  # <- this is important
    }
}

For sequences, Perl 6 has the Seq type. Curiously, even an infinite lazy gather is still a Seq. Indexing and length are not part of Seq — both are implemented as separate methods.

Curiously, even though Perl 6 became much stricter overall, the indexing methods don’t seem to be part of a role; you only need define them, much like Python’s __dunder__ methods. In fact, the preceding examples, does Iterator isn’t necessary at all; the for block will blindly try to call an iterator method and doesn’t much care where it came from.

I’m sure there are plenty of cute tricks possible with Perl 6, but, er, I’ll leave those as an exercise for the reader.

Ruby

Ruby is a popular and well-disguised Perl variant, if Perl just went completely all-in on Smalltalk. It has no C-style for, but it does have an infinite loop block and a very Python-esque for:

1
2
3
for value in sequence do
    ...
end

Nobody uses this. No, really, the core language documentation outright says:

The for loop is rarely used in modern ruby programs.

Instead, you’ll probably see this:

1
2
3
sequence.each do |value|
    ...
end

It doesn’t look it, but this is completely backwards from everything seen so far. All of these other languages have used external iterators, where an object is repeatedly asked to produce values and calling code can do whatever it wants with them. Here, something very different is happening. The entire do ... end block acts as a closure whose argument is value; it’s passed to the each method, which calls it once for each value in the sequence. This is an internal iterator.

Pass a block to a function which can then call it a lot” is a built-in syntactic feature of Ruby, so these kinds of iterators are fairly common. The upside is that they look almost like a custom block, so they fit naturally with the language. The downside is that all of these block-accepting methods are implemented on Array, rather than as generic functions: bsearch, bsearch_index, collect, collect!, combination, count, cycle, delete, delete_if, drop_while, each, each_index, fetch, fill, find_index, index, keep_if, map, map!, permutation, product, reject, reject!, repeated_combination, repeated_permutation, reverse_each, rindex, select, select!, sort, sort!, sort_by!, take_while, uniq, uniq!, zip. Some of those, as well as a number of additional methods, are provided by the Enumerable mixin which can express them in terms of each. I suppose the other upside is that any given type can provide its own more efficient implementation of these methods, if it so desires.

I guess that huge list of methods answers most questions about how to iterate over indices or in reverse. The only bit missing is that .. range syntax exists in Ruby as well, and it produces Range objects which also have an each method. If you don’t care about each index, you can also use the cute 3.times method.

Ruby blocks are a fundamental part of the language and built right into the method-calling syntax. Even break is defined in terms of blocks, and it works with an argument!

1
2
3
4
# This just doesn't feel like it should work, but it does.  Prints 17.
# Braces are conventionally used for inline blocks, but do/end would work too.
primes = [2, 3, 5, 7, 11, 13, 17, 19]
puts primes.each { |p| break p if p > 16 }

each() doesn’t need to do anything special here; break will just cause its return value to be 17. Somehow. (Honestly, this is the sort of thing that makes me wary of Ruby; it seems so ad-hoc and raises so many questions. A language keyword that changes the return value of a different function? Does the inside of each() know about this or have any control over it? How does it actually work? Is there any opportunity for cleanup? I have no idea, and the documentation doesn’t seem to think this is worth commenting on.)

Using blocks

Anyway, with block-passing as a language feature, the “iterator protocol” is pretty straightforward: just write a method that takes a block.

1
2
3
4
5
def each
    for value in self do
        yield value
    end
end

Be careful! Though it’s handy for iteration, that yield is not the same as Python’s yield. Ruby’s yield calls the passed-in block — yields control to the caller — with the given value(s).

I pulled a dirty trick there, because I expressed each in terms of for. So how does for work? Well, ah, it just delegates to each. Oops!

How, then, do you write an iterator completely from scratch? The obvious way is to use yield repeatedly. That gives you something that looks rather a lot like Python, though it doesn’t actually pause execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class InclusiveRange
    # This gets you a variety of other iteration methods, all defined in
    # terms of each()
    include Enumerable

    def initialize(start, stop)
        @start = start
        @stop = stop
    end
    def each
        val = @start
        while val <= @stop do
            yield val
            val += 1
        end
    end
end

# 6 7 8 9
# A `for` loop would also work here
InclusiveRange.new(6, 9).each do |n|
    ...
end

Enumerators

Well, that’s nice for creating a whole collection type, but what if I want an ad-hoc custom iterator? Enter the Enumerator class, which allows you to create… ah, enumerators.

Note that the relationship between Enumerable and Enumerator is not the same as the relationship between “iterable” and “iterator”. Most importantly, neither is really an interface. Enumerable is a set of common iteration methods that any collection type may want to have, and it expects an each to exist. Enumerator is a generic collection type, and in fact mixes in Enumerable. Maybe I should just show you some code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def inclusive_range(start, stop)
    Enumerator.new do |y|
        val = start
        while val <= stop do
            y.yield val
            val += 1
        end
    end
end

# 6 7 8 9
inclusive_range(6, 9).each do |n|
    puts n
end

Enumerator turns a block into a fully-fledged data stream. The block is free to do whatever it wants, and whenever it wants to emit a value, it calls y.yield value. The y argument is a “yielder” object, an opaque magic type; y.yield is a regular method call, unrelated to the yield keyword. (y << value is equivalent; << is Ruby’s “append” operator. And also, yes, bit shift.)

The amazing bit is that you can do this:

1
2
# 6
puts inclusive_range(6, 9).first

Enumerator has all of the Enumerable methods, one of which is first. So, that’s nice.

The really amazing bit is that if you stick some debugging code into the block passed to Enumerator.new, you’ll find that… the values are produced lazily. That call to first() doesn’t generate the full sequence and then discard everything after the first item; it only generates the first item, then stops.

(Beware! The values are produced lazily, but many Enumerable methods are eager. I’ll get back to this in a moment.)

Hang on, didn’t I say yield doesn’t pause execution? Didn’t I also say the above yield is just a method call, not the keyword?

I did! And I wasn’t lying. The really truly amazing bit, which I’ve seen shockingly little excitement about while researching this, is that under the hood, this is all using Fibers. Coroutines.

Enumerator.new takes a block and turns it into a coroutine. Every time something wants a value from the enumerator, it resumes the coroutine. The yielder object’s yield method then calls Fiber.yield() to pause the coroutine. It works just like Lua, but it’s designed to work with existing Ruby conventions, like the piles of internal iteration methods developers expect to find.

So Enumerator.new can produce Python-style generators, albeit in a slightly un-native-looking way. There’s also one other significant difference: an Enumerator can restart itself for each method called on it, simply by calling the block again. This code will print 6 three times:

1
2
3
4
ir = inclusive_range(6, 9)
puts ir.first
puts ir.first
puts ir.first

For something like an inclusive range object, that’s pretty nice. For something like a file, maybe not so nice. It also means you need to be sure to put your setup code inside the block passed to Enumerator.new, or funny things will happen when the block is restarted.

Something like generators

But wait, there’s more. Specifically, this common pattern, which pretty much lets you ignore Enumerator.new entirely.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def some_iterator_method
    # __method__ is the current method name.  block_given? is straightforward.
    return enum_for(__method__) unless block_given?

    # An extremely accurate simulation of a large list.
    (1..1000).each do |item|
        puts "having a look at #{item}"
        # Blocks are invisible to `yield`; this will yield to the block passed
        # to some_iterator_method.
        yield item if item.even?
    end
end

# having a look at 1
# having a look at 2
# 2
puts some_iterator_method.first

Okay, bear with me.

First, some_iterator_method() is called. It doesn’t have a block attached, so block_given? is false, and it returns enum_for(...), whatever that does. Then first() is called on the result, and that produces a single element and stops.

The above code has no magic yielder object. It uses the straightforward yield keyword. Why doesn’t it loop over the entire range from 1 to 1000?

Remember, Enumerator uses coroutines under the hood. One neat thing coroutines can do is pause code that doesn’t know it’s in a coroutine. Python’s generators pause themselves with yield, and the mere presence of yield turns a function into a generator; but in Lua or Ruby or any other language with coroutines, any function can pause at any time. You can even make a closure that pauses, then pass that closure to another function which calls it, without that function ever knowing anything happened.

(This arguably has some considerable downsides as well — it becomes difficult to know when or where your code might pause, which makes reasoning about the order of operations much harder. That’s why Python and some other languages opted to implement async IO with an await keyword — anyone reading the code knows that it can only pause where an await appears.)

(Also, I’m saying “pause” here instead of “yield” because Ruby has really complicated the hell out of this by already having a yield keyword that does something totally different, and naming its coroutine pause function yield.)

Anyway, that’s exactly what’s happening here. enum_for returns an Enumerator that wraps the whole method. (It doesn’t need to know self, because enum_for is actually a method inherited from Object, goodness gracious.) When the Enumerator needs some items, it calls the method a second time with its own block, running in a coroutine, just like a block passed to Enumerator.new. Eventually the method emits a value using the regular old yield keyword, and that value reaches the block created by Enumerator, and that block pauses the call stack. It doesn’t matter that Range.each is eager, because its iteration is still happening in code somewhere, and that code is part of a call stack in a coroutine, so it can be paused. Eventually the coroutine is no longer useful and gets thrown away, so the eager each call simply stops midway through its work, unaware that anything unusual ever happened.

In fact, despite being an Object method, enum_for isn’t special at all. It can be expressed in pure Ruby very easily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def my_enum_for(receiver, method)
    # Enumerator.new creates a coroutine-as-iteration-source, as above.
    Enumerator.new do |y|
        # All it does is call the named method with a trivial block.  Every
        # time the method produces a value with the `yield` keyword, we pass it
        # along to the yielder object, which pauses the coroutine.
        # This is nothing more than a bridge between "yield" in the Ruby block
        # sense, and "yield" in the coroutine sense.
        receiver.send method do |value|
            y.yield value
        end
    end
end

So, that’s pretty neat. Incidentally, several built-in methods like Array.each and Enumerable.collect act like this, returning an Enumerator if called with no arguments.

Full laziness

I mentioned above that while an Enumerator fetches items lazily, many of the methods are eager. To clarify what I mean by that, consider:

1
2
3
4
5
inclusive_range(6, 9000).collect {
    |n|
    puts "considering #{n}"
    "a" * n
}.first(3)

collect() is one of those common Enumerable methods. You might know it by its other name, map(). Ruby is big on multiple names for the same thing: one that everyone uses in practice, and another that people who don’t use Ruby will actually recognize.

Even though this code ultimately only needs three items, and even though there’s all this coroutine machinery happening under the hood, this still evaluates the entire range. Why?

The problem is that collect() has always returned an array, and is generally expected to continue doing so. It has no way of knowing that it’s about to be fed into first. Rather than violate this API, Ruby added a new method, Enumerable.lazy. This stops after three items:

1
2
3
4
5
inclusive_range(6, 9000).lazy.collect {
    |n|
    puts "considering #{n}"
    "a" * n
}.first(3)

All this does is return an Enumerator::Lazy object, which has lazy implementations of various methods that would usually do a full iteration. Methods like first(3) are still “eager” (in the sense that they just return an array), since their results have a fixed finite size.

This seems a little clunky to me, since the end result is still an object with a collect method that doesn’t return an array. I suspect the real reason is just that Enumerator was added first; even though the coroutine support was already there, Enumerator::Lazy only came along later. Changing existing eager methods to be lazy can, ah, cause problems.

The only built-in type that seems to have interesting lazy behavior is Range, which can be infinite.

1
2
3
4
# Whoops, infinite loop.
(1..Float::INFINITY).select { |n| n.even? }.first(5)
# 2 4 6 8 10
(1..Float::INFINITY).lazy.select { |n| n.even? }.first(5)

A loose end

I think the only remaining piece of this puzzle is something I stumbled upon but can’t explain. Enumerator has a next method, which returns the next value or raises StopIteration.

Wow, that sounds awfully familiar.

But I can’t find anything in the language or standard library that uses this, with one single and boring exception: the loop construct. It catches StopIteration and exits the block.

1
2
3
4
5
6
enumerator = [1, 2, 3].each
loop do
    while true do
        puts enumerator.next
    end
end

On the fourth call, next() will be out of items, so it raises StopIteration. Removing the loop block makes this quite obvious.

That’s it. That’s the only use of it in the language, as far as I can tell. It seems almost… vestigial. It’s also a little weird, since it keeps the current iteration state inside the Enumerator, unlike any of its other methods. But it’s also the only form of external iteration that I know of in Ruby, and that’s handy to have sometimes.

And, uh, so on

I intended to foray into a few more languages, including some recent lower-level friends like C++/Rust/Swift, but this post somehow spiraled out of control and hit nine thousand words. No one has read this far.

Handily, it turns out that the above languages pretty much cover the basic ways of approaching iteration; if any of this made sense, other languages will probably seem pretty familiar.

  • C++’s iteration protocol(s) has existed for a long time in the form of ++it to advance an iterator and *it to read the current item, though this was usually written manually in a C-style for loop, and loops were generally terminated with an explicit endpoint.

    C++11 added the range-based for, which does basically the same stuff under the hood. Idiomatic C++ is inscrutible, but maybe you can make sense of this project which provides optionally-infinite iterable ranges.

  • Rust has an entire (extremely well-documented) iter module with numerous iterators and examples of how to create your own. The core of the Iterator trait is just a next method which returns None when exhausted. It also has a lot of handy Ruby-like chainable methods, so working directly with iterators is more common in Rust than in Python.

  • Swift also has (well-documented) simple next-based iterators, which return nil when exhausted, effectively the same API as Rust.

I could probably keep finding more subsequent languages indefinitely, so I’m gonna take a break from this now.

The curious case of the switch statement

Post Syndicated from Eevee original https://eev.ee/blog/2016/09/18/the-curious-case-of-the-switch-statement/

Sometimes, I lie awake at night thinking about programming languages.

That’s all the intro I’ve got here, sorry. I felt like writing about the switch statement for some reason.

Some history

1958: ALGOL 58

The earliest incarnation I can find is in ALGOL 58. The original description of ALGOL 58 is a fascinating read — it’s written like a math paper, with literal text in italics and heavy use of subscripts. Character classes are even named by Greek letters, with λ representing letters and so on. The example program at the end is completely incomprehensible, with almost every variable being a single letter and labels forming their own entire column on the left side. I guess that last bit came from FORTRAN.

I’m not surprised by this, what with ALGOL’s being originally conceived as a universal way to express mathematical algorithms — “algebraic” is right in the name, after all — but seeing it in practice is a little surreal. I can’t imagine a programming language being described this way nowadays.

ALGOL 58’s version of switch is a little different from the one we’ve all come to know and love. It’s not a statement at all; it’s a declaration. Switches are their own kind of thing, in the same way that variables and labels are different kinds of thing. (Though, curiously, it seems you can return a label — but not a switch! — from a procedure in ALGOL 58.)

1
switch I := (D₁, D₂, ..., Dₙ)

A switch is used with the go to statement, along with a subscript indicating which Dᵢ to select. go to I[2] would thus be equivalent to go to D₂, except that the subscript could be computed at runtime.

Each Dᵢ is a “designational expression”, which is a funny way of saying “a thing you can use with go to” — either a label or another switch-plus-subscript.

This very first appearance of switch was a very simple form of computed goto: a list of places to jump to that you could index at runtime.

Please also enjoy the two example uses of the go to statement from the original report.

go to HADES

go to exit [(i ↑ 2 ↓ -j ↑ 2 ↓ +I)/2], where exit refers to the declaration
switch exit :=(D₁, D₂, ~~~, Dₙ)

1960: ALGOL 60

ALGOL 58 was soon superseded by ALGOL 60, which has a slightly more modern-looking description. The basic idea behind switch as a declaration stayed the same, but it learned a few new tricks.

Inline if ... then ... else ... became a form of arithmetic expression, meaning it could be used as a subscript, even on a switch. Along the same lines, anywhere a “designational expression” was expected — including within a switch declaration or as the argument to go to — an inline if could be substituted. The above-linked report includes these examples:

switch S:=S1,S2,Q[m], if v>-5 then S3 else S4

switch Q:=p1,w

It also clarifies something I wasn’t sure about from the ALGOL 58 report: the expressions in a switch are evaluated when the jump is made, not when the switch is declared!

Finally, both switches and individual labels could be passed around as arguments. Don’t see that much any more.

1963: CPL

CPL — originally Cambridge Programming Language, then later Combined Programming Language when London joined in — was heavily inspired by ALGOL 60, but sought to expand beyond mathematical algorithms and into more general programming. It was described somewhat informally by a paper in 1963, but never saw much widespread use.

CPL dropped switches entirely, in favor of having labels themselves be a kind of storable variable. The paper gives an example, where “Switch” is the name of a variable and not a keyword:

label Switch = x < 0 → L1, L2

go to Switch

(Here, →, is effectively the ternary operator, the equivalent to C’s ?:. a → b, c means b if a is true, c otherwise.)

Since the first statement is a regular assignment, the right-hand side is evaluated immediately. The paper doesn’t delve very deeply into this, but since labels were first-class values, you could presumably choose a label using whatever logic you wanted and then jump to it. Arrays could contain any type, too, so I assume the direct equivalent to ALGOL 60’s switch was merely an array of labels. Certainly much more flexible. If you’re cringing right now, don’t worry — you can do this in C, too! Keep readin’.

CPL also allowed referring to (and thus jumping to) a label in the middle of a different scope, as long as the label was prefixed with the containing block’s label, as BLOCK at LABEL. How convenient!

1966: ALGOL W

ALGOL W was a big step towards something recognizable as a modern language. It brought us such revolutionary innovations as null, a value that crashes your program.

It also dropped the switch kind in favor of a new compound statement, case, which was sort of like the older switch without the labels. case looked like this:

1
2
3
4
5
case X of begin
    WRITE("X is 1");
    WRITE("X is 2");
    WRITE("X is 3")
end

Yes, case was a regular block, and the expression indicated which one of its statements to run. Any individual statement could of course be a bare begin ... end block wrapping multiple sub-statements, so a case wasn’t limited to only executing one literal line of code. Still, that seems like it would be impossible to read with more than two or three cases.

Similar to ALGOL 60’s inline if ... else ..., ALGOL W had an inline case ... of:

1
case EXPR of (A, B, C, ...)

This evaluated to one of A, B, etc., as determined by the value of EXPR. In most modern languages, you’d do this by indexing an array literal.

1967: BCPL

CPL was a “large” language for the time, which is a strange thing to say when the paper describing it is the shortest I’ve read today. I can’t very well look at it through 1963 eyes, so I’ll take Wikipedia’s word that it was too big to be fully implemented until a good few years later.

In the meantime, someone else at Cambridge thought there were some good ideas in there, so he lopped off all the parts that were hard to implement and called the result “BCPL”, for “Bootstrap CPL” (as the original compiler was written in BCPL) and later “Basic CPL”.

BCPL was the origin of a few interesting ideas we now take for granted: it was the first language to delimit blocks with curly braces (or the $( and $) digraphs), and possibly the first to compile to an intermediate object code before compiling to machine code (which made porting much easier, and is roughly how LLVM and GCC work). It also treated everything as integers and had no type-checking, so, make of that what you will.

As far as I can tell, BCPL was also the origin of the modern switch statement. The BCPL reference manual describes the following block:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
switchon EXPR into {
    ...
    ...
    case CONST:
    ...
    ...
    default:
    ...
    ...
}

Boy, doesn’t that look familiar! It works exactly as you’d expect, and the manual even mentions that the exact compilation may vary depending on the number of cases. I don’t think break worked in switchon, but since the behavior is described solely as a jump and nothing else, you’d have needed something to prevent fallthrough.

Curiously, BCPL also had label-type variables, so it seems that switchon was a novel addition. BCPL was first implemented only a year after ALGOL W arrived with the case statement, but I can’t find any further details on what inspired this particular design.

1969: B

Now we’re nearing the history that most computer people are familiar with. B confused the next three generations’ worth of budding programmers by changing assignment from := to =, but made up for it by introducing compound assignment operators, but then ruined it again by inventing post/pre-increment/decrement.

The B manual describes the switch statement as “the most complicated statement in B”. The general syntax is switch EXPR STATEMENT — the statement doesn’t even have to be a block, but may be. I believe it would look like this:

1
2
3
4
5
6
7
8
switch X {
    case 1;
    printf("X is 1*n");
    case 2;
    printf("X is 2 (or 1)*n");
    case 3;
    printf("X is 3 (or 2, or 1)*n");
}

Most notably, case seems to have become a statement all its own, rather than a funny kind of label. At least, I think so. The manual wasn’t scanned particularly well, and the punctuation after case is underlined, so it might be a semicolon or a colon with a bit of lint under it.

In a step backwards from BCPL, I see no mention of a default label; if none of the cases match, the entire block is just skipped.

Again, since these are just jumps and break hasn’t been overloaded to work here yet (in fact, it didn’t seem to exist in the language at all!), there’s unavoidable fallthrough. The manual contains an example program that uses a switch, and every case ends with an explicit goto out of the block.

Oh, and B used * for escape characters in strings. Imagine what might’ve been.

1972: C

And then we had C and that’s the end of the story. I don’t own a copy of The C Programming Language First Edition (gasp!), but several people have confirmed to me that even the original version of C had the modern switch statement, complete with the break overload.

The modern switch statement

As we’ve seen, the canonical switch statement has had basically the same form for 49 years. The special case labels are based on syntax derived directly from fixed-layout FORTRAN on punchcards in 1957, several months before my father was born.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
switch (EXPR) {
    case 1:
        ...
        break;
    case 2:
        ...
        break;
    case 3:
        ...
        break;
    default:
        ...
}

I hate it.

Okay, “hate” is a strong word. I’m not a huge fan. From a language design standpoint, it sticks out like a sore thumb. I’ve never seen a particularly pleasing design for it, and it baffles me that we still have dedicated syntax for this weird thing — and frequent requests to add it to languages that lack it.

Some objections

It’s not like anything else in the language. Consider how almost every other kind of block works, especially in languages like C where a block and a single statement are interchangeable.

1
2
3
4
stmt ... {
    ...;
    ...;
}

if, do, while, for, even function definitions all work pretty much the same way: the block is just some stuff, and the exact contents are irrelevant. The semantics are always of the form: something happens, then the block runs, then something else happens.

switch changes the rules. The block introduced by a switch is allowed to use special syntax that can’t appear anywhere else. A programmer sprinkles this unique syntax — which sorta-kinda-half slices the block into multiple blocks — throughout, and the switch header inspects the block’s contents. It’s invasive in a way nothing else is.

It splits an expression in half. Semantically, a switch asks: which of these values is equal to this expression? That is, for some expression x and a set of values a, b, c, …, which of x == a, x == b, x == c, … is true?

Those comparisons are lost in a switch, obscured by a construct designed to obscure them. This is C, remember, the language that makes you explicitly write out all the parts of a for, yet it has a statement whose sole purpose is to hide some == from you. That’s weird!

Overloading break is clumsy. Granted, C overloads just about every keyword it has; static alone has at least thirteen meanings.

Normally, break means to immediately jump to the end of a looping block and ignore its loop condition. That distinguishes it from continue. A switch is not a loop.

This would make perfect sense if you were allowed to break out of any arbitrary block. Perl 5, for example, has roughly equivalent next and last statements that you can use even inside bare blocks (though, curiously, not inside an if). If Perl 5 had a switch, last should sensibly jump out of it, since that’s the default behavior. In C, this is a special case… to fix a problem that shouldn’t exist in the first place:

Fallthrough by default is a horrible idea. The vast majority of the time, you don’t want fallthrough. So why is fallthrough the default, and the common case needs opting into?

Yes, I know switch is fundamentally just a jump. But it’s a jump to a thing that looks like a label yet has a non-opaque value in it… unless such a label doesn’t exist, in which case it’s a jump to a label with a reserved name… unless that doesn’t exist either, in which case it’s a jump past the end of the block.

Oh! A jump past the end of the block. switch can already do that! Why not do it automatically after each case?

It’s so useful for parsers!” I hear you cry. Hang on! This behavior wouldn’t even affect 99% of cases that want fallthrough. Consider this tiny tokenizer in a hypothetical world where break isn’t necessary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
switch (ch) {
    case ' ':
    case '\t':
    case '\n':
        tokentype = WHITESPACE;

    case '(':
        tokentype = OPENPAREN;

    ...
}

This would still work exactly as you want. When ch be a space, control would jump to tokentype = WHITESPACE;. Why? Because labels can only mark a statement, not other labels — and case is a label, not a statement! Jumping to a label always means jumping to the next statement that appears after it. Effectively, that assignment has three labels.

How would you create an empty case, then? The same way you create any other empty case: a lone semicolon.

It’s not really all that useful. The C-style switch can only match one thing at a time. If you want to match multiple literals, you can use multiple cases. That’s all you get.

It’s handy if you’re writing a parser, I suppose. By hand. Via recursive-descent. In C. In 1972.

It might also be useful for dealing with really big enums that indicate a lot of different behavior.

Offhand, I can’t think of anything else. I sympathize with the frustrations of people who reflexively want to reach for it in a language that doesn’t have it, but I find it hard to believe that it’s the best tool for the job very often.

It’s not even useful as an optimization any more; I’m pretty sure our compilers are plenty smart enough by now to recognize an if tree and turn it into a jump table or binary search or whatever. It’s one of those things that feels like it’ll make a program faster, but almost certainly won’t make any difference whatsoever.

It spread. A few language designers in the 90s made a point of inheriting all of C’s mistakes: nulls, bitwise operators with nonsense precedence, C-style for loops, and of course the switch statement.

I’m looking at you, Java and JavaScript.

What’s particularly curious is that neither language has goto — so why does either language need special computed-jump syntax? I’ve seen a switch in JavaScript maybe once.

Also, just like C, both languages have had preposterous amounts of effort put into their compilation. They also both have better “string” primitives than C and regexes built in, making switch not nearly as useful for writing parsers; JavaScript doesn’t even have enums, and Java only got them fairly recently.

Potentially fixing it

None of those are intended as huge complaints, and I’m not proposing we start a petition to remove switch from the next version of C or anything. I merely object… aesthetically. (And if we were really going to change C to match my aesthetic preferences, we’d start with something much more important, such as ditching the braces.)

People occasionally ask for it in Python, and that gets me thinking about how it might be made to fit there. Python doesn’t have labels, so presumably this would have entire blocks.

1
2
3
4
5
6
7
switch x:
    case 1:
        print("x is 1")
    case 2:
        print("x is 2")
    default:
        print("dunno")

The double indentation here is awful. I also don’t much like the nebulous scope inside switch. What is that? Can anything other than a case block go there? Every other kind of block in Python can contain completely arbitrary code, so this would be like nothing else in the language.

Okay, so let’s try outdenting the cases, like a try with multiple except clauses.

1
2
3
4
5
6
7
switch x:
case 1:
    print("x is 1")
case 2:
    print("x is 2")
default:
    print("dunno")

This is more consistent with the rest of the language, but now the block immediately following switch is empty, which is disallowed everywhere else. Maybe the default case could go there?

1
2
3
4
5
6
switch x:
    print("dunno")
case 1:
    print("x is 1")
case 2:
    print("x is 2")

I could kind of get behind this, but it’s backwards from how most people think about switch (the default case is virtually always written last) as well as backwards from elif and except, the other two kinds of block that can be stacked this way.

A bigger problem is fallthrough — these are now distinct blocks, not labels, so my previous argument doesn’t apply. A couple languages have the convention of using continue to mean explicit fallthrough, which would give us:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
switch x:
    print("dunno")
case 1:
    print("x is 1")
case 2:
    print("x is 2")
case 3:
    continue
case 4:
    continue
case 5:
    print("x is 3, 4, or 5")

If the goal here was to save typing and/or vertical space, I don’t think we’re succeeding.

Alternatively, a completely empty case block could be treated as meaning fallthrough, and a no-op block could use pass. That would leave us with empty blocks again, though.

Now I’m stuck. I only had so much syntax to work with in the first place, and it doesn’t fit the weird expectations of switch. Which is probably why Python doesn’t have a switch block.

Here’s another question: what order is the comparison done in? That is, given switch s and case c, do we check s == c or c == s? You might argue the former because s appears first, and I wouldn’t disagree, but that’s not a particularly solid reason. And what if you wanted it the other way? It shouldn’t matter, but it can, because Python has operator overloading.

Alternatives

A few languages have put interesting spins on switch, and they probably have better ideas than I do. Here are some I happen to know about.

I note that not a single one of these alternatives has default fallthrough.

Bourne shell

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
case $x in
1)
    echo 'x is 1'
    ;;
2)
    echo 'x is 2'
    ;;
3|4|5)
    echo 'x is 3, 4, or 5'
    ;;
*)
    echo 'dunno'
    ;;
esac

The syntax is totally weird, but perhaps no more weird than anything else in shell.

I believe each case must have a terminator, so there’s no risk of accidental fallthrough. The use of | to separate multiple patterns takes care of the multiple-value case.

* isn’t a special symbol meaning “default”; each case is actually a pattern, so the * is a wildcard that will match anything not matched thusfar. I approve.

Visual Basic

This is .NET, but I believe it’s been the same since before then.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Select Case x
    Case 1
        Debug.WriteLine("x is 1")
    Case 2
        Debug.WriteLine("x is 2")
    Case 3 To 5
        Debug.WriteLine("x is 3, 4, or 5")
    Case Else
        Debug.WriteLine("dunno")
End Select

Leaving aside Visual Basic’s insistence on Writing Everything In Title Case, wow, this is weird.

The case expressions have their own subsyntax that I don’t think exists anywhere else in the language. Each one is a comma-separated list of one or more expressions, a To b range expressions, or Is <= expression comparison thing. That last one is extra weird, especially since the Is is optional, so you can just write Case > 12. It’s convenient, sure, but what a strange thing to implement solely for one kind of block.

Case Else is probably the worst way to spell that. And it looks like the Case after Select is optional? What is even going on here.

Ruby

Ruby has a case statement, or maybe it’s a method, who can even tell.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
case x
when 1
    puts "x is 1"
when 2
    puts "x is 2"
when 3, 4, 5
    puts "x is 3, 4, or 5"
else
    puts "dunno"
end

I heartily approve of the use of else here.

when has only slightly special syntax; listing multiple values separated by commas allows matching against any of them. You can do some other tricks, though, because Ruby doesn’t actually test with normal equality; it uses the terribly-named === operator, which is a slightly fuzzy kind of matching deemed to be helpful sometimes.

For example, the last when above could be written:

1
2
when 3..5
    puts "x is 3, 4, or 5"

The .. operator constructs a Range, and Range implements handwave-equality by checking for inclusion, as well.

This seems a little hokey to me — what if you want to case on a Range? — but I can’t find an exhaustive list of every non-equality use of === in core Ruby. Maybe it works out okay in practice.

Perl 6

Perl 5, historically, did not have a switch statement. Perl 6, in its quest to include literally every single language feature ever conceived, sought to remedy this.

As the documentation explains,

The given statement is Perl 6’s topicalizing keyword in a similar way that switch topicalizes in languages such as C.

Yes, exactly.

Perl 6’s answer to switch is composed of three parts.

given EXPR { ... } assigns EXPR to $_. You might know $_ from Perl 5, where it’s the “default” variable in a number of cases. In Perl 6, it acts much the same way, except that it became much more defaultier. For example, you can do a method call without an invocant: .foo is equivalent to $_.foo.

when EXPR { ... } performs a “smart match” of $_ against EXPR using the ~~ operator. As in Ruby, the precise semantics vary, and are about what you’d expect except when they’re not. If the match succeeds, the block runs, and then control breaks out of the containing block. If the match fails, nothing happens.

default { ... } runs a block and then breaks out of the containing block.

Used together, you get:

1
2
3
4
5
6
given $x {
    when 1 { say "x is 1" }
    when 2 { say "x is 2" }
    when 3 | 4 | 5 { say "x is 3, 4, or 5" }
    default { say "dunno" }
}

The magical thing about this is that none of these parts are required to be used together. A when or default block can appear anywhere, and when will merrily match against whatever happens to be in $_. Any arbitrary code can appear inside a given. Even that 3 | 4 | 5 thing? That’s not special when-only syntax; that’s a junction, a first-class type of value that is all three numbers simultaneously. Perl 6 is truly the realization of Perl 5’s mission: to be startlingly consistent, and also just plain startling.

If you need more complex conditions, Perl 6 has a kind of implicit lambda thing when you write an expression containing * as an atom. I don’t remember exactly how it works, but you can write when * > 12.

Fallthrough is possible, too: a proceed statement will immediately leave a when or default block and continue on after the block, rather than leaving the enclosing block. It’s only allowed inside those two blocks, alas.

Perl 5

Not to be outdone by… itself?, Perl 5 has two switch statements.

The first is a module in the standard library, Switch. Much like Perl 6’s smartmatch, it hardcodes a number of different comparisons depending on the shapes of the operands — including automatically calling functions and considering it a match if the function returns true. Otherwise, it looks fairly similar to the C construct:

1
2
3
4
5
6
switch ($x) {
    case 1 { say "x is 1" }
    case 2 { say "x is 2" }
    case [3..5] { say "x is 3, 4, or 5" }
    else { say "dunno" }
}

The [3..5] is the regular Perl range operator .., inside a regular Perl arrayref. One of the hardcoded comparisons is that if one side is an arrayref and the other is a scalar, case tests for array containment.

Since this isn’t an actual language feature and is implemented in a module… you don’t want to know how it works. Really. Don’t ask. (It rewrites your source code when you import it.)

The native equivalent came with Perl 5.10, the first of the regular Perl 5 releases. You have to opt into it by asking for a Perl version of 5.10 or later — use v5.20; or similar — and then you have pretty much the same given, when, and default as Perl 6. Perl 5 even has smartmatch.

It appears that all three blocks, as well as smartmatch itself, were later marked “experimental” and produce warnings when you try to use them. I’ve heard quiet grumblings about smartmatch’s unpredictability since it was first added to Perl 5.10, which is why Ruby’s similar thing gave me pause. Seriously, y’all, just add an in operator; it’ll solve most of this problem.

Rust

Rust has a match block with its own sub-syntax:

1
2
3
4
5
6
match x {
    1 => println!("x is 1"),
    2 => println!("x is 2"),
    3 | 4 | 5 => println!("x is 3, 4, or 5"),
    _ => println!("dunno"),
}

Like most block constructs in Rust, match can be used as an expression as well. Each statement can likewise be a braced block instead of a single statement.

The stuff on the left side of the => is a pattern, a special syntax with a few bells and whistles. Matching against any of several values with | is one, but you can also give a range with ..., do destructuring assignment, express more complex conditions, and… some other stuff.

A really nice thing here is that _ isn’t special-cased to mean “default”; it’s only “destructuring” to a single variable, which can’t possibly fail. (The name _ is special in the language overall, in that Rust will never create a variable named _, but this isn’t unique to patterns; you can use _ in other forms of destructuring to mean “I don’t actually need this part”.)

My favorite thing about Rust’s match is that it forces the matching to be exhaustive — every possible value of the expression must match one of the patterns you give. If you try to match an enum but forget one of the possible values, your code will not compile.

My least favorite thing is the sprinkle of commas down the right side, which is a travesty.

And others

A lot of languages have switch statements, but most of them are fairly similar and I’m not going to go through them all! I’m sure plenty of commenters will mention their favorites.

In particular, the functional programming sphere makes very heavy use of pattern matching, often much more powerfully than Rust. I believe Ruby’s case came from Pascal, the Betamax to C’s VHS, so it’s a glimpse at what we could’ve had.

Duff’s device

Finally, I can’t very well talk about switch without mentioning Duff’s device, the greatest argument against C’s switch syntax. If you haven’t seen it before, have fun unraveling this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

The hardest problem in computer science

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/26/the-hardest-problem-in-computer-science/

…is, of course, naming.

Not just naming variables or new technologies. Oh no. We can’t even agree on names for basic concepts.

A thousand overlapping vernaculars

Did you know that the C specification makes frequent reference to “objects”? Not in the OO sense as you might think — a C “object” is defined as a “region of data storage in the execution environment, the contents of which can represent values”. The spec then goes on to discuss, for example, “objects of type char.

Method” is a pretty universal term, but you may encounter a C++ programmer who only knows it as “member function”. Conversely, Java may not have functions at all, depending on who you ask. “Procedure” and “subroutine” aren’t used much any more, but a procedure is a completely different construct from a function in Pascal.

Even within the same language, we get sloppy: see if you can catch a Python programmer using “property” to refer to a regular “attribute”, when property is a special kind of attribute.

There’s a difference between “argument” and “parameter”, but no one cares what it is, so we all just use “argument” which abbreviates more easily. I use the word “signature” a lot, but I rarely see anyone else use it, and sometimes I wonder if anyone understands what I mean.

A float is single-precision in C and double-precision in Python. I reflexively tense up whenever someone says “word” unqualified, because it could mean any of three or four different sizes.

Part of the problem here is that we’re not actually doing computer science. We’re doing programming, with a wide variety (hundreds!) of imperfect languages with different combinations of features and restrictions. There are only so many words to go around, so the same names get used for vaguely similar features across many languages, and native speakers naturally attach their mother tongue’s baggage to the jargon it uses. Someone who got started with JavaScript would have a very different idea of what a “class” is than someone who got started with Ruby. People come to Python or JavaScript and exclaim that they “don’t have real closures” because of a quirk of name binding.

Most of the time, this is fine. Sometimes, it’s incredibly confusing. Here are my (least?) favorite lexical clashes. (That was one too!)

Arrays, vectors, and lists

In C, an array is a contiguous block of storage, in which you can put some fixed number of values of the same type. int[5] describes enough space to store five ints, all snuggled right next to each other. There’s no such thing as a “vector”. “List” would likely be interpreted as a linked list, in which each value is stored separately and has a pointer to the next one.

C++ introduced vector, an array that automatically expands to fit an arbitrary number of values. There’s also a standard list type, which is a doubly-linked list. (The exact implementations may be anything, but the types require certain properties that make an array and a linked list the most obvious choices.) But wait! C++11 introduced the initializer_list, which is actually an array.

Lisp dialects are of course nothing but lists, but under the hood, these tend to be implemented as linked lists — which is no doubt why Lisp originally handled lists in terms of heads and tails (very easy to do with linked lists), rather than random access (very easy to do with contiguous arrays). Haskell works similarly, and additionally has a Data.Array module which offers fast random access.

Perl (5)’s sequence type is the array, though “type” is a little misleading here, because it’s really one of Perl’s handful of shapes of variables. Perl also has a distinct thing called a “list”, but it’s a transient context that only exists while evaluating an expression, and is not a type of value. It’s weird and I can’t really explain it within a single paragraph.

Meanwhile, in Python, list is the fundamental sequence type, but it has similar properties to a C++ vector and (in CPython) is implemented with a C array. The standard library also offers the rarely-used array type, which packs numbers into C arrays to save space — a source of occasional confusion for new Python programmers coming from C, who think “array” is the thing they want.

JavaScript has an Array type, but it’s (semantically) built on top of the only data structure in JavaScript, which is a hash table.

PHP’s sole data structure is called array, but it’s really an ordered hash table where you are free to use integers for the keys if you want. It also has a thing called list, but it’s not a type, just quirky syntax for doing deconstructing assignment. People coming from PHP to other languages are occasionally frustrated that hash tables lose their order.

Lua likewise has only a single data structure, but is more upfront in calling its structure a “table”; there’s nothing in the language called “array”, “vector”, or “list”.

While I’m at it, the names for mapping types are all over the place:

  • C++: map (actually a binary tree; C++11’s unordered_map is a hash table)
  • JavaScript: object (!)
  • Lua: table
  • PHP: array (!)
  • Perl: hash (another “shape”, and somewhat misleading since a “hash” is also a different thing), though the documentation likes to say “associative array” a lot
  • Python: dict
  • Rust: HashMap

Pointers, references, and aliases

C has pointers, which are storage addresses. This is pretty easy for C to do, since it’s all about operating on one big array of storage. A pointer is just an index into that storage.

C++ inherited pointers from C, but chastizes you for using them. As an alternative it introduced “references”, which are exactly like pointers, except you can leave off the *. This added a very strange new capability that didn’t exist in C: two regular ol’ local variables could refer to the same storage, so that a = 5; could also change the value of b.

And so all programming conversation was doomed forever, but more on that in a moment.

Rust has things called references, and uses the C++ reference syntax for their types, but they’re really “borrowed pointers” (i.e., pointers, but opaque and subject to compile-time lifetime constraints). It also has lesser-used “raw pointers”, which use C’s pointer syntax.

Perl has things called references. Two different kinds of things, in fact. The ones people generally refer to are “hard references”, which are pretty much like C pointers, except the “address” is supposed to be opaque and can’t be arbitrarily operated on. The others are “soft references”, where you use the contents of a variable as the name of another variable using much the same syntax as hard references, but this is forbidden by use strict so doesn’t see much use (and can be done other ways anyway). Perl also has things called aliases, which work like C++ references — but they don’t work on local variables, and they’re not really a type, just explicit manipulation of the symbol table. (Cool fact: Perl functions receive their arguments as aliases! It’s easy not to notice, because most people immediately assign the arguments to readable names.)

PHP has things called references, but despite PHP’s prominent Perl influence, it borrowed its references from C++. C++ declares references as part of the type, but PHP has no variable declaration whatsoever, so a variable becomes a reference if it’s involved in one of a handful of specific operations with a & involved. The variable is then permanently “infected” with reference-ness.

Python, Ruby, JavaScript, Lua, Java, and probably several hundred other high-level languages have nothing called pointers, references, or aliases. This causes endless confusion when trying to explain the language semantics to someone with a C or C++ background, because we want to say things like “this references that” or “this points to that” which seem to imply a level of indirection that doesn’t exist. For this reason (and my Perl background), I like to call C++’s reference behavior “aliasing”, which more clearly describes what it does and frees up the word “refer” to be used in its generic English sense.

Pass by value, pass by reference

I’m not done yet! I’ve explained this before for Python, but here’s the quick(ish) version. I maintain that this dichotomy makes no sense in almost all languages, because the very question hinges on C’s idea of what a value is, and it’s a relatively rare attitude outside of the C family.

The fundamental issue is that C has syntax to imply structure, but the semantics are all about bytes. A struct looks and sounds like a container, a thing with a lid on it: it’s wrapped in braces, and you have to use . to look inside it. But C just sees a blob of bytes, not much different from an int, except that it lets you look at a few of those bytes at a time. If you put one struct inside another, C will dump the inner’s structure into the outer. If you assign one struct to another, C will dutifully copy all the bytes over, same as it would for a double. The boundary is illusory. In effect, the only “true” container C has — the only form of containment that doesn’t spill its contents all over the place — is the pointer!

If you pass a struct to or from a function, C will copy the whole thing, as with any other form of assignment. If you want a function to modify a struct, you have to pass in a pointer to it, so the function can modify the original storage and not a local copy. If you want to pass a very large struct to a function, you should still use a pointer, or you’ll waste a lot of time uselessly copying data around just to throw it away.

This is so-called “pass by value”, but it’s really about the underlying storage, not any semantic notion of “value”. Pass by bytes, if you will. It’s similar to how forgetting to quote a variable in a shell script will cause it to be split on whitespace, or how passing an array to a function in Perl will copy all the elements. It’s nonsense. The semantics, the diagrams of boxes that we draw, and even the very syntax all imply that there’s something being bundled up, but then you turn your back for a second and the language scatters your data to the wind.

C++ added references to make this sort of thing more transparent, just in case C was too easy to understand. Now you can appear to pass a struct by value”, but if the function is declared as taking references for arguments, it can still freely modify your data. The function’s argument becomes an alias for whatever you pass in, so even an atomic type like an int can be overwritten wholesale.

This is “pass by reference”, perhaps better named “pass by alias”. And it’s entirely possible to have this in a language that’s not so byte-oriented — PHP does it. But if your language isn’t all about bytes, then “pass by value” is meaningless.

The way Java, Python, Ruby, Lua, JavaScript, and countless other languages work is to have containers act as a single unit. When you have a variable containing a structure, and you assign that variable to another variable, no copying is done. Rather, both variables now refer to— err, point to— err…

And here’s the major issue with the terminology. Someone who’s asking whether X language is by-value or by-reference has likely ingrained the C model, and takes for granted that this is how programming fundamentally works. If I say “refer”, they might think there are C++ references (alias) involved somewhere. If I say “point”, they might think the language has some form of indirection like a C pointer. In most cases, the languages have neither, but there are only so many ways to express this concept in English.

Semantically, those languages act like values exist in their own right, and variables are merely names. Assignment gives another name to a value. It’s tempting to explain a = b as “now a points to b” or “now they refer to the same object”, but that introduces an indirection, implies an intermediate layer that doesn’t exist in the language. a and b both name the same value.

Function calls are a form of assignment, so the arguments inside a function name the same values that the caller passed in. You can modify them in-place, if they’re mutable, and the caller will see the changes, because it’s the same value. You can’t just reassign the variable: the variable is not an alias, and assigning to it merely makes it a name for something else instead.

Loose typing

Okay, so, this is really up for interpretation, but I’m pretty sure “loose typing” is not actually a thing. At least, I’ve never seen a particularly concrete definition for it, which is kind of ironic. To recap:

  • Strong typing means that values do not implicitly change type to fit operations performed on them. Rust is strongly typed: comparing an i32 with a i64 is an error.

  • Weak typing means that values can implicitly change type to fit operations performed on them. JavaScript is weakly typed: 5 + "3" will implicitly convert the string to a number and produce 8. Also, C is weakly typed: 5 + "3" will implicitly convert the char* to an integer and produce some hilarious nonsense.

  • Static typing means that names (variables) have associated types that are known before the program runs. Java is statically typed: Java code is 70% type names by volume.

  • Dynamic typing means that names are not given types ahead of time. Ruby is dynamically typed: types are figured out on the fly while the program runs.

Strong–weak forms a spectrum, and static–dynamic forms a spectrum. Languages may have both strong and weak elements, or both static and dynamic elements, though usually one is more prominent. For example, while Go is considered statically-typed, interface{} acts much like dynamic typing. Conversely, you could argue that Python is statically-typed and every variable is of type object, but good luck with that.

Crucially, because strong–weak concerns values and static–dynamic concerns names, all four combinations exist. Haskell is strong and static. C is weak and static. Python is strong and dynamic. Shell is weak and dynamic.

So, then, what exactly is “loose typing”? You’d think it would mean the same as “weak typing”, but I’ve seen a lot of people refer to Python as “loosely typed”, even though Python is mostly strong. (Stronger than C!)

As best as I can tell, “loosely typed” really means “doesn’t have C++’s type system” and is intended to be derisive. Which is kind of funny, given how flimsy C++’s type system is. What type is a pointer to a T? It’s not *T, because that might be a null pointer (which is not a pointer to T) or complete garbage (which is unlikely to be a pointer to a T) or uninitialized (also unlikely to be a pointer to a T). What’s the point of static typing if your variables don’t actually have to contain the type they’re declared as?

Caching

This one is most anecdotal, as it’s not even a language feature.

Caching is storing the results of some computations so you don’t have to compute them again later. It’s an optimization, trading memory in exchange for speed.

I think a crucial property of a cache is that if the cache is emptied or destroyed or unavailable for any reason, everything still works, just more slowly.

And yet I’ve seen a number of programmers use “cache” to refer to any form of storing a value to use later. I find this very confusing, since that’s all programming is.

A fabulous example is a handy Python utility that shows up in a number of projects. I know it by the name reify, which is how it’s spelled in Pyramid, where I first saw it. It does lazy initialization of an object attribute, for example:

1
2
3
4
5
6
7
class Monster:
    def think(self):
        # do something smart

    @reify
    def inventory(self):
        return []

Here, monster.inventory doesn’t actually exist until you try to read it, at which point the function is called — once — and the list it returns becomes the attribute. It’s completely transparent, and once the value is created, it’s a normal attribute with no indirection cost. You can add items to it, and you’ll see the same list every time. Hence, “to make real”: the attribute isn’t real until you summon it into being by observing it.

This is nice for objects that deal with several related but interconnected ideas (which are thus difficult to split into multiple objects). If part of the object takes time or space to set up, you can slap @reify on it, and the end user won’t have to pay the cost if they don’t use that functionality.

It wasn’t on PyPI as a separate package for the longest time, probably because it can be implemented in a dozen lines. When I said it “shows up in a number of projects”, I meant “a number of projects have copy/pasted it from each other”.

It finally showed up a couple years ago, under the name… cached-property. The docs even prominently show how to “invalidate” the “cache” — by mucking with object internals.

The problem I have here is that virtually every use of this decorator I have ever seen is not a cache. The example above is silly, of course, but it immediately demonstrates the problem: “invalidating” monster.inventory would irrevocably lose the only copy of the monster’s inventory. Real uses of @reify tend to produce database connections and other kinds of mutable storage, where “invalidating” would be similarly destructive. This isn’t data you can just whip up again if need be.

It’s possible to use @reify to create a cache, but it’s also possible to use dict to create a cache, so I don’t find that very compelling.

I did try to make my case for renaming the project early on — especially as the maintainer wanted to add this to the standard library! — but no one else liked reify and the conversation degenerated into bikeshedding over an alternative name. Naming really is the hardest problem in computer science.

Bonus: cool terminology we should use more

I love that the Git changelog refers to commands as having “learned” new things:

git remote” learned “get-url” subcommand to show the URL for a given remote name used for fetching and pushing.

Relatedly, I love to see “spelled” use to explain how to write code (especially a single construct or brief expression). Indexing is spelled a[b], etc.

A function “signature” is just the set of arguments it takes.

I realize I said “semantic” a bunch of times in this post, but it doesn’t see much use outside HTML — a lot of programmers seem to get really preoccupied with what the physical hardware is doing. “Semantic” refers to what code means, as opposed to how it works in practice.

And my favorite, which I wish I had more excuses to use: a “nybble” is four bits.