Emulated atomic operations and real-time scheduling

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/projects/atomic-rt.html

Unfortunately not all CPU architectures have native support for atomic
operations, or only support a very limited subset. Most
prominently ARMv5 (and
older) hasn’t any support besides the most basic atomic swap
operation^[1]. Now, more and more free code is starting to
use atomic operations and lock-free
algorithms, one being my own project, PulseAudio. If you have ever done real-time
programming you probably know that you cannot really do it without
support for atomic operations. One question remains however: what to
do on CPUs which support only the most basic atomic operations
natively?

On the kernel side atomic ops are very easy to emulate: just disable
interrupts temporarily, then do your operation non-atomically, and afterwards
enable them again. That’s relatively cheap and works fine (unless you are on SMP — which
fortunately you usually are not for those CPUs). The Linux
kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let’s see how the different userspace libraries/frameworks do it
for ARMv5, a very relevant architecture that only knows an atomic swap (exchange)
but no CAS
or even atomic arithmetics. Let’s start with an excerpt from glibc’s
atomic operations implementation for ARM:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic
operations the more likely you’re going to to hit this race. Let’s
hope glibc is not a heavy user of atomic operations. PulseAudio however is, and
PulseAudio happens to be my focus.

Let’s have a look on how Qt4
does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard
situations it actually works. But it sucks big time, too. Why? It
contains a spin lock: the variable q_atomic_lock is used for
locking the atomic operation. The code tries to set it to non-zero,
and if that fails it tries again, until it succeeds, in the hope that
the other thread — which currently holds the lock — gives it up. The
big problem here is: it might take a while until that happens, up to
1/HZ time on Linux. Usually you want to use atomic operations to
minimize the need for mutexes and thus speed things up. Now, here you
got a lock, and it’s the worst kind: the spinning lock. Not
good. Also, if used from a real-time thread the machine simply locks
up when we enter the loop in contended state, because preemption is
disabled for RT threads and thus the loop will spin forever. Evil. And
then, there’s another problem: it’s a big bottleneck, because all
atomic operations are synchronized via a single variable which is
q_atomic_lock. Not good either. And let’s not forget that
only code that has access to q_atomic_lock actually can
execute this code safely. If you want to use it for
lock-free IPC via shared memory this is going to break. And let’s not
forget that it is unusable from signal handlers (which probably
doesn’t matter much, though). So, in summary: this code sucks,
too.

Next try, let’s have a look on how glib
does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of
sched_yield() for asking the OS to reschedule. It’s a bit
better than the Qt version, since it doesn’t spin just burning CPU,
but instead tells the kernel to execute something else, increasing the
chance that the thread currently holding the lock is scheduled. It’s a
bit friendlier, but it’s not great either because this might still delay
execution quite a bit. It’s better then the Qt version. And probably
one of the very few ligitimate occasions where using
sched_yield() is OK. It still doesn’t work for RT — because
sched_yield() in most cases is a NOP on for RT threads, so
you still get a machine lockup. And it still has the
one-lock-to-rule-them-all bottleneck. And it still is not compatible
with shared memory.

Then, there’s libatomic_ops. It’s
the most complex code, so I’ll spare you to paste it here. Basically
it uses the same spin loop. With three differences however:

16 lock variables instead of a single one are used. The variable
that is used is picked via simple hashing of the pointer to the atomic variable
that shall be modified. This removes the one-lock-to-rule-them-all
bottleneck.
Instead of pthread_yield() it uses select() with
a small timeval parameter to give the current holder of the lock some
time to give it up. To make sure that the select() is not
optimized away by the kernel and the thread thus never is preempted
the sleep time is increased on every loop iteration.
It explicitly disables signals before doing the atomic operation.

It’s certainly the best implementation of the ones discussed here:
It doesn’t suffer by the one-lock-to-rule-them-all bottleneck. It’s
(supposedly) signal handler safe (which however comes at the cost of
doing two syscalls on every atomic operation — probably a very high
price). It actually works on RT, due to sleeping for an explicit
time. However it still doesn’t deal with priority
inversion problems — which is a big issue for real-time
programming. Also, the time slept in the select() call might
be relatively long, since at least on Linux the time passed to
select() is rounded up to 1/HZ — not good for RT either. And
then, it still doesn’t work for shared memory IPC.

So, what do we learn from this? At least one thing: better don’t do
real-time programming with ARMv5^[2]. But more practically, how
could a good emulation for atomic ops, solely based on atomic swap
look like? Here are a few ideas:

Use an implementation inspired by libatomic_ops. Right
now it’s the best available. It’s probably a good idea, though, to
replace select() by a nanosleep(), since on recent
kernels the latter doesn’t round up to 1/HZ anymore, at least when you
have high-resolution timers^[3] Then, if you can live
without signal handler safety, drop the signal mask changing.
If you use something based on libatomic_ops and want to
use it for shared memory IPC, then you have the option to move the
lock variables into shared memory too. Note however, that this allows
evil applications to lock up your process by taking the locks and
never giving them up. (Which however is always a problem if not all
atomic operations you need are available in hardware) So if you do
this, make sure that only trusted processes can attach to your memory
segment.
Alternatively, spend some time and investigate if it is possible
to use futexes to sleep on the lock variables. This is not trivial
though, since futexes right now expect the availability of an atomic
increment operation. But it might be possible to emulate this good
enough with the swap operation. There’s now even a FUTEX_LOCK_PI
operation which would allow priority inheritance.
Alternatively, find a a way to allow user space disabling
interrupts cheaply (requires kernel patching). Since enabling
RT scheduling is a priviliged operation already (since you may easily
lock up your machine with it), it might not be too problematic to
extend the ability to disable interrupts to user space: it’s just yet
another way to lock up your machine.
For the libatomic_ops based algorithm: if you’re lucky
and defined a struct type for your atomic integer types, like the
kernel does, or like I do in PulseAudio with pa_atomic_t,
then you can stick the lock variable directly into your
structure. This makes shared memory support transparent, and removes
the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it
increases the memory consumption a bit and increases cache pressure
(though I’d assume that this is neglible).
For the libatomic_ops based algorithm: start sleeping for
the time returned by clock_getres()
(cache the result!). You cannot sleep shorter than that anyway.

Yepp, that’s as good as it gets. Unfortunately I cannot serve you
the optimal solution on a silver platter. I never actually did
development for ARMv5, this blog story just sums up my thoughts on all
the code I saw which emulates atomic ops on ARMv5. But maybe someone
who actually cares about atomic operations on ARM finds this
interesting and maybe invests some time to prepare patches for Qt,
glib, glibc — and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let’s not even think about CPUs which don’t even have an atomic swap!

[3] Which however you probably won’t, given that they’re only available on x86 on stable Linux kernels for now — but still, it’s cleaner.

Noise

Emulated atomic operations and real-time scheduling

The collective thoughts of the interwebz