Defending UNIX and the OOM Killer: toaster

toaster_boy

Defending UNIX and the OOM Killer

Dec 01, 2009 04:40

I found this via reddit and wrote an entirely too-long response. I figured I'd put my response here, mostly so I can find it again later, but maybe someone will find it interesting. I doubt it. ;-) Obviously you must read the original rant first for this to make any sense.

There was a LWN article recently about someone working on the OOM killer to improve process selection, so things may get better soon.

I must say, however, that you're being unduly hard on Linux. Every UNIX implementation of which I'm aware has a sbrk() system call which succeeds without regard to memory pressure and an OOM killer which behaves stupidly in one scenario or another. FreeBSD 8.0's OOM killer, for example, assumes the memory footprint of a process for which it can't immediately acquire PROC_LOCK is zero; in a particular system I've been working on, the run-away application has a highly-contended PROC_LOCK, so the OOM killer basically always picks the largest of the remaining processes. In my case, that's always /sbin/init--yay, kernel panic.

The sbrk() and mmap() system calls--malloc() maps onto a combination of the two, these days--return success in almost every situation because on modern UNIX implementations, they allocate address space, *not* memory. The implementations on which malloc() demonstrably *does* return NULL probably have RLIMIT_DATA set to some "reasonable" value--most Linux distros set it to inifinity--but it is important to note that it's pretty trivial to construct 1) a situation where RLIMIT_DATA causes malloc() to return NULL even though there's plenty of remaining memory and 2) a situation where RLIMIT_DATA (even a very low limit) hasn't been hit but the system still runs out of memory and triggers the OOM killer.

I mentioned above that malloc() allocates address space. The way you allocate actual memory is, of course, by touching the address space to cause a page fault. A naive way to remove (part of) the OOM problem would be to have sbrk() and mmap() allocate both address space and memory, making the page faults unnecessary, but that solution is unacceptibly wasteful in practice. First, the heap algorithms in a typical malloc() implementation--say, jemalloc from FreeBSD--impose a correlation between address and allocation size in order to keep virtual memory fragmentation under control. That is, 8-byte malloc()s are much more likely to be near other 8-byte malloc()s than 12K mallocs(), for example. Even on a 64-bit machine, you can't just ignore VA fragmentation; there's still a finite amount of it, and using it sparsely costs both memory and TLB entries. Second, even without also allocating memory, the sbrk() and mmap() system calls are expensive. So the reality is that a malloc() implementation will tend spread different sized allocations out in the address space because of the different size pools but must request large (contiguous) chunks of address space from sbrk() and mmap()--address space it has no way of knowing whether it will actually use, since it can't predict what malloc() calls will be made in the future--in order to amortize the cost. So maybe you can hope for some kind of SIGBACONATOR that puts you on notice that you'd better go on a diet or face the OOM killer, but malloc() returning NULL is basically out of the question; it can't determine whether there's enough free memory or not in advance because communication with the OS is far too costly and the page fault can't affect the return value because it happens *after* malloc() returns!

Along those lines: you mentioned that applications written to handle OOM conditions are able to gracefully degrade. That isn't actually true in practice--or at least, the likelihood that graceful degradation in OOM conditions is impossible is sufficiently high that the code to implement it is almost never worth the effort, unless you are exercising complete control over a system (to the extent that you're willing to do non-portable things like setting oom_adj and oom_score). The reason is that, while an application can certainly avoid calling malloc() while trying to recover, it is much much more difficult to avoid page faulting because it needs another page in some data buffer it's already allocated or another page of stack--presumably, performing the recovery requires calling one or more functions. (Signal handling requires stack space too, of course.) Or because it needs another page of code--after all, the no-more-memory code path by definition won't have been executed before, so it probably isn't even resident and will have to be paged in from disk. If you're using C++, maybe the tables used by the compiler to unwind the stack when throwing an exception aren't resident. In either of those situations any application that needs memory for any reason deadlocks unless the OOM killer runs to free some physical memory and/or swap space.

unix, oom, linux