New Callout and Timer Facilities for NetBSD

Adam M. Costello and George Varghese
Computer and Communications Research Center
Washington University in St. Louis

Current BSD kernels (4.4BSD-Lite and derivatives) take time proportional to the number of outstanding timers (“callouts”) to set or cancel timers. Our implementation takes constant time to start, stop, and maintain timers; this leads to a highly scalable design that can support thousands of outstanding timers without much overhead. Unlike the existing implementation, our routines are guaranteed to lock out interrupts only for a small, bounded amount of time. We also extend the setitimer() interface to allow a process to have multiple outstanding timers, thereby reducing the need for users to maintain their own timer packages. The changes to the kernel are small (548 lines of code added, 80 removed), and are available below. See also:

a journal article (not online):
Adam M. Costello and George Varghese. Redesigning the BSD Timer Facilities. Software--Practice and Experience. 28(8):883-896, July 1998.
a nearly identical technical report: Redesigning the BSD Callout and Timer Facilities
a talk first presented on 10 Jan 1996

Patches for NetBSD:

machdep.c.patch applies to arch/sparc/sparc/machdep.c version 1.46
param.c.patch applies to conf/param.c version 1.15
kern_clock.c.patch applies to kern/kern_clock.c version 1.22
kern_exit.c.patch applies to kern/kern_exit.c version 1.30
kern_time.c.patch applies to kern/kern_time.c version 1.14
callout.h.patch applies to sys/callout.h version 1.10
proc.h.patch applies to sys/proc.h version 1.40
systm.h.patch applies to sys/systm.h version 1.36
time.h.patch applies to sys/time.h version 1.16

To activate the new facilities, you need additional configuration options.

For convenience, patches.tar.gz contains all of the patches, and the options.
Warning: They will all untar into the current directory.

Fixed bug: The hash function used when the performance results in the paper were generated accidentally used bitwise-or instead of bitwise-xor. Somehow, it didn't seem to adversely affect the performance, but it's been fixed in the patch above. Here is the original flawed version of kern_clock.c.patch.

Known bug: timeout_guts() inserts the new callout structure at the head of the list, rather than the tail. This causes two problems. First, callouts scheduled for the same time are not executed in causal order (the order in which they were scheduled), but rather in the reverse order, which differs from the existing implementation. Second, callouts that happen to get inserted into the list currently being traversed by softclock() don't get seen until callwheelsize ticks in the future, although they ought to get processed immediately. (Thanks to Glenn Waters for noticing this.)

The solution to both problems is to have timeout_guts() insert at the end of the list. This could be achieved by making the list fully circular, with each callwheel element being a dummy list item with both a forward and backward pointer. All code that operates on these lists would have to be modified.

Another way would be to leave the callwheel array as-is, use the c_back member of the first callout as a tail-pointer to the last callout, leave the c_next member of the last callout pointing to NULL, and enlarge the callout structure with a new pointer to the callwheel element at the head of the list (needed for updating the tail-pointer when the last callout is canceled). This would probably be easier to code, but...

Justin Gibbs has further improved on that idea. Store the absolute expiration time in the callout instead of the number of callwheel cycles until the expiration. This allows you to deduce which callwheel entry heads the list containing the callout, eliminating the need for the new pointer.

No matter how we choose to append callouts to the end of the list, timeout_guts() will need to change nextsoftcheck to point to the new callout if nextsoftcheck is NULL and (ticks + to_ticks - softticks) & callwheelbits is 0 (i.e. if softclock() is just about to fall off the end of the list being appended).

Aesthetic nitpick: The code uses “mrtimer” and “mrt” for some names, and “mritimer” and “mrit” for others. Eventually, all the names should be made consistent (preferring the latter convention).

I am no longer set up to compile and test the code, so I haven't tried implementing the changes outlined above. However, Justin Gibbs is continuing the development in the FreeBSD environment (at least, the callout portion, maybe not the mritimer stuff).

Prepared by Adam M. Costello
Last modified: 1998-Oct-17-Sat 03:54:54 GMT