sys/doc/sleep.ms

   1 .HTML "Process Sleep and Wakeup on a Shared-memory Multiprocessor
   2 .TL
   3 Process Sleep and Wakeup on a Shared-memory Multiprocessor
   4 .AU
   5 Rob Pike
   6 Dave Presotto
   7 Ken Thompson
   8 Gerard Holzmann
   9 .sp
  10 rob,presotto,ken,gerard@plan9.bell-labs.com
  11 .AB
  12 .FS
  13 Appeared in a slightly different form in
  14 .I
  15 Proceedings of the Spring 1991 EurOpen Conference,
  16 .R
  17 Tromsø, Norway, 1991, pp. 161-166.
  18 .FE
  19 The problem of enabling a `sleeping' process on a shared-memory multiprocessor
  20 is a difficult one, especially if the process is to be awakened by an interrupt-time
  21 event.  We present here the code
  22 for sleep and wakeup primitives that we use in our multiprocessor system.
  23 The code has been exercised by years of active use and by a verification
  24 system.
  25 .AE
  26 .LP
  27 Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
  28 Processes suspend execution, or
  29 .I sleep,
  30 while awaiting an enabling event such as an I/O interrupt.
  31 When the event occurs, the process is issued a
  32 .I wakeup
  33 to resume its execution.
  34 During these events, other processes may be running and other interrupts
  35 occurring on other processors.
  36 .LP
  37 More specifically, we wish to implement subroutines called
  38 .CW sleep ,
  39 callable by a process to relinquish control of its current processor,
  40 and
  41 .CW wakeup ,
  42 callable by another process or an interrupt to resume the execution
  43 of a suspended process.
  44 The calling conventions of these subroutines will remain unspecified
  45 for the moment.
  46 .LP
  47 We assume the processors have an atomic test-and-set or equivalent
  48 operation but no other synchronisation method.  Also, we assume interrupts
  49 can occur on any processor at any time, except on a processor that has
  50 locally inhibited them.
  51 .LP
  52 The problem is the generalisation to a multiprocessor of a familiar
  53 and well-understood uniprocessor problem.  It may be reduced to a
  54 uniprocessor problem by using a global test-and-set to serialise the
  55 sleeps and wakeups,
  56 which is equivalent to synchronising through a monitor.
  57 For performance and cleanliness, however,
  58 we prefer to allow the interrupt handling and process control to be multiprocessed.
  59 .LP
  60 Our attempts to solve the sleep/wakeup problem in Plan 9
  61 [Pik90]
  62 prompted this paper.
  63 We implemented solutions several times over several months and each
  64 time convinced ourselves \(em wrongly \(em they were correct.
  65 Multiprocessor algorithms can be
  66 difficult to prove correct by inspection and formal reasoning about them
  67 is impractical.  We finally developed an algorithm we trust by
  68 verifying our code using an
  69 empirical testing tool.
  70 We present that code here, along with some comments about the process by
  71 which it was designed.
  72 .SH
  73 History
  74 .LP
  75 Since processes in Plan 9 and the UNIX
  76 system have similar structure and properties, one might ask if
  77 UNIX
  78 .CW sleep
  79 and
  80 .CW wakeup
  81 [Bac86]
  82 could not easily be adapted from their standard uniprocessor implementation
  83 to our multiprocessor needs.
  84 The short answer is, no.
  85 .LP
  86 The
  87 UNIX
  88 routines
  89 take as argument a single global address
  90 that serves as a unique
  91 identifier to connect the wakeup with the appropriate process or processes.
  92 This has several inherent disadvantages.
  93 From the point of view of
  94 .CW sleep
  95 and
  96 .CW wakeup ,
  97 it is difficult to associate a data structure with an arbitrary address;
  98 the routines are unable to maintain a state variable recording the
  99 status of the event and processes.
 100 (The reverse is of course easy \(em we could
 101 require the address to point to a special data structure \(em
 102 but we are investigating
 103 UNIX
 104 .CW sleep
 105 and
 106 .CW wakeup ,
 107 not the code that calls them.)
 108 Also, multiple processes sleep `on' a given address, so
 109 .CW wakeup
 110 must enable them all, and let process scheduling determine which process
 111 actually benefits from the event.
 112 This is inefficient;
 113 a queueing mechanism would be preferable
 114 but, again, it is difficult to associate a queue with a general address.
 115 Moreover, the lack of state means that
 116 .CW sleep
 117 and
 118 .CW wakeup
 119 cannot know what the corresponding process (or interrupt) is doing;
 120 .CW sleep
 121 and
 122 .CW wakeup
 123 must be executed atomically.
 124 On a uniprocessor it suffices to disable interrupts during their
 125 execution.
 126 On a multiprocessor, however,
 127 most processors
 128 can inhibit interrupts only on the current processor,
 129 so while a process is executing
 130 .CW sleep
 131 the desired interrupt can come and go on another processor.
 132 If the wakeup is to be issued by another process, the problem is even harder.
 133 Some inter-process mutual exclusion mechanism must be used,
 134 which, yet again, is difficult to do without a way to communicate state.
 135 .LP
 136 In summary, to be useful on a multiprocessor,
 137 UNIX
 138 .CW sleep
 139 and
 140 .CW wakeup
 141 must either be made to run atomically on a single
 142 processor (such as by using a monitor)
 143 or they need a richer model for their communication.
 144 .SH
 145 The design
 146 .LP
 147 Consider the case of an interrupt waking up a sleeping process.
 148 (The other case, a process awakening a second process, is easier because
 149 atomicity can be achieved using an interlock.)
 150 The sleeping process is waiting for some event to occur, which may be
 151 modeled by a condition coming true.
 152 The condition could be just that the event has happened, or something
 153 more subtle such as a queue draining below some low-water mark.
 154 We represent the condition by a function of one
 155 argument of type
 156 .CW void* ;
 157 the code supporting the device generating the interrupts
 158 provides such a function to be used by
 159 .CW sleep
 160 and
 161 .CW wakeup
 162 to synchronise.  The function returns
 163 .CW false
 164 if the event has not occurred, and
 165 .CW true
 166 some time after the event has occurred.
 167 The
 168 .CW sleep
 169 and
 170 .CW wakeup
 171 routines must, of course, work correctly if the
 172 event occurs while the process is executing
 173 .CW sleep .
 174 .LP
 175 We assume that a particular call to
 176 .CW sleep
 177 corresponds to a particular call to
 178 .CW wakeup ,
 179 that is,
 180 at most one process is asleep waiting for a particular event.
 181 This can be guaranteed in the code that calls
 182 .CW sleep
 183 and
 184 .CW wakeup
 185 by appropriate interlocks.
 186 We also assume for the moment that there will be only one interrupt
 187 and that it may occur at any time, even before
 188 .CW sleep
 189 has been called.
 190 .LP
 191 For performance,
 192 we desire that multiple instances of
 193 .CW sleep
 194 and
 195 .CW wakeup
 196 may be running simultaneously on our multiprocessor.
 197 For example, a process calling
 198 .CW sleep
 199 to await a character from an input channel need not
 200 wait for another process to finish executing
 201 .CW sleep
 202 to await a disk block.
 203 At a finer level, we would like a process reading from one input channel
 204 to be able to execute
 205 .CW sleep
 206 in parallel with a process reading from another input channel.
 207 A standard approach to synchronisation is to interlock the channel `driver'
 208 so that only one process may be executing in the channel code at once.
 209 This method is clearly inadequate for our purposes; we need
 210 fine-grained synchronisation, and in particular to apply
 211 interlocks at the level of individual channels rather than at the level
 212 of the channel driver.
 213 .LP
 214 Our approach is to use an object called a
 215 .I rendezvous ,
 216 which is a data structure through which
 217 .CW sleep
 218 and
 219 .CW wakeup
 220 synchronise.
 221 (The similarly named construct in Ada is a control structure;
 222 ours is an unrelated data structure.)
 223 A rendezvous
 224 is allocated for each active source of events:
 225 one for each I/O channel,
 226 one for each end of a pipe, and so on.
 227 The rendezvous serves as an interlockable structure in which to record
 228 the state of the sleeping process, so that
 229 .CW sleep
 230 and
 231 .CW wakeup
 232 can communicate if the event happens before or while
 233 .CW sleep
 234 is executing.
 235 .LP
 236 Our design for
 237 .CW sleep
 238 is therefore a function
 239 .P1
 240 void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
 241 .P2
 242 called by the sleeping process.
 243 The argument
 244 .CW r
 245 connects the call to
 246 .CW sleep
 247 with the call to
 248 .CW wakeup ,
 249 and is part of the data structure for the (say) device.
 250 The function
 251 .CW condition
 252 is described above;
 253 called with argument
 254 .CW arg ,
 255 it is used by
 256 .CW sleep
 257 to decide whether the event has occurred.
 258 .CW Wakeup
 259 has a simpler specification:
 260 .P1
 261 void wakeup(Rendezvous *r).
 262 .P2
 263 .CW Wakeup
 264 must be called after the condition has become true.
 265 .SH
 266 An implementation
 267 .LP
 268 The
 269 .CW Rendezvous
 270 data type is defined as
 271 .P1
 272 typedef struct{
 273         Lock    l;
 274         Proc    *p;
 275 }Rendezvous;
 276 .P2
 277 Our
 278 .CW Locks
 279 are test-and-set spin locks.
 280 The routine
 281 .CW lock(Lock\ *l)
 282 returns when the current process holds that lock;
 283 .CW unlock(Lock\ *l)
 284 releases the lock.
 285 .LP
 286 Here is our implementation of
 287 .CW sleep .
 288 Its details are discussed below.
 289 .CW Thisp
 290 is a pointer to the current process on the current processor.
 291 (Its value differs on each processor.)
 292 .P1
 293 void
 294 sleep(Rendezvous *r, int (*condition)(void*), void *arg)
 295 {
 296         int s;
 297
 298         s = inhibit();          /* interrupts */
 299         lock(&r->l);
 300
 301         /*
 302          * if condition happened, never mind
 303          */
 304         if((*condition)(arg)){
 305                 unlock(&r->l);
 306                 allow();        /* interrupts */
 307                 return;
 308         }
 309
 310         /*
 311          * now we are committed to
 312          * change state and call scheduler
 313          */
 314         if(r->p)
 315                 error("double sleep %d %d", r->p->pid, thisp->pid);
 316         thisp->state = Wakeme;
 317         r->p = thisp;
 318         unlock(&r->l);
 319         allow(s);       /* interrupts */
 320         sched();        /* relinquish CPU */
 321 }
 322 .P2
 323 .ne 3i
 324 Here is
 325 .CW wakeup.
 326 .P1
 327 void
 328 wakeup(Rendezvous *r)
 329 {
 330         Proc *p;
 331         int s;
 332
 333         s = inhibit();  /* interrupts; return old state */
 334         lock(&r->l);
 335         p = r->p;
 336         if(p){
 337                 r->p = 0;
 338                 if(p->state != Wakeme)
 339                         panic("wakeup: not Wakeme");
 340                 ready(p);
 341         }
 342         unlock(&r->l);
 343         if(s)
 344                 allow();
 345 }
 346 .P2
 347 .CW Sleep
 348 and
 349 .CW wakeup
 350 both begin by disabling interrupts
 351 and then locking the rendezvous structure.
 352 Because
 353 .CW wakeup
 354 may be called in an interrupt routine, the lock must be set only
 355 with interrupts disabled on the current processor,
 356 so that if the interrupt comes during
 357 .CW sleep
 358 it will occur only on a different processor;
 359 if it occurred on the processor executing
 360 .CW sleep ,
 361 the spin lock in
 362 .CW wakeup
 363 would hang forever.
 364 At the end of each routine, the lock is released and processor priority
 365 returned to its previous value.
 366 .CW Wakeup "" (
 367 needs to inhibit interrupts in case
 368 it is being called by a process;
 369 this is a no-op if called by an interrupt.)
 370 .LP
 371 .CW Sleep
 372 checks to see if the condition has become true, and returns if so.
 373 Otherwise the process posts its name in the rendezvous structure where
 374 .CW wakeup
 375 may find it, marks its state as waiting to be awakened
 376 (this is for error checking only) and goes to sleep by calling
 377 .CW sched() .
 378 The manipulation of the rendezvous structure is all done under the lock,
 379 and
 380 .CW wakeup
 381 only examines it under lock, so atomicity and mutual exclusion
 382 are guaranteed.
 383 .LP
 384 .CW Wakeup
 385 has a simpler job.  When it is called, the condition has implicitly become true,
 386 so it locks the rendezvous, sees if a process is waiting, and readies it to run.
 387 .SH
 388 Discussion
 389 .LP
 390 The synchronisation technique used here
 391 is similar to known methods, even as far back as Saltzer's thesis
 392 [Sal66].
 393 The code looks trivially correct in retrospect: all access to data structures is done
 394 under lock, and there is no place that things may get out of order.
 395 Nonetheless, it took us several iterations to arrive at the above
 396 implementation, because the things that
 397 .I can
 398 go wrong are often hard to see.  We had four earlier implementations
 399 that were examined at great length and only found faulty when a new,
 400 different style of device or activity was added to the system.
 401 .LP
 402 .ne 3i
 403 Here, for example, is an incorrect implementation of wakeup,
 404 closely related to one of our versions.
 405 .P1
 406 void
 407 wakeup(Rendezvous *r)
 408 {
 409         Proc *p;
 410         int s;
 411
 412         p = r->p;
 413         if(p){
 414                 s = inhibit();
 415                 lock(&r->l);
 416                 r->p = 0;
 417                 if(p->state != Wakeme)
 418                         panic("wakeup: not Wakeme");
 419                 ready(p);
 420                 unlock(&r->l);
 421                 if(s)
 422                         allow();
 423         }
 424 }
 425 .P2
 426 The mistake is that the reading of
 427 .CW r->p
 428 may occur just as the other process calls
 429 .CW sleep ,
 430 so when the interrupt examines the structure it sees no one to wake up,
 431 and the sleeping process misses its wakeup.
 432 We wrote the code this way because we reasoned that the fetch
 433 .CW p
 434 .CW =
 435 .CW r->p
 436 was inherently atomic and need not be interlocked.
 437 The bug was found by examination when a new, very fast device
 438 was added to the system and sleeps and interrupts were closely overlapped.
 439 However, it was in the system for a couple of months without causing an error.
 440 .LP
 441 How many errors lurk in our supposedly correct implementation above?
 442 We would like a way to guarantee correctness; formal proofs are beyond
 443 our abilities when the subtleties of interrupts and multiprocessors are
 444 involved.
 445 With that in mind, the first three authors approached the last to see
 446 if his automated tool for checking protocols
 447 [Hol91]
 448 could be
 449 used to verify our new
 450 .CW sleep
 451 and
 452 .CW wakeup
 453 for correctness.
 454 The code was translated into the language for that system
 455 (with, unfortunately, no way of proving that the translation is itself correct)
 456 and validated by exhaustive simulation.
 457 .LP
 458 The validator found a bug.
 459 Under our assumption that there is only one interrupt, the bug cannot
 460 occur, but in the more general case of multiple interrupts synchronising
 461 through the same condition function and rendezvous,
 462 the process and interrupt can enter a peculiar state.
 463 A process may return from
 464 .CW sleep
 465 with the condition function false
 466 if there is a delay between
 467 the condition coming true and
 468 .CW wakeup
 469 being called,
 470 with the delay occurring
 471 just as the receiving process calls
 472 .CW sleep .
 473 The condition is now true, so that process returns immediately,
 474 does whatever is appropriate, and then (say) decides to call
 475 .CW sleep
 476 again.  This time the condition is false, so it goes to sleep.
 477 The wakeup process then finds a sleeping process,
 478 and wakes it up, but the condition is now false.
 479 .LP
 480 There is an easy (and verified) solution: at the end of
 481 .CW sleep
 482 or after
 483 .CW sleep
 484 returns,
 485 if the condition is false, execute
 486 .CW sleep
 487 again.  This re-execution cannot repeat; the second synchronisation is guaranteed
 488 to function under the external conditions we are supposing.
 489 .LP
 490 Even though the original code is completely
 491 protected by interlocks and had been examined carefully by all of us
 492 and believed correct, it still had problems.
 493 It seems to us that some exhaustive automated analysis is
 494 required of multiprocessor algorithms to guarantee their safety.
 495 Our experience has confirmed that it is almost impossible to
 496 guarantee by inspection or simple testing the correctness
 497 of a multiprocessor algorithm.  Testing can demonstrate the presence
 498 of bugs but not their absence
 499 [Dij72].
 500 .LP
 501 We close by claiming that the code above with
 502 the suggested modification passes all tests we have for correctness
 503 under the assumptions used in the validation.
 504 We would not, however, go so far as to claim that it is universally correct.
 505 .SH
 506 References
 507 .LP
 508 [Bac86] Maurice J. Bach,
 509 .I "The Design of the UNIX Operating System,
 510 Prentice-Hall,
 511 Englewood Cliffs,
 512 1986.
 513 .LP
 514 [Dij72] Edsger W. Dijkstra,
 515 ``The Humble Programmer \- 1972 Turing Award Lecture'',
 516 .I "Comm. ACM,
 517 15(10), pp. 859-866,
 518 October 1972.
 519 .LP
 520 [Hol91] Gerard J. Holzmann,
 521 .I "Design and Validation of Computer Protocols,
 522 Prentice-Hall,
 523 Englewood Cliffs,
 524 1991.
 525 .LP
 526 [Pik90]
 527 Rob Pike,
 528 Dave Presotto,
 529 Ken Thompson,
 530 Howard Trickey,
 531 ``Plan 9 from Bell Labs'',
 532 .I "Proceedings of the Summer 1990 UKUUG Conference,
 533 pp. 1-9,
 534 London,
 535 July, 1990.
 536 .LP
 537 [Sal66] Jerome H. Saltzer,
 538 .I "Traffic Control in a Multiplexed Computer System
 539 MIT,
 540 Cambridge, Mass.,
 541 1966.