sys/doc/net/net.ms

   1 .HTML "The Organization of Networks in Plan 9
   2 .TL
   3 The Organization of Networks in Plan 9
   4 .AU
   5 Dave Presotto
   6 Phil Winterbottom
   7 .sp
   8 presotto,philw@plan9.bell-labs.com
   9 .AB
  10 .FS
  11 Originally appeared in
  12 .I
  13 Proc. of the Winter 1993 USENIX Conf.,
  14 .R
  15 pp. 271-280,
  16 San Diego, CA
  17 .FE
  18 In a distributed system networks are of paramount importance. This
  19 paper describes the implementation, design philosophy, and organization
  20 of network support in Plan 9. Topics include network requirements
  21 for distributed systems, our kernel implementation, network naming, user interfaces,
  22 and performance. We also observe that much of this organization is relevant to
  23 current systems.
  24 .AE
  25 .NH
  26 Introduction
  27 .PP
  28 Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system
  29 implemented on a variety of computers and networks.
  30 What distinguishes Plan 9 is its organization.
  31 The goals of this organization were to
  32 reduce administration
  33 and to promote resource sharing. One of the keys to its success as a distributed
  34 system is the organization and management of its networks.
  35 .PP
  36 A Plan 9 system comprises file servers, CPU servers and terminals.
  37 The file servers and CPU servers are typically centrally
  38 located multiprocessor machines with large memories and
  39 high speed interconnects.
  40 A variety of workstation-class machines
  41 serve as terminals
  42 connected to the central servers using several networks and protocols.
  43 The architecture of the system demands a hierarchy of network
  44 speeds matching the needs of the components.
  45 Connections between file servers and CPU servers are high-bandwidth point-to-point
  46 fiber links.
  47 Connections from the servers fan out to local terminals
  48 using medium speed networks
  49 such as Ethernet [Met80] and Datakit [Fra80].
  50 Low speed connections via the Internet and
  51 the AT&T backbone serve users in Oregon and Illinois.
  52 Basic Rate ISDN data service and 9600 baud serial lines provide slow
  53 links to users at home.
  54 .PP
  55 Since CPU servers and terminals use the same kernel,
  56 users may choose to run programs locally on
  57 their terminals or remotely on CPU servers.
  58 The organization of Plan 9 hides the details of system connectivity
  59 allowing both users and administrators to configure their environment
  60 to be as distributed or centralized as they wish.
  61 Simple commands support the
  62 construction of a locally represented name space
  63 spanning many machines and networks.
  64 At work, users tend to use their terminals like workstations,
  65 running interactive programs locally and
  66 reserving the CPU servers for data or compute intensive jobs
  67 such as compiling and computing chess endgames.
  68 At home or when connected over
  69 a slow network, users tend to do most work on the CPU server to minimize
  70 traffic on the slow links.
  71 The goal of the network organization is to provide the same
  72 environment to the user wherever resources are used.
  73 .NH
  74 Kernel Network Support
  75 .PP
  76 Networks play a central role in any distributed system. This is particularly
  77 true in Plan 9 where most resources are provided by servers external to the kernel.
  78 The importance of the networking code within the kernel
  79 is reflected by its size;
  80 of 25,000 lines of kernel code, 12,500 are network and protocol related.
  81 Networks are continually being added and the fraction of code
  82 devoted to communications
  83 is growing.
  84 Moreover, the network code is complex.
  85 Protocol implementations consist almost entirely of
  86 synchronization and dynamic memory management, areas demanding
  87 subtle error recovery
  88 strategies.
  89 The kernel currently supports Datakit, point-to-point fiber links,
  90 an Internet (IP) protocol suite and ISDN data service.
  91 The variety of networks and machines
  92 has raised issues not addressed by other systems running on commercial
  93 hardware supporting only Ethernet or FDDI.
  94 .NH 2
  95 The File System protocol
  96 .PP
  97 A central idea in Plan 9 is the representation of a resource as a hierarchical
  98 file system.
  99 Each process assembles a view of the system by building a
 100 .I "name space
 101 [Needham] connecting its resources.
 102 File systems need not represent disc files; in fact, most Plan 9 file systems have no
 103 permanent storage.
 104 A typical file system dynamically represents
 105 some resource like a set of network connections or the process table.
 106 Communication between the kernel, device drivers, and local or remote file servers uses a
 107 protocol called 9P. The protocol consists of 17 messages
 108 describing operations on files and directories.
 109 Kernel resident device and protocol drivers use a procedural version
 110 of the protocol while external file servers use an RPC form.
 111 Nearly all traffic between Plan 9 systems consists
 112 of 9P messages.
 113 9P relies on several properties of the underlying transport protocol.
 114 It assumes messages arrive reliably and in sequence and
 115 that delimiters between messages
 116 are preserved.
 117 When a protocol does not meet these
 118 requirements (for example, TCP does not preserve delimiters)
 119 we provide mechanisms to marshal messages before handing them
 120 to the system.
 121 .PP
 122 A kernel data structure, the
 123 .I channel ,
 124 is a handle to a file server.
 125 Operations on a channel generate the following 9P messages.
 126 The
 127 .CW session
 128 and
 129 .CW attach
 130 messages authenticate a connection, established by means external to 9P,
 131 and validate its user.
 132 The result is an authenticated
 133 channel
 134 referencing the root of the
 135 server.
 136 The
 137 .CW clone
 138 message makes a new channel identical to an existing channel, much like
 139 the
 140 .CW dup
 141 system call.
 142 A
 143 channel
 144 may be moved to a file on the server using a
 145 .CW walk
 146 message to descend each level in the hierarchy.
 147 The
 148 .CW stat
 149 and
 150 .CW wstat
 151 messages read and write the attributes of the file referenced by a channel.
 152 The
 153 .CW open
 154 message prepares a channel for subsequent
 155 .CW read
 156 and
 157 .CW write
 158 messages to access the contents of the file.
 159 .CW Create
 160 and
 161 .CW remove
 162 perform the actions implied by their names on the file
 163 referenced by the channel.
 164 The
 165 .CW clunk
 166 message discards a channel without affecting the file.
 167 .PP
 168 A kernel resident file server called the
 169 .I "mount driver"
 170 converts the procedural version of 9P into RPCs.
 171 The
 172 .I mount
 173 system call provides a file descriptor, which can be
 174 a pipe to a user process or a network connection to a remote machine, to
 175 be associated with the mount point.
 176 After a mount, operations
 177 on the file tree below the mount point are sent as messages to the file server.
 178 The
 179 mount
 180 driver manages buffers, packs and unpacks parameters from
 181 messages, and demultiplexes among processes using the file server.
 182 .NH 2
 183 Kernel Organization
 184 .PP
 185 The network code in the kernel is divided into three layers: hardware interface,
 186 protocol processing, and program interface.
 187 A device driver typically uses streams to connect the two interface layers.
 188 Additional stream modules may be pushed on
 189 a device to process protocols.
 190 Each device driver is a kernel-resident file system.
 191 Simple device drivers serve a single level
 192 directory containing just a few files;
 193 for example, we represent each UART
 194 by a data and a control file.
 195 .P1
 196 cpu% cd /dev
 197 cpu% ls -l eia*
 198 --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1
 199 --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl
 200 --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2
 201 --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl
 202 cpu%
 203 .P2
 204 The control file is used to control the device;
 205 writing the string
 206 .CW b1200
 207 to
 208 .CW /dev/eia1ctl
 209 sets the line to 1200 baud.
 210 .PP
 211 Multiplexed devices present
 212 a more complex interface structure.
 213 For example, the LANCE Ethernet driver
 214 serves a two level file tree (Figure 1)
 215 providing
 216 .IP \(bu
 217 device control and configuration
 218 .IP \(bu
 219 user-level protocols like ARP
 220 .IP \(bu
 221 diagnostic interfaces for snooping software.
 222 .LP
 223 The top directory contains a
 224 .CW clone
 225 file and a directory for each connection, numbered
 226 .CW 1
 227 to
 228 .CW n .
 229 Each connection directory corresponds to an Ethernet packet type.
 230 Opening the
 231 .CW clone
 232 file finds an unused connection directory
 233 and opens its
 234 .CW ctl
 235 file.
 236 Reading the control file returns the ASCII connection number; the user
 237 process can use this value to construct the name of the proper
 238 connection directory.
 239 In each connection directory files named
 240 .CW ctl ,
 241 .CW data ,
 242 .CW stats ,
 243 and
 244 .CW type
 245 provide access to the connection.
 246 Writing the string
 247 .CW "connect 2048"
 248 to the
 249 .CW ctl
 250 file sets the packet type to 2048
 251 and
 252 configures the connection to receive
 253 all IP packets sent to the machine.
 254 Subsequent reads of the file
 255 .CW type
 256 yield the string
 257 .CW 2048 .
 258 The
 259 .CW data
 260 file accesses the media;
 261 reading it
 262 returns the
 263 next packet of the selected type.
 264 Writing the file
 265 queues a packet for transmission after
 266 appending a packet header containing the source address and packet type.
 267 The
 268 .CW stats
 269 file returns ASCII text containing the interface address,
 270 packet input/output counts, error statistics, and general information
 271 about the state of the interface.
 272 .so tree.pout
 273 .PP
 274 If several connections on an interface
 275 are configured for a particular packet type, each receives a
 276 copy of the incoming packets.
 277 The special packet type
 278 .CW -1
 279 selects all packets.
 280 Writing the strings
 281 .CW promiscuous
 282 and
 283 .CW connect
 284 .CW -1
 285 to the
 286 .CW ctl
 287 file
 288 configures a conversation to receive all packets on the Ethernet.
 289 .PP
 290 Although the driver interface may seem elaborate,
 291 the representation of a device as a set of files using ASCII strings for
 292 communication has several advantages.
 293 Any mechanism supporting remote access to files immediately
 294 allows a remote machine to use our interfaces as gateways.
 295 Using ASCII strings to control the interface avoids byte order problems and
 296 ensures a uniform representation for
 297 devices on the same machine and even allows devices to be accessed remotely.
 298 Representing dissimilar devices by the same set of files allows common tools
 299 to serve
 300 several networks or interfaces.
 301 Programs like
 302 .CW stty
 303 are replaced by
 304 .CW echo
 305 and shell redirection.
 306 .NH 2
 307 Protocol devices
 308 .PP
 309 Network connections are represented as pseudo-devices called protocol devices.
 310 Protocol device drivers exist for the Datakit URP protocol and for each of the
 311 Internet IP protocols TCP, UDP, and IL.
 312 IL, described below, is a new communication protocol used by Plan 9 for
 313 transmitting file system RPC's.
 314 All protocol devices look identical so user programs contain no
 315 network-specific code.
 316 .PP
 317 Each protocol device driver serves a directory structure
 318 similar to that of the Ethernet driver.
 319 The top directory contains a
 320 .CW clone
 321 file and a directory for each connection numbered
 322 .CW 0
 323 to
 324 .CW n .
 325 Each connection directory contains files to control one
 326 connection and to send and receive information.
 327 A TCP connection directory looks like this:
 328 .P1
 329 cpu% cd /net/tcp/2
 330 cpu% ls -l
 331 --rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 ctl
 332 --rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 data
 333 --rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 listen
 334 --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local
 335 --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote
 336 --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status
 337 cpu% cat local remote status
 338 135.104.9.31 5012
 339 135.104.53.11 564
 340 tcp/2 1 Established connect
 341 cpu%
 342 .P2
 343 The files
 344 .CW local ,
 345 .CW remote ,
 346 and
 347 .CW status
 348 supply information about the state of the connection.
 349 The
 350 .CW data
 351 and
 352 .CW ctl
 353 files
 354 provide access to the process end of the stream implementing the protocol.
 355 The
 356 .CW listen
 357 file is used to accept incoming calls from the network.
 358 .PP
 359 The following steps establish a connection.
 360 .IP 1)
 361 The clone device of the
 362 appropriate protocol directory is opened to reserve an unused connection.
 363 .IP 2)
 364 The file descriptor returned by the open points to the
 365 .CW ctl
 366 file of the new connection.
 367 Reading that file descriptor returns an ASCII string containing
 368 the connection number.
 369 .IP 3)
 370 A protocol/network specific ASCII address string is written to the
 371 .CW ctl
 372 file.
 373 .IP 4)
 374 The path of the
 375 .CW data
 376 file is constructed using the connection number.
 377 When the
 378 .CW data
 379 file is opened the connection is established.
 380 .LP
 381 A process can read and write this file descriptor
 382 to send and receive messages from the network.
 383 If the process opens the
 384 .CW listen
 385 file it blocks until an incoming call is received.
 386 An address string written to the
 387 .CW ctl
 388 file before the listen selects the
 389 ports or services the process is prepared to accept.
 390 When an incoming call is received, the open completes
 391 and returns a file descriptor
 392 pointing to the
 393 .CW ctl
 394 file of the new connection.
 395 Reading the
 396 .CW ctl
 397 file yields a connection number used to construct the path of the
 398 .CW data
 399 file.
 400 A connection remains established while any of the files in the connection directory
 401 are referenced or until a close is received from the network.
 402 .NH 2
 403 Streams
 404 .PP
 405 A
 406 .I stream
 407 [Rit84a][Presotto] is a bidirectional channel connecting a
 408 physical or pseudo-device to user processes.
 409 The user processes insert and remove data at one end of the stream.
 410 Kernel processes acting on behalf of a device insert data at
 411 the other end.
 412 Asynchronous communications channels such as pipes,
 413 TCP conversations, Datakit conversations, and RS232 lines are implemented using
 414 streams.
 415 .PP
 416 A stream comprises a linear list of
 417 .I "processing modules" .
 418 Each module has both an upstream (toward the process) and
 419 downstream (toward the device)
 420 .I "put routine" .
 421 Calling the put routine of the module on either end of the stream
 422 inserts data into the stream.
 423 Each module calls the succeeding one to send data up or down the stream.
 424 .PP
 425 An instance of a processing module is represented by a pair of
 426 .I queues ,
 427 one for each direction.
 428 The queues point to the put procedures and can be used
 429 to queue information traveling along the stream.
 430 Some put routines queue data locally and send it along the stream at some
 431 later time, either due to a subsequent call or an asynchronous
 432 event such as a retransmission timer or a device interrupt.
 433 Processing modules create helper kernel processes to
 434 provide a context for handling asynchronous events.
 435 For example, a helper kernel process awakens periodically
 436 to perform any necessary TCP retransmissions.
 437 The use of kernel processes instead of serialized run-to-completion service routines
 438 differs from the implementation of Unix streams.
 439 Unix service routines cannot
 440 use any blocking kernel resource and they lack a local long-lived state.
 441 Helper kernel processes solve these problems and simplify the stream code.
 442 .PP
 443 There is no implicit synchronization in our streams.
 444 Each processing module must ensure that concurrent processes using the stream
 445 are synchronized.
 446 This maximizes concurrency but introduces the
 447 possibility of deadlock.
 448 However, deadlocks are easily avoided by careful programming; to
 449 date they have not caused us problems.
 450 .PP
 451 Information is represented by linked lists of kernel structures called
 452 .I blocks .
 453 Each block contains a type, some state flags, and pointers to
 454 an optional buffer.
 455 Block buffers can hold either data or control information, i.e., directives
 456 to the processing modules.
 457 Blocks and block buffers are dynamically allocated from kernel memory.
 458 .NH 3
 459 User Interface
 460 .PP
 461 A stream is represented at user level as two files,
 462 .CW ctl
 463 and
 464 .CW data .
 465 The actual names can be changed by the device driver using the stream,
 466 as we saw earlier in the example of the UART driver.
 467 The first process to open either file creates the stream automatically.
 468 The last close destroys it.
 469 Writing to the
 470 .CW data
 471 file copies the data into kernel blocks
 472 and passes them to the downstream put routine of the first processing module.
 473 A write of less than 32K is guaranteed to be contained by a single block.
 474 Concurrent writes to the same stream are not synchronized, although the
 475 32K block size assures atomic writes for most protocols.
 476 The last block written is flagged with a delimiter
 477 to alert downstream modules that care about write boundaries.
 478 In most cases the first put routine calls the second, the second
 479 calls the third, and so on until the data is output.
 480 As a consequence, most data is output without context switching.
 481 .PP
 482 Reading from the
 483 .CW data
 484 file returns data queued at the top of the stream.
 485 The read terminates when the read count is reached
 486 or when the end of a delimited block is encountered.
 487 A per stream read lock ensures only one process
 488 can read from a stream at a time and guarantees
 489 that the bytes read were contiguous bytes from the
 490 stream.
 491 .PP
 492 Like UNIX streams [Rit84a],
 493 Plan 9 streams can be dynamically configured.
 494 The stream system intercepts and interprets
 495 the following control blocks:
 496 .IP "\f(CWpush\fP \fIname\fR" 15
 497 adds an instance of the processing module
 498 .I name
 499 to the top of the stream.
 500 .IP \f(CWpop\fP 15
 501 removes the top module of the stream.
 502 .IP \f(CWhangup\fP 15
 503 sends a hangup message
 504 up the stream from the device end.
 505 .LP
 506 Other control blocks are module-specific and are interpreted by each
 507 processing module
 508 as they pass.
 509 .PP
 510 The convoluted syntax and semantics of the UNIX
 511 .CW ioctl
 512 system call convinced us to leave it out of Plan 9.
 513 Instead,
 514 .CW ioctl
 515 is replaced by the
 516 .CW ctl
 517 file.
 518 Writing to the
 519 .CW ctl
 520 file
 521 is identical to writing to a
 522 .CW data
 523 file except the blocks are of type
 524 .I control .
 525 A processing module parses each control block it sees.
 526 Commands in control blocks are ASCII strings, so
 527 byte ordering is not an issue when one system
 528 controls streams in a name space implemented on another processor.
 529 The time to parse control blocks is not important, since control
 530 operations are rare.
 531 .NH 3
 532 Device Interface
 533 .PP
 534 The module at the downstream end of the stream is part of a device interface.
 535 The particulars of the interface vary with the device.
 536 Most device interfaces consist of an interrupt routine, an output
 537 put routine, and a kernel process.
 538 The output put routine stages data for the
 539 device and starts the device if it is stopped.
 540 The interrupt routine wakes up the kernel process whenever
 541 the device has input to be processed or needs more output staged.
 542 The kernel process puts information up the stream or stages more data for output.
 543 The division of labor among the different pieces varies depending on
 544 how much must be done at interrupt level.
 545 However, the interrupt routine may not allocate blocks or call
 546 a put routine since both actions require a process context.
 547 .NH 3
 548 Multiplexing
 549 .PP
 550 The conversations using a protocol device must be
 551 multiplexed onto a single physical wire.
 552 We push a multiplexer processing module
 553 onto the physical device stream to group the conversations.
 554 The device end modules on the conversations add the necessary header
 555 onto downstream messages and then put them to the module downstream
 556 of the multiplexer.
 557 The multiplexing module looks at each message moving up its stream and
 558 puts it to the correct conversation stream after stripping
 559 the header controlling the demultiplexing.
 560 .PP
 561 This is similar to the Unix implementation of multiplexer streams.
 562 The major difference is that we have no general structure that
 563 corresponds to a multiplexer.
 564 Each attempt to produce a generalized multiplexer created a more complicated
 565 structure and underlined the basic difficulty of generalizing this mechanism.
 566 We now code each multiplexer from scratch and favor simplicity over
 567 generality.
 568 .NH 3
 569 Reflections
 570 .PP
 571 Despite five year's experience and the efforts of many programmers,
 572 we remain dissatisfied with the stream mechanism.
 573 Performance is not an issue;
 574 the time to process protocols and drive
 575 device interfaces continues to dwarf the
 576 time spent allocating, freeing, and moving blocks
 577 of data.
 578 However the mechanism remains inordinately
 579 complex.
 580 Much of the complexity results from our efforts
 581 to make streams dynamically configurable, to
 582 reuse processing modules on different devices
 583 and to provide kernel synchronization
 584 to ensure data structures
 585 don't disappear under foot.
 586 This is particularly irritating since we seldom use these properties.
 587 .PP
 588 Streams remain in our kernel because we are unable to
 589 devise a better alternative.
 590 Larry Peterson's X-kernel [Pet89a]
 591 is the closest contender but
 592 doesn't offer enough advantage to switch.
 593 If we were to rewrite the streams code, we would probably statically
 594 allocate resources for a large fixed number of conversations and burn
 595 memory in favor of less complexity.
 596 .NH
 597 The IL Protocol
 598 .PP
 599 None of the standard IP protocols is suitable for transmission of
 600 9P messages over an Ethernet or the Internet.
 601 TCP has a high overhead and does not preserve delimiters.
 602 UDP, while cheap, does not provide reliable sequenced delivery.
 603 Early versions of the system used a custom protocol that was
 604 efficient but unsatisfactory for internetwork transmission.
 605 When we implemented IP, TCP, and UDP we looked around for a suitable
 606 replacement with the following properties:
 607 .IP \(bu
 608 Reliable datagram service with sequenced delivery
 609 .IP \(bu
 610 Runs over IP
 611 .IP \(bu
 612 Low complexity, high performance
 613 .IP \(bu
 614 Adaptive timeouts
 615 .LP
 616 None met our needs so a new protocol was designed.
 617 IL is a lightweight protocol designed to be encapsulated by IP.
 618 It is a connection-based protocol
 619 providing reliable transmission of sequenced messages between machines.
 620 No provision is made for flow control since the protocol is designed to transport RPC
 621 messages between client and server.
 622 A small outstanding message window prevents too
 623 many incoming messages from being buffered;
 624 messages outside the window are discarded
 625 and must be retransmitted.
 626 Connection setup uses a two way handshake to generate
 627 initial sequence numbers at each end of the connection;
 628 subsequent data messages increment the
 629 sequence numbers allowing
 630 the receiver to resequence out of order messages.
 631 In contrast to other protocols, IL does not do blind retransmission.
 632 If a message is lost and a timeout occurs, a query message is sent.
 633 The query message is a small control message containing the current
 634 sequence numbers as seen by the sender.
 635 The receiver responds to a query by retransmitting missing messages.
 636 This allows the protocol to behave well in congested networks,
 637 where blind retransmission would cause further
 638 congestion.
 639 Like TCP, IL has adaptive timeouts.
 640 A round-trip timer is used
 641 to calculate acknowledge and retransmission times in terms of the network speed.
 642 This allows the protocol to perform well on both the Internet and on local Ethernets.
 643 .PP
 644 In keeping with the minimalist design of the rest of the kernel, IL is small.
 645 The entire protocol is 847 lines of code, compared to 2200 lines for TCP.
 646 IL is our protocol of choice.
 647 .NH
 648 Network Addressing
 649 .PP
 650 A uniform interface to protocols and devices is not sufficient to
 651 support the transparency we require.
 652 Since each network uses a different
 653 addressing scheme,
 654 the ASCII strings written to a control file have no common format.
 655 As a result, every tool must know the specifics of the networks it
 656 is capable of addressing.
 657 Moreover, since each machine supplies a subset
 658 of the available networks, each user must be aware of the networks supported
 659 by every terminal and server machine.
 660 This is obviously unacceptable.
 661 .PP
 662 Several possible solutions were considered and rejected; one deserves
 663 more discussion.
 664 We could have used a user-level file server
 665 to represent the network name space as a Plan 9 file tree.
 666 This global naming scheme has been implemented in other distributed systems.
 667 The file hierarchy provides paths to
 668 directories representing network domains.
 669 Each directory contains
 670 files representing the names of the machines in that domain;
 671 an example might be the path
 672 .CW /net/name/usa/edu/mit/ai .
 673 Each machine file contains information like the IP address of the machine.
 674 We rejected this representation for several reasons.
 675 First, it is hard to devise a hierarchy encompassing all representations
 676 of the various network addressing schemes in a uniform manner.
 677 Datakit and Ethernet address strings have nothing in common.
 678 Second, the address of a machine is
 679 often only a small part of the information required to connect to a service on
 680 the machine.
 681 For example, the IP protocols require symbolic service names to be mapped into
 682 numeric port numbers, some of which are privileged and hence special.
 683 Information of this sort is hard to represent in terms of file operations.
 684 Finally, the size and number of the networks being represented burdens users with
 685 an unacceptably large amount of information about the organization of the network
 686 and its connectivity.
 687 In this case the Plan 9 representation of a
 688 resource as a file is not appropriate.
 689 .PP
 690 If tools are to be network independent, a third-party server must resolve
 691 network names.
 692 A server on each machine, with local knowledge, can select the best network
 693 for any particular destination machine or service.
 694 Since the network devices present a common interface,
 695 the only operation which differs between networks is name resolution.
 696 A symbolic name must be translated to
 697 the path of the clone file of a protocol
 698 device and an ASCII address string to write to the
 699 .CW ctl
 700 file.
 701 A connection server (CS) provides this service.
 702 .NH 2
 703 Network Database
 704 .PP
 705 On most systems several
 706 files such as
 707 .CW /etc/hosts ,
 708 .CW /etc/networks ,
 709 .CW /etc/services ,
 710 .CW /etc/hosts.equiv ,
 711 .CW /etc/bootptab ,
 712 and
 713 .CW /etc/named.d
 714 hold network information.
 715 Much time and effort is spent
 716 administering these files and keeping
 717 them mutually consistent.
 718 Tools attempt to
 719 automatically derive one or more of the files from
 720 information in other files but maintenance continues to be
 721 difficult and error prone.
 722 .PP
 723 Since we were writing an entirely new system, we were free to
 724 try a simpler approach.
 725 One database on a shared server contains all the information
 726 needed for network administration.
 727 Two ASCII files comprise the main database:
 728 .CW /lib/ndb/local
 729 contains locally administered information and
 730 .CW /lib/ndb/global
 731 contains information imported from elsewhere.
 732 The files contain sets of attribute/value pairs of the form
 733 .I attr\f(CW=\fPvalue ,
 734 where
 735 .I attr
 736 and
 737 .I value
 738 are alphanumeric strings.
 739 Systems are described by multi-line entries;
 740 a header line at the left margin begins each entry followed by zero or more
 741 indented attribute/value pairs specifying
 742 names, addresses, properties, etc.
 743 For example, the entry for our CPU server
 744 specifies a domain name, an IP address, an Ethernet address,
 745 a Datakit address, a boot file, and supported protocols.
 746 .P1
 747 sys=helix
 748         dom=helix.research.bell-labs.com
 749         bootf=/mips/9power
 750         ip=135.104.9.31 ether=0800690222f0
 751         dk=nj/astro/helix
 752         proto=il flavor=9cpu
 753 .P2
 754 If several systems share entries such as
 755 network mask and gateway, we specify that information
 756 with the network or subnetwork instead of the system.
 757 The following entries define a Class B IP network and
 758 a few subnets derived from it.
 759 The entry for the network specifies the IP mask,
 760 file system, and authentication server for all systems
 761 on the network.
 762 Each subnetwork specifies its default IP gateway.
 763 .P1
 764 ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0
 765         fs=bootes.research.bell-labs.com
 766         auth=1127auth
 767 ipnet=unix-room ip=135.104.117.0
 768         ipgw=135.104.117.1
 769 ipnet=third-floor ip=135.104.51.0
 770         ipgw=135.104.51.1
 771 ipnet=fourth-floor ip=135.104.52.0
 772         ipgw=135.104.52.1
 773 .P2
 774 Database entries also define the mapping of service names
 775 to port numbers for TCP, UDP, and IL.
 776 .P1
 777 tcp=echo        port=7
 778 tcp=discard     port=9
 779 tcp=systat      port=11
 780 tcp=daytime     port=13
 781 .P2
 782 .PP
 783 All programs read the database directly so
 784 consistency problems are rare.
 785 However the database files can become large.
 786 Our global file, containing all information about
 787 both Datakit and Internet systems in AT&T, has 43,000
 788 lines.
 789 To speed searches, we build hash table files for each
 790 attribute we expect to search often.
 791 The hash file entries point to entries
 792 in the master files.
 793 Every hash file contains the modification time of its master
 794 file so we can avoid using an out-of-date hash table.
 795 Searches for attributes that aren't hashed or whose hash table
 796 is out-of-date still work, they just take longer.
 797 .NH 2
 798 Connection Server
 799 .PP
 800 On each system a user level connection server process, CS, translates
 801 symbolic names to addresses.
 802 CS uses information about available networks, the network database, and
 803 other servers (such as DNS) to translate names.
 804 CS is a file server serving a single file,
 805 .CW /net/cs .
 806 A client writes a symbolic name to
 807 .CW /net/cs
 808 then reads one line for each matching destination reachable
 809 from this system.
 810 The lines are of the form
 811 .I "filename message",
 812 where
 813 .I filename
 814 is the path of the clone file to open for a new connection and
 815 .I message
 816 is the string to write to it to make the connection.
 817 The following example illustrates this.
 818 .CW Ndb/csquery
 819 is a program that prompts for strings to write to
 820 .CW /net/cs
 821 and prints the replies.
 822 .P1
 823 % ndb/csquery
 824 > net!helix!9fs
 825 /net/il/clone 135.104.9.31!17008
 826 /net/dk/clone nj/astro/helix!9fs
 827 .P2
 828 .PP
 829 CS provides meta-name translation to perform complicated
 830 searches.
 831 The special network name
 832 .CW net
 833 selects any network in common between source and
 834 destination supporting the specified service.
 835 A host name of the form \f(CW$\fIattr\f1
 836 is the name of an attribute in the network database.
 837 The database search returns the value
 838 of the matching attribute/value pair
 839 most closely associated with the source host.
 840 Most closely associated is defined on a per network basis.
 841 For example, the symbolic name
 842 .CW tcp!$auth!rexauth
 843 causes CS to search for the
 844 .CW auth
 845 attribute in the database entry for the source system, then its
 846 subnetwork (if there is one) and then its network.
 847 .P1
 848 % ndb/csquery
 849 > net!$auth!rexauth
 850 /net/il/clone 135.104.9.34!17021
 851 /net/dk/clone nj/astro/p9auth!rexauth
 852 /net/il/clone 135.104.9.6!17021
 853 /net/dk/clone nj/astro/musca!rexauth
 854 .P2
 855 .PP
 856 Normally CS derives naming information from its database files.
 857 For domain names however, CS first consults another user level
 858 process, the domain name server (DNS).
 859 If no DNS is reachable, CS relies on its own tables.
 860 .PP
 861 Like CS, the domain name server is a user level process providing
 862 one file,
 863 .CW /net/dns .
 864 A client writes a request of the form
 865 .I "domain-name type" ,
 866 where
 867 .I type
 868 is a domain name service resource record type.
 869 DNS performs a recursive query through the
 870 Internet domain name system producing one line
 871 per resource record found.  The client reads
 872 .CW /net/dns
 873 to retrieve the records.
 874 Like other domain name servers, DNS caches information
 875 learned from the network.
 876 DNS is implemented as a multi-process shared memory application
 877 with separate processes listening for network and local requests.
 878 .NH
 879 Library routines
 880 .PP
 881 The section on protocol devices described the details
 882 of making and receiving connections across a network.
 883 The dance is straightforward but tedious.
 884 Library routines are provided to relieve
 885 the programmer of the details.
 886 .NH 2
 887 Connecting
 888 .PP
 889 The
 890 .CW dial
 891 library call establishes a connection to a remote destination.
 892 It
 893 returns an open file descriptor for the
 894 .CW data
 895 file in the connection directory.
 896 .P1
 897 int  dial(char *dest, char *local, char *dir, int *cfdp)
 898 .P2
 899 .IP \f(CWdest\fP 10
 900 is the symbolic name/address of the destination.
 901 .IP \f(CWlocal\fP 10
 902 is the local address.
 903 Since most networks do not support this, it is
 904 usually zero.
 905 .IP \f(CWdir\fP 10
 906 is a pointer to a buffer to hold the path name of the protocol directory
 907 representing this connection.
 908 .CW Dial
 909 fills this buffer if the pointer is non-zero.
 910 .IP \f(CWcfdp\fP 10
 911 is a pointer to a file descriptor for the
 912 .CW ctl
 913 file of the connection.
 914 If the pointer is non-zero,
 915 .CW dial
 916 opens the control file and tucks the file descriptor here.
 917 .LP
 918 Most programs call
 919 .CW dial
 920 with a destination name and all other arguments zero.
 921 .CW Dial
 922 uses CS to
 923 translate the symbolic name to all possible destination addresses
 924 and attempts to connect to each in turn until one works.
 925 Specifying the special name
 926 .CW net
 927 in the network portion of the destination
 928 allows CS to pick a network/protocol in common
 929 with the destination for which the requested service is valid.
 930 For example, assume the system
 931 .CW research.bell-labs.com
 932 has the Datakit address
 933 .CW nj/astro/research
 934 and IP addresses
 935 .CW 135.104.117.5
 936 and
 937 .CW 129.11.4.1 .
 938 The call
 939 .P1
 940 fd = dial("net!research.bell-labs.com!login", 0, 0, 0, 0);
 941 .P2
 942 tries in succession to connect to
 943 .CW nj/astro/research!login
 944 on the Datakit and both
 945 .CW 135.104.117.5!513
 946 and
 947 .CW 129.11.4.1!513
 948 across the Internet.
 949 .PP
 950 .CW Dial
 951 accepts addresses instead of symbolic names.
 952 For example, the destinations
 953 .CW tcp!135.104.117.5!513
 954 and
 955 .CW tcp!research.bell-labs.com!login
 956 are equivalent
 957 references to the same machine.
 958 .NH 2
 959 Listening
 960 .PP
 961 A program uses
 962 four routines to listen for incoming connections.
 963 It first
 964 .CW announce() s
 965 its intention to receive connections,
 966 then
 967 .CW listen() s
 968 for calls and finally
 969 .CW accept() s
 970 or
 971 .CW reject() s
 972 them.
 973 .CW Announce
 974 returns an open file descriptor for the
 975 .CW ctl
 976 file of a connection and fills
 977 .CW dir
 978 with the
 979 path of the protocol directory
 980 for the announcement.
 981 .P1
 982 int  announce(char *addr, char *dir)
 983 .P2
 984 .CW Addr
 985 is the symbolic name/address announced;
 986 if it does not contain a service, the announcement is for
 987 all services not explicitly announced.
 988 Thus, one can easily write the equivalent of the
 989 .CW inetd
 990 program without
 991 having to announce each separate service.
 992 An announcement remains in force until the control file is
 993 closed.
 994 .LP
 995 .CW Listen
 996 returns an open file descriptor for the
 997 .CW ctl
 998 file and fills
 999 .CW ldir
1000 with the path
1001 of the protocol directory
1002 for the received connection.
1003 It is passed
1004 .CW dir
1005 from the announcement.
1006 .P1
1007 int  listen(char *dir, char *ldir)
1008 .P2
1009 .LP
1010 .CW Accept
1011 and
1012 .CW reject
1013 are called with the control file descriptor and
1014 .CW ldir
1015 returned by
1016 .CW listen.
1017 Some networks such as Datakit accept a reason for a rejection;
1018 networks such as IP ignore the third argument.
1019 .P1
1020 int  accept(int ctl, char *ldir)
1021 int  reject(int ctl, char *ldir, char *reason)
1022 .P2
1023 .PP
1024 The following code implements a typical TCP listener.
1025 It announces itself, listens for connections, and forks a new
1026 process for each.
1027 The new process echoes data on the connection until the
1028 remote end closes it.
1029 The "*" in the symbolic name means the announcement is valid for
1030 any addresses bound to the machine the program is run on.
1031 .P1
1032 .ta 8n 16n 24n 32n 40n 48n 56n 64n
1033 int
1034 echo_server(void)
1035 {
1036         int dfd, lcfd;
1037         char adir[40], ldir[40];
1038         int n;
1039         char buf[256];
1040
1041         afd = announce("tcp!*!echo", adir);
1042         if(afd < 0)
1043                 return -1;
1044
1045         for(;;){
1046                 /* listen for a call */
1047                 lcfd = listen(adir, ldir);
1048                 if(lcfd < 0)
1049                         return -1;
1050
1051                 /* fork a process to echo */
1052                 switch(fork()){
1053                 case 0:
1054                         /* accept the call and open the data file */
1055                         dfd = accept(lcfd, ldir);
1056                         if(dfd < 0)
1057                                 return -1;
1058
1059                         /* echo until EOF */
1060                         while((n = read(dfd, buf, sizeof(buf))) > 0)
1061                                 write(dfd, buf, n);
1062                         exits(0);
1063                 case -1:
1064                         perror("forking");
1065                 default:
1066                         close(lcfd);
1067                         break;
1068                 }
1069
1070         }
1071 }
1072 .P2
1073 .NH
1074 User Level
1075 .PP
1076 Communication between Plan 9 machines is done almost exclusively in
1077 terms of 9P messages. Only the two services
1078 .CW cpu
1079 and
1080 .CW exportfs
1081 are used.
1082 The
1083 .CW cpu
1084 service is analogous to
1085 .CW rlogin .
1086 However, rather than emulating a terminal session
1087 across the network,
1088 .CW cpu
1089 creates a process on the remote machine whose name space is an analogue of the window
1090 in which it was invoked.
1091 .CW Exportfs
1092 is a user level file server which allows a piece of name space to be
1093 exported from machine to machine across a network. It is used by the
1094 .CW cpu
1095 command to serve the files in the terminal's name space when they are
1096 accessed from the
1097 cpu server.
1098 .PP
1099 By convention, the protocol and device driver file systems are mounted in a
1100 directory called
1101 .CW /net .
1102 Although the per-process name space allows users to configure an
1103 arbitrary view of the system, in practice their profiles build
1104 a conventional name space.
1105 .NH 2
1106 Exportfs
1107 .PP
1108 .CW Exportfs
1109 is invoked by an incoming network call.
1110 The
1111 .I listener
1112 (the Plan 9 equivalent of
1113 .CW inetd )
1114 runs the profile of the user
1115 requesting the service to construct a name space before starting
1116 .CW exportfs .
1117 After an initial protocol
1118 establishes the root of the file tree being
1119 exported,
1120 the remote process mounts the connection,
1121 allowing
1122 .CW exportfs
1123 to act as a relay file server. Operations in the imported file tree
1124 are executed on the remote server and the results returned.
1125 As a result
1126 the name space of the remote machine appears to be exported into a
1127 local file tree.
1128 .PP
1129 The
1130 .CW import
1131 command calls
1132 .CW exportfs
1133 on a remote machine, mounts the result in the local name space,
1134 and
1135 exits.
1136 No local process is required to serve mounts;
1137 9P messages are generated by the kernel's mount driver and sent
1138 directly over the network.
1139 .PP
1140 .CW Exportfs
1141 must be multithreaded since the system calls
1142 .CW open,
1143 .CW read
1144 and
1145 .CW write
1146 may block.
1147 Plan 9 does not implement the
1148 .CW select
1149 system call but does allow processes to share file descriptors,
1150 memory and other resources.
1151 .CW Exportfs
1152 and the configurable name space
1153 provide a means of sharing resources between machines.
1154 It is a building block for constructing complex name spaces
1155 served from many machines.
1156 .PP
1157 The simplicity of the interfaces encourages naive users to exploit the potential
1158 of a richly connected environment.
1159 Using these tools it is easy to gateway between networks.
1160 For example a terminal with only a Datakit connection can import from the server
1161 .CW helix :
1162 .P1
1163 import -a helix /net
1164 telnet ai.mit.edu
1165 .P2
1166 The
1167 .CW import
1168 command makes a Datakit connection to the machine
1169 .CW helix
1170 where
1171 it starts an instance
1172 .CW exportfs
1173 to serve
1174 .CW /net .
1175 The
1176 .CW import
1177 command mounts the remote
1178 .CW /net
1179 directory after (the
1180 .CW -a
1181 option to
1182 .CW import )
1183 the existing contents
1184 of the local
1185 .CW /net
1186 directory.
1187 The directory contains the union of the local and remote contents of
1188 .CW /net .
1189 Local entries supersede remote ones of the same name so
1190 networks on the local machine are chosen in preference
1191 to those supplied remotely.
1192 However, unique entries in the remote directory are now visible in the local
1193 .CW /net
1194 directory.
1195 All the networks connected to
1196 .CW helix ,
1197 not just Datakit,
1198 are now available in the terminal. The effect on the name space is shown by the following
1199 example:
1200 .P1
1201 philw-gnot% ls /net
1202 /net/cs
1203 /net/dk
1204 philw-gnot% import -a musca /net
1205 philw-gnot% ls /net
1206 /net/cs
1207 /net/cs
1208 /net/dk
1209 /net/dk
1210 /net/dns
1211 /net/ether
1212 /net/il
1213 /net/tcp
1214 /net/udp
1215 .P2
1216 .NH 2
1217 Ftpfs
1218 .PP
1219 We decided to make our interface to FTP
1220 a file system rather than the traditional command.
1221 Our command,
1222 .I ftpfs,
1223 dials the FTP port of a remote system, prompts for login and password, sets image mode,
1224 and mounts the remote file system onto
1225 .CW /n/ftp .
1226 Files and directories are cached to reduce traffic.
1227 The cache is updated whenever a file is created.
1228 Ftpfs works with TOPS-20, VMS, and various Unix flavors
1229 as the remote system.
1230 .NH
1231 Cyclone Fiber Links
1232 .PP
1233 The file servers and CPU servers are connected by
1234 high-bandwidth
1235 point-to-point links.
1236 A link consists of two VME cards connected by a pair of optical
1237 fibers.
1238 The VME cards use 33MHz Intel 960 processors and AMD's TAXI
1239 fiber transmitter/receivers to drive the lines at 125 Mbit/sec.
1240 Software in the VME card reduces latency by copying messages from system memory
1241 to fiber without intermediate buffering.
1242 .NH
1243 Performance
1244 .PP
1245 We measured both latency and throughput
1246 of reading and writing bytes between two processes
1247 for a number of different paths.
1248 Measurements were made on two- and four-CPU SGI Power Series processors.
1249 The CPUs are 25 MHz MIPS 3000s.
1250 The latency is measured as the round trip time
1251 for a byte sent from one process to another and
1252 back again.
1253 Throughput is measured using 16k writes from
1254 one process to another.
1255 .DS C
1256 .TS
1257 box, tab(:);
1258 c s s
1259 c | c | c
1260 l | n | n.
1261 Table 1 - Performance
1262 _
1263 test:throughput:latency
1264 :MBytes/sec:millisec
1265 _
1266 pipes:8.15:.255
1267 _
1268 IL/ether:1.02:1.42
1269 _
1270 URP/Datakit:0.22:1.75
1271 _
1272 Cyclone:3.2:0.375
1273 .TE
1274 .DE
1275 .NH
1276 Conclusion
1277 .PP
1278 The representation of all resources as file systems
1279 coupled with an ASCII interface has proved more powerful
1280 than we had originally imagined.
1281 Resources can be used by any computer in our networks
1282 independent of byte ordering or CPU type.
1283 The connection server provides an elegant means
1284 of decoupling tools from the networks they use.
1285 Users successfully use Plan 9 without knowing the
1286 topology of the system or the networks they use.
1287 More information about 9P can be found in the Section 5 of the Plan 9 Programmer's
1288 Manual, Volume I.
1289 .NH
1290 References
1291 .LP
1292 [Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1293 ``Plan 9 from Bell Labs'',
1294 .I
1295 UKUUG Proc. of the Summer 1990 Conf. ,
1296 London, England,
1297 1990.
1298 .LP
1299 [Needham] R. Needham, ``Names'', in
1300 .I
1301 Distributed systems,
1302 .R
1303 S. Mullender, ed.,
1304 Addison Wesley, 1989.
1305 .LP
1306 [Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
1307 .I
1308 UKUUG Proc. of the Summer 1990 Conf. ,
1309 .R
1310 London, England, 1990.
1311 .LP
1312 [Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The
1313 Ethernet Local Network: Three reports'',
1314 .I
1315 CSL-80-2,
1316 .R
1317 XEROX Palo Alto Research Center, February 1980.
1318 .LP
1319 [Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous
1320 and Asynchronous Traffic'',
1321 .I
1322 Proc. Int'l Conf. on Communication,
1323 .R
1324 Boston, June 1980.
1325 .LP
1326 [Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'',
1327 .I
1328 Proc. Twelfth Symp. on Op. Sys. Princ.,
1329 .R
1330 Litchfield Park, AZ, December 1990.
1331 .LP
1332 [Rit84a] D. M. Ritchie, ``A Stream Input-Output System'',
1333 .I
1334 AT&T Bell Laboratories Technical Journal, 68(8),
1335 .R
1336 October 1984.