sys/doc/comp.ms

   1 .HTML "How to Use the Plan 9 C Compiler
   2 .TL
   3 How to Use the Plan 9 C Compiler
   4 .AU
   5 Rob Pike
   6 rob@plan9.bell-labs.com
   7 .SH
   8 Introduction
   9 .PP
  10 The C compiler on Plan 9 is a wholly new program; in fact
  11 it was the first piece of software written for what would
  12 eventually become Plan 9 from Bell Labs.
  13 Programmers familiar with existing C compilers will find
  14 a number of differences in both the language the Plan 9 compiler
  15 accepts and in how the compiler is used.
  16 .PP
  17 The compiler is really a set of compilers, one for each
  18 architecture \(em MIPS, SPARC, Motorola 68020, Intel 386, etc. \(em
  19 that accept a dialect of ANSI C and efficiently produce
  20 fairly good code for the target machine.
  21 There is a packaging of the compiler that accepts strict ANSI C for
  22 a POSIX environment, but this document focuses on the
  23 native Plan 9 environment, that in which all the system source and
  24 almost all the utilities are written.
  25 .SH
  26 Source
  27 .PP
  28 The language accepted by the compilers is the core ANSI C language
  29 with some modest extensions,
  30 a greatly simplified preprocessor,
  31 a smaller library that includes system calls and related facilities,
  32 and a completely different structure for include files.
  33 .PP
  34 Official ANSI C accepts the old (K&R) style of declarations for
  35 functions; the Plan 9 compilers
  36 are more demanding.
  37 Without an explicit run-time flag
  38 .CW -B ) (
  39 whose use is discouraged, the compilers insist
  40 on new-style function declarations, that is, prototypes for
  41 function arguments.
  42 The function declarations in the libraries' include files are
  43 all in the new style so the interfaces are checked at compile time.
  44 For C programmers who have not yet switched to function prototypes
  45 the clumsy syntax may seem repellent but the payoff in stronger typing
  46 is substantial.
  47 Those who wish to import existing software to Plan 9 are urged
  48 to use the opportunity to update their code.
  49 .PP
  50 The compilers include an integrated preprocessor that accepts the familiar
  51 .CW #include ,
  52 .CW #define
  53 for macros both with and without arguments,
  54 .CW #undef ,
  55 .CW #line ,
  56 .CW #ifdef ,
  57 .CW #ifndef ,
  58 and
  59 .CW #endif .
  60 It
  61 supports neither
  62 .CW #if
  63 nor
  64 .CW ## ,
  65 although it does
  66 honor a few
  67 .CW #pragmas .
  68 The
  69 .CW #if
  70 directive was omitted because it greatly complicates the
  71 preprocessor, is never necessary, and is usually abused.
  72 Conditional compilation in general makes code hard to understand;
  73 the Plan 9 source uses it sparingly.
  74 Also, because the compilers remove dead code, regular
  75 .CW if
  76 statements with constant conditions are more readable equivalents to many
  77 .CW #ifs .
  78 To compile imported code ineluctably fouled by
  79 .CW #if
  80 there is a separate command,
  81 .CW /bin/cpp ,
  82 that implements the complete ANSI C preprocessor specification.
  83 .PP
  84 Include files fall into two groups: machine-dependent and machine-independent.
  85 The machine-independent files occupy the directory
  86 .CW /sys/include ;
  87 the others are placed in a directory appropriate to the machine, such as
  88 .CW /mips/include .
  89 The compiler searches for include files
  90 first in the machine-dependent directory and then
  91 in the machine-independent directory.
  92 At the time of writing there are thirty-one machine-independent include
  93 files and two (per machine) machine-dependent ones:
  94 .CW <ureg.h>
  95 and
  96 .CW <u.h> .
  97 The first describes the layout of registers on the system stack,
  98 for use by the debugger.
  99 The second defines some
 100 architecture-dependent types such as
 101 .CW jmp_buf
 102 for
 103 .CW setjmp
 104 and the
 105 .CW va_arg
 106 and
 107 .CW va_list
 108 macros for handling arguments to variadic functions,
 109 as well as a set of
 110 .CW typedef
 111 abbreviations for
 112 .CW unsigned
 113 .CW short
 114 and so on.
 115 .PP
 116 Here is an excerpt from
 117 .CW /68020/include/u.h :
 118 .P1
 119 #define nil             ((void*)0)
 120 typedef unsigned short  ushort;
 121 typedef unsigned char   uchar;
 122 typedef unsigned long   ulong;
 123 typedef unsigned int    uint;
 124 typedef   signed char   schar;
 125 typedef long long       vlong;
 126
 127 typedef long    jmp_buf[2];
 128 #define JMPBUFSP        0
 129 #define JMPBUFPC        1
 130 #define JMPBUFDPC       0
 131 .P2
 132 Plan 9 programs use
 133 .CW nil
 134 for the name of the zero-valued pointer.
 135 The type
 136 .CW vlong
 137 is the largest integer type available; on most architectures it
 138 is a 64-bit value.
 139 A couple of other types in
 140 .CW <u.h>
 141 are
 142 .CW u32int ,
 143 which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
 144 .CW mpdigit ,
 145 which is used by the multiprecision math package
 146 .CW <mp.h> .
 147 The
 148 .CW #define
 149 constants permit an architecture-independent (but compiler-dependent)
 150 implementation of stack-switching using
 151 .CW setjmp
 152 and
 153 .CW longjmp .
 154 .PP
 155 Every Plan 9 C program begins
 156 .P1
 157 #include <u.h>
 158 .P2
 159 because all the other installed header files use the
 160 .CW typedefs
 161 declared in
 162 .CW <u.h> .
 163 .PP
 164 In strict ANSI C, include files are grouped to collect related functions
 165 in a single file: one for string functions, one for memory functions,
 166 one for I/O, and none for system calls.
 167 Each include file is protected by an
 168 .CW #ifdef
 169 to guarantee its contents are seen by the compiler only once.
 170 Plan 9 takes a different approach.  Other than a few include
 171 files that define external formats such as archives, the files in
 172 .CW /sys/include
 173 correspond to
 174 .I libraries.
 175 If a program is using a library, it includes the corresponding header.
 176 The default C library comprises string functions, memory functions, and
 177 so on, largely as in ANSI C, some formatted I/O routines,
 178 plus all the system calls and related functions.
 179 To use these functions, one must
 180 .CW #include
 181 the file
 182 .CW <libc.h> ,
 183 which in turn must follow
 184 .CW <u.h> ,
 185 to define their prototypes for the compiler.
 186 Here is the complete source to the traditional first C program:
 187 .P1
 188 #include <u.h>
 189 #include <libc.h>
 190
 191 void
 192 main(void)
 193 {
 194         print("hello world\en");
 195         exits(0);
 196 }
 197 .P2
 198 The
 199 .CW print
 200 routine and its relatives
 201 .CW fprint
 202 and
 203 .CW sprint
 204 resemble the similarly-named functions in Standard I/O but are not
 205 attached to a specific I/O library.
 206 In Plan 9
 207 .CW main
 208 is not integer-valued; it should call
 209 .CW exits ,
 210 which takes a string argument (or null; here ANSI C promotes the 0 to a
 211 .CW char* ).
 212 All these functions are, of course, documented in the Programmer's Manual.
 213 .PP
 214 To use
 215 .CW printf ,
 216 .CW <stdio.h>
 217 must be included to define the function prototype for
 218 .CW printf :
 219 .P1
 220 #include <u.h>
 221 #include <libc.h>
 222 #include <stdio.h>
 223
 224 void
 225 main(int argc, char *argv[])
 226 {
 227         printf("%s: hello world; argc = %d\en", argv[0], argc);
 228         exits(0);
 229 }
 230 .P2
 231 In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
 232 discussed in a later section of this document.
 233 .PP
 234 There are libraries for handling regular expressions, raster graphics,
 235 windows, and so on, and each has an associated include file.
 236 The manual for each library states which include files are needed.
 237 The files are not protected against multiple inclusion and themselves
 238 contain no nested
 239 .CW #includes .
 240 Instead the
 241 programmer is expected to sort out the requirements
 242 and to
 243 .CW #include
 244 the necessary files once at the top of each source file.  In practice this is
 245 trivial: this way of handling include files is so straightforward
 246 that it is rare for a source file to contain more than half a dozen
 247 .CW #includes .
 248 .PP
 249 The compilers do their own register allocation so the
 250 .CW register
 251 keyword is ignored.
 252 For different reasons,
 253 .CW volatile
 254 and
 255 .CW const
 256 are also ignored.
 257 .PP
 258 To make it easier to share code with other systems, Plan 9 has a version
 259 of the compiler,
 260 .CW pcc ,
 261 that provides the standard ANSI C preprocessor, headers, and libraries
 262 with POSIX extensions.
 263 .CW Pcc
 264 is recommended only
 265 when broad external portability is mandated.  It compiles slower,
 266 produces slower code (it takes extra work to simulate POSIX on Plan 9),
 267 eliminates those parts of the Plan 9 interface
 268 not related to POSIX, and illustrates the clumsiness of an environment
 269 designed by committee.
 270 .CW Pcc
 271 is described in more detail in
 272 .I
 273 APE\(emThe ANSI/POSIX Environment,
 274 .R
 275 by Howard Trickey.
 276 .SH
 277 Process
 278 .PP
 279 Each CPU architecture supported by Plan 9 is identified by a single,
 280 arbitrary, alphanumeric character:
 281 .CW k
 282 for SPARC,
 283 .CW q
 284 for Motorola Power PC 630 and 640,
 285 .CW v
 286 for MIPS,
 287 .CW 0
 288 for little-endian MIPS,
 289 .CW 1
 290 for Motorola 68000,
 291 .CW 2
 292 for Motorola 68020 and 68040,
 293 .CW 5
 294 for Acorn ARM 7500,
 295 .CW 6
 296 for AMD 64,
 297 .CW 7
 298 for DEC Alpha,
 299 .CW 8
 300 for Intel 386, and
 301 .CW 9
 302 for AMD 29000.
 303 The character labels the support tools and files for that architecture.
 304 For instance, for the 68020 the compiler is
 305 .CW 2c ,
 306 the assembler is
 307 .CW 2a ,
 308 the link editor/loader is
 309 .CW 2l ,
 310 the object files are suffixed
 311 .CW \&.2 ,
 312 and the default name for an executable file is
 313 .CW 2.out .
 314 Before we can use the compiler we therefore need to know which
 315 machine we are compiling for.
 316 The next section explains how this decision is made; for the moment
 317 assume we are building 68020 binaries and make the mental substitution for
 318 .CW 2
 319 appropriate to the machine you are actually using.
 320 .PP
 321 To convert source to an executable binary is a two-step process.
 322 First run the compiler,
 323 .CW 2c ,
 324 on the source, say
 325 .CW file.c ,
 326 to generate an object file
 327 .CW file.2 .
 328 Then run the loader,
 329 .CW 2l ,
 330 to generate an executable
 331 .CW 2.out
 332 that may be run (on a 680X0 machine):
 333 .P1
 334 2c file.c
 335 2l file.2
 336 2.out
 337 .P2
 338 The loader automatically links with whatever libraries the program
 339 needs, usually including the standard C library as defined by
 340 .CW <libc.h> .
 341 Of course the compiler and loader have lots of options, both familiar and new;
 342 see the manual for details.
 343 The compiler does not generate an executable automatically;
 344 the output of the compiler must be given to the loader.
 345 Since most compilation is done under the control of
 346 .CW mk
 347 (see below), this is rarely an inconvenience.
 348 .PP
 349 The distribution of work between the compiler and loader is unusual.
 350 The compiler integrates preprocessing, parsing, register allocation,
 351 code generation and some assembly.
 352 Combining these tasks in a single program is part of the reason for
 353 the compiler's efficiency.
 354 The loader does instruction selection, branch folding,
 355 instruction scheduling,
 356 and writes the final executable.
 357 There is no separate C preprocessor and no assembler in the usual pipeline.
 358 Instead the intermediate object file
 359 (here a
 360 .CW \&.2
 361 file) is a type of binary assembly language.
 362 The instructions in the intermediate format are not exactly those in
 363 the machine.  For example, on the 68020 the object file may specify
 364 a MOVE instruction but the loader will decide just which variant of
 365 the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
 366 etc. \(em is most efficient.
 367 .PP
 368 The assembler,
 369 .CW 2a ,
 370 is just a translator between the textual and binary
 371 representations of the object file format.
 372 It is not an assembler in the traditional sense.  It has limited
 373 macro capabilities (the same as the integral C preprocessor in the compiler),
 374 clumsy syntax, and minimal error checking.  For instance, the assembler
 375 will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
 376 machine does not actually support; only when the output of the assembler
 377 is passed to the loader will the error be discovered.
 378 The assembler is intended only for writing things that need access to instructions
 379 invisible from C,
 380 such as the machine-dependent
 381 part of an operating system;
 382 very little code in Plan 9 is in assembly language.
 383 .PP
 384 The compilers take an option
 385 .CW -S
 386 that causes them to print on their standard output the generated code
 387 in a format acceptable as input to the assemblers.
 388 This is of course merely a formatting of the
 389 data in the object file; therefore the assembler is just
 390 an
 391 ASCII-to-binary converter for this format.
 392 Other than the specific instructions, the input to the assemblers
 393 is largely architecture-independent; see
 394 ``A Manual for the Plan 9 Assembler'',
 395 by Rob Pike,
 396 for more information.
 397 .PP
 398 The loader is an integral part of the compilation process.
 399 Each library header file contains a
 400 .CW #pragma
 401 that tells the loader the name of the associated archive; it is
 402 not necessary to tell the loader which libraries a program uses.
 403 The C run-time startup is found, by default, in the C library.
 404 The loader starts with an undefined
 405 symbol,
 406 .CW _main ,
 407 that is resolved by pulling in the run-time startup code from the library.
 408 (The loader undefines
 409 .CW _mainp
 410 when profiling is enabled, to force loading of the profiling start-up
 411 instead.)
 412 .PP
 413 Unlike its counterpart on other systems, the Plan 9 loader rearranges
 414 data to optimize access.  This means the order of variables in the
 415 loaded program is unrelated to its order in the source.
 416 Most programs don't care, but some assume that, for example, the
 417 variables declared by
 418 .P1
 419 int a;
 420 int b;
 421 .P2
 422 will appear at adjacent addresses in memory.  On Plan 9, they won't.
 423 .SH
 424 Heterogeneity
 425 .PP
 426 When the system starts or a user logs in the environment is configured
 427 so the appropriate binaries are available in
 428 .CW /bin .
 429 The configuration process is controlled by an environment variable,
 430 .CW $cputype ,
 431 with value such as
 432 .CW mips ,
 433 .CW 68020 ,
 434 .CW 386 ,
 435 or
 436 .CW sparc .
 437 For each architecture there is a directory in the root,
 438 with the appropriate name,
 439 that holds the binary and library files for that architecture.
 440 Thus
 441 .CW /mips/lib
 442 contains the object code libraries for MIPS programs,
 443 .CW /mips/include
 444 holds MIPS-specific include files, and
 445 .CW /mips/bin
 446 has the MIPS binaries.
 447 These binaries are attached to
 448 .CW /bin
 449 at boot time by binding
 450 .CW /$cputype/bin
 451 to
 452 .CW /bin ,
 453 so
 454 .CW /bin
 455 always contains the correct files.
 456 .PP
 457 The MIPS compiler,
 458 .CW vc ,
 459 by definition
 460 produces object files for the MIPS architecture,
 461 regardless of the architecture of the machine on which the compiler is running.
 462 There is a version of
 463 .CW vc
 464 compiled for each architecture:
 465 .CW /mips/bin/vc ,
 466 .CW /68020/bin/vc ,
 467 .CW /sparc/bin/vc ,
 468 and so on,
 469 each capable of producing MIPS object files regardless of the native
 470 instruction set.
 471 If one is running on a SPARC,
 472 .CW /sparc/bin/vc
 473 will compile programs for the MIPS;
 474 if one is running on machine
 475 .CW $cputype ,
 476 .CW /$cputype/bin/vc
 477 will compile programs for the MIPS.
 478 .PP
 479 Because of the bindings that assemble
 480 .CW /bin ,
 481 the shell always looks for a command, say
 482 .CW date ,
 483 in
 484 .CW /bin
 485 and automatically finds the file
 486 .CW /$cputype/bin/date .
 487 Therefore the MIPS compiler is known as just
 488 .CW vc ;
 489 the shell will invoke
 490 .CW /bin/vc
 491 and that is guaranteed to be the version of the MIPS compiler
 492 appropriate for the machine running the command.
 493 Regardless of the architecture of the compiling machine,
 494 .CW /bin/vc
 495 is
 496 .I always
 497 the MIPS compiler.
 498 .PP
 499 Also, the output of
 500 .CW vc
 501 and
 502 .CW vl
 503 is completely independent of the machine type on which they are executed:
 504 .CW \&.v
 505 files compiled (with
 506 .CW vc )
 507 on a SPARC may be linked (with
 508 .CW vl )
 509 on a 386.
 510 (The resulting
 511 .CW v.out
 512 will run, of course, only on a MIPS.)
 513 Similarly, the MIPS libraries in
 514 .CW /mips/lib
 515 are suitable for loading with
 516 .CW vl
 517 on any machine; there is only one set of MIPS libraries, not one
 518 set for each architecture that supports the MIPS compiler.
 519 .SH
 520 Heterogeneity and \f(CWmk\fP
 521 .PP
 522 Most software on Plan 9 is compiled under the control of
 523 .CW mk ,
 524 a descendant of
 525 .CW make
 526 that is documented in the Programmer's Manual.
 527 A convention used throughout the
 528 .CW mkfiles
 529 makes it easy to compile the source into binary suitable for any architecture.
 530 .PP
 531 The variable
 532 .CW $cputype
 533 is advisory: it reports the architecture of the current environment, and should
 534 not be modified.  A second variable,
 535 .CW $objtype ,
 536 is used to set which architecture is being
 537 .I compiled
 538 for.
 539 The value of
 540 .CW $objtype
 541 can be used by a
 542 .CW mkfile
 543 to configure the compilation environment.
 544 .PP
 545 In each machine's root directory there is a short
 546 .CW mkfile
 547 that defines a set of macros for the compiler, loader, etc.
 548 Here is
 549 .CW /mips/mkfile :
 550 .P1
 551 </sys/src/mkfile.proto
 552
 553 CC=vc
 554 LD=vl
 555 O=v
 556 AS=va
 557 .P2
 558 The line
 559 .P1
 560 </sys/src/mkfile.proto
 561 .P2
 562 causes
 563 .CW mk
 564 to include the file
 565 .CW /sys/src/mkfile.proto ,
 566 which contains general definitions:
 567 .P1
 568 #
 569 # common mkfile parameters shared by all architectures
 570 #
 571
 572 OS=v486xq7
 573 CPUS=mips 386 power alpha
 574 CFLAGS=-FVw
 575 LEX=lex
 576 YACC=yacc
 577 MK=/bin/mk
 578 .P2
 579 .CW CC
 580 is obviously the compiler,
 581 .CW AS
 582 the assembler, and
 583 .CW LD
 584 the loader.
 585 .CW O
 586 is the suffix for the object files and
 587 .CW CPUS
 588 and
 589 .CW OS
 590 are used in special rules described below.
 591 .PP
 592 Here is a
 593 .CW mkfile
 594 to build the installed source for
 595 .CW sam :
 596 .P1
 597 </$objtype/mkfile
 598 OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
 599         file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
 600         plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
 601
 602 $O.out: $OBJ
 603         $LD $OBJ
 604
 605 install:        $O.out
 606         cp $O.out /$objtype/bin/sam
 607
 608 installall:
 609         for(objtype in $CPUS) mk install
 610
 611 %.$O:   %.c
 612         $CC $CFLAGS $stem.c
 613
 614 $OBJ:   sam.h errors.h mesg.h
 615 address.$O cmd.$O parse.$O xec.$O unix.$O:      parse.h
 616
 617 clean:V:
 618         rm -f [$OS].out *.[$OS] y.tab.?
 619 .P2
 620 (The actual
 621 .CW mkfile
 622 imports most of its rules from other secondary files, but
 623 this example works and is not misleading.)
 624 The first line causes
 625 .CW mk
 626 to include the contents of
 627 .CW /$objtype/mkfile
 628 in the current
 629 .CW mkfile .
 630 If
 631 .CW $objtype
 632 is
 633 .CW mips ,
 634 this inserts the MIPS macro definitions into the
 635 .CW mkfile .
 636 In this case the rule for
 637 .CW $O.out
 638 uses the MIPS tools to build
 639 .CW v.out .
 640 The
 641 .CW %.$O
 642 rule in the file uses
 643 .CW mk 's
 644 pattern matching facilities to convert the source files to the object
 645 files through the compiler.
 646 (The text of the rules is passed directly to the shell,
 647 .CW rc ,
 648 without further translation.
 649 See the
 650 .CW mk
 651 manual if any of this is unfamiliar.)
 652 Because the default rule builds
 653 .CW $O.out
 654 rather than
 655 .CW sam ,
 656 it is possible to maintain binaries for multiple machines in the
 657 same source directory without conflict.
 658 This is also, of course, why the output files from the various
 659 compilers and loaders
 660 have distinct names.
 661 .PP
 662 The rest of the
 663 .CW mkfile
 664 should be easy to follow; notice how the rules for
 665 .CW clean
 666 and
 667 .CW installall
 668 (that is, install versions for all architectures) use other macros
 669 defined in
 670 .CW /$objtype/mkfile .
 671 In Plan 9,
 672 .CW mkfiles
 673 for commands conventionally contain rules to
 674 .CW install
 675 (compile and install the version for
 676 .CW $objtype ),
 677 .CW installall
 678 (compile and install for all
 679 .CW $objtypes ),
 680 and
 681 .CW clean
 682 (remove all object files, binaries, etc.).
 683 .PP
 684 The
 685 .CW mkfile
 686 is easy to use.  To build a MIPS binary,
 687 .CW v.out :
 688 .P1
 689 % objtype=mips
 690 % mk
 691 .P2
 692 To build and install a MIPS binary:
 693 .P1
 694 % objtype=mips
 695 % mk install
 696 .P2
 697 To build and install all versions:
 698 .P1
 699 % mk installall
 700 .P2
 701 These conventions make cross-compilation as easy to manage
 702 as traditional native compilation.
 703 Plan 9 programs compile and run without change on machines from
 704 large multiprocessors to laptops.  For more information about this process, see
 705 ``Plan 9 Mkfiles'',
 706 by Bob Flandrena.
 707 .SH
 708 Portability
 709 .PP
 710 Within Plan 9, it is painless to write portable programs, programs whose
 711 source is independent of the machine on which they execute.
 712 The operating system is fixed and the compiler, headers and libraries
 713 are constant so most of the stumbling blocks to portability are removed.
 714 Attention to a few details can avoid those that remain.
 715 .PP
 716 Plan 9 is a heterogeneous environment, so programs must
 717 .I expect
 718 that external files will be written by programs on machines of different
 719 architectures.
 720 The compilers, for instance, must handle without confusion
 721 object files written by other machines.
 722 The traditional approach to this problem is to pepper the source with
 723 .CW #ifdefs
 724 to turn byte-swapping on and off.
 725 Plan 9 takes a different approach: of the handful of machine-dependent
 726 .CW #ifdefs
 727 in all the source, almost all are deep in the libraries.
 728 Instead programs read and write files in a defined format,
 729 either (for low volume applications) as formatted text, or
 730 (for high volume applications) as binary in a known byte order.
 731 If the external data were written with the most significant
 732 byte first, the following code reads a 4-byte integer correctly
 733 regardless of the architecture of the executing machine (assuming
 734 an unsigned long holds 4 bytes):
 735 .P1
 736 ulong
 737 getlong(void)
 738 {
 739         ulong l;
 740
 741         l = (getchar()&0xFF)<<24;
 742         l |= (getchar()&0xFF)<<16;
 743         l |= (getchar()&0xFF)<<8;
 744         l |= (getchar()&0xFF)<<0;
 745         return l;
 746 }
 747 .P2
 748 Note that this code does not `swap' the bytes; instead it just reads
 749 them in the correct order.
 750 Variations of this code will handle any binary format
 751 and also avoid problems
 752 involving how structures are padded, how words are aligned,
 753 and other impediments to portability.
 754 Be aware, though, that extra care is needed to handle floating point data.
 755 .PP
 756 Efficiency hounds will argue that this method is unnecessarily slow and clumsy
 757 when the executing machine has the same byte order (and padding and alignment)
 758 as the data.
 759 The CPU cost of I/O processing
 760 is rarely the bottleneck for an application, however,
 761 and the gain in simplicity of porting and maintaining the code greatly outweighs
 762 the minor speed loss from handling data in this general way.
 763 This method is how the Plan 9 compilers, the window system, and even the file
 764 servers transmit data between programs.
 765 .PP
 766 To port programs beyond Plan 9, where the system interface is more variable,
 767 it is probably necessary to use
 768 .CW pcc
 769 and hope that the target machine supports ANSI C and POSIX.
 770 .SH
 771 I/O
 772 .PP
 773 The default C library, defined by the include file
 774 .CW <libc.h> ,
 775 contains no buffered I/O package.
 776 It does have several entry points for printing formatted text:
 777 .CW print
 778 outputs text to the standard output,
 779 .CW fprint
 780 outputs text to a specified integer file descriptor, and
 781 .CW sprint
 782 places text in a character array.
 783 To access library routines for buffered I/O, a program must
 784 explicitly include the header file associated with an appropriate library.
 785 .PP
 786 The recommended I/O library, used by most Plan 9 utilities, is
 787 .CW bio
 788 (buffered I/O), defined by
 789 .CW <bio.h> .
 790 There also exists an implementation of ANSI Standard I/O,
 791 .CW stdio .
 792 .PP
 793 .CW Bio
 794 is small and efficient, particularly for buffer-at-a-time or
 795 line-at-a-time I/O.
 796 Even for character-at-a-time I/O, however, it is significantly faster than
 797 the Standard I/O library,
 798 .CW stdio .
 799 Its interface is compact and regular, although it lacks a few conveniences.
 800 The most noticeable is that one must explicitly define buffers for standard
 801 input and output;
 802 .CW bio
 803 does not predefine them.  Here is a program to copy input to output a byte
 804 at a time using
 805 .CW bio :
 806 .P1
 807 #include <u.h>
 808 #include <libc.h>
 809 #include <bio.h>
 810
 811 Biobuf  bin;
 812 Biobuf  bout;
 813
 814 main(void)
 815 {
 816         int c;
 817
 818         Binit(&bin, 0, OREAD);
 819         Binit(&bout, 1, OWRITE);
 820
 821         while((c=Bgetc(&bin)) != Beof)
 822                 Bputc(&bout, c);
 823         exits(0);
 824 }
 825 .P2
 826 For peak performance, we could replace
 827 .CW Bgetc
 828 and
 829 .CW Bputc
 830 by their equivalent in-line macros
 831 .CW BGETC
 832 and
 833 .CW BPUTC
 834 but
 835 the performance gain would be modest.
 836 For more information on
 837 .CW bio ,
 838 see the Programmer's Manual.
 839 .PP
 840 Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
 841 systems' is that text is not ASCII.
 842 The format for
 843 text in Plan 9 is a byte-stream encoding of 16-bit characters.
 844 The character set is based on the Unicode Standard and is backward compatible with
 845 ASCII:
 846 characters with value 0 through 127 are the same in both sets.
 847 The 16-bit characters, called
 848 .I runes
 849 in Plan 9, are encoded using a representation called
 850 UTF,
 851 an encoding that is becoming accepted as a standard.
 852 (ISO calls it UTF-8;
 853 throughout Plan 9 it's just called
 854 UTF.)
 855 UTF
 856 defines multibyte sequences to
 857 represent character values from 0 to 65535.
 858 In
 859 UTF,
 860 character values up to 127 decimal, 7F hexadecimal, represent themselves,
 861 so straight
 862 ASCII
 863 files are also valid
 864 UTF.
 865 Also,
 866 UTF
 867 guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
 868 will appear only when they represent themselves, so programs that read bytes
 869 looking for plain ASCII characters will continue to work.
 870 Any program that expects a one-to-one correspondence between bytes and
 871 characters will, however, need to be modified.
 872 An example is parsing file names.
 873 File names, like all text, are in
 874 UTF,
 875 so it is incorrect to search for a character in a string by
 876 .CW strchr(filename,
 877 .CW c)
 878 because the character might have a multi-byte encoding.
 879 The correct method is to call
 880 .CW utfrune(filename,
 881 .CW c) ,
 882 defined in
 883 .I rune (2),
 884 which interprets the file name as a sequence of encoded characters
 885 rather than bytes.
 886 In fact, even when you know the character is a single byte
 887 that can represent only itself,
 888 it is safer to use
 889 .CW utfrune
 890 because that assumes nothing about the character set
 891 and its representation.
 892 .PP
 893 The library defines several symbols relevant to the representation of characters.
 894 Any byte with unsigned value less than
 895 .CW Runesync
 896 will not appear in any multi-byte encoding of a character.
 897 .CW Utfrune
 898 compares the character being searched against
 899 .CW Runesync
 900 to see if it is sufficient to call
 901 .CW strchr
 902 or if the byte stream must be interpreted.
 903 Any byte with unsigned value less than
 904 .CW Runeself
 905 is represented by a single byte with the same value.
 906 Finally, when errors are encountered converting
 907 to runes from a byte stream, the library returns the rune value
 908 .CW Runeerror
 909 and advances a single byte.  This permits programs to find runes
 910 embedded in binary data.
 911 .PP
 912 .CW Bio
 913 includes routines
 914 .CW Bgetrune
 915 and
 916 .CW Bputrune
 917 to transform the external byte stream
 918 UTF
 919 format to and from
 920 internal 16-bit runes.
 921 Also, the
 922 .CW %s
 923 format to
 924 .CW print
 925 accepts
 926 UTF;
 927 .CW %c
 928 prints a character after narrowing it to 8 bits.
 929 The
 930 .CW %S
 931 format prints a null-terminated sequence of runes;
 932 .CW %C
 933 prints a character after narrowing it to 16 bits.
 934 For more information, see the Programmer's Manual, in particular
 935 .I utf (6)
 936 and
 937 .I rune (2),
 938 and the paper,
 939 ``Hello world, or
 940 Καλημέρα κόσμε, or\
 941 \f(Jpこんにちは 世界\f1'',
 942 by Rob Pike and
 943 Ken Thompson;
 944 there is not room for the full story here.
 945 .PP
 946 These issues affect the compiler in several ways.
 947 First, the C source is in
 948 UTF.
 949 ANSI says C variables are formed from
 950 ASCII
 951 alphanumerics, but comments and literal strings may contain any characters
 952 encoded in the native encoding, here
 953 UTF.
 954 The declaration
 955 .P1
 956 char *cp = "abcÿ";
 957 .P2
 958 initializes the variable
 959 .CW cp
 960 to point to an array of bytes holding the
 961 UTF
 962 representation of the characters
 963 .CW abcÿ.
 964 The type
 965 .CW Rune
 966 is defined in
 967 .CW <u.h>
 968 to be
 969 .CW ushort ,
 970 which is also the  `wide character' type in the compiler.
 971 Therefore the declaration
 972 .P1
 973 Rune *rp = L"abcÿ";
 974 .P2
 975 initializes the variable
 976 .CW rp
 977 to point to an array of unsigned short integers holding the 16-bit
 978 values of the characters
 979 .CW abcÿ .
 980 Note that in both these declarations the characters in the source
 981 that represent
 982 .CW "abcÿ"
 983 are the same; what changes is how those characters are represented
 984 in memory in the program.
 985 The following two lines:
 986 .P1
 987 print("%s\en", "abcÿ");
 988 print("%S\en", L"abcÿ");
 989 .P2
 990 produce the same
 991 UTF
 992 string on their output, the first by copying the bytes, the second
 993 by converting from runes to bytes.
 994 .PP
 995 In C, character constants are integers but narrowed through the
 996 .CW char
 997 type.
 998 The Unicode character
 999 .CW ÿ
1000 has value 255, so if the
1001 .CW char
1002 type is signed,
1003 the constant
1004 .CW 'ÿ'
1005 has value \-1 (which is equal to EOF).
1006 On the other hand,
1007 .CW L'ÿ'
1008 narrows through the wide character type,
1009 .CW ushort ,
1010 and therefore has value 255.
1011 .PP
1012 Finally, although it's not ANSI C, the Plan 9 C compilers
1013 assume any character with value above
1014 .CW Runeself
1015 is an alphanumeric,
1016 so α is a legal, if non-portable, variable name.
1017 .SH
1018 Arguments
1019 .PP
1020 Some macros are defined
1021 in
1022 .CW <libc.h>
1023 for parsing the arguments to
1024 .CW main() .
1025 They are described in
1026 .I ARG (2)
1027 but are fairly self-explanatory.
1028 There are four macros:
1029 .CW ARGBEGIN
1030 and
1031 .CW ARGEND
1032 are used to bracket a hidden
1033 .CW switch
1034 statement within which
1035 .CW ARGC
1036 returns the current option character (rune) being processed and
1037 .CW ARGF
1038 returns the argument to the option, as in the loader option
1039 .CW -o
1040 .CW file .
1041 Here, for example, is the code at the beginning of
1042 .CW main()
1043 in
1044 .CW ramfs.c
1045 (see
1046 .I ramfs (1))
1047 that cracks its arguments:
1048 .P1
1049 void
1050 main(int argc, char *argv[])
1051 {
1052         char *defmnt;
1053         int p[2];
1054         int mfd[2];
1055         int stdio = 0;
1056
1057         defmnt = "/tmp";
1058         ARGBEGIN{
1059         case 'i':
1060                 defmnt = 0;
1061                 stdio = 1;
1062                 mfd[0] = 0;
1063                 mfd[1] = 1;
1064                 break;
1065         case 's':
1066                 defmnt = 0;
1067                 break;
1068         case 'm':
1069                 defmnt = ARGF();
1070                 break;
1071         default:
1072                 usage();
1073         }ARGEND
1074 .P2
1075 .SH
1076 Extensions
1077 .PP
1078 The compiler has several extensions to ANSI C, all of which are used
1079 extensively in the system source.
1080 First,
1081 .I structure
1082 .I displays
1083 permit
1084 .CW struct
1085 expressions to be formed dynamically.
1086 Given these declarations:
1087 .P1
1088 typedef struct Point Point;
1089 typedef struct Rectangle Rectangle;
1090
1091 struct Point
1092 {
1093         int x, y;
1094 };
1095
1096 struct Rectangle
1097 {
1098         Point min, max;
1099 };
1100
1101 Point   p, q, add(Point, Point);
1102 Rectangle r;
1103 int     x, y;
1104 .P2
1105 this assignment may appear anywhere an assignment is legal:
1106 .P1
1107 r = (Rectangle){add(p, q), (Point){x, y+3}};
1108 .P2
1109 The syntax is the same as for initializing a structure but with
1110 a leading cast.
1111 .PP
1112 If an
1113 .I anonymous
1114 .I structure
1115 or
1116 .I union
1117 is declared within another structure or union, the members of the internal
1118 structure or union are addressable without prefix in the outer structure.
1119 This feature eliminates the clumsy naming of nested structures and,
1120 particularly, unions.
1121 For example, after these declarations,
1122 .P1
1123 struct Lock
1124 {
1125         int     locked;
1126 };
1127
1128 struct Node
1129 {
1130         int     type;
1131         union{
1132                 double  dval;
1133                 double  fval;
1134                 long    lval;
1135         };              /* anonymous union */
1136         struct Lock;    /* anonymous structure */
1137 } *node;
1138
1139 void    lock(struct Lock*);
1140 .P2
1141 one may refer to
1142 .CW node->type ,
1143 .CW node->dval ,
1144 .CW node->fval ,
1145 .CW node->lval ,
1146 and
1147 .CW node->locked .
1148 Moreover, the address of a
1149 .CW struct
1150 .CW Node
1151 may be used without a cast anywhere that the address of a
1152 .CW struct
1153 .CW Lock
1154 is used, such as in argument lists.
1155 The compiler automatically promotes the type and adjusts the address.
1156 Thus one may invoke
1157 .CW lock(node) .
1158 .PP
1159 Anonymous structures and unions may be accessed by type name
1160 if (and only if) they are declared using a
1161 .CW typedef
1162 name.
1163 For example, using the above declaration for
1164 .CW Point ,
1165 one may declare
1166 .P1
1167 struct
1168 {
1169         int     type;
1170         Point;
1171 } p;
1172 .P2
1173 and refer to
1174 .CW p.Point .
1175 .PP
1176 In the initialization of arrays, a number in square brackets before an
1177 element sets the index for the initialization.  For example, to initialize
1178 some elements in
1179 a table of function pointers indexed by
1180 ASCII
1181 character,
1182 .P1
1183 void    percent(void), slash(void);
1184
1185 void    (*func[128])(void) =
1186 {
1187         ['%']   percent,
1188         ['/']   slash,
1189 };
1190 .P2
1191 .LP
1192 A similar syntax allows one to initialize structure elements:
1193 .P1
1194 Point p =
1195 {
1196         .y 100,
1197         .x 200
1198 };
1199 .P2
1200 These initialization syntaxes were later added to ANSI C, with the addition of an
1201 equals sign between the index or tag and the value.
1202 The Plan 9 compiler accepts either form.
1203 .PP
1204 Finally, the declaration
1205 .P1
1206 extern register reg;
1207 .P2
1208 .I this "" (
1209 appearance of the register keyword is not ignored)
1210 allocates a global register to hold the variable
1211 .CW reg .
1212 External registers must be used carefully: they need to be declared in
1213 .I all
1214 source files and libraries in the program to guarantee the register
1215 is not allocated temporarily for other purposes.
1216 Especially on machines with few registers, such as the i386,
1217 it is easy to link accidentally with code that has already usurped
1218 the global registers and there is no diagnostic when this happens.
1219 Used wisely, though, external registers are powerful.
1220 The Plan 9 operating system uses them to access per-process and
1221 per-machine data structures on a multiprocessor.  The storage class they provide
1222 is hard to create in other ways.
1223 .SH
1224 The compile-time environment
1225 .PP
1226 The code generated by the compilers is `optimized' by default:
1227 variables are placed in registers and peephole optimizations are
1228 performed.
1229 The compiler flag
1230 .CW -N
1231 disables these optimizations.
1232 Registerization is done locally rather than throughout a function:
1233 whether a variable occupies a register or
1234 the memory location identified in the symbol
1235 table depends on the activity of the variable and may change
1236 throughout the life of the variable.
1237 The
1238 .CW -N
1239 flag is rarely needed;
1240 its main use is to simplify debugging.
1241 There is no information in the symbol table to identify the
1242 registerization of a variable, so
1243 .CW -N
1244 guarantees the variable is always where the symbol table says it is.
1245 .PP
1246 Another flag,
1247 .CW -w ,
1248 turns
1249 .I on
1250 warnings about portability and problems detected in flow analysis.
1251 Most code in Plan 9 is compiled with warnings enabled;
1252 these warnings plus the type checking offered by function prototypes
1253 provide most of the support of the Unix tool
1254 .CW lint
1255 more accurately and with less chatter.
1256 Two of the warnings,
1257 `used and not set' and `set and not used', are almost always accurate but
1258 may be triggered spuriously by code with invisible control flow,
1259 such as in routines that call
1260 .CW longjmp .
1261 The compiler statements
1262 .P1
1263 SET(v1);
1264 USED(v2);
1265 .P2
1266 decorate the flow graph to silence the compiler.
1267 Either statement accepts a comma-separated list of variables.
1268 Use them carefully: they may silence real errors.
1269 For the common case of unused parameters to a function,
1270 leaving the name off the declaration silences the warnings.
1271 That is, listing the type of a parameter but giving it no
1272 associated variable name does the trick.
1273 .SH
1274 Debugging
1275 .PP
1276 There are two debuggers available on Plan 9.
1277 The first, and older, is
1278 .CW db ,
1279 a revision of Unix
1280 .CW adb .
1281 The other,
1282 .CW acid ,
1283 is a source-level debugger whose commands are statements in
1284 a true programming language.
1285 .CW Acid
1286 is the preferred debugger, but since it
1287 borrows some elements of
1288 .CW db ,
1289 notably the formats for displaying values, it is worth knowing a little bit about
1290 .CW db .
1291 .PP
1292 Both debuggers support multiple architectures in a single program; that is,
1293 the programs are
1294 .CW db
1295 and
1296 .CW acid ,
1297 not for example
1298 .CW vdb
1299 and
1300 .CW vacid .
1301 They also support cross-architecture debugging comfortably:
1302 one may debug a 68020 binary on a MIPS.
1303 .PP
1304 Imagine a program has crashed mysteriously:
1305 .P1
1306 % X11/X
1307 Fatal server bug!
1308 failed to create default stipple
1309 X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1310 %
1311 .P2
1312 When a process dies on Plan 9 it hangs in the `broken' state
1313 for debugging.
1314 Attach a debugger to the process by naming its process id:
1315 .P1
1316 % acid 106
1317 /proc/106/text:mips plan 9 executable
1318
1319 /sys/lib/acid/port
1320 /sys/lib/acid/mips
1321 acid:
1322 .P2
1323 The
1324 .CW acid
1325 function
1326 .CW stk()
1327 reports the stack traceback:
1328 .P1
1329 acid: stk()
1330 At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1331 abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1332         called from FatalError+#4e
1333                 /sys/src/X/mit/server/dix/misc.c:421
1334 FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1335     s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1336     /sys/src/X/mit/server/dix/misc.c:416
1337         called from gnotscreeninit+#4ce
1338                 /sys/src/X/mit/server/ddx/gnot/gnot.c:792
1339 gnotscreeninit(snum=#0, sc=#80db0)
1340     /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1341         called from AddScreen+#16e
1342                 /n/bootes/sys/src/X/mit/server/dix/main.c:610
1343 AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1344     /sys/src/X/mit/server/dix/main.c:530
1345         called from InitOutput+0x80
1346                 /sys/src/X/mit/server/ddx/brazil/brddx.c:522
1347 InitOutput(argc=0x00000001,argv=0x7fffffe4)
1348     /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1349         called from main+0x294
1350                 /sys/src/X/mit/server/dix/main.c:225
1351 main(argc=0x00000001,argv=0x7fffffe4)
1352     /sys/src/X/mit/server/dix/main.c:136
1353         called from _main+0x24
1354                 /sys/src/ape/lib/ap/mips/main9.s:8
1355 .P2
1356 The function
1357 .CW lstk()
1358 is similar but
1359 also reports the values of local variables.
1360 Note that the traceback includes full file names; this is a boon to debugging,
1361 although it makes the output much noisier.
1362 .PP
1363 To use
1364 .CW acid
1365 well you will need to learn its input language; see the
1366 ``Acid Manual'',
1367 by Phil Winterbottom,
1368 for details.  For simple debugging, however, the information in the manual page is
1369 sufficient.  In particular, it describes the most useful functions
1370 for examining a process.
1371 .PP
1372 The compiler does not place
1373 information describing the types of variables in the executable,
1374 but a compile-time flag provides crude support for symbolic debugging.
1375 The
1376 .CW -a
1377 flag to the compiler suppresses code generation
1378 and instead emits source text in the
1379 .CW acid
1380 language to format and display data structure types defined in the program.
1381 The easiest way to use this feature is to put a rule in the
1382 .CW mkfile :
1383 .P1
1384 syms:   main.$O
1385         $CC -a main.c > syms
1386 .P2
1387 Then from within
1388 .CW acid ,
1389 .P1
1390 acid: include("sourcedirectory/syms")
1391 .P2
1392 to read in the relevant definitions.
1393 (For multi-file source, you need to be a little fancier;
1394 see
1395 .I 2c (1)).
1396 This text includes, for each defined compound
1397 type, a function with that name that may be called with the address of a structure
1398 of that type to display its contents.
1399 For example, if
1400 .CW rect
1401 is a global variable of type
1402 .CW Rectangle ,
1403 one may execute
1404 .P1
1405 Rectangle(*rect)
1406 .P2
1407 to display it.
1408 The
1409 .CW *
1410 (indirection) operator is necessary because
1411 of the way
1412 .CW acid
1413 works: each global symbol in the program is defined as a variable by
1414 .CW acid ,
1415 with value equal to the
1416 .I address
1417 of the symbol.
1418 .PP
1419 Another common technique is to write by hand special
1420 .CW acid
1421 code to define functions to aid debugging, initialize the debugger, and so on.
1422 Conventionally, this is placed in a file called
1423 .CW acid
1424 in the source directory; it has a line
1425 .P1
1426 include("sourcedirectory/syms");
1427 .P2
1428 to load the compiler-produced symbols.  One may edit the compiler output directly but
1429 it is wiser to keep the hand-generated
1430 .CW acid
1431 separate from the machine-generated.
1432 .PP
1433 To make things simple, the default rules in the system
1434 .CW mkfiles
1435 include entries to make
1436 .CW foo.acid
1437 from
1438 .CW foo.c ,
1439 so one may use
1440 .CW mk
1441 to automate the production of
1442 .CW acid
1443 definitions for a given C source file.
1444 .PP
1445 There is much more to say here.  See
1446 .CW acid
1447 manual page, the reference manual, or the paper
1448 ``Acid: A Debugger Built From A Language'',
1449 also by Phil Winterbottom.