sys/doc/sam/sam.ms

   1 .HTML "The Text Editor sam
   2 .Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM"
   3 .ds DY "31 May 1987
   4 .ds DR "Revised 1 July 1987
   5 .de CW          \" puts first arg in CW font, same as UL; maintains font
   6 \%\&\\$3\f(CW\\$1\fP\&\\$2
   7 ..
   8 .de Cs
   9 .br
  10 .fi
  11 .ft 2
  12 .ps -2
  13 .vs -2
  14 ..
  15 .de Ce
  16 .br
  17 .nf
  18 .ft 1
  19 .ps
  20 .vs
  21 .sp
  22 ..
  23 .de XP
  24 .ie h .html - <center><img src="\\$1.gif" /></center>
  25 .el .BP \\$1.ps \\$2
  26 ..
  27 .TL
  28 The Text Editor \&\f(CWsam\fP
  29 .AU
  30 Rob Pike
  31 rob@plan9.bell-labs.com
  32 .AB
  33 .LP
  34 .CW Sam
  35 is an interactive multi-file text editor intended for
  36 bitmap displays.
  37 A textual command language
  38 supplements the mouse-driven, cut-and-paste interface
  39 to make complex or
  40 repetitive editing tasks easy to specify.
  41 The language is characterized by the composition of regular expressions
  42 to describe the structure of the text being modified.
  43 The treatment of files as a database, with changes logged
  44 as atomic transactions, guides the implementation and
  45 makes a general `undo' mechanism straightforward.
  46 .PP
  47 .CW Sam
  48 is implemented as two processes connected by a low-bandwidth stream,
  49 one process handling the display and the other the editing
  50 algorithms.  Therefore it can run with the display process
  51 in a bitmap terminal and the editor on a local host,
  52 with both processes on a bitmap-equipped host, or with
  53 the display process in the terminal and the editor in a
  54 remote host.
  55 By suppressing the display process,
  56 it can even run without a bitmap terminal.
  57 .PP
  58 This paper is reprinted from Software\(emPractice and Experience,
  59 Vol 17, number 11, pp. 813-845, November 1987.
  60 The paper has not been updated for the Plan 9 manuals.  Although
  61 .CW Sam
  62 has not changed much since the paper was written, the system around it certainly has.
  63 Nonetheless, the description here still stands as the best introduction to the editor.
  64 .AE
  65 .SH
  66 Introduction
  67 .LP
  68 .CW Sam
  69 is an interactive text editor that combines cut-and-paste interactive editing with
  70 an unusual command language based on the composition of regular expressions.
  71 It is written as two programs: one, the `host part,' runs on a UNIX system
  72 and implements the command language and provides file access; the other, the
  73 `terminal part,' runs asynchronously
  74 on a machine with a mouse and bitmap display
  75 and supports the display and interactive editing.
  76 The host part may be even run in isolation on an ordinary terminal
  77 to edit text using the command
  78 language, much like a traditional line editor,
  79 without assistance from a mouse or display.
  80 Most often,
  81 the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal
  82 (actually on a Teletype DMD 5620, the production version of the Blit), whose
  83 host connection is an ordinary 9600 bps RS232 link;
  84 on the SUN computer the host and display processes run on a single machine,
  85 connected by a pipe.
  86 .PP
  87 .CW Sam
  88 edits uninterpreted
  89 ASCII text.
  90 It has no facilities for multiple fonts, graphics or tables,
  91 unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d
  92 or Lara.\u\s-4\&5\s+4\d
  93 Also unlike them, it has a rich command language.
  94 (Throughout this paper, the phrase
  95 .I
  96 command language
  97 .R
  98 refers to
  99 textual commands; commands activated from the mouse form the
 100 .I mouse
 101 .I language. )
 102 .CW Sam
 103 developed as an editor for use by programmers, and tries to join
 104 the styles of the UNIX text editor
 105 .CW ed \u\s-4\&6,7\s+4\d
 106 with that of interactive cut-and-paste editors by
 107 providing a comfortable mouse-driven interface
 108 to a program with a solid command language driven by regular expressions.
 109 The command language developed more than the mouse language, and
 110 acquired a notation for describing the structure of files
 111 more richly than as a sequence of lines,
 112 using a dataflow-like syntax for specifying changes.
 113 .PP
 114 The interactive style was influenced by
 115 .CW jim ,\u\s-4\&1\s+4\d
 116 an early cut-and-paste editor for the Blit, and by
 117 .CW mux ,\u\s-4\&8\s+4\d
 118 the Blit window system.
 119 .CW Mux
 120 merges the original Blit window system,
 121 .CW mpx ,\u\s-4\&1\s+4\d
 122 with cut-and-paste editing, forming something like a
 123 multiplexed version of
 124 .CW jim
 125 that edits the output of (and input to) command sessions rather than files.
 126 .PP
 127 The first part of this paper describes the command language, then the mouse
 128 language, and explains how they interact.
 129 That is followed by a description of the implementation,
 130 first of the host part, then of the terminal part.
 131 A principle that influenced the design of
 132 .CW sam
 133 is that it should have no explicit limits, such as upper limits on
 134 file size or line length.
 135 A secondary consideration is that it be efficient.
 136 To honor these two goals together requires a method for efficiently
 137 manipulating
 138 huge strings (files) without breaking them into lines,
 139 perhaps while making thousands of changes
 140 under control of the command language.
 141 .CW Sam 's
 142 method is to
 143 treat the file as a transaction database, implementing changes as atomic
 144 updates.  These updates may be unwound easily to `undo' changes.
 145 Efficiency is achieved through a collection of caches that minimizes
 146 disc traffic and data motion, both within the two parts of the program
 147 and between them.
 148 .PP
 149 The terminal part of
 150 .CW sam
 151 is fairly straightforward.
 152 More interesting is how the two halves of the editor stay
 153 synchronized when either half may initiate a change.
 154 This is achieved through a data structure that organizes the
 155 communications and is maintained in parallel by both halves.
 156 .PP
 157 The last part of the paper chronicles the writing of
 158 .CW sam
 159 and discusses the lessons that were learned through its development and use.
 160 .PP
 161 The paper is long, but is composed largely of two papers of reasonable length:
 162 a description of the user interface of
 163 .CW sam
 164 and a discussion of its implementation.
 165 They are combined because the implementation is strongly influenced by
 166 the user interface, and vice versa.
 167 .SH
 168 The Interface
 169 .LP
 170 .CW Sam
 171 is a text editor for multiple files.
 172 File names may be provided when it is invoked:
 173 .P1
 174 sam file1 file2 ...
 175 .P2
 176 and there are commands
 177 to add new files and discard unneeded ones.
 178 Files are not read until necessary
 179 to complete some command.
 180 Editing operations apply to an internal copy
 181 made when the file is read; the UNIX file associated with the copy
 182 is changed only by an explicit command.
 183 To simplify the discussion, the internal copy is here called a
 184 .I file ,
 185 while the disc-resident original is called a
 186 .I
 187 disc file.
 188 .R
 189 .PP
 190 .CW Sam
 191 is usually connected to a bitmap display that presents a cut-and-paste
 192 editor driven by the mouse.
 193 In this mode, the command language is still available:
 194 text typed in a special window, called the
 195 .CW sam
 196 .I window,
 197 is interpreted
 198 as commands to be executed in the current file.
 199 Cut-and-paste editing may be used in any window \(em even in the
 200 .CW sam
 201 window to construct commands.
 202 The other mode of operation, invoked by starting
 203 .CW sam
 204 with the option
 205 .CW -d
 206 (for `no download'),
 207 does not use the mouse or bitmap display, but still permits
 208 editing using the textual command language, even on an ordinary terminal,
 209 interactively or from a script.
 210 .PP
 211 The following sections describe first the command language (under
 212 .CW sam\ -d
 213 and in the
 214 .CW sam
 215 window), and then the mouse interface.
 216 These two languages are nearly independent, but connect through the
 217 .I current
 218 .I text,
 219 described below.
 220 .SH 2
 221 The Command Language
 222 .LP
 223 A file consists of its contents, which are an array of characters
 224 (that is, a string); the
 225 .I name
 226 of the associated disc file; the
 227 .I
 228 modified bit
 229 .R
 230 that states whether the contents match those of
 231 the disc file;
 232 and a substring of the contents, called the
 233 .I
 234 current text
 235 .R
 236 or
 237 .I dot
 238 (see Figures 1 and 2).
 239 If the current text is a null string, dot falls between characters.
 240 The
 241 .I value
 242 of dot is the location of the current text; the
 243 .I contents
 244 of dot are the characters it contains.
 245 .CW Sam
 246 imparts to the text no two-dimensional interpretation such as columns
 247 or fields; text is always one-dimensional.
 248 Even the idea of a `line' of text as understood by most UNIX programs
 249 \(em a sequence of characters terminated by a newline character \(em
 250 is only weakly supported.
 251 .PP
 252 The
 253 .I
 254 current file
 255 .R
 256 is the file to which editing commands refer.
 257 The current text is therefore dot in the current file.
 258 If a command doesn't explicitly name a particular file or piece of text,
 259 the command is assumed to apply to the current text.
 260 For the moment, ignore the presence of multiple files and consider
 261 editing a single file.
 262 .KF L
 263 .XP fig1 3.5i
 264 .Cs
 265 Figure 1. A typical
 266 .CW sam
 267 screen, with the editing menu presented.
 268 The
 269 .CW sam
 270 (command language) window is in the middle, with file windows above and below.
 271 (The user interface makes it easy to create these abutting windows.)
 272 The partially obscured window is a third file window.
 273 The uppermost window is that to which typing and mouse operations apply,
 274 as indicated by its heavy border.
 275 Each window has its current text highlighted in reverse video.
 276 The
 277 .CW sam
 278 window's current text is the null string on the last visible line,
 279 indicated by a vertical bar.
 280 See also Figure 2.
 281 .Ce
 282 .KE
 283 .PP
 284 Commands have one-letter names.
 285 Except for non-editing commands such as writing
 286 the file to disc, most commands make some change
 287 to the text in dot and leave dot set to the text resulting from the change.
 288 For example, the delete command,
 289 .CW d ,
 290 deletes the text in dot, replacing it by the null string and setting dot
 291 to the result.
 292 The change command,
 293 .CW c ,
 294 replaces dot by text delimited by an arbitrary punctuation character,
 295 conventionally
 296 a slash.  Thus,
 297 .P1
 298 c/Peter/
 299 .P2
 300 replaces the text in dot by the string
 301 .CW Peter .
 302 Similarly,
 303 .P1
 304 a/Peter/
 305 .P2
 306 (append) adds the string after dot, and
 307 .P1
 308 i/Peter/
 309 .P2
 310 (insert) inserts before dot.
 311 All three leave dot set to the new text,
 312 .CW Peter .
 313 .PP
 314 Newlines are part of the syntax of commands:
 315 the newline character lexically terminates a command.
 316 Within the inserted text, however, newlines are never implicit.
 317 But since it is often convenient to insert multiple lines of text,
 318 .CW sam
 319 has a special
 320 syntax for that case:
 321 .P1
 322 a
 323 some lines of text
 324 to be inserted in the file,
 325 terminated by a period
 326 on a line by itself
 327 \&.
 328 .P2
 329 In the one-line syntax, a newline character may be specified by a C-like
 330 escape, so
 331 .P1
 332 c/\en/
 333 .P2
 334 replaces dot by a single newline character.
 335 .PP
 336 .CW Sam
 337 also has a substitute command,
 338 .CW s :
 339 .P1
 340 s/\f2expression\fP/\f2replacement\fP/
 341 .P2
 342 substitutes the replacement text for the first match, in dot,
 343 of the regular expression.
 344 Thus, if dot is the string
 345 .CW Peter ,
 346 the command
 347 .P1
 348 s/t/st/
 349 .P2
 350 changes it to
 351 .CW Pester .
 352 In general,
 353 .CW s
 354 is unnecessary, but it was inherited from
 355 .CW ed
 356 and it has some convenient variations.
 357 For instance, the replacement text may include the matched text,
 358 specified by
 359 .CW & :
 360 .P1
 361 s/Peter/Oh, &, &, &, &!/
 362 .P2
 363 .PP
 364 There are also three commands that apply programs
 365 to text:
 366 .P1
 367 < \f2UNIX program\fP
 368 .P2
 369 replaces dot by the output of the UNIX program.
 370 Similarly, the
 371 .CW >
 372 command
 373 runs the program with dot as its standard input, and
 374 .CW |
 375 does both.  For example,
 376 .P1
 377 | sort
 378 .P2
 379 replaces dot by the result of applying the standard sorting utility to it.
 380 Again, newlines have no special significance for these
 381 .CW sam
 382 commands.
 383 The text acted upon and resulting from these commands is not necessarily
 384 bounded by newlines, although for connection with UNIX programs,
 385 newlines may be necessary to obey conventions.
 386 .PP
 387 One more command:
 388 .CW p
 389 prints the contents of dot.
 390 Table I summarizes
 391 .CW sam 's
 392 commands.
 393 .KF
 394 .TS
 395 center;
 396 c s
 397 lfCW l.
 398 Table I. \f(CWSam\fP commands
 399 .sp .4
 400 .ft CW
 401 _
 402 .ft
 403 .sp .4
 404 \f1Text commands\fP
 405 .sp .4
 406 _
 407 .sp .4
 408 a/\f2text\fP/   Append text after dot
 409 c/\f2text\fP/   Change text in dot
 410 i/\f2text\fP/   Insert text before dot
 411 d       Delete text in dot
 412 s/\f2regexp\fP/\f2text\fP/      Substitute text for match of regular expression in dot
 413 m \f2address\fP Move text in dot after address
 414 t \f2address\fP Copy text in dot after address
 415 .sp .4
 416 _
 417 .sp .4
 418 \f1Display commands\fP
 419 .sp .4
 420 _
 421 .sp .2
 422 p       Print contents of dot
 423 \&=     Print value (line numbers and character numbers) of dot
 424 .sp .4
 425 _
 426 .sp .4
 427 \f1File commands\fP
 428 .sp .4
 429 _
 430 .sp .2
 431 b \f2file-list\fP       Set current file to first file in list that \f(CWsam\fP has in menu
 432 B \f2file-list\fP       Same as \f(CWb\fP, but load new files
 433 n       Print menu lines of all files
 434 D \f2file-list\fP       Delete named files from \f(CWsam\fP
 435 .sp .4
 436 _
 437 .sp .4
 438 \f1I/O commands\fP
 439 .sp .4
 440 _
 441 .sp .2
 442 e \f2filename\fP        Replace file with named disc file
 443 r \f2filename\fP        Replace dot by contents of named disc file
 444 w \f2filename\fP        Write file to named disc file
 445 f \f2filename\fP        Set file name and print new menu line
 446 < \f2UNIX-command\fP    Replace dot by standard output of command
 447 > \f2UNIX-command\fP    Send dot to standard input of command
 448 | \f2UNIX-command\fP    Replace dot by result of command applied to dot
 449 ! \f2UNIX-command\fP    Run the command
 450 .sp .4
 451 _
 452 .sp .4
 453 \f1Loops and conditionals\fP
 454 .sp .4
 455 _
 456 .sp .2
 457 x/\f2regexp\fP/ \f2command\fP   For each match of regexp, set dot and run command
 458 y/\f2regexp\fP/ \f2command\fP   Between adjacent matches of regexp, set dot and run command
 459 X/\f2regexp\fP/ \f2command\fP   Run command in each file whose menu line matches regexp
 460 Y/\f2regexp\fP/ \f2command\fP   Run command in each file whose menu line does not match
 461 g/\f2regexp\fP/ \f2command\fP   If dot contains a match of regexp, run command
 462 v/\f2regexp\fP/ \f2command\fP   If dot does not contain a match of regexp, run command
 463 .sp .4
 464 _
 465 .sp .4
 466 \f1Miscellany\fP
 467 .sp .4
 468 _
 469 .sp .2
 470 k       Set address mark to value of dot
 471 q       Quit
 472 u \f2n\fP       Undo last \f2n\fP (default 1) changes
 473 { }     Braces group commands
 474 .sp .3
 475 .ft CW
 476 _
 477 .ft
 478 .TE
 479 .sp
 480 .KE
 481 .PP
 482 The value of dot may be changed by
 483 specifying an
 484 .I address
 485 for the command.
 486 The simplest address is a line number:
 487 .P1
 488 3
 489 .P2
 490 refers to the third line of the file, so
 491 .P1
 492 3d
 493 .P2
 494 deletes the third line of the file, and implicitly renumbers
 495 the lines so the old line 4 is now numbered 3.
 496 (This is one of the few places where
 497 .CW sam
 498 deals with lines directly.)
 499 Line
 500 .CW 0
 501 is the null string at the beginning of the file.
 502 If a command consists of only an address, a
 503 .CW p
 504 command is assumed, so typing an unadorned
 505 .CW 3
 506 prints line 3 on the terminal.
 507 There are a couple of other basic addresses:
 508 a period addresses dot itself; and
 509 a dollar sign
 510 .CW $ ) (
 511 addresses the null string at the end of the file.
 512 .PP
 513 An address is always a single substring of the file.
 514 Thus, the address
 515 .CW 3
 516 addresses the characters
 517 after the second newline of
 518 the file through the third newline of the file.
 519 A
 520 .I
 521 compound address
 522 .R
 523 is constructed by the comma operator
 524 .P1
 525 \f2address1\fP,\f2address2\fP
 526 .P2
 527 and addresses the substring of the file from the beginning of
 528 .I address1
 529 to the end of
 530 .I address2 .
 531 For example, the command
 532 .CW 3,5p
 533 prints the third through fifth lines of the file and
 534 .CW .,$d
 535 deletes the text from the beginning of dot to the end of the file.
 536 .PP
 537 These addresses are all absolute positions in the file, but
 538 .CW sam
 539 also has relative addresses, indicated by
 540 .CW +
 541 or
 542 .CW - .
 543 For example,
 544 .P1
 545 $-3
 546 .P2
 547 is the third line before the end of the file and
 548 .P1
 549 \&.+1
 550 .P2
 551 is the line after dot.
 552 If no address appears to the left of the
 553 .CW +
 554 or
 555 .CW - ,
 556 dot is assumed;
 557 if nothing appears to the right,
 558 .CW 1
 559 is assumed.
 560 Therefore,
 561 .CW .+1
 562 may be abbreviated to just a plus sign.
 563 .PP
 564 The
 565 .CW +
 566 operator acts relative to the end of its first argument, while the
 567 .CW -
 568 operator acts relative to the beginning.  Thus
 569 .CW .+1
 570 addresses the first line after dot,
 571 .CW .-
 572 addresses the first line before dot, and
 573 .CW +-
 574 refers to the line containing the end of dot.  (Dot may span multiple lines, and
 575 .CW +
 576 selects the line after the end of dot, then
 577 .CW -
 578 backs up one line.)
 579 .PP
 580 The final type of address is a regular expression, which addresses the
 581 text matched by the expression.  The expression is enclosed in slashes, as in
 582 .P1
 583 /\f2expression\fP/
 584 .P2
 585 The expressions are the same as those in the UNIX program
 586 .CW egrep ,\u\s-4\&6,7\s+4\d
 587 and include closures, alternations, and so on.
 588 They find the
 589 .I
 590 leftmost longest
 591 .R
 592 string that matches the expression, that is,
 593 the first match after the point where the search is started,
 594 and if more than one match begins at the same spot, the longest such match.
 595 (I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d)
 596 For example,
 597 .P1
 598 /x/
 599 .P2
 600 matches the next
 601 .CW x
 602 character in the file,
 603 .P1
 604 /xx*/
 605 .P2
 606 matches the next run of one or more
 607 .CW x 's,
 608 and
 609 .P1
 610 /x|Peter/
 611 .P2
 612 matches the next
 613 .CW x
 614 or
 615 .CW Peter .
 616 For compatibility with other UNIX programs, the `any character' operator,
 617 a period,
 618 does not match a newline, so
 619 .P1
 620 /.*/
 621 .P2
 622 matches the text from dot to the end of the line, but excludes the newline
 623 and so will not match across
 624 the line boundary.
 625 .PP
 626 Regular expressions are always relative addresses.
 627 The direction is forwards by default,
 628 so
 629 .CW /Peter/
 630 is really an abbreviation for
 631 .CW +/Peter/ .
 632 The search can be reversed with a minus sign, so
 633 .P1
 634 .CW -/Peter/
 635 .P2
 636 finds the first
 637 .CW Peter
 638 before dot.
 639 Regular expressions may be used with other address forms, so
 640 .CW 0+/Peter/
 641 finds the first
 642 .CW Peter
 643 in the file and
 644 .CW $-/Peter/
 645 finds the last.
 646 Table II summarizes
 647 .CW sam 's
 648 addresses.
 649 .KF
 650 .TS
 651 center;
 652 c s
 653 lfCW l.
 654 Table II. \f(CWSam\fP addresses
 655 .sp .4
 656 .ft CW
 657 _
 658 .ft
 659 .sp .4
 660 \f1Simple addresses\fP
 661 .sp .4
 662 _
 663 .sp .2
 664 #\f2n\fP        The empty string after character \f2n\fP
 665 \f2n\fP Line \f2n\fP.
 666 /\f2regexp\fP/  The first following match of the regular expression
 667 -/\f2regexp\fP/ The first previous match of the regular expression
 668 $       The null string at the end of the file
 669 \&.     Dot
 670 \&'     The address mark, set by \f(CWk\fP command
 671 "\f2regexp\fP"  Dot in the file whose menu line matches regexp
 672 .sp .4
 673 _
 674 .sp .4
 675 \f1Compound addresses\fP
 676 .sp .4
 677 _
 678 .sp .2
 679 \f2a1\fP+\f2a2\fP       The address \f2a2\fP evaluated starting at right of \f2a1\fP
 680 \f2a1\fP-\f2a2\fP       \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
 681 \f2a1\fP,\f2a2\fP       From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
 682 \f2a1\fP;\f2a2\fP       Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
 683 .sp .4
 684 _
 685 .sp .4
 686 .T&
 687 c s.
 688 T{
 689 The operators
 690 .CW +
 691 and
 692 .CW -
 693 are high precedence, while
 694 .CW ,
 695 and
 696 .CW ;
 697 are low precedence.
 698 In both
 699 .CW +
 700 and
 701 .CW -
 702 forms,
 703 .I a2
 704 defaults to 1 and
 705 .I a1
 706 defaults to dot.
 707 If both
 708 .I a1
 709 and
 710 .I a2
 711 are present,
 712 .CW +
 713 may be elided.
 714 T}
 715 .sp .5
 716 .ft CW
 717 _
 718 .ft
 719 .TE
 720 .sp
 721 .KE
 722 .PP
 723 The language discussed so far will not seem novel
 724 to people who use UNIX text editors
 725 such as
 726 .CW ed
 727 or
 728 .CW vi .\u\s-4\&9\s+4\d
 729 Moreover, the kinds of editing operations these commands allow, with the exception
 730 of regular expressions and line numbers,
 731 are clearly more conveniently handled by a mouse-based interface.
 732 Indeed,
 733 .CW sam 's
 734 mouse language (discussed at length below) is the means by which
 735 simple changes are usually made.
 736 For large or repetitive changes, however, a textual language
 737 outperforms a manual interface.
 738 .PP
 739 Imagine that, instead of deleting just one occurrence of the string
 740 .CW Peter ,
 741 we wanted to eliminate every
 742 .CW Peter .
 743 What's needed is an iterator that runs a command for each occurrence of some
 744 text.
 745 .CW Sam 's
 746 iterator is called
 747 .CW x ,
 748 for extract:
 749 .P1
 750 x/\f2expression\fP/ \f2command\fP
 751 .P2
 752 finds all matches in dot of the specified expression, and for each
 753 such match, sets dot to the text matched and runs the command.
 754 So to delete all the
 755 .CW Peters:
 756 .P1
 757 0,$ x/Peter/ d
 758 .P2
 759 (Blanks in these examples are to improve readability;
 760 .CW sam
 761 neither requires nor interprets them.)
 762 This searches the entire file
 763 .CW 0,$ ) (
 764 for occurrences of the string
 765 .CW Peter ,
 766 and runs the
 767 .CW d
 768 command with dot set to each such occurrence.
 769 (By contrast, the comparable
 770 .CW ed
 771 command would delete all
 772 .I lines
 773 containing
 774 .CW Peter ;
 775 .CW sam
 776 deletes only the
 777 .CW Peters .)
 778 The address
 779 .CW 0,$
 780 is commonly used, and may be abbreviated to just a comma.
 781 As another example,
 782 .P1
 783 , x/Peter/ p
 784 .P2
 785 prints a list of
 786 .CW Peters,
 787 one for each appearance in the file, with no intervening text (not even newlines
 788 to separate the instances).
 789 .PP
 790 Of course, the text extracted by
 791 .CW x
 792 may be selected by a regular expression,
 793 which complicates deciding what set of matches is chosen \(em
 794 matches may overlap.  This is resolved by generating the matches
 795 starting from the beginning of dot using the leftmost-longest rule,
 796 and searching for each match starting from the end of the previous one.
 797 Regular expressions may also match null strings, but a null match
 798 adjacent to a non-null match is never selected; at least one character
 799 must intervene.
 800 For example,
 801 .P1
 802 , c/AAA/
 803 x/B*/ c/-/
 804 , p
 805 .P2
 806 produces as output
 807 .P1
 808 -A-A-A-
 809 .P2
 810 because the pattern
 811 .CW B*
 812 matches the null strings separating the
 813 .CW A 's.
 814 .PP
 815 The
 816 .CW x
 817 command has a complement,
 818 .CW y ,
 819 with similar syntax, that executes the command with dot set to the text
 820 .I between
 821 the matches of the expression.
 822 For example,
 823 .P1
 824 , c/AAA/
 825 y/A/ c/-/
 826 , p
 827 .P2
 828 produces the same result as the example above.
 829 .PP
 830 The
 831 .CW x
 832 and
 833 .CW y
 834 commands are looping constructs, and
 835 .CW sam
 836 has a pair of conditional commands to go with them.
 837 They have similar syntax:
 838 .P1
 839 g/\f2expression\fP/ \f2command\fP
 840 .P2
 841 (guard)
 842 runs the command exactly once if dot contains a match of the expression.
 843 This is different from
 844 .CW x ,
 845 which runs the command for
 846 .I each
 847 match:
 848 .CW x
 849 loops;
 850 .CW g
 851 merely tests, without changing the value of dot.
 852 Thus,
 853 .P1
 854 , x/Peter/ d
 855 .P2
 856 deletes all occurrences of
 857 .CW Peter ,
 858 but
 859 .P1
 860 , g/Peter/ d
 861 .P2
 862 deletes the whole file (reduces it to a null string) if
 863 .CW Peter
 864 occurs anywhere in the text.
 865 The complementary conditional is
 866 .CW v ,
 867 which runs the command if there is
 868 .I no
 869 match of the expression.
 870 .PP
 871 These control-structure-like commands may be composed to construct more
 872 involved operations.  For example, to print those lines of text that
 873 contain the string
 874 .CW Peter :
 875 .P1
 876 , x/.*\en/ g/Peter/ p
 877 .P2
 878 The
 879 .CW x
 880 breaks the file into lines, the
 881 .CW g
 882 selects those lines containing
 883 .CW Peter ,
 884 and the
 885 .CW p
 886 prints them.
 887 This command gives an address for the
 888 .CW x
 889 command (the whole file), but because
 890 .CW g
 891 does not have an explicit address, it applies to the value of
 892 dot produced by the
 893 .CW x
 894 command, that is, to each line.
 895 All commands in
 896 .CW sam
 897 except for the command to write a file to disc use dot for the
 898 default address.
 899 .PP
 900 Composition may be continued indefinitely.
 901 .P1
 902 , x/.*\en/ g/Peter/ v/SaltPeter/ p
 903 .P2
 904 prints those lines containing
 905 .CW Peter
 906 but
 907 .I not
 908 those containing
 909 .CW SaltPeter .
 910 .SH 2
 911 Structural Regular Expressions
 912 .LP
 913 Unlike other UNIX text editors,
 914 including the non-interactive ones such as
 915 .CW sed
 916 and
 917 .CW awk ,\u\s-4\&7\s+4\d
 918 .CW sam
 919 is good for manipulating files with multi-line `records.'
 920 An example is an on-line phone book composed of records,
 921 separated by blank lines, of the form
 922 .P1
 923 Herbert Tic
 924 44 Turnip Ave., Endive, NJ
 925 201-5555642
 926
 927 Norbert Twinge
 928 16 Potato St., Cabbagetown, NJ
 929 201-5553145
 930
 931 \&...
 932 .P2
 933 The format may be encoded as a regular expression:
 934 .P1
 935 (.+\en)+
 936 .P2
 937 that is, a sequence of one or more non-blank lines.
 938 The command to print Mr. Tic's entire record is then
 939 .P1
 940 , x/(.+\en)+/ g/^Herbert Tic$/ p
 941 .P2
 942 and that to extract just the phone number is
 943 .P1
 944 , x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p
 945 .P2
 946 The latter command breaks the file into records,
 947 chooses Mr. Tic's record,
 948 extracts the phone number from the record,
 949 and finally prints the number.
 950 .PP
 951 A more involved problem is that of
 952 renaming a particular variable, say
 953 .CW n ,
 954 to
 955 .CW num
 956 in a C program.
 957 The obvious first attempt,
 958 .P1
 959 , x/n/ c/num/
 960 .P2
 961 is badly flawed: it changes not only the variable
 962 .CW n
 963 but any letter
 964 .CW n
 965 that appears.
 966 We need to extract all the variables, and select those that match
 967 .CW n
 968 and only
 969 .CW n :
 970 .P1
 971 , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
 972 .P2
 973 The pattern
 974 .CW [A-Za-z_][A-Za-z_0-9]*
 975 matches C identifiers.
 976 Next
 977 .CW g/n/
 978 selects those containing an
 979 .CW n .
 980 Then
 981 .CW v/../
 982 rejects those containing two (or more) characters, and finally
 983 .CW c/num/
 984 changes the remainder (identifiers
 985 .CW n )
 986 to
 987 .CW num .
 988 This version clearly works much better, but there may still be problems.
 989 For example, in C character and string constants, the sequence
 990 .CW \en
 991 is interpreted as a newline character, and we don't want to change it to
 992 .CW \enum.
 993 This problem can be forestalled with a
 994 .CW y
 995 command:
 996 .P1
 997 , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
 998 .P2
 999 (the second
1000 .CW \e
1001 is necessary because of lexical conventions in regular expressions),
1002 or we could even reject character constants and strings outright:
1003 .P1 0
1004 ,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1005 .P2
1006 The
1007 .CW y
1008 commands in this version exclude from consideration all character constants
1009 and strings.
1010 The only remaining problem is to deal with the possible occurrence of
1011 .CW \e'
1012 or
1013 .CW \e"
1014 within these sequences, but it's easy to see how to resolve this difficulty.
1015 .PP
1016 The point of these composed commands is successive refinement.
1017 A simple version of the command is tried, and if it's not good enough,
1018 it can be honed by adding a clause or two.
1019 (Mistakes can be undone; see below.
1020 Also, the mouse language makes it unnecessary to retype the command each time.)
1021 The resulting chains of commands are somewhat reminiscent of
1022 shell pipelines.\u\s-4\&7\s+4\d
1023 Unlike pipelines, though, which pass along modified
1024 .I data ,
1025 .CW sam
1026 commands pass a
1027 .I view
1028 of the data.
1029 The text at each step of the command is the same, but which pieces
1030 are selected is refined step by step until the correct piece is
1031 available to the final step of the command line, which ultimately makes the change.
1032 .PP
1033 In other UNIX programs, regular expressions are used only for selection,
1034 as in the
1035 .CW sam
1036 .CW g
1037 command, never for extraction as in the
1038 .CW x
1039 or
1040 .CW y
1041 command.
1042 For example, patterns in
1043 .CW awk \u\s-4\&7\s+4\d
1044 are used to select lines to be operated on, but cannot be used
1045 to describe the format of the input text, or to handle newline-free text.
1046 The use of regular expressions to describe the structure of a piece
1047 of text rather than its contents, as in the
1048 .CW x
1049 command,
1050 has been given a name:
1051 .I
1052 structural regular expressions.
1053 .R
1054 When they are composed, as in the above example,
1055 they are pleasantly expressive.
1056 Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d
1057 .PP
1058 .SH 2
1059 Multiple files
1060 .LP
1061 .CW Sam
1062 has a few other commands, mostly relating to input and output.
1063 .P1
1064 e discfilename
1065 .P2
1066 replaces the contents and name of the current file with those of the named
1067 disc file;
1068 .P1
1069 w discfilename
1070 .P2
1071 writes the contents to the named disc file; and
1072 .P1
1073 r discfilename
1074 .P2
1075 replaces dot with the contents of the named disc file.
1076 All these commands use the current file's name if none is specified.
1077 Finally,
1078 .P1
1079 f discfilename
1080 .P2
1081 changes the name associated with the file and displays the result:
1082 .P1
1083 \&'-. discfilename
1084 .P2
1085 This output is called the file's
1086 .I
1087 menu line,
1088 .R
1089 because it is the contents of the file's line in the button 3 menu (described
1090 in the
1091 next section).
1092 The first three characters are a concise notation for the state of the file.
1093 The apostrophe signifies that the file is modified.
1094 The minus sign indicates the number of windows
1095 open on the file (see the next section):
1096 .CW -
1097 means none,
1098 .CW +
1099 means one, and
1100 .CW *
1101 means more than one.
1102 Finally, the period indicates that this is the current file.
1103 These characters are useful for controlling the
1104 .CW X
1105 command, described shortly.
1106 .PP
1107 .CW Sam
1108 may be started with a set of disc files (such as all the source for
1109 a program) by invoking it with a list of file names as arguments, and
1110 more may be added or deleted on demand.
1111 .P1
1112 B discfile1 discfile2 ...
1113 .P2
1114 adds the named files to
1115 .CW sam 's
1116 list, and
1117 .P1
1118 D discfile1 discfile2 ...
1119 .P2
1120 removes them from
1121 .CW sam 's
1122 memory (without effect on associated disc files).
1123 Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d
1124 (the UNIX command interpreter) to generate the lists:
1125 .P1
1126 B <echo *.c
1127 .P2
1128 will add all C source files, and
1129 .P1
1130 B <grep -l variable *.c
1131 .P2
1132 will add all C source files referencing a particular variable
1133 (the UNIX command
1134 .CW grep\ -l
1135 lists all files in its arguments that contain matches of
1136 the specified regular expression).
1137 Finally,
1138 .CW D
1139 without arguments deletes the current file.
1140 .PP
1141 There are two ways to change which file is current:
1142 .P1
1143 b filename
1144 .P2
1145 makes the named file current.
1146 The
1147 .CW B
1148 command
1149 does the same, but also adds any new files to
1150 .CW sam 's
1151 list.
1152 (In practice, of course, the current file
1153 is usually chosen by mouse actions, not by textual commands.)
1154 The other way is to use a form of address that refers to files:
1155 .P1
1156 "\f2expression\fP" \f2address\fP
1157 .P2
1158 refers to the address evaluated in the file whose menu line
1159 matches the expression (there must be exactly one match).
1160 For example,
1161 .P1
1162 "peter.c" 3
1163 .P2
1164 refers to the third line of the file whose name matches
1165 .CW peter.c .
1166 This is most useful in the move
1167 .CW m ) (
1168 and copy
1169 .CW t ) (
1170 commands:
1171 .P1
1172 0,$ t "peter.c" 0
1173 .P2
1174 makes a copy of the current file at the beginning of
1175 .CW peter.c .
1176 .PP
1177 The
1178 .CW X
1179 command
1180 is a looping construct, like
1181 .CW x ,
1182 that refers to files instead of strings:
1183 .P1
1184 X/\f2expression\fP/ \f2command\fP
1185 .P2
1186 runs the command in all
1187 files whose menu lines match the expression.  The best example is
1188 .P1
1189 X/'/ w
1190 .P2
1191 which writes to disc all modified files.
1192 .CW Y
1193 is the complement of
1194 .CW X :
1195 it runs the command on all files whose menu lines don't match the expression:
1196 .P1
1197 Y/\e.c/ D
1198 .P2
1199 deletes all files that don't have
1200 .CW \&.c
1201 in their names, that is, it keeps all C source files and deletes the rest.
1202 .PP
1203 Braces allow commands to be grouped, so
1204 .P1
1205 {
1206         \f2command1\fP
1207         \f2command2\fP
1208 }
1209 .P2
1210 is syntactically a single command that runs two commands.
1211 Thus,
1212 .P1
1213 X/\e.c/ ,g/variable/ {
1214         f
1215         , x/.*\en/ g/variable/ p
1216 }
1217 .P2
1218 finds all occurrences of
1219 .CW variable
1220 in C source files, and prints
1221 out the file names and lines of each match.
1222 The precise semantics of compound operations is discussed in the implementation
1223 sections below.
1224 .PP
1225 Finally,
1226 the undo command,
1227 .CW u ,
1228 undoes the last command,
1229 no matter how many files were affected.
1230 Multiple undo operations move further back in time, so
1231 .P1
1232 u
1233 u
1234 .P2
1235 (which may be abbreviated
1236 .CW u2 )
1237 undoes the last two commands.  An undo may not be undone, however, nor
1238 may any command that adds or deletes files.
1239 Everything else is undoable, though, including for example
1240 .CW e
1241 commands:
1242 .P1
1243 e filename
1244 u
1245 .P2
1246 restores the state of the file completely, including its name, dot,
1247 and modified bit.  Because of the undo, potentially dangerous commands
1248 are not guarded by confirmations.  Only
1249 .CW D ,
1250 which destroys the information necessary to restore itself, is protected.
1251 It will not delete a modified file, but a second
1252 .CW D
1253 of the same file will succeed regardless.
1254 The
1255 .CW q
1256 command, which exits
1257 .CW sam ,
1258 is similarly guarded.
1259 .SH 2
1260 Mouse Interface
1261 .LP
1262 .CW Sam
1263 is most commonly run
1264 connected to a bitmap display and mouse for interactive editing.
1265 The only difference in the command language
1266 between regular, mouse-driven
1267 .CW sam
1268 and
1269 .CW sam\ -d
1270 is that if an address
1271 is provided without a command,
1272 .CW sam\ -d
1273 will print the text referenced by the address, but
1274 regular
1275 .CW sam
1276 will highlight it on the screen \(em in fact,
1277 dot is always highlighted (see Figure 2).
1278 .WS 1
1279 .KF
1280 .XP fig3 2.04i
1281 .Cs
1282 Figure 2. A
1283 .CW sam
1284 window.  The scroll bar down the left
1285 represents the file, with the bubble showing the fraction
1286 visible in the window.
1287 The scroll bar may be manipulated by the mouse for convenient browsing.
1288 The current text,
1289 which is highlighted, need not fit on a line.  Here it consists of one partial
1290 line, one complete line, and final partial line.
1291 .Ce
1292 .KE
1293 .PP
1294 Each file may have zero or more windows open on the display.
1295 At any time, only one window in all of
1296 .CW sam
1297 is the
1298 .I
1299 current window,
1300 .R
1301 that is, the window to which typing and mouse actions refer;
1302 this may be the
1303 .CW sam
1304 window (that in which commands may be typed)
1305 or one of the file windows.
1306 When a file has multiple windows, the image of the file in each window
1307 is always kept up to date.
1308 The current file is the last file affected by a command,
1309 so if the
1310 .CW sam
1311 window is current,
1312 the current window is not a window on the current file.
1313 However, each window on a file has its own value of dot,
1314 and when switching between windows on a single file,
1315 the file's value of dot is changed to that of the window.
1316 Thus, flipping between windows behaves in the obvious, convenient way.
1317 .PP
1318 The mouse on the Blit has three buttons, numbered left to right.
1319 Button 3 has a list of commands to manipulate windows,
1320 followed by a list of `menu lines' exactly as printed by the
1321 .CW f
1322 command, one per file (not one per window).
1323 These menu lines are sorted by file name.
1324 If the list is long, the Blit menu software will make it more manageable
1325 by generating a scrolling menu instead of an unwieldy long list.
1326 Using the menu to select a file from the list makes that file the current
1327 file, and the most recently current window in that file the current window.
1328 But if that file is already current, selecting it in the menu cycles through
1329 the windows on the file; this simple trick avoids a special menu to
1330 choose windows on a file.
1331 If there is no window open on the file,
1332 .CW sam
1333 changes the mouse cursor to prompt the user to create one.
1334 .PP
1335 The commands on the button 3 menu are straightforward (see Figure 3), and
1336 are like the commands to manipulate windows in
1337 .CW mux ,\u\s-4\&8\s+4\d
1338 the Blit's window system.
1339 .CW New
1340 makes a new file, and gives it one empty window, whose size is determined
1341 by a rectangle swept by the mouse.
1342 .CW Zerox
1343 prompts for a window to be selected, and
1344 makes a clone of that window; this is how multiple windows are created on one file.
1345 .CW Reshape
1346 changes the size of the indicated window, and
1347 .CW close
1348 deletes it.  If that is the last window open on the file,
1349 .CW close
1350 first does a
1351 .CW D
1352 command on the file.
1353 .CW Write
1354 is identical to a
1355 .CW w
1356 command on the file; it is in the menu purely for convenience.
1357 Finally,
1358 .CW ~~sam~~
1359 is a menu item that appears between the commands and the file names.
1360 Selecting it makes the
1361 .CW sam
1362 window the current window,
1363 causing subsequent typing to be interpreted as commands.
1364 .KF
1365 .XP fig2 2.74i
1366 .Cs
1367 Figure 3. The menu on button 3.
1368 The black rectangle on the left is a scroll bar; the menu is limited to
1369 the length shown to prevent its becoming unwieldy.
1370 Above the
1371 .CW ~~sam~~
1372 line is a list of commands;
1373 beneath it is a list of files, presented exactly as with the
1374 .CW f
1375 command.
1376 .Ce
1377 .KE
1378 .PP
1379 When
1380 .CW sam
1381 requests that a window be swept, in response to
1382 .CW new ,
1383 .CW zerox
1384 or
1385 .CW reshape ,
1386 it changes the mouse cursor from the usual arrow to a box with
1387 a small arrow.
1388 In this state, the mouse may be used to indicate an arbitrary rectangle by
1389 pressing button 3 at one corner and releasing it at the opposite corner.
1390 More conveniently,
1391 button 3 may simply be clicked,
1392 whereupon
1393 .CW sam
1394 creates the maximal rectangle that contains the cursor
1395 and abuts the
1396 .CW sam
1397 window.
1398 By placing the
1399 .CW sam
1400 window in the middle of the screen, the user can define two regions (one above,
1401 one below) in which stacked fully-overlapping
1402 windows can be created with minimal fuss (see Figure 1).
1403 This simple user interface trick makes window creation noticeably easier.
1404 .PP
1405 The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d
1406 The text in dot is always highlighted on the screen.
1407 When a character is typed it replaces dot, and sets dot to the null
1408 string after the character.  Thus, ordinary typing inserts text.
1409 Button 1 is used for selection:
1410 pressing the button, moving the mouse, and lifting the button
1411 selects (sets dot to) the text between the points where the
1412 button was pressed and released.
1413 Pressing and releasing at the same point selects a null string; this
1414 is called clicking.  Clicking twice quickly, or
1415 .I
1416 double clicking,
1417 .R
1418 selects larger objects;
1419 for example, double clicking in a word selects the word,
1420 double clicking just inside an opening bracket selects the text
1421 contained in the brackets (handling nested brackets correctly),
1422 and similarly for
1423 parentheses, quotes, and so on.
1424 The double-clicking rules reflect a bias toward
1425 programmers.
1426 If
1427 .CW sam
1428 were intended more for word processing, double-clicks would probably
1429 select linguistic structures such as sentences.
1430 .PP
1431 If button 1 is pressed outside the current window, it makes the indicated
1432 window current.
1433 This is the easiest way to switch between windows and files.
1434 .PP
1435 Pressing button 2 brings up a menu of editing functions (see Figure 4).
1436 These mostly apply to the selected text:
1437 .CW cut
1438 deletes the selected text, and remembers it in a hidden buffer called the
1439 .I
1440 snarf buffer,
1441 .R
1442 .CW paste
1443 replaces the selected text by the contents of the snarf buffer,
1444 .CW snarf
1445 just copies the selected text to the snarf buffer,
1446 .CW look
1447 searches forward for the next literal occurrence of the selected text, and
1448 .CW <mux>
1449 exchanges snarf buffers with the window system in which
1450 .CW sam
1451 is running.
1452 Finally, the last regular expression used appears as a menu entry
1453 to search
1454 forward for the next occurrence of a match for the expression.
1455 .WS 1
1456 .KF
1457 .XP fig4 1.20i
1458 .Cs
1459 Figure 4. The menu on button 2.
1460 The bottom entry tracks the most recently used regular expression, which may
1461 be literal text.
1462 .Ce
1463 .KE
1464 .PP
1465 The relationship between the command language and the mouse language is
1466 entirely due to the equality of dot and the selected text chosen
1467 with button 1 on the mouse.
1468 For example, to make a set of changes in a C subroutine, dot can be
1469 set by double clicking on the left brace that begins the subroutine,
1470 which sets dot for the command language.
1471 An address-free command then typed in the
1472 .CW sam
1473 window will apply only to the text between the opening and closing
1474 braces of the function.
1475 The idea is to select what you want, and then say what you want
1476 to do with it, whether invoked by a menu selection or by a typed command.
1477 And of course, the value of dot is highlighted on
1478 the display after the command completes.
1479 This relationship between mouse interface and command language
1480 is clumsy to explain, but comfortable, even natural, in practice.
1481 .SH
1482 The Implementation
1483 .LP
1484 The next few sections describe how
1485 .CW sam
1486 is put together, first the host part,
1487 then the inter-component communication,
1488 then the terminal part.
1489 After explaining how the command language is implemented,
1490 the discussion follows (roughly) the path of a character
1491 from the temporary file on disc to the screen.
1492 The presentation centers on the data structures,
1493 because that is how the program was designed and because
1494 the algorithms are easy to provide, given the right data
1495 structures.
1496 .SH 2
1497 Parsing and execution
1498 .LP
1499 The command language is interpreted by parsing each command with a
1500 table-driven recursive
1501 descent parser, and when a complete command is assembled, invoking a top-down
1502 executor.
1503 Most editors instead employ a simple character-at-a-time
1504 lexical scanner.
1505 Use of a parser makes it
1506 easy and unambiguous to detect when a command is complete,
1507 which has two advantages.
1508 First, escape conventions such as backslashes to quote
1509 multiple-line commands are unnecessary;  if the command isn't finished,
1510 the parser keeps reading.  For example, a multiple-line append driven by an
1511 .CW x
1512 command is straightforward:
1513 .P1
1514 x/.*\en/ g/Peter/ a
1515 one line about Peter
1516 another line about Peter
1517 \&.
1518 .P2
1519 Other UNIX editors would require a backslash after all but the last line.
1520 .PP
1521 The other advantage is specific to the two-process structure of
1522 .CW sam .
1523 The host process must decide when a command is completed so the
1524 command interpreter can be called.  This problem is easily resolved
1525 by having the lexical analyzer read the single stream of events from the
1526 terminal, directly executing all typing and mouse commands,
1527 but passing to the parser characters typed to the
1528 .CW sam
1529 command window.
1530 This scheme is slightly complicated by the availability of cut-and-paste
1531 editing in the
1532 .CW sam
1533 window, but that difficulty is resolved by applying the rules
1534 used in
1535 .CW mux :
1536 when a newline is typed to the
1537 .CW sam
1538 window, all text between the newline and the previously typed newline
1539 is made available to the parser.
1540 This permits arbitrary editing to be done to a command before
1541 typing newline and thereby requesting execution.
1542 .PP
1543 The parser is driven by a table because the syntax of addresses
1544 and commands is regular enough
1545 to be encoded compactly.  There are few special cases, such as the
1546 replacement text in a substitution, so the syntax of almost all commands
1547 can be encoded with a few flags.
1548 These include whether the command allows an address (for example,
1549 .CW e
1550 does not), whether it takes a regular expression (as in
1551 .CW x
1552 and
1553 .CW s ),
1554 whether it takes replacement text (as in
1555 .CW c
1556 or
1557 .CW i ),
1558 which may be multi-line, and so on.
1559 The internal syntax of regular expressions is handled by a separate
1560 parser; a regular expression is a leaf of the command parse tree.
1561 Regular expressions are discussed fully in the next section.
1562 .PP
1563 The parser table also has information about defaults, so the interpreter
1564 is always called with a complete tree.  For example, the parser fills in
1565 the implicit
1566 .CW 0
1567 and
1568 .CW $
1569 in the abbreviated address
1570 .CW ,
1571 (comma),
1572 inserts a
1573 .CW +
1574 to the left of an unadorned regular expression in an address,
1575 and provides the usual default address
1576 .CW .
1577 (dot) for commands that expect an address but are not given one.
1578 .PP
1579 Once a complete command is parsed, the evaluation is easy.
1580 The address is evaluated left-to-right starting from the value of dot,
1581 with a mostly ordinary expression evaluator.
1582 Addresses, like many of the data structures in
1583 .CW sam ,
1584 are held in a C structure and passed around by value:
1585 .P1
1586 typedef long Posn;    /* Position in a file */
1587 typedef struct Range{
1588         Posn    p1, p2;
1589 }Range;
1590 typedef struct Address{
1591         Range   r;
1592         File    *f;
1593 }Address;
1594 .P2
1595 An address is encoded as a substring (character positions
1596 .CW p1
1597 to
1598 .CW p2 )
1599 in a file
1600 .CW f .
1601 (The data type
1602 .CW File
1603 is described in detail below.)
1604 .PP
1605 The address interpreter is an
1606 .CW Address -valued
1607 function that traverses the parse tree describing an address (the
1608 parse tree for the address has type
1609 .CW Addrtree ):
1610 .P1
1611 Address
1612 address(ap, a, sign)
1613         Addrtree *ap;
1614         Address a;
1615         int sign;
1616 {
1617         Address a2;
1618         do
1619                 switch(ap->type){
1620                 case '.':
1621                         a=a.f->dot;
1622                         break;
1623                 case '$':
1624                         a.r.p1=a.r.p2=a.f->nbytes;
1625                         break;
1626                 case '"':
1627                         a=matchfile(a, ap->aregexp)->dot;
1628                         break;
1629                 case ',':
1630                         a2=address(ap->right, a, 0);
1631                         a=address(ap->left, a, 0);
1632                         if(a.f!=a2.f || a2.r.p2<a.r.p1)
1633                                 error(Eorder);
1634                         a.r.p2=a2.r.p2;
1635                         return a;
1636                 /* and so on */
1637                 }
1638         while((ap=ap->right)!=0);
1639         return a;
1640 }
1641 .P2
1642 .PP
1643 Throughout, errors are handled by a non-local
1644 .CW goto
1645 (a
1646 .CW setjmp/longjmp
1647 in C terminology)
1648 hidden in a routine called
1649 .CW error
1650 that immediately aborts the execution, retracts any
1651 partially made changes (see the section below on `undoing'), and
1652 returns to the top level of the parser.
1653 The argument to
1654 .CW error
1655 is an enumeration type that
1656 is translated to a terse but possibly helpful
1657 message such as `?addresses out of order.'
1658 Very common messages are kept short; for example the message for
1659 a failed regular expression search is `?search.'
1660 .PP
1661 Character addresses such as
1662 .CW #3
1663 are trivial to implement, as the
1664 .CW File
1665 data structure is accessible by character number.
1666 However,
1667 .CW sam
1668 keeps no information about the position of newlines \(em it is too
1669 expensive to track dynamically \(em so line addresses are computed by reading
1670 the file, counting newlines.  Except in very large files, this has proven
1671 acceptable: file access is fast enough to make the technique practical,
1672 and lines are not central to the structure of the command language.
1673 .PP
1674 The command interpreter, called
1675 .CW cmdexec ,
1676 is also straightforward.  The parse table includes a
1677 function to call to interpret a particular command.  That function
1678 receives as arguments
1679 the calculated address
1680 for the command
1681 and the command tree (of type
1682 .CW Cmdtree ),
1683 which may contain information such as the subtree for compound commands.
1684 Here, for example, is the function for the
1685 .CW g
1686 and
1687 .CW v
1688 commands:
1689 .P1
1690 int
1691 g_cmd(a, cp)
1692         Address a;
1693         Cmdtree *cp;
1694 {
1695         compile(cp->regexp);
1696         if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){
1697                 a.f->dot=a;
1698                 return cmdexec(a, cp->subcmd);
1699         }
1700         return TRUE;    /* cause execution to continue */
1701 }
1702 .P2
1703 .CW Compile "" (
1704 and
1705 .CW execute
1706 are part of the regular expression code, described in the next section.)
1707 Because the parser and the
1708 .CW File
1709 data structure do most of the work, most commands
1710 are similarly brief.
1711 .SH 2
1712 Regular expressions
1713 .LP
1714 The regular expression code in
1715 .CW sam
1716 is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1717 non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d
1718 The syntax and semantics of the expressions are as in the UNIX program
1719 .CW egrep ,
1720 including alternation, closures, character classes, and so on.
1721 The only changes in the notation are two additions:
1722 .CW \en
1723 is translated to, and matches, a newline character, and
1724 .CW @
1725 matches any character.  In
1726 .CW egrep ,
1727 the character
1728 .CW \&.
1729 matches any character except newline, and in
1730 .CW sam
1731 the same rule seemed safest, to prevent idioms like
1732 .CW \&.*
1733 from spanning newlines.
1734 .CW Egrep
1735 expressions are arguably too complicated for an interactive editor \(em
1736 certainly it would make sense if all the special characters were two-character
1737 sequences, so that most of the punctuation characters wouldn't have
1738 peculiar meanings \(em but for an interesting command language, full
1739 regular expressions are necessary, and
1740 .CW egrep
1741 defines the full regular expression syntax for UNIX programs.
1742 Also, it seemed superfluous to define a new syntax, since various UNIX programs
1743 .CW ed , (
1744 .CW egrep
1745 and
1746 .CW vi )
1747 define too many already.
1748 .PP
1749 The expressions are compiled by a routine,
1750 .CW compile ,
1751 that generates the description of the non-deterministic finite state machine.
1752 A second routine,
1753 .CW execute ,
1754 interprets the machine to generate the leftmost-longest match of the
1755 expression in a substring of the file.
1756 The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d
1757 .CW Execute
1758 reports
1759 whether a match was found, and sets a global variable,
1760 of type
1761 .CW Range ,
1762 to the substring matched.
1763 .PP
1764 A trick is required to evaluate the expression in reverse, such as when
1765 searching backwards for an expression.
1766 For example,
1767 .P1
1768 -/P.*r/
1769 .P2
1770 looks backwards through the file for a match of the expression.
1771 The expression, however, is defined for a forward search.
1772 The solution is to construct a machine identical to the machine
1773 for a forward search except for a reversal of all the concatenation
1774 operators (the other operators are symmetric under direction reversal),
1775 to exchange the meaning of the operators
1776 .CW ^
1777 and
1778 .CW $ ,
1779 and then to read the file backwards, looking for the
1780 usual earliest longest match.
1781 .PP
1782 .CW Execute
1783 generates only one match each time it is called.
1784 To interpret looping constructs such as the
1785 .CW x
1786 command,
1787 .CW sam
1788 must therefore synchronize between
1789 calls of
1790 .CW execute
1791 to avoid
1792 problems with null matches.
1793 For example, even given the leftmost-longest rule,
1794 the expression
1795 .CW a*
1796 matches three times in the string
1797 .CW ab
1798 (the character
1799 .CW a ,
1800 the null string between the
1801 .CW a
1802 and
1803 .CW b ,
1804 and the final null string).
1805 After returning a match for the
1806 .CW a ,
1807 .CW sam
1808 must not match the null string before the
1809 .CW b .
1810 The algorithm starts
1811 .CW execute
1812 at the end of its previous match, and
1813 if the match it returns
1814 is null and abuts the previous match, rejects the match and advances
1815 the initial position one character.
1816 .SH 2
1817 Memory allocation
1818 .LP
1819 The C language has no memory allocation primitives, although a standard
1820 library routine,
1821 .CW malloc ,
1822 provides adequate service for simple programs.
1823 For specific uses, however,
1824 it can be better to write a custom allocator.
1825 The allocator (or rather, pair of allocators) described here
1826 work in both the terminal and host parts of
1827 .CW sam .
1828 They are designed for efficient manipulation of strings,
1829 which are allocated and freed frequently and vary in length from essentially
1830 zero to 32 Kbytes (very large strings are written to disc).
1831 More important, strings may be large and change size often,
1832 so to minimize memory usage it is helpful to reclaim and to coalesce the
1833 unused portions of strings when they are truncated.
1834 .PP
1835 Objects to be allocated in
1836 .CW sam
1837 are of two flavors:
1838 the first is C
1839 .CW structs ,
1840 which are small and often addressed by pointer variables;
1841 the second is variable-sized arrays of characters
1842 or integers whose
1843 base pointer is always used to access them.
1844 The memory allocator in
1845 .CW sam
1846 is therefore in two parts:
1847 first, a traditional first-fit allocator that provides fixed storage for
1848 .CW structs ;
1849 and second, a garbage-compacting allocator that reduces storage
1850 overhead for variable-sized objects, at the cost of some bookkeeping.
1851 The two types of objects are allocated from adjoining arenas, with
1852 the garbage-compacting allocator controlling the arena with higher addresses.
1853 Separating into two arenas simplifies compaction and prevents fragmentation due
1854 to immovable objects.
1855 The access rules for garbage-compactable objects
1856 (discussed in the next paragraph) allow them to be relocated, so when
1857 the first-fit arena needs space, it moves the garbage-compacted arena
1858 to higher addresses to make room.  Storage is therefore created only
1859 at successively higher addresses, either when more garbage-compacted
1860 space is needed or when the first-fit arena pushes up the other arena.
1861 .PP
1862 Objects that may be compacted declare to the
1863 allocator a cell that is guaranteed to be the sole repository of the
1864 address of the object whenever a compaction can occur.
1865 The compactor can then update the address when the object is moved.
1866 For example, the implementation of type
1867 .CW List
1868 (really a variable-length array)
1869 is:
1870 .P1
1871 typedef struct List{
1872         int     nused;
1873         long    *ptr;
1874 }List;
1875 .P2
1876 The
1877 .CW ptr
1878 cell must always be used directly, and never copied.  When a
1879 .CW List
1880 is to be created the
1881 .CW List
1882 structure is allocated in the ordinary first-fit arena
1883 and its
1884 .CW ptr
1885 is allocated in the garbage-compacted arena.
1886 A similar data type for strings, called
1887 .CW String ,
1888 stores variable-length character arrays of up to 32767 elements.
1889 .PP
1890 A related matter of programming style:
1891 .CW sam
1892 frequently passes structures by value, which
1893 simplifies the code.
1894 Traditionally, C programs have
1895 passed structures by reference, but implicit allocation on
1896 the stack is easier to use.
1897 Structure passing is a relatively new feature of C
1898 (it is not in the
1899 standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most
1900 commercial C compilers.
1901 It's convenient and expressive, though,
1902 and simplifies memory management by
1903 avoiding the allocator altogether
1904 and eliminating pointer aliases.
1905 .SH 2
1906 Data structures for manipulating files
1907 .LP
1908 Experience with
1909 .CW jim
1910 showed that the requirements
1911 of the file data structure were few, but strict.
1912 First, files need to be read and written quickly;
1913 adding a fresh file must be painless.
1914 Second, the implementation must place no arbitrary upper limit on
1915 the number or sizes of files.  (It should be practical to edit many files,
1916 and files up to megabytes in length should be handled gracefully.)
1917 This implies that files be stored on disc, not in main memory.
1918 (Aficionados of virtual memory may argue otherwise, but the
1919 implementation of virtual
1920 memory in our system is not something to depend on
1921 for good performance.)
1922 Third, changes to files need be made by only two primitives:
1923 deletion and insertion.
1924 These are inverses of each other,
1925 which simplifies the implementation of the undo operation.
1926 Finally,
1927 it must be easy and efficient to access the file, either
1928 forwards or backwards, a byte at a time.
1929 .PP
1930 The
1931 .CW File
1932 data type is constructed from three simpler data structures that hold arrays
1933 of characters.
1934 Each of these types has an insertion and deletion operator, and the
1935 insertion and deletion operators of the
1936 .CW File
1937 type itself are constructed from them.
1938 .PP
1939 The simplest type is the
1940 .CW String ,
1941 which is used to hold strings in main memory.
1942 The code that manages
1943 .CW Strings
1944 guarantees that they will never be longer
1945 than some moderate size, and in practice they are rarely larger than 8 Kbytes.
1946 .CW Strings
1947 have two purposes: they hold short strings like file names with little overhead,
1948 and because they are deliberately small, they are efficient to modify.
1949 They are therefore used as the data structure for in-memory caches.
1950 .PP
1951 The disc copy of the file is managed by a data structure called a
1952 .CW Disc ,
1953 which corresponds to a temporary file.  A
1954 .CW Disc
1955 has no storage in main memory other than bookkeeping information;
1956 the actual data being held is all on the disc.
1957 To reduce the number of open files needed,
1958 .CW sam
1959 opens a dozen temporary UNIX files and multiplexes the
1960 .CW Discs
1961 upon them.
1962 This permits many files to
1963 be edited; the entire
1964 .CW sam
1965 source (48 files) may be edited comfortably with a single
1966 instance of
1967 .CW sam .
1968 Allocating one temporary file per
1969 .CW Disc
1970 would strain the operating system's limit on the number of open files.
1971 Also, spreading the traffic among temporary files keeps the files shorter,
1972 and shorter files are more efficiently implemented by the UNIX
1973 I/O subsystem.
1974 .PP
1975 A
1976 .CW Disc
1977 is an array of fixed-length blocks, each of which contains
1978 between 1 and 4096 characters of active data.
1979 (The block size of our UNIX file system is 4096 bytes.)
1980 The block addresses within the temporary file and the length of each
1981 block are stored in a
1982 .CW List .
1983 When changes are made the live part of blocks may change size.
1984 Blocks are created and coalesced when necessary to try to keep the sizes
1985 between 2048 and 4096 bytes.
1986 An actively changing part of the
1987 .CW Disc
1988 therefore typically has about a kilobyte of slop that can be
1989 inserted or deleted
1990 without changing more than one block or affecting the block order.
1991 When an insertion would overflow a block, the block is split, a new one
1992 is allocated to receive the overflow, and the memory-resident list of blocks
1993 is rearranged to reflect the insertion of the new block.
1994 .PP
1995 Obviously, going to the disc for every modification to the file is
1996 prohibitively expensive.
1997 The data type
1998 .CW Buffer
1999 consists of a
2000 .CW Disc
2001 to hold the data and a
2002 .CW String
2003 that acts as a cache.
2004 This is the first of a series of caches throughout the data structures in
2005 .CW sam.
2006 The caches not only improve performance, they provide a way to organize
2007 the flow of data, particularly in the communication between the host
2008 and terminal.
2009 This idea is developed below, in the section on communications.
2010 .PP
2011 To reduce disc traffic, changes to a
2012 .CW Buffer
2013 are mediated by a variable-length string, in memory, that acts as a cache.
2014 When an insertion or deletion is made to a
2015 .CW Buffer ,
2016 if the change can be accommodated by the cache, it is done there.
2017 If the cache becomes bigger than a block because of an insertion,
2018 some of it is written to the
2019 .CW Disc
2020 and deleted from the cache.
2021 If the change does not intersect the cache, the cache is flushed.
2022 The cache is only loaded at the new position if the change is smaller than a block;
2023 otherwise, it is sent directly to the
2024 .CW Disc .
2025 This is because
2026 large changes are typically sequential,
2027 whereupon the next change is unlikely to overlap the current one.
2028 .PP
2029 A
2030 .CW File
2031 comprises a
2032 .CW String
2033 to hold the file name and some ancillary data such as dot and the modified bit.
2034 The most important components, though, are a pair of
2035 .CW Buffers ,
2036 one called the transcript and the other the contents.
2037 Their use is described in the next section.
2038 .PP
2039 The overall structure is shown in Figure 5.
2040 Although it may seem that the data is touched many times on its
2041 way from the
2042 .CW Disc ,
2043 it is read (by one UNIX system call) directly into the cache of the
2044 associated
2045 .CW Buffer ;
2046 no extra copy is done.
2047 Similarly, when flushing the cache, the text is written
2048 directly from the cache to disc.
2049 Most operations act directly on the text in the cache.
2050 A principle applied throughout
2051 .CW sam
2052 is that the fewer times the data is copied, the faster the program will run
2053 (see also the paper by Waite\u\s-4\&15\s+4\d).
2054 .KF
2055 .PS
2056 copy "fig5.pic"
2057 .PE
2058 .Cs
2059 Figure 5. File data structures.
2060 The temporary files are stored in the standard repository for such files
2061 on the host system.
2062 .Ce
2063 .KE
2064 .PP
2065 The contents of a
2066 .CW File
2067 are accessed by a routine that
2068 copies to a buffer a substring of a file starting at a specified offset.
2069 To read a byte at a time, a
2070 .CW File "" per-
2071 array is loaded starting from a specified initial position,
2072 and bytes may then be read from the array.
2073 The implementation is done by a macro similar to the C standard I/O
2074 .CW getc
2075 macro.\u\s-4\&14\s+4\d
2076 Because the reading may be done at any address, a minor change to the
2077 macro allows the file to be read backwards.
2078 This array is read-only; there is no
2079 .CW putc .
2080 .SH 2
2081 Doing and undoing
2082 .LP
2083 .CW Sam
2084 has an unusual method for managing changes to files.
2085 The command language makes it easy to specify multiple variable-length changes
2086 to a file millions of bytes long, and such changes
2087 must be made efficiently if the editor is to be practical.
2088 The usual techniques for inserting and deleting strings
2089 are inadequate under these conditions.
2090 The
2091 .CW Buffer
2092 and
2093 .CW Disc
2094 data structures are designed for efficient random access to long strings,
2095 but care must be taken to avoid super-linear behavior when making
2096 many changes simultaneously.
2097 .PP
2098 .CW Sam
2099 uses a two-pass algorithm for making changes, and treats each file as a database
2100 against which transactions are registered.
2101 Changes are not made directly to the contents.
2102 Instead, when a command is started, a `mark' containing
2103 a sequence number is placed in the transcript
2104 .CW Buffer ,
2105 and each change made to the file, either an insertion or deletion
2106 or a change to the file name,
2107 is appended to the end of the transcript.
2108 When the command is complete, the transcript is rewound to the
2109 mark and applied to the contents.
2110 .PP
2111 One reason for separating evaluation from
2112 application in this way is to simplify tracking the addresses of changes
2113 made in the middle of a long sequence.
2114 The two-pass algorithm also allows all changes to apply to the
2115 .I original
2116 data: no change can affect another change made in the same command.
2117 This is particularly important when evaluating an
2118 .CW x
2119 command because it prevents regular expression matches
2120 from stumbling over changes made earlier in the execution.
2121 Also, the two-pass
2122 algorithm is cleaner than the way other UNIX editors allow changes to
2123 affect each other;
2124 for example,
2125 .CW ed 's
2126 idioms to do things like delete every other line
2127 depend critically on the implementation.
2128 Instead,
2129 .CW sam 's
2130 simple model, in which all changes in a command occur effectively
2131 simultaneously, is easy to explain and to understand.
2132 .PP
2133 The records in the transcript are of the form ``delete substring from
2134 locations
2135 123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2136 (It is an error if the changes are not at monotonically greater
2137 positions through the file.)
2138 While the update is occurring, these numbers must be
2139 offset by earlier changes, but that is straightforward and
2140 local to the update routine;
2141 moreover, all the numbers have been computed
2142 before the first is examined.
2143 .PP
2144 Treating the file as a transaction system has another advantage:
2145 undo is trivial.
2146 All it takes is to invert the transcript after it has been
2147 implemented, converting insertions
2148 into deletions and vice versa, and saving them in a holding
2149 .CW Buffer .
2150 The `do' transcript can then be deleted from
2151 the transcript
2152 .CW Buffer
2153 and replaced by the `undo' transcript.
2154 If an undo is requested, the transcript is rewound and the undo transcript
2155 executed.
2156 Because the transcript
2157 .CW Buffer
2158 is not truncated after each command, it accumulates
2159 successive changes.
2160 A sequence of undo commands
2161 can therefore back up the file arbitrarily,
2162 which is more helpful than the more commonly implemented self-inverse form of undo.
2163 .CW Sam "" (
2164 provides no way to undo an undo, but if it were desired,
2165 it would be easy to provide by re-interpreting the `do' transcript.)
2166 Each mark in the transcript contains a sequence number and the offset into
2167 the transcript of the previous mark, to aid in unwinding the transcript.
2168 Marks also contain the value of dot and the modified bit so these can be
2169 restored easily.
2170 Undoing multiple files is easy; it merely demands undoing all files whose
2171 latest change has the same sequence number as the current file.
2172 .PP
2173 Another benefit of having a transcript is that errors encountered in the middle
2174 of a complicated command need not leave the files in an intermediate state.
2175 By rewinding the transcript to the mark beginning the command,
2176 the partial command can be trivially undone.
2177 .PP
2178 When the update algorithm was first implemented, it was unacceptably slow,
2179 so a cache was added to coalesce nearby changes,
2180 replacing multiple small changes by a single larger one.
2181 This reduced the number
2182 of insertions into the transaction
2183 .CW Buffer ,
2184 and made a dramatic improvement in performance,
2185 but made it impossible
2186 to handle changes in non-monotonic order in the file; the caching method
2187 only works if changes don't overlap.
2188 Before the cache was added, the transaction could in principle be sorted
2189 if the changes were out of order, although
2190 this was never done.
2191 The current status is therefore acceptable performance with a minor
2192 restriction on global changes, which is sometimes, but rarely, an annoyance.
2193 .PP
2194 The update algorithm obviously paws the data more than simpler
2195 algorithms, but it is not prohibitively expensive;
2196 the caches help.
2197 (The principle of avoiding copying the data is still honored here,
2198 although not as piously:
2199 the data is moved from contents' cache to
2200 the transcript's all at once and through only one internal buffer.)
2201 Performance figures confirm the efficiency.
2202 To read from a dead start a hundred kilobyte file on a VAX-11/750
2203 takes 1.4 seconds of user time, 2.5 seconds of system time,
2204 and 5 seconds of real time.
2205 Reading the same file in
2206 .CW ed
2207 takes 6.0 seconds of user time, 1.7 seconds of system time,
2208 and 8 seconds of real time.
2209 .CW Sam
2210 uses about half the CPU time.
2211 A more interesting example is the one stated above:
2212 inserting a character between every pair of characters in the file.
2213 The
2214 .CW sam
2215 command is
2216 .P1
2217 ,y/@/ a/x/
2218 .P2
2219 and takes 3 CPU seconds per kilobyte of input file, of which
2220 about a third is spent in the regular expression code.
2221 This translates to about 500 changes per second.
2222 .CW Ed
2223 takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2224 but cannot undo it.
2225 The same example in
2226 .CW ex ,\u\s-4\&9\s+4\d
2227 a variant of
2228 .CW ed
2229 done at the University of California at Berkeley,
2230 which allows one level of undoing, again takes 3 seconds.
2231 In summary,
2232 .CW sam 's
2233 performance is comparable to that of other UNIX editors, although it solves
2234 a harder problem.
2235 .SH 2
2236 Communications
2237 .LP
2238 The discussion so far has described the implementation of the host part of
2239 .CW sam ;
2240 the next few sections explain how a machine with mouse and bitmap display
2241 can be engaged to improve interaction.
2242 .CW Sam
2243 is not the first editor to be written as two processes,\u\s-4\&16\s+4\d
2244 but its implementation
2245 has some unusual aspects.
2246 .PP
2247 There are several ways
2248 .CW sam 's
2249 host and terminal parts may be connected.
2250 The first and simplest is to forgo the terminal part and use the host
2251 part's command language to edit text on an ordinary terminal.
2252 This mode is invoked by starting
2253 .CW sam
2254 with the
2255 .CW -d
2256 option.
2257 With no options,
2258 .CW sam
2259 runs separate host and terminal programs,
2260 communicating with a message protocol over the physical
2261 connection that joins them.
2262 Typically, the connection is an RS-232 link between a Blit
2263 (the prototypical display for
2264 .CW sam )
2265 and a host running
2266 the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d
2267 (This is the version of the system used in the Computing Sciences Research
2268 Center at AT&T Bell Laboratories [now Lucent Technologies, Bell Labs], where I work.  Its relevant
2269 aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d)
2270 The implementation of
2271 .CW sam
2272 for the SUN computer runs both processes on the same machine and
2273 connects them by a pipe.
2274 .PP
2275 The low bandwidth of an RS-232 link
2276 necessitated the split between
2277 the two programs.
2278 The division is a mixed blessing:
2279 a program in two parts is much harder to write and to debug
2280 than a self-contained one,
2281 but the split makes several unusual configurations possible.
2282 The terminal may be physically separated from the host, allowing the conveniences
2283 of a mouse and bitmap display to be taken home while leaving the files at work.
2284 It is also possible to run the host part on a remote machine:
2285 .P1
2286 sam -r host
2287 .P2
2288 connects to the terminal in the usual way, and then makes a call
2289 across the network to establish the host part of
2290 .CW sam
2291 on the named machine.
2292 Finally, it cross-connects the I/O to join the two parts.
2293 This allows
2294 .CW sam
2295 to be run on machines that do not support bitmap displays;
2296 for example,
2297 .CW sam
2298 is the editor of choice on our Cray X-MP/24.
2299 .CW Sam
2300 .CW -r
2301 involves
2302 .I three
2303 machines: the remote host, the terminal, and the local host.
2304 The local host's job is simple but vital: it passes the data
2305 between the remote host and terminal.
2306 .PP
2307 The host and terminal exchange messages asynchronously
2308 (rather than, say, as remote procedure calls) but there is no
2309 error detection or correction
2310 because, whatever the configuration, the connection is reliable.
2311 Because the terminal handles mundane interaction tasks such as
2312 popping up menus and interpreting the responses, the messages are about
2313 data, not actions.
2314 For example, the host knows nothing about what is displayed on the screen,
2315 and when the user types a character, the message sent to the host says
2316 ``insert a one-byte string at location 123 in file 7,'' not ``a character
2317 was typed at the current position in the current file.''
2318 In other words, the messages look very much like the transaction records
2319 in the transcripts.
2320 .PP
2321 Either the host or terminal part of
2322 .CW sam
2323 may initiate a change to a file.
2324 The command language operates on the host, while typing and some
2325 mouse operations are executed directly in the terminal to optimize response.
2326 Changes initiated by the host program must be transmitted to the terminal,
2327 and
2328 vice versa.
2329 (A token is exchanged to determine which end is in control,
2330 which means that characters typed while a time-consuming command runs
2331 must be buffered and do not appear until the command is complete.)
2332 To maintain consistent information,
2333 the host and terminal track changes through a per-file
2334 data structure that records what portions of the file
2335 the terminal has received.
2336 The data structure, called a
2337 .CW Rasp
2338 (a weak pun: it's a file with holes)
2339 is held and updated by both the host and terminal.
2340 A
2341 .CW Rasp
2342 is a list of
2343 .CW Strings
2344 holding those parts of the file known to the terminal,
2345 separated by counts of the number of bytes in the interstices.
2346 Of course, the host doesn't keep a separate copy of the data (it only needs
2347 the lengths of the various pieces),
2348 but the structure is the same on both ends.
2349 .PP
2350 The
2351 .CW Rasp
2352 in the terminal doubles as a cache.
2353 Since the terminal keeps the text for portions of the file it has displayed,
2354 it need not request data from the host when revisiting old parts of the file
2355 or redrawing obscured windows, which speeds things up considerably
2356 over low-speed links.
2357 .PP
2358 It's trivial for the terminal to maintain its
2359 .CW Rasp ,
2360 because all changes made on the terminal apply to parts of the file
2361 already loaded there.
2362 Changes made by the host are compared against the
2363 .CW Rasp
2364 during the update sequence after each command.
2365 Small changes to pieces of the file loaded in the terminal
2366 are sent in their entirety.
2367 Larger changes, and changes that fall entirely in the holes,
2368 are transmitted as messages without literal data:
2369 only the lengths of the deleted and inserted strings are transmitted.
2370 When a command is completed, the terminal examines its visible
2371 windows to see if any holes in their
2372 .CW Rasps
2373 intersect the visible portion of the file.
2374 It then requests the missing data from the host,
2375 along with up to 512 bytes of surrounding data, to minimize
2376 the number of messages when visiting a new portion of the file.
2377 This technique provides a kind of two-level lazy evaluation for the terminal.
2378 The first level sends a minimum of information about
2379 parts of the file not being edited interactively;
2380 the second level waits until a change is displayed before
2381 transmitting the new data.
2382 Of course,
2383 performance is also helped by having the terminal respond immediately to typing
2384 and simple mouse requests.
2385 Except for small changes to active pieces of the file, which are
2386 transmitted to the terminal without negotiation,
2387 the terminal is wholly responsible for deciding what is displayed;
2388 the host uses the
2389 .CW Rasp
2390 only to tell the terminal what might be relevant.
2391 .PP
2392 When a change is initiated by the host,
2393 the messages to the terminal describing the change
2394 are generated by the routine that applies the transcript of the changes
2395 to the contents of the
2396 .CW File .
2397 Since changes are undone by the same update routine,
2398 undoing requires
2399 no extra code in the communications;
2400 the usual messages describing changes to the file are sufficient
2401 to back up the screen image.
2402 .PP
2403 The
2404 .CW Rasp
2405 is a particularly good example of the way caches are used in
2406 .CW sam .
2407 First, it facilitates access to the active portion of the text by placing
2408 the busy text in main memory.
2409 In so doing, it provides efficient access
2410 to a large data structure that does not fit in memory.
2411 Since the form of data is to be imposed by the user, not by the program,
2412 and because characters will frequently be scanned sequentially,
2413 files are stored as flat objects.
2414 Caches help keep performance good and linear when working with such
2415 data.
2416 .PP
2417 Second, the
2418 .CW Rasp
2419 and several of the other caches have some
2420 .I read-ahead;
2421 that is, the cache is loaded with more information than is needed for
2422 the job immediately at hand.
2423 When manipulating linear structures, the accesses are usually sequential,
2424 and read-ahead can significantly reduce the average time to access the
2425 next element of the object.
2426 Sequential access is a common mode for people as well as programs;
2427 consider scrolling through a document while looking for something.
2428 .PP
2429 Finally, like any good data structure,
2430 the cache guides the algorithm, or at least the implementation.
2431 The
2432 .CW Rasp
2433 was actually invented to control the communications between the host and
2434 terminal parts, but I realized very early that it was also a form of
2435 cache.  Other caches were more explicitly intended to serve a double
2436 purpose: for example, the caches in
2437 .CW Files
2438 that coalesce updates not only reduce traffic to the
2439 transcript and contents
2440 .CW Buffers ,
2441 they also clump screen updates so that complicated changes to the
2442 screen are achieved in
2443 just a few messages to the terminal.
2444 This saved me considerable work: I did not need to write special
2445 code to optimize the message traffic to the
2446 terminal.
2447 Caches pay off in surprising ways.
2448 Also, they tend to be independent, so their performance improvements
2449 are multiplicative.
2450 .SH 2
2451 Data structures in the terminal
2452 .LP
2453 The terminal's job is to display and to maintain a consistent image of
2454 pieces of the files being edited.
2455 Because the text is always in memory, the data structures are
2456 considerably simpler than those in the host part.
2457 .PP
2458 .CW Sam
2459 typically has far more windows than does
2460 .CW mux ,
2461 the window system within which its Blit implementation runs.
2462 .CW Mux
2463 has a fairly small number of asynchronously updated windows;
2464 .CW sam
2465 needs a large number of synchronously updated windows that are
2466 usually static and often fully obscured.
2467 The different tradeoffs guided
2468 .CW sam
2469 away from the memory-intensive implementation of windows, called
2470 .CW Layers ,\u\s-4\&17\s+4\d
2471 used in
2472 .CW mux.
2473 Rather than depending on a complete bitmap image of the display for each window,
2474 .CW sam
2475 regenerates the image from its in-memory text
2476 (stored in the
2477 .CW Rasp )
2478 when necessary, although it will use such an image if it is available.
2479 Like
2480 .CW Layers ,
2481 though,
2482 .CW sam
2483 uses the screen bitmap as active storage in which to update the image using
2484 .CW bitblt .\u\s-4\&18,19\s+4\d
2485 The resulting organization, pictured in Figure 6,
2486 has a global array of windows, called
2487 .CW Flayers ,
2488 each of which holds an image of a piece of text held in a data structure
2489 called a
2490 .CW Frame ,
2491 which in turn represents
2492 a rectangular window full of text displayed in some
2493 .CW Bitmap .
2494 Each
2495 .CW Flayer
2496 appears in a global list that orders them all front-to-back
2497 on the display, and simultaneously as an element of a per-file array
2498 that holds all the open windows for that file.
2499 The complement in the terminal of the
2500 .CW File
2501 on the host is called a
2502 .CW Text ;
2503 each connects its
2504 .CW Flayers
2505 to the associated
2506 .CW Rasp .
2507 .KF
2508 .PS
2509 copy "fig6.pic"
2510 .PE
2511 .Cs
2512 Figure 6. Data structures in the terminal.
2513 .CW Flayers
2514 are also linked together into a front-to-back list.
2515 .CW Boxes
2516 are discussed in the next section.
2517 .Ce
2518 .KE
2519 .PP
2520 The
2521 .CW Bitmap
2522 for a
2523 .CW Frame
2524 contains the image of the text.
2525 For a fully visible window, the
2526 .CW Bitmap
2527 will be the screen (or at least the
2528 .CW Layer
2529 in which
2530 .CW sam
2531 is being run),
2532 while for partially obscured windows the
2533 .CW Bitmap
2534 will be off-screen.
2535 If the window is fully obscured, the
2536 .CW Bitmap
2537 will be null.
2538 .PP
2539 The
2540 .CW Bitmap
2541 is a kind of cache.
2542 When making changes to the display, most of the original image will
2543 look the same in the final image, and the update algorithms exploit this.
2544 The
2545 .CW Frame
2546 software updates the image in the
2547 .CW Bitmap
2548 incrementally; the
2549 .CW Bitmap
2550 is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d
2551 The job of the software that updates the display is therefore
2552 to use as much as possible of the existing image (converting the
2553 text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2554 string insertion algorithm.
2555 The details of this process are described in the next section.
2556 .PP
2557 The
2558 .CW Frame
2559 software has no code to support overlapping windows;
2560 its job is to keep a single
2561 .CW Bitmap
2562 up to date.
2563 It falls to the
2564 .CW Flayer
2565 software to multiplex the various
2566 .CW Bitmaps
2567 onto the screen.
2568 The problem of maintaining overlapping
2569 .CW Flayers
2570 is easier than for
2571 .CW Layers \u\s-4\&17\s+4\d
2572 because changes are made synchronously and because the contents of the window
2573 can be reconstructed from the data stored in the
2574 .CW Frame ;
2575 the
2576 .CW Layers
2577 software
2578 makes no such assumptions.
2579 In
2580 .CW sam ,
2581 the window being changed is almost always fully visible, because the current
2582 window is always fully visible, by construction.
2583 However, when multi-file changes are being made, or when
2584 more than one window is open on a file,
2585 it may be necessary to update partially obscured windows.
2586 .PP
2587 There are three cases: the window is
2588 fully visible, invisible (fully obscured), or partially visible.
2589 If fully visible, the
2590 .CW Bitmap
2591 is part of the screen, so when the
2592 .CW Flayer
2593 update routine calls the
2594 .CW Frame
2595 update routine, the screen will be updated directly.
2596 If the window is invisible,
2597 there is no associated
2598 .CW Bitmap ,
2599 and all that is necessary is to update the
2600 .CW Frame
2601 data structure, not the image.
2602 If the window is partially visible, the
2603 .CW Frame
2604 routine is called to update the image in the off-screen
2605 .CW Bitmap ,
2606 which may require regenerating it from the text of the window.
2607 The
2608 .CW Flayer
2609 code then clips this
2610 .CW Bitmap
2611 against the
2612 .CW Bitmaps
2613 of all
2614 .CW Frames
2615 in front of the
2616 .CW Frame
2617 being modified, and the remainder is copied to the display.
2618 .PP
2619 This is much faster than recreating the image off-screen
2620 for every change, or clipping all the changes made to the image
2621 during its update.
2622 Unfortunately, these caches can also consume prohibitive amounts of
2623 memory, so they are freed fairly liberally \(em after every change to the
2624 front-to-back order of the
2625 .CW Flayers .
2626 The result is that
2627 the off-screen
2628 .CW Bitmaps
2629 exist only while multi-window changes are occurring,
2630 which is the only time the performance improvement they provide is needed.
2631 Also, the user interface causes fully-obscured windows to be the
2632 easiest to make \(em
2633 creating a canonically sized and placed window requires only a button click
2634 \(em which reduces the need for caching still further.
2635 .PP
2636 .SH 2
2637 Screen update
2638 .LP
2639 Only two low-level primitives are needed for incremental update:
2640 .CW bitblt ,
2641 which copies rectangles of pixels, and
2642 .CW string
2643 (which in turn calls
2644 .CW bitblt ),
2645 which draws a null-terminated character string in a
2646 .CW Bitmap .
2647 A
2648 .CW Frame
2649 contains a list of
2650 .CW Boxes ,
2651 each of which defines a horizontal strip of text in the window
2652 (see Figure 7).
2653 A
2654 .CW Box
2655 has a character string
2656 .CW str ,
2657 and a
2658 .CW Rectangle
2659 .CW rect
2660 that defines the location of the strip in the window.
2661 (The text in
2662 .CW str
2663 is stored in the
2664 .CW Box
2665 separately from the
2666 .CW Rasp
2667 associated with the window's file, so
2668 .CW Boxes
2669 are self-contained.)
2670 The invariant is that
2671 the image of the
2672 .CW Box
2673 can be reproduced by calling
2674 .CW string
2675 with argument
2676 .CW str
2677 to draw the string in
2678 .CW rect ,
2679 and the resulting picture fits perfectly within
2680 .CW rect .
2681 In other words, the
2682 .CW Boxes
2683 define the tiling of the window.
2684 The tiling may be complicated by long lines of text, which
2685 are folded onto the next line.
2686 Some editors use horizontal scrolling to avoid this complication,
2687 but to be comfortable this technique requires that lines not be
2688 .I too
2689 long;
2690 .CW sam
2691 has no such restriction.
2692 Also, and perhaps more importantly, UNIX programs and terminals traditionally fold
2693 long lines to make their contents fully visible.
2694 .PP
2695 Two special kinds of
2696 .CW Boxes
2697 contain a single
2698 character: either a newline or a tab.
2699 Newlines and tabs are white space.
2700 A newline
2701 .CW Box
2702 always extends to the right edge of the window,
2703 forcing the following
2704 .CW Box
2705 to the next line.
2706 The width of a tab depends on where it is located:
2707 it forces the next
2708 .CW Box
2709 to begin at a tab location.
2710 Tabs also
2711 have a minimum width equivalent to a blank (blanks are
2712 drawn by
2713 .CW string
2714 and are not treated specially); newlines have a minimum width of zero.
2715 .KF
2716 .PS
2717 copy "fig7.pic"
2718 .PE
2719 .sp .5
2720 .Cs
2721 Figure 7. A line of text showing its
2722 .CW Boxes .
2723 The first two blank
2724 .CW Boxes
2725 contain tabs; the last contains a newline.
2726 Spaces are handled as ordinary characters.
2727 .Ce
2728 .KE
2729 .PP
2730 The update algorithms always use the
2731 .CW Bitmap
2732 image of the text (either the display or cache
2733 .CW Bitmap );
2734 they never examine the characters within a
2735 .CW Box
2736 except when the
2737 .CW Box
2738 needs to be split in two.
2739 Before a change, the window consists of a tiling of
2740 .CW Boxes ;
2741 after the change the window is tiled differently.
2742 The update algorithms rearrange the tiles in place, without
2743 backup storage.
2744 The algorithms are not strictly optimal \(em for example, they can
2745 clear a pixel that is later going to be written upon \(em
2746 but they never move a tile that doesn't need to be moved,
2747 and they move each tile at most once.
2748 .CW Frinsert
2749 on a Blit can absorb over a thousand characters a second if the strings
2750 being inserted are a few tens of characters long.
2751 .PP
2752 Consider
2753 .CW frdelete .
2754 Its job is to delete a substring from a
2755 .CW Frame
2756 and restore the image of the
2757 .CW Frame .
2758 The image of a substring has a peculiar shape (see Figure 2) comprising
2759 possibly a partial line,
2760 zero or more full lines,
2761 and possibly a final partial line.
2762 For reference, call this the
2763 .I
2764 Z-shape.
2765 .R
2766 .CW Frdelete
2767 begins by splitting, if necessary, the
2768 .CW Boxes
2769 containing the ends of
2770 the substring so the substring begins and ends on
2771 .CW Box
2772 boundaries.
2773 Because the substring is being deleted, its image is not needed,
2774 so the Z-shape is then cleared.
2775 Then, tiles (that is, the images of
2776 .CW Boxes )
2777 are copied, using
2778 .CW bitblt ,
2779 from immediately after the Z-shape to
2780 the beginning of the Z-shape,
2781 resulting in a new Z-shape.
2782 .CW Boxes "" (
2783 whose contents would span two lines in the new position must first be split.)
2784 .PP
2785 Copying the remainder of the
2786 .CW Frame
2787 tile by tile
2788 this way will clearly accomplish the deletion but eventually,
2789 typically when the copying algorithm encounters a tab or newline,
2790 the old and new
2791 .CW x
2792 coordinates of the tile
2793 to be copied are the same.
2794 This correspondence implies
2795 that the Z-shape has its beginning and ending edges aligned
2796 vertically, and a sequence of at most two
2797 .CW bitblts
2798 can be used to copy the remaining tiles.
2799 The last step is to clear out the resulting empty space at the bottom
2800 of the window;
2801 the number of lines to be cleared is the number of complete lines in the
2802 Z-shape closed by the final
2803 .CW bitblts.
2804 The final step is to merge horizontally adjacent
2805 .CW Boxes
2806 of plain text.
2807 The complete source to
2808 .CW frdelete
2809 is less than 100 lines of C.
2810 .PP
2811 .CW frinsert
2812 is more complicated because it must do four passes:
2813 one to construct the
2814 .CW Box
2815 list for the inserted string,
2816 one to reconnoitre,
2817 one to copy (in opposite order to
2818 .CW frdelete )
2819 the
2820 .CW Boxes
2821 to make the hole for the new text,
2822 and finally one to copy the new text into place.
2823 Overall, though,
2824 .CW frinsert
2825 has a similar flavor to
2826 .CW frdelete ,
2827 and needn't be described further.
2828 .CW Frinsert
2829 and its subsidiary routines comprise 211 lines of C.
2830 .PP
2831 The terminal source code is 3024 lines of C,
2832 and the host source is 5797 lines.
2833 .SH
2834 Discussion
2835 .SH 2
2836 History
2837 .LP
2838 The immediate ancestor of
2839 .CW sam
2840 was the original text editor for the Blit, called
2841 .CW jim .
2842 .CW Sam
2843 inherited
2844 .CW jim 's
2845 two-process structure and mouse language almost unchanged, but
2846 .CW jim
2847 suffered from several drawbacks that were addressed in the design of
2848 .CW sam .
2849 The most important of these was the lack of a command language.
2850 Although
2851 .CW jim
2852 was easy to use for simple editing, it provided no direct help with
2853 large or repetitive editing tasks.  Instead, it provided a command to pass
2854 selected text through a shell pipeline,
2855 but this was no more satisfactory than could be expected of a stopgap measure.
2856 .PP
2857 .CW Jim
2858 was written primarily as a vehicle for experimenting with a mouse-based
2859 interface to text, and the experiment was successful.
2860 .CW Jim
2861 had some spin-offs:
2862 .CW mux ,
2863 the second window system for the Blit, is essentially a multiplexed
2864 version of the terminal part of
2865 .CW jim ;
2866 and the debugger
2867 .CW pi 's
2868 user interface\u\s-4\&20\s+4\d was closely modeled on
2869 .CW jim 's.
2870 But after a couple of years,
2871 .CW jim
2872 had become difficult to maintain and limiting to use,
2873 and its replacement was overdue.
2874 .PP
2875 I began the design of
2876 .CW sam
2877 by asking
2878 .CW jim
2879 customers what they wanted.
2880 This was probably a mistake; the answers were essentially a list of features
2881 to be found in other editors, which did not provide any of the
2882 guiding principles I was seeking.
2883 For instance, one common request was for a ``global substitute,''
2884 but no one suggested how to provide it within a cut-and-paste editor.
2885 I was looking for a scheme that would
2886 support such specialized features comfortably in the context of some
2887 general command language.
2888 Ideas were not forthcoming, though, particularly given my insistence
2889 on removing all limits on file sizes, line lengths and so on.
2890 Even worse, I recognized that, since the mouse could easily
2891 indicate a region of the screen that was not an integral number of lines,
2892 the command language would best forget about newlines altogether,
2893 and that meant the command language had to treat the file as a single
2894 string, not an array of lines.
2895 .PP
2896 Eventually, I decided that thinking was not getting me very far and it was
2897 time to try building.
2898 I knew that the terminal part could be built easily \(em
2899 that part of
2900 .CW jim
2901 behaved acceptably well \(em and that most of the hard work was going
2902 to be in the host part: the file interface, command interpreter and so on.
2903 Moreover, I had some ideas about how the architecture of
2904 .CW jim
2905 could be improved without destroying its basic structure, which I liked
2906 in principle but which hadn't worked out as well as I had hoped.
2907 So I began by designing the file data structure,
2908 starting with the way
2909 .CW jim
2910 worked \(em comparable to a single structure merging
2911 .CW Disc
2912 and
2913 .CW Buffer ,
2914 which I split to make the cache more general
2915 \(em and thinking about how global substitute could be implemented.
2916 The answer was clearly that it had to be done in two passes,
2917 and the transcript-oriented implementation fell out naturally.
2918 .PP
2919 .CW Sam
2920 was written bottom-up,
2921 starting from the data structures and algorithms for manipulating text,
2922 through the command language and up to the code for maintaining
2923 the display.
2924 In retrospect, it turned out well, but this implementation method is
2925 not recommended in general.
2926 There were several times when I had a large body of interesting code
2927 assembled and no clue how to proceed with it.
2928 The command language, in particular, took almost a year to figure out,
2929 but can be implemented (given what was there at the beginning of that year)
2930 in a day or two.  Similarly, inventing the
2931 .CW Rasp
2932 data structure delayed the
2933 connection of the host and terminal pieces by another few months.
2934 .CW Sam
2935 took about two years to write, although only about four months were
2936 spent actually working on it.
2937 .PP
2938 Part of the design process was unusual:
2939 the subset of the protocol that maintains the
2940 .CW Rasp
2941 was simulated, debugged
2942 and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free
2943 from the start.
2944 The rest of the protocol, concerned mostly
2945 with keeping menus up to date,
2946 was unfortunately too unwieldy for such analysis,
2947 and was debugged by more traditional methods, primarily
2948 by logging in a file all messages in and out of the host.
2949 .SH 2
2950 Reflections
2951 .LP
2952 .CW Sam
2953 is essentially the only interactive editor used by the sixty or so members of
2954 the computing science research center in which I work.
2955 The same could not be said of
2956 .CW jim ;
2957 the lack of a command language kept some people from adopting it.
2958 The union of a user interface as comfortable as
2959 .CW jim 's
2960 with a command language as powerful as
2961 .CW ed 's†
2962 .FS
2963 .vs 9
2964 †The people who criticize
2965 .CW ed
2966 as an interactive program often forget that it and its close relative
2967 .CW sed \u\s-4\&7\s+4\d
2968 still thrive as programmable editors.  The strength of these programs is
2969 independent of their convenience for interactive editing.
2970 .br
2971 .vs
2972 .FE
2973 is essential to
2974 .CW sam 's
2975 success.
2976 When
2977 .CW sam
2978 was first made available to the
2979 .CW jim
2980 community,
2981 almost everyone switched to it within two or three days.
2982 In the months that followed, even people who had never adopted
2983 .CW jim
2984 started using
2985 .CW sam
2986 exclusively.
2987 .PP
2988 To be honest,
2989 .CW ed
2990 still gets occasional use, but usually when
2991 something quick needs to be done and the overhead of
2992 downloading the terminal part of
2993 .CW sam
2994 isn't worth the trouble.
2995 Also, as a `line' editor,
2996 .CW sam
2997 .CW -d
2998 is a bit odd;
2999 when using a good old ASCII terminal, it's comforting to have
3000 a true line editor.
3001 But it is fair to say that
3002 .CW sam 's
3003 command language has displaced
3004 .CW ed 's
3005 for most of the complicated editing that has kept line editors
3006 (that is, command-driven editors) with us.
3007 .PP
3008 .CW Sam 's
3009 command language is even fancier than
3010 .CW ed 's,
3011 and most
3012 .CW sam
3013 customers don't come near to using all its capabilities.
3014 Does it need to be so sophisticated?
3015 I think the answer is yes, for two reasons.
3016 .PP
3017 First, the
3018 .I model
3019 for
3020 .CW sam 's
3021 command language is really relatively simple, and certainly simpler than that of
3022 .CW ed .
3023 For instance, there is only one kind of textual loop in
3024 .CW sam
3025 \(em the
3026 .CW x
3027 command \(em
3028 while
3029 .CW ed
3030 has three (the
3031 .CW g
3032 command, the global flag on substitutions, and the implicit loop over
3033 lines in multi-line substitutions).
3034 Also,
3035 .CW ed 's
3036 substitute command is necessary to make changes within lines, but in
3037 .CW sam
3038 the
3039 .CW s
3040 command is more of a familiar convenience than a necessity;
3041 .CW c
3042 and
3043 .CW t
3044 can do all the work.
3045 .PP
3046 Second,
3047 given a community that expects an editor to be about as powerful as
3048 .CW ed ,
3049 it's hard to see how
3050 .CW sam
3051 could really be much simpler and still satisfy that expectation.
3052 People want to do ``global substitutes,'' and most are content
3053 to have the recipe for that and a few other fancy changes.
3054 The sophistication of the command language is really just a veneer
3055 over a design that makes it possible to do global substitutes
3056 in a screen editor.
3057 Some people will always want something more, however, and it's gratifying to
3058 be able to provide it.
3059 The real power of
3060 .CW sam 's
3061 command language comes from composability of the operators, which is by
3062 nature orthogonal to the underlying model.
3063 In other words,
3064 .CW sam
3065 is not itself complex, but it makes complex things possible.
3066 If you don't want to do anything complex, you can ignore the
3067 complexity altogether, and many people do so.
3068 .PP
3069 Sometimes I am asked the opposite question: why didn't I just make
3070 .CW sam
3071 a real programmable editor, with macros and variables and so on?
3072 The main reason is a matter of taste: I like the editor
3073 to be the same every time I use it.
3074 There is one technical reason, though:
3075 programmability in editors is largely a workaround for insufficient
3076 interactivity.
3077 Programmable editors are used to make particular, usually short-term,
3078 things easy to do, such as by providing shorthands for common actions.
3079 If things are generally easy to do in the first place,
3080 shorthands are not as helpful.
3081 .CW Sam
3082 makes common editing operations very easy, and the solutions to
3083 complex editing problems seem commensurate with the problems themselves.
3084 Also, the ability to edit the
3085 .CW sam
3086 window makes it easy to repeat commands \(em it only takes a mouse button click
3087 to execute a command again.
3088 .SH 2
3089 Pros and cons
3090 .LP
3091 .CW Sam
3092 has several other good points,
3093 and its share of problems.
3094 Among the good things is the idea of
3095 structural regular expressions,
3096 whose usefulness has only begun to be explored.
3097 They were arrived at serendipitously when I attempted to distill the essence of
3098 .CW ed 's
3099 way of doing global substitution and recognized that the looping command in
3100 .CW ed
3101 was implicitly imposing a structure (an array of lines) on the file.
3102 .PP
3103 Another of
3104 .CW sam 's
3105 good things is its undo capability.
3106 I had never before used an editor with a true undo,
3107 but I would never go back now.
3108 Undo
3109 .I must
3110 be done well, but if it is, it can be relied on.
3111 For example,
3112 it's safe to experiment if you're not sure how to write some intricate command,
3113 because if you make a mistake, it can be fixed simply and reliably.
3114 I learned two things about undo from writing
3115 .CW sam :
3116 first, it's easy to provide if you design it in from the beginning, and
3117 second, it's necessary, particularly if the system has some subtle
3118 properties that may be unfamiliar or error-prone for users.
3119 .PP
3120 .CW Sam 's
3121 lack of internal limits and sizes is a virtue.
3122 Because it avoids all fixed-size tables and data structures,
3123 .CW sam
3124 is able to make global changes to files that some of our other
3125 tools cannot even read.
3126 Moreover, the design keeps the performance linear when doing such
3127 operations, although I must admit
3128 .CW sam
3129 does get slow when editing a huge file.
3130 .PP
3131 Now, the problems.
3132 Externally, the most obvious is that it is poorly integrated into the
3133 surrounding window system.
3134 By design, the user interface in
3135 .CW sam
3136 feels almost identical to that of
3137 .CW mux ,
3138 but a thick wall separates text in
3139 .CW sam
3140 from the programs running in
3141 .CW mux .
3142 For instance, the `snarf buffer' in
3143 .CW sam
3144 must be maintained separately from that in
3145 .CW mux .
3146 This is regrettable, but probably necessary given the unusual configuration
3147 of the system, with a programmable terminal on the far end of an RS-232 link.
3148 .PP
3149 .CW Sam
3150 is reliable; otherwise, people wouldn't use it.
3151 But it was written over such a long time, and has so many new (to me)
3152 ideas in it, that I would like to see it done over again to clean
3153 up the code and remove many of the lingering problems in the implementation.
3154 The worst part is in the interconnection of the host and terminal parts,
3155 which might even be able to go away in a redesign for a more
3156 conventional window system.
3157 The program must be split in two to use the terminal effectively,
3158 but the low bandwidth of the connection forces the separation to
3159 occur in an inconvenient part of the design if performance is to be acceptable.
3160 A simple remote procedure call
3161 protocol driven by the host, emitting only graphics
3162 commands, would be easy to write but wouldn't have nearly the
3163 necessary responsiveness.  On the other hand, if the terminal were in control
3164 and requested much simpler file services from the host, regular expression
3165 searches would require that the terminal read the entire file over its RS-232
3166 link, which would be unreasonably slow.
3167 A compromise in which either end can take control is necessary.
3168 In retrospect, the communications protocol should have been
3169 designed and verified formally, although I do not know of any tool
3170 that can adequately relate the protocol to
3171 its implementation.
3172 .PP
3173 Not all of
3174 .CW sam 's
3175 users are comfortable with its command language, and few are adept.
3176 Some (venerable) people use a sort of
3177 .CW ed \& ``
3178 subset'' of
3179 .CW sam 's
3180 command language,
3181 and even ask why
3182 .CW sam 's
3183 command language is not exactly
3184 .CW ed 's.
3185 (The reason, of course, is that
3186 .CW sam 's
3187 model for text does not include newlines, which are central to
3188 .CW ed .
3189 Making the text an array of newlines to the command language would
3190 be too much of a break from the seamless model provided by the mouse.
3191 Some editors, such as
3192 .CW vi ,
3193 are willing to make this break, though.)
3194 The difficulty is that
3195 .CW sam 's
3196 syntax is so close to
3197 .CW ed 's
3198 that people believe it
3199 .I should
3200 be the same.
3201 I thought, with some justification in hindsight,
3202 that making
3203 .CW sam
3204 similar to
3205 .CW ed
3206 would make it easier to learn and to accept.
3207 But I may have overstepped and raised the users'
3208 expectations too much.
3209 It's hard to decide which way to resolve this problem.
3210 .PP
3211 Finally, there is a tradeoff in
3212 .CW sam
3213 that was decided by the environment in which it runs:
3214 .CW sam
3215 is a multi-file editor, although in a different system there might instead be
3216 multiple single-file editors.
3217 The decision was made primarily because starting a new program in a Blit is
3218 time-consuming.
3219 If the choice could be made freely, however, I would
3220 still choose the multi-file architecture, because it allows
3221 groups of files to be handled as a unit;
3222 the usefulness of the multi-file commands is incontrovertible.
3223 It is delightful to have the source to an entire program
3224 available at your fingertips.
3225 .SH
3226 Acknowledgements
3227 .LP
3228 Tom Cargill suggested the idea behind the
3229 .CW Rasp
3230 data structure.
3231 Norman Wilson and Ken Thompson influenced the command language.
3232 This paper was improved by comments from
3233 Al Aho,
3234 Jon Bentley,
3235 Chris Fraser,
3236 Gerard Holzmann,
3237 Brian Kernighan,
3238 Ted Kowalski,
3239 Doug McIlroy
3240 and
3241 Dennis Ritchie.