1 .HTML "The Text Editor sam
2 .Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM"
4 .ds DR "Revised 1 July 1987
5 .de CW \" puts first arg in CW font, same as UL; maintains font
6 \%\&\\$3\f(CW\\$1\fP\&\\$2
24 .ie h .html - <center><img src="\\$1.gif" /></center>
28 The Text Editor \&\f(CWsam\fP
31 rob@plan9.bell-labs.com
35 is an interactive multi-file text editor intended for
37 A textual command language
38 supplements the mouse-driven, cut-and-paste interface
40 repetitive editing tasks easy to specify.
41 The language is characterized by the composition of regular expressions
42 to describe the structure of the text being modified.
43 The treatment of files as a database, with changes logged
44 as atomic transactions, guides the implementation and
45 makes a general `undo' mechanism straightforward.
48 is implemented as two processes connected by a low-bandwidth stream,
49 one process handling the display and the other the editing
50 algorithms. Therefore it can run with the display process
51 in a bitmap terminal and the editor on a local host,
52 with both processes on a bitmap-equipped host, or with
53 the display process in the terminal and the editor in a
55 By suppressing the display process,
56 it can even run without a bitmap terminal.
58 This paper is reprinted from Software\(emPractice and Experience,
59 Vol 17, number 11, pp. 813-845, November 1987.
60 The paper has not been updated for the Plan 9 manuals. Although
62 has not changed much since the paper was written, the system around it certainly has.
63 Nonetheless, the description here still stands as the best introduction to the editor.
69 is an interactive text editor that combines cut-and-paste interactive editing with
70 an unusual command language based on the composition of regular expressions.
71 It is written as two programs: one, the `host part,' runs on a UNIX system
72 and implements the command language and provides file access; the other, the
73 `terminal part,' runs asynchronously
74 on a machine with a mouse and bitmap display
75 and supports the display and interactive editing.
76 The host part may be even run in isolation on an ordinary terminal
77 to edit text using the command
78 language, much like a traditional line editor,
79 without assistance from a mouse or display.
81 the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal
82 (actually on a Teletype DMD 5620, the production version of the Blit), whose
83 host connection is an ordinary 9600 bps RS232 link;
84 on the SUN computer the host and display processes run on a single machine,
90 It has no facilities for multiple fonts, graphics or tables,
91 unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d
92 or Lara.\u\s-4\&5\s+4\d
93 Also unlike them, it has a rich command language.
94 (Throughout this paper, the phrase
99 textual commands; commands activated from the mouse form the
103 developed as an editor for use by programmers, and tries to join
104 the styles of the UNIX text editor
105 .CW ed \u\s-4\&6,7\s+4\d
106 with that of interactive cut-and-paste editors by
107 providing a comfortable mouse-driven interface
108 to a program with a solid command language driven by regular expressions.
109 The command language developed more than the mouse language, and
110 acquired a notation for describing the structure of files
111 more richly than as a sequence of lines,
112 using a dataflow-like syntax for specifying changes.
114 The interactive style was influenced by
115 .CW jim ,\u\s-4\&1\s+4\d
116 an early cut-and-paste editor for the Blit, and by
117 .CW mux ,\u\s-4\&8\s+4\d
118 the Blit window system.
120 merges the original Blit window system,
121 .CW mpx ,\u\s-4\&1\s+4\d
122 with cut-and-paste editing, forming something like a
123 multiplexed version of
125 that edits the output of (and input to) command sessions rather than files.
127 The first part of this paper describes the command language, then the mouse
128 language, and explains how they interact.
129 That is followed by a description of the implementation,
130 first of the host part, then of the terminal part.
131 A principle that influenced the design of
133 is that it should have no explicit limits, such as upper limits on
134 file size or line length.
135 A secondary consideration is that it be efficient.
136 To honor these two goals together requires a method for efficiently
138 huge strings (files) without breaking them into lines,
139 perhaps while making thousands of changes
140 under control of the command language.
143 treat the file as a transaction database, implementing changes as atomic
144 updates. These updates may be unwound easily to `undo' changes.
145 Efficiency is achieved through a collection of caches that minimizes
146 disc traffic and data motion, both within the two parts of the program
151 is fairly straightforward.
152 More interesting is how the two halves of the editor stay
153 synchronized when either half may initiate a change.
154 This is achieved through a data structure that organizes the
155 communications and is maintained in parallel by both halves.
157 The last part of the paper chronicles the writing of
159 and discusses the lessons that were learned through its development and use.
161 The paper is long, but is composed largely of two papers of reasonable length:
162 a description of the user interface of
164 and a discussion of its implementation.
165 They are combined because the implementation is strongly influenced by
166 the user interface, and vice versa.
171 is a text editor for multiple files.
172 File names may be provided when it is invoked:
176 and there are commands
177 to add new files and discard unneeded ones.
178 Files are not read until necessary
179 to complete some command.
180 Editing operations apply to an internal copy
181 made when the file is read; the UNIX file associated with the copy
182 is changed only by an explicit command.
183 To simplify the discussion, the internal copy is here called a
185 while the disc-resident original is called a
191 is usually connected to a bitmap display that presents a cut-and-paste
192 editor driven by the mouse.
193 In this mode, the command language is still available:
194 text typed in a special window, called the
198 as commands to be executed in the current file.
199 Cut-and-paste editing may be used in any window \(em even in the
201 window to construct commands.
202 The other mode of operation, invoked by starting
207 does not use the mouse or bitmap display, but still permits
208 editing using the textual command language, even on an ordinary terminal,
209 interactively or from a script.
211 The following sections describe first the command language (under
215 window), and then the mouse interface.
216 These two languages are nearly independent, but connect through the
223 A file consists of its contents, which are an array of characters
224 (that is, a string); the
226 of the associated disc file; the
230 that states whether the contents match those of
232 and a substring of the contents, called the
238 (see Figures 1 and 2).
239 If the current text is a null string, dot falls between characters.
242 of dot is the location of the current text; the
244 of dot are the characters it contains.
246 imparts to the text no two-dimensional interpretation such as columns
247 or fields; text is always one-dimensional.
248 Even the idea of a `line' of text as understood by most UNIX programs
249 \(em a sequence of characters terminated by a newline character \(em
250 is only weakly supported.
256 is the file to which editing commands refer.
257 The current text is therefore dot in the current file.
258 If a command doesn't explicitly name a particular file or piece of text,
259 the command is assumed to apply to the current text.
260 For the moment, ignore the presence of multiple files and consider
261 editing a single file.
267 screen, with the editing menu presented.
270 (command language) window is in the middle, with file windows above and below.
271 (The user interface makes it easy to create these abutting windows.)
272 The partially obscured window is a third file window.
273 The uppermost window is that to which typing and mouse operations apply,
274 as indicated by its heavy border.
275 Each window has its current text highlighted in reverse video.
278 window's current text is the null string on the last visible line,
279 indicated by a vertical bar.
284 Commands have one-letter names.
285 Except for non-editing commands such as writing
286 the file to disc, most commands make some change
287 to the text in dot and leave dot set to the text resulting from the change.
288 For example, the delete command,
290 deletes the text in dot, replacing it by the null string and setting dot
294 replaces dot by text delimited by an arbitrary punctuation character,
300 replaces the text in dot by the string
306 (append) adds the string after dot, and
310 (insert) inserts before dot.
311 All three leave dot set to the new text,
314 Newlines are part of the syntax of commands:
315 the newline character lexically terminates a command.
316 Within the inserted text, however, newlines are never implicit.
317 But since it is often convenient to insert multiple lines of text,
320 syntax for that case:
324 to be inserted in the file,
325 terminated by a period
329 In the one-line syntax, a newline character may be specified by a C-like
334 replaces dot by a single newline character.
337 also has a substitute command,
340 s/\f2expression\fP/\f2replacement\fP/
342 substitutes the replacement text for the first match, in dot,
343 of the regular expression.
344 Thus, if dot is the string
354 is unnecessary, but it was inherited from
356 and it has some convenient variations.
357 For instance, the replacement text may include the matched text,
361 s/Peter/Oh, &, &, &, &!/
364 There are also three commands that apply programs
369 replaces dot by the output of the UNIX program.
373 runs the program with dot as its standard input, and
375 does both. For example,
379 replaces dot by the result of applying the standard sorting utility to it.
380 Again, newlines have no special significance for these
383 The text acted upon and resulting from these commands is not necessarily
384 bounded by newlines, although for connection with UNIX programs,
385 newlines may be necessary to obey conventions.
389 prints the contents of dot.
398 Table I. \f(CWSam\fP commands
408 a/\f2text\fP/ Append text after dot
409 c/\f2text\fP/ Change text in dot
410 i/\f2text\fP/ Insert text before dot
412 s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot
413 m \f2address\fP Move text in dot after address
414 t \f2address\fP Copy text in dot after address
418 \f1Display commands\fP
422 p Print contents of dot
423 \&= Print value (line numbers and character numbers) of dot
431 b \f2file-list\fP Set current file to first file in list that \f(CWsam\fP has in menu
432 B \f2file-list\fP Same as \f(CWb\fP, but load new files
433 n Print menu lines of all files
434 D \f2file-list\fP Delete named files from \f(CWsam\fP
442 e \f2filename\fP Replace file with named disc file
443 r \f2filename\fP Replace dot by contents of named disc file
444 w \f2filename\fP Write file to named disc file
445 f \f2filename\fP Set file name and print new menu line
446 < \f2UNIX-command\fP Replace dot by standard output of command
447 > \f2UNIX-command\fP Send dot to standard input of command
448 | \f2UNIX-command\fP Replace dot by result of command applied to dot
449 ! \f2UNIX-command\fP Run the command
453 \f1Loops and conditionals\fP
457 x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command
458 y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command
459 X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp
460 Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match
461 g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command
462 v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command
470 k Set address mark to value of dot
472 u \f2n\fP Undo last \f2n\fP (default 1) changes
473 { } Braces group commands
482 The value of dot may be changed by
486 The simplest address is a line number:
490 refers to the third line of the file, so
494 deletes the third line of the file, and implicitly renumbers
495 the lines so the old line 4 is now numbered 3.
496 (This is one of the few places where
498 deals with lines directly.)
501 is the null string at the beginning of the file.
502 If a command consists of only an address, a
504 command is assumed, so typing an unadorned
506 prints line 3 on the terminal.
507 There are a couple of other basic addresses:
508 a period addresses dot itself; and
511 addresses the null string at the end of the file.
513 An address is always a single substring of the file.
516 addresses the characters
517 after the second newline of
518 the file through the third newline of the file.
523 is constructed by the comma operator
525 \f2address1\fP,\f2address2\fP
527 and addresses the substring of the file from the beginning of
531 For example, the command
533 prints the third through fifth lines of the file and
535 deletes the text from the beginning of dot to the end of the file.
537 These addresses are all absolute positions in the file, but
539 also has relative addresses, indicated by
547 is the third line before the end of the file and
551 is the line after dot.
552 If no address appears to the left of the
557 if nothing appears to the right,
562 may be abbreviated to just a plus sign.
566 operator acts relative to the end of its first argument, while the
568 operator acts relative to the beginning. Thus
570 addresses the first line after dot,
572 addresses the first line before dot, and
574 refers to the line containing the end of dot. (Dot may span multiple lines, and
576 selects the line after the end of dot, then
580 The final type of address is a regular expression, which addresses the
581 text matched by the expression. The expression is enclosed in slashes, as in
585 The expressions are the same as those in the UNIX program
586 .CW egrep ,\u\s-4\&6,7\s+4\d
587 and include closures, alternations, and so on.
592 string that matches the expression, that is,
593 the first match after the point where the search is started,
594 and if more than one match begins at the same spot, the longest such match.
595 (I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d)
602 character in the file,
606 matches the next run of one or more
616 For compatibility with other UNIX programs, the `any character' operator,
618 does not match a newline, so
622 matches the text from dot to the end of the line, but excludes the newline
623 and so will not match across
626 Regular expressions are always relative addresses.
627 The direction is forwards by default,
630 is really an abbreviation for
632 The search can be reversed with a minus sign, so
639 Regular expressions may be used with other address forms, so
654 Table II. \f(CWSam\fP addresses
660 \f1Simple addresses\fP
664 #\f2n\fP The empty string after character \f2n\fP
665 \f2n\fP Line \f2n\fP.
666 /\f2regexp\fP/ The first following match of the regular expression
667 -/\f2regexp\fP/ The first previous match of the regular expression
668 $ The null string at the end of the file
670 \&' The address mark, set by \f(CWk\fP command
671 "\f2regexp\fP" Dot in the file whose menu line matches regexp
675 \f1Compound addresses\fP
679 \f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP
680 \f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP
681 \f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP)
682 \f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP
693 are high precedence, while
723 The language discussed so far will not seem novel
724 to people who use UNIX text editors
728 .CW vi .\u\s-4\&9\s+4\d
729 Moreover, the kinds of editing operations these commands allow, with the exception
730 of regular expressions and line numbers,
731 are clearly more conveniently handled by a mouse-based interface.
734 mouse language (discussed at length below) is the means by which
735 simple changes are usually made.
736 For large or repetitive changes, however, a textual language
737 outperforms a manual interface.
739 Imagine that, instead of deleting just one occurrence of the string
741 we wanted to eliminate every
743 What's needed is an iterator that runs a command for each occurrence of some
750 x/\f2expression\fP/ \f2command\fP
752 finds all matches in dot of the specified expression, and for each
753 such match, sets dot to the text matched and runs the command.
759 (Blanks in these examples are to improve readability;
761 neither requires nor interprets them.)
762 This searches the entire file
764 for occurrences of the string
768 command with dot set to each such occurrence.
769 (By contrast, the comparable
771 command would delete all
780 is commonly used, and may be abbreviated to just a comma.
787 one for each appearance in the file, with no intervening text (not even newlines
788 to separate the instances).
790 Of course, the text extracted by
792 may be selected by a regular expression,
793 which complicates deciding what set of matches is chosen \(em
794 matches may overlap. This is resolved by generating the matches
795 starting from the beginning of dot using the leftmost-longest rule,
796 and searching for each match starting from the end of the previous one.
797 Regular expressions may also match null strings, but a null match
798 adjacent to a non-null match is never selected; at least one character
812 matches the null strings separating the
817 command has a complement,
819 with similar syntax, that executes the command with dot set to the text
821 the matches of the expression.
828 produces the same result as the example above.
834 commands are looping constructs, and
836 has a pair of conditional commands to go with them.
837 They have similar syntax:
839 g/\f2expression\fP/ \f2command\fP
842 runs the command exactly once if dot contains a match of the expression.
843 This is different from
845 which runs the command for
851 merely tests, without changing the value of dot.
856 deletes all occurrences of
862 deletes the whole file (reduces it to a null string) if
864 occurs anywhere in the text.
865 The complementary conditional is
867 which runs the command if there is
869 match of the expression.
871 These control-structure-like commands may be composed to construct more
872 involved operations. For example, to print those lines of text that
876 , x/.*\en/ g/Peter/ p
880 breaks the file into lines, the
882 selects those lines containing
887 This command gives an address for the
889 command (the whole file), but because
891 does not have an explicit address, it applies to the value of
894 command, that is, to each line.
897 except for the command to write a file to disc use dot for the
900 Composition may be continued indefinitely.
902 , x/.*\en/ g/Peter/ v/SaltPeter/ p
904 prints those lines containing
911 Structural Regular Expressions
913 Unlike other UNIX text editors,
914 including the non-interactive ones such as
917 .CW awk ,\u\s-4\&7\s+4\d
919 is good for manipulating files with multi-line `records.'
920 An example is an on-line phone book composed of records,
921 separated by blank lines, of the form
924 44 Turnip Ave., Endive, NJ
928 16 Potato St., Cabbagetown, NJ
933 The format may be encoded as a regular expression:
937 that is, a sequence of one or more non-blank lines.
938 The command to print Mr. Tic's entire record is then
940 , x/(.+\en)+/ g/^Herbert Tic$/ p
942 and that to extract just the phone number is
944 , x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p
946 The latter command breaks the file into records,
947 chooses Mr. Tic's record,
948 extracts the phone number from the record,
949 and finally prints the number.
951 A more involved problem is that of
952 renaming a particular variable, say
957 The obvious first attempt,
961 is badly flawed: it changes not only the variable
966 We need to extract all the variables, and select those that match
971 , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
974 .CW [A-Za-z_][A-Za-z_0-9]*
975 matches C identifiers.
978 selects those containing an
982 rejects those containing two (or more) characters, and finally
984 changes the remainder (identifiers
988 This version clearly works much better, but there may still be problems.
989 For example, in C character and string constants, the sequence
991 is interpreted as a newline character, and we don't want to change it to
993 This problem can be forestalled with a
997 , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1001 is necessary because of lexical conventions in regular expressions),
1002 or we could even reject character constants and strings outright:
1004 ,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/
1008 commands in this version exclude from consideration all character constants
1010 The only remaining problem is to deal with the possible occurrence of
1014 within these sequences, but it's easy to see how to resolve this difficulty.
1016 The point of these composed commands is successive refinement.
1017 A simple version of the command is tried, and if it's not good enough,
1018 it can be honed by adding a clause or two.
1019 (Mistakes can be undone; see below.
1020 Also, the mouse language makes it unnecessary to retype the command each time.)
1021 The resulting chains of commands are somewhat reminiscent of
1022 shell pipelines.\u\s-4\&7\s+4\d
1023 Unlike pipelines, though, which pass along modified
1029 The text at each step of the command is the same, but which pieces
1030 are selected is refined step by step until the correct piece is
1031 available to the final step of the command line, which ultimately makes the change.
1033 In other UNIX programs, regular expressions are used only for selection,
1037 command, never for extraction as in the
1042 For example, patterns in
1043 .CW awk \u\s-4\&7\s+4\d
1044 are used to select lines to be operated on, but cannot be used
1045 to describe the format of the input text, or to handle newline-free text.
1046 The use of regular expressions to describe the structure of a piece
1047 of text rather than its contents, as in the
1050 has been given a name:
1052 structural regular expressions.
1054 When they are composed, as in the above example,
1055 they are pleasantly expressive.
1056 Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d
1062 has a few other commands, mostly relating to input and output.
1066 replaces the contents and name of the current file with those of the named
1071 writes the contents to the named disc file; and
1075 replaces dot with the contents of the named disc file.
1076 All these commands use the current file's name if none is specified.
1081 changes the name associated with the file and displays the result:
1085 This output is called the file's
1089 because it is the contents of the file's line in the button 3 menu (described
1092 The first three characters are a concise notation for the state of the file.
1093 The apostrophe signifies that the file is modified.
1094 The minus sign indicates the number of windows
1095 open on the file (see the next section):
1101 means more than one.
1102 Finally, the period indicates that this is the current file.
1103 These characters are useful for controlling the
1105 command, described shortly.
1108 may be started with a set of disc files (such as all the source for
1109 a program) by invoking it with a list of file names as arguments, and
1110 more may be added or deleted on demand.
1112 B discfile1 discfile2 ...
1114 adds the named files to
1118 D discfile1 discfile2 ...
1122 memory (without effect on associated disc files).
1123 Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d
1124 (the UNIX command interpreter) to generate the lists:
1128 will add all C source files, and
1130 B <grep -l variable *.c
1132 will add all C source files referencing a particular variable
1135 lists all files in its arguments that contain matches of
1136 the specified regular expression).
1139 without arguments deletes the current file.
1141 There are two ways to change which file is current:
1145 makes the named file current.
1149 does the same, but also adds any new files to
1152 (In practice, of course, the current file
1153 is usually chosen by mouse actions, not by textual commands.)
1154 The other way is to use a form of address that refers to files:
1156 "\f2expression\fP" \f2address\fP
1158 refers to the address evaluated in the file whose menu line
1159 matches the expression (there must be exactly one match).
1164 refers to the third line of the file whose name matches
1166 This is most useful in the move
1174 makes a copy of the current file at the beginning of
1180 is a looping construct, like
1182 that refers to files instead of strings:
1184 X/\f2expression\fP/ \f2command\fP
1186 runs the command in all
1187 files whose menu lines match the expression. The best example is
1191 which writes to disc all modified files.
1193 is the complement of
1195 it runs the command on all files whose menu lines don't match the expression:
1199 deletes all files that don't have
1201 in their names, that is, it keeps all C source files and deletes the rest.
1203 Braces allow commands to be grouped, so
1210 is syntactically a single command that runs two commands.
1213 X/\e.c/ ,g/variable/ {
1215 , x/.*\en/ g/variable/ p
1218 finds all occurrences of
1220 in C source files, and prints
1221 out the file names and lines of each match.
1222 The precise semantics of compound operations is discussed in the implementation
1228 undoes the last command,
1229 no matter how many files were affected.
1230 Multiple undo operations move further back in time, so
1235 (which may be abbreviated
1237 undoes the last two commands. An undo may not be undone, however, nor
1238 may any command that adds or deletes files.
1239 Everything else is undoable, though, including for example
1246 restores the state of the file completely, including its name, dot,
1247 and modified bit. Because of the undo, potentially dangerous commands
1248 are not guarded by confirmations. Only
1250 which destroys the information necessary to restore itself, is protected.
1251 It will not delete a modified file, but a second
1253 of the same file will succeed regardless.
1256 command, which exits
1258 is similarly guarded.
1263 is most commonly run
1264 connected to a bitmap display and mouse for interactive editing.
1265 The only difference in the command language
1266 between regular, mouse-driven
1270 is that if an address
1271 is provided without a command,
1273 will print the text referenced by the address, but
1276 will highlight it on the screen \(em in fact,
1277 dot is always highlighted (see Figure 2).
1284 window. The scroll bar down the left
1285 represents the file, with the bubble showing the fraction
1286 visible in the window.
1287 The scroll bar may be manipulated by the mouse for convenient browsing.
1289 which is highlighted, need not fit on a line. Here it consists of one partial
1290 line, one complete line, and final partial line.
1294 Each file may have zero or more windows open on the display.
1295 At any time, only one window in all of
1301 that is, the window to which typing and mouse actions refer;
1304 window (that in which commands may be typed)
1305 or one of the file windows.
1306 When a file has multiple windows, the image of the file in each window
1307 is always kept up to date.
1308 The current file is the last file affected by a command,
1312 the current window is not a window on the current file.
1313 However, each window on a file has its own value of dot,
1314 and when switching between windows on a single file,
1315 the file's value of dot is changed to that of the window.
1316 Thus, flipping between windows behaves in the obvious, convenient way.
1318 The mouse on the Blit has three buttons, numbered left to right.
1319 Button 3 has a list of commands to manipulate windows,
1320 followed by a list of `menu lines' exactly as printed by the
1322 command, one per file (not one per window).
1323 These menu lines are sorted by file name.
1324 If the list is long, the Blit menu software will make it more manageable
1325 by generating a scrolling menu instead of an unwieldy long list.
1326 Using the menu to select a file from the list makes that file the current
1327 file, and the most recently current window in that file the current window.
1328 But if that file is already current, selecting it in the menu cycles through
1329 the windows on the file; this simple trick avoids a special menu to
1330 choose windows on a file.
1331 If there is no window open on the file,
1333 changes the mouse cursor to prompt the user to create one.
1335 The commands on the button 3 menu are straightforward (see Figure 3), and
1336 are like the commands to manipulate windows in
1337 .CW mux ,\u\s-4\&8\s+4\d
1338 the Blit's window system.
1340 makes a new file, and gives it one empty window, whose size is determined
1341 by a rectangle swept by the mouse.
1343 prompts for a window to be selected, and
1344 makes a clone of that window; this is how multiple windows are created on one file.
1346 changes the size of the indicated window, and
1348 deletes it. If that is the last window open on the file,
1352 command on the file.
1356 command on the file; it is in the menu purely for convenience.
1359 is a menu item that appears between the commands and the file names.
1360 Selecting it makes the
1362 window the current window,
1363 causing subsequent typing to be interpreted as commands.
1367 Figure 3. The menu on button 3.
1368 The black rectangle on the left is a scroll bar; the menu is limited to
1369 the length shown to prevent its becoming unwieldy.
1372 line is a list of commands;
1373 beneath it is a list of files, presented exactly as with the
1381 requests that a window be swept, in response to
1386 it changes the mouse cursor from the usual arrow to a box with
1388 In this state, the mouse may be used to indicate an arbitrary rectangle by
1389 pressing button 3 at one corner and releasing it at the opposite corner.
1391 button 3 may simply be clicked,
1394 creates the maximal rectangle that contains the cursor
1400 window in the middle of the screen, the user can define two regions (one above,
1401 one below) in which stacked fully-overlapping
1402 windows can be created with minimal fuss (see Figure 1).
1403 This simple user interface trick makes window creation noticeably easier.
1405 The cut-and-paste editor is essentially the same as that in Smalltalk-80.\u\s-4\&11\s+4\d
1406 The text in dot is always highlighted on the screen.
1407 When a character is typed it replaces dot, and sets dot to the null
1408 string after the character. Thus, ordinary typing inserts text.
1409 Button 1 is used for selection:
1410 pressing the button, moving the mouse, and lifting the button
1411 selects (sets dot to) the text between the points where the
1412 button was pressed and released.
1413 Pressing and releasing at the same point selects a null string; this
1414 is called clicking. Clicking twice quickly, or
1418 selects larger objects;
1419 for example, double clicking in a word selects the word,
1420 double clicking just inside an opening bracket selects the text
1421 contained in the brackets (handling nested brackets correctly),
1423 parentheses, quotes, and so on.
1424 The double-clicking rules reflect a bias toward
1428 were intended more for word processing, double-clicks would probably
1429 select linguistic structures such as sentences.
1431 If button 1 is pressed outside the current window, it makes the indicated
1433 This is the easiest way to switch between windows and files.
1435 Pressing button 2 brings up a menu of editing functions (see Figure 4).
1436 These mostly apply to the selected text:
1438 deletes the selected text, and remembers it in a hidden buffer called the
1443 replaces the selected text by the contents of the snarf buffer,
1445 just copies the selected text to the snarf buffer,
1447 searches forward for the next literal occurrence of the selected text, and
1449 exchanges snarf buffers with the window system in which
1452 Finally, the last regular expression used appears as a menu entry
1454 forward for the next occurrence of a match for the expression.
1459 Figure 4. The menu on button 2.
1460 The bottom entry tracks the most recently used regular expression, which may
1465 The relationship between the command language and the mouse language is
1466 entirely due to the equality of dot and the selected text chosen
1467 with button 1 on the mouse.
1468 For example, to make a set of changes in a C subroutine, dot can be
1469 set by double clicking on the left brace that begins the subroutine,
1470 which sets dot for the command language.
1471 An address-free command then typed in the
1473 window will apply only to the text between the opening and closing
1474 braces of the function.
1475 The idea is to select what you want, and then say what you want
1476 to do with it, whether invoked by a menu selection or by a typed command.
1477 And of course, the value of dot is highlighted on
1478 the display after the command completes.
1479 This relationship between mouse interface and command language
1480 is clumsy to explain, but comfortable, even natural, in practice.
1484 The next few sections describe how
1486 is put together, first the host part,
1487 then the inter-component communication,
1488 then the terminal part.
1489 After explaining how the command language is implemented,
1490 the discussion follows (roughly) the path of a character
1491 from the temporary file on disc to the screen.
1492 The presentation centers on the data structures,
1493 because that is how the program was designed and because
1494 the algorithms are easy to provide, given the right data
1497 Parsing and execution
1499 The command language is interpreted by parsing each command with a
1500 table-driven recursive
1501 descent parser, and when a complete command is assembled, invoking a top-down
1503 Most editors instead employ a simple character-at-a-time
1505 Use of a parser makes it
1506 easy and unambiguous to detect when a command is complete,
1507 which has two advantages.
1508 First, escape conventions such as backslashes to quote
1509 multiple-line commands are unnecessary; if the command isn't finished,
1510 the parser keeps reading. For example, a multiple-line append driven by an
1512 command is straightforward:
1515 one line about Peter
1516 another line about Peter
1519 Other UNIX editors would require a backslash after all but the last line.
1521 The other advantage is specific to the two-process structure of
1523 The host process must decide when a command is completed so the
1524 command interpreter can be called. This problem is easily resolved
1525 by having the lexical analyzer read the single stream of events from the
1526 terminal, directly executing all typing and mouse commands,
1527 but passing to the parser characters typed to the
1530 This scheme is slightly complicated by the availability of cut-and-paste
1533 window, but that difficulty is resolved by applying the rules
1536 when a newline is typed to the
1538 window, all text between the newline and the previously typed newline
1539 is made available to the parser.
1540 This permits arbitrary editing to be done to a command before
1541 typing newline and thereby requesting execution.
1543 The parser is driven by a table because the syntax of addresses
1544 and commands is regular enough
1545 to be encoded compactly. There are few special cases, such as the
1546 replacement text in a substitution, so the syntax of almost all commands
1547 can be encoded with a few flags.
1548 These include whether the command allows an address (for example,
1550 does not), whether it takes a regular expression (as in
1554 whether it takes replacement text (as in
1558 which may be multi-line, and so on.
1559 The internal syntax of regular expressions is handled by a separate
1560 parser; a regular expression is a leaf of the command parse tree.
1561 Regular expressions are discussed fully in the next section.
1563 The parser table also has information about defaults, so the interpreter
1564 is always called with a complete tree. For example, the parser fills in
1569 in the abbreviated address
1574 to the left of an unadorned regular expression in an address,
1575 and provides the usual default address
1577 (dot) for commands that expect an address but are not given one.
1579 Once a complete command is parsed, the evaluation is easy.
1580 The address is evaluated left-to-right starting from the value of dot,
1581 with a mostly ordinary expression evaluator.
1582 Addresses, like many of the data structures in
1584 are held in a C structure and passed around by value:
1586 typedef long Posn; /* Position in a file */
1587 typedef struct Range{
1590 typedef struct Address{
1595 An address is encoded as a substring (character positions
1603 is described in detail below.)
1605 The address interpreter is an
1607 function that traverses the parse tree describing an address (the
1608 parse tree for the address has type
1612 address(ap, a, sign)
1624 a.r.p1=a.r.p2=a.f->nbytes;
1627 a=matchfile(a, ap->aregexp)->dot;
1630 a2=address(ap->right, a, 0);
1631 a=address(ap->left, a, 0);
1632 if(a.f!=a2.f || a2.r.p2<a.r.p1)
1638 while((ap=ap->right)!=0);
1643 Throughout, errors are handled by a non-local
1648 hidden in a routine called
1650 that immediately aborts the execution, retracts any
1651 partially made changes (see the section below on `undoing'), and
1652 returns to the top level of the parser.
1655 is an enumeration type that
1656 is translated to a terse but possibly helpful
1657 message such as `?addresses out of order.'
1658 Very common messages are kept short; for example the message for
1659 a failed regular expression search is `?search.'
1661 Character addresses such as
1663 are trivial to implement, as the
1665 data structure is accessible by character number.
1668 keeps no information about the position of newlines \(em it is too
1669 expensive to track dynamically \(em so line addresses are computed by reading
1670 the file, counting newlines. Except in very large files, this has proven
1671 acceptable: file access is fast enough to make the technique practical,
1672 and lines are not central to the structure of the command language.
1674 The command interpreter, called
1676 is also straightforward. The parse table includes a
1677 function to call to interpret a particular command. That function
1678 receives as arguments
1679 the calculated address
1681 and the command tree (of type
1683 which may contain information such as the subtree for compound commands.
1684 Here, for example, is the function for the
1695 compile(cp->regexp);
1696 if(execute(a.f, a.r.p1, a.r.p2)!=(cp->cmdchar=='v')){
1698 return cmdexec(a, cp->subcmd);
1700 return TRUE; /* cause execution to continue */
1706 are part of the regular expression code, described in the next section.)
1707 Because the parser and the
1709 data structure do most of the work, most commands
1710 are similarly brief.
1714 The regular expression code in
1716 is an interpreted, rather than compiled on-the-fly, implementation of Thompson's
1717 non-deterministic finite automaton algorithm.\u\s-4\&12\s+4\d
1718 The syntax and semantics of the expressions are as in the UNIX program
1720 including alternation, closures, character classes, and so on.
1721 The only changes in the notation are two additions:
1723 is translated to, and matches, a newline character, and
1725 matches any character. In
1729 matches any character except newline, and in
1731 the same rule seemed safest, to prevent idioms like
1733 from spanning newlines.
1735 expressions are arguably too complicated for an interactive editor \(em
1736 certainly it would make sense if all the special characters were two-character
1737 sequences, so that most of the punctuation characters wouldn't have
1738 peculiar meanings \(em but for an interesting command language, full
1739 regular expressions are necessary, and
1741 defines the full regular expression syntax for UNIX programs.
1742 Also, it seemed superfluous to define a new syntax, since various UNIX programs
1747 define too many already.
1749 The expressions are compiled by a routine,
1751 that generates the description of the non-deterministic finite state machine.
1754 interprets the machine to generate the leftmost-longest match of the
1755 expression in a substring of the file.
1756 The algorithm is described elsewhere.\u\s-4\&12,13\s+4\d
1759 whether a match was found, and sets a global variable,
1762 to the substring matched.
1764 A trick is required to evaluate the expression in reverse, such as when
1765 searching backwards for an expression.
1770 looks backwards through the file for a match of the expression.
1771 The expression, however, is defined for a forward search.
1772 The solution is to construct a machine identical to the machine
1773 for a forward search except for a reversal of all the concatenation
1774 operators (the other operators are symmetric under direction reversal),
1775 to exchange the meaning of the operators
1779 and then to read the file backwards, looking for the
1780 usual earliest longest match.
1783 generates only one match each time it is called.
1784 To interpret looping constructs such as the
1788 must therefore synchronize between
1792 problems with null matches.
1793 For example, even given the leftmost-longest rule,
1796 matches three times in the string
1800 the null string between the
1804 and the final null string).
1805 After returning a match for the
1808 must not match the null string before the
1810 The algorithm starts
1812 at the end of its previous match, and
1813 if the match it returns
1814 is null and abuts the previous match, rejects the match and advances
1815 the initial position one character.
1819 The C language has no memory allocation primitives, although a standard
1822 provides adequate service for simple programs.
1823 For specific uses, however,
1824 it can be better to write a custom allocator.
1825 The allocator (or rather, pair of allocators) described here
1826 work in both the terminal and host parts of
1828 They are designed for efficient manipulation of strings,
1829 which are allocated and freed frequently and vary in length from essentially
1830 zero to 32 Kbytes (very large strings are written to disc).
1831 More important, strings may be large and change size often,
1832 so to minimize memory usage it is helpful to reclaim and to coalesce the
1833 unused portions of strings when they are truncated.
1835 Objects to be allocated in
1840 which are small and often addressed by pointer variables;
1841 the second is variable-sized arrays of characters
1843 base pointer is always used to access them.
1844 The memory allocator in
1846 is therefore in two parts:
1847 first, a traditional first-fit allocator that provides fixed storage for
1849 and second, a garbage-compacting allocator that reduces storage
1850 overhead for variable-sized objects, at the cost of some bookkeeping.
1851 The two types of objects are allocated from adjoining arenas, with
1852 the garbage-compacting allocator controlling the arena with higher addresses.
1853 Separating into two arenas simplifies compaction and prevents fragmentation due
1854 to immovable objects.
1855 The access rules for garbage-compactable objects
1856 (discussed in the next paragraph) allow them to be relocated, so when
1857 the first-fit arena needs space, it moves the garbage-compacted arena
1858 to higher addresses to make room. Storage is therefore created only
1859 at successively higher addresses, either when more garbage-compacted
1860 space is needed or when the first-fit arena pushes up the other arena.
1862 Objects that may be compacted declare to the
1863 allocator a cell that is guaranteed to be the sole repository of the
1864 address of the object whenever a compaction can occur.
1865 The compactor can then update the address when the object is moved.
1866 For example, the implementation of type
1868 (really a variable-length array)
1871 typedef struct List{
1878 cell must always be used directly, and never copied. When a
1880 is to be created the
1882 structure is allocated in the ordinary first-fit arena
1885 is allocated in the garbage-compacted arena.
1886 A similar data type for strings, called
1888 stores variable-length character arrays of up to 32767 elements.
1890 A related matter of programming style:
1892 frequently passes structures by value, which
1893 simplifies the code.
1894 Traditionally, C programs have
1895 passed structures by reference, but implicit allocation on
1896 the stack is easier to use.
1897 Structure passing is a relatively new feature of C
1899 standard reference manual for C\u\s-4\&14\s+4\d), and is poorly supported in most
1900 commercial C compilers.
1901 It's convenient and expressive, though,
1902 and simplifies memory management by
1903 avoiding the allocator altogether
1904 and eliminating pointer aliases.
1906 Data structures for manipulating files
1910 showed that the requirements
1911 of the file data structure were few, but strict.
1912 First, files need to be read and written quickly;
1913 adding a fresh file must be painless.
1914 Second, the implementation must place no arbitrary upper limit on
1915 the number or sizes of files. (It should be practical to edit many files,
1916 and files up to megabytes in length should be handled gracefully.)
1917 This implies that files be stored on disc, not in main memory.
1918 (Aficionados of virtual memory may argue otherwise, but the
1919 implementation of virtual
1920 memory in our system is not something to depend on
1921 for good performance.)
1922 Third, changes to files need be made by only two primitives:
1923 deletion and insertion.
1924 These are inverses of each other,
1925 which simplifies the implementation of the undo operation.
1927 it must be easy and efficient to access the file, either
1928 forwards or backwards, a byte at a time.
1932 data type is constructed from three simpler data structures that hold arrays
1934 Each of these types has an insertion and deletion operator, and the
1935 insertion and deletion operators of the
1937 type itself are constructed from them.
1939 The simplest type is the
1941 which is used to hold strings in main memory.
1942 The code that manages
1944 guarantees that they will never be longer
1945 than some moderate size, and in practice they are rarely larger than 8 Kbytes.
1947 have two purposes: they hold short strings like file names with little overhead,
1948 and because they are deliberately small, they are efficient to modify.
1949 They are therefore used as the data structure for in-memory caches.
1951 The disc copy of the file is managed by a data structure called a
1953 which corresponds to a temporary file. A
1955 has no storage in main memory other than bookkeeping information;
1956 the actual data being held is all on the disc.
1957 To reduce the number of open files needed,
1959 opens a dozen temporary UNIX files and multiplexes the
1962 This permits many files to
1963 be edited; the entire
1965 source (48 files) may be edited comfortably with a single
1968 Allocating one temporary file per
1970 would strain the operating system's limit on the number of open files.
1971 Also, spreading the traffic among temporary files keeps the files shorter,
1972 and shorter files are more efficiently implemented by the UNIX
1977 is an array of fixed-length blocks, each of which contains
1978 between 1 and 4096 characters of active data.
1979 (The block size of our UNIX file system is 4096 bytes.)
1980 The block addresses within the temporary file and the length of each
1981 block are stored in a
1983 When changes are made the live part of blocks may change size.
1984 Blocks are created and coalesced when necessary to try to keep the sizes
1985 between 2048 and 4096 bytes.
1986 An actively changing part of the
1988 therefore typically has about a kilobyte of slop that can be
1990 without changing more than one block or affecting the block order.
1991 When an insertion would overflow a block, the block is split, a new one
1992 is allocated to receive the overflow, and the memory-resident list of blocks
1993 is rearranged to reflect the insertion of the new block.
1995 Obviously, going to the disc for every modification to the file is
1996 prohibitively expensive.
2001 to hold the data and a
2003 that acts as a cache.
2004 This is the first of a series of caches throughout the data structures in
2006 The caches not only improve performance, they provide a way to organize
2007 the flow of data, particularly in the communication between the host
2009 This idea is developed below, in the section on communications.
2011 To reduce disc traffic, changes to a
2013 are mediated by a variable-length string, in memory, that acts as a cache.
2014 When an insertion or deletion is made to a
2016 if the change can be accommodated by the cache, it is done there.
2017 If the cache becomes bigger than a block because of an insertion,
2018 some of it is written to the
2020 and deleted from the cache.
2021 If the change does not intersect the cache, the cache is flushed.
2022 The cache is only loaded at the new position if the change is smaller than a block;
2023 otherwise, it is sent directly to the
2026 large changes are typically sequential,
2027 whereupon the next change is unlikely to overlap the current one.
2033 to hold the file name and some ancillary data such as dot and the modified bit.
2034 The most important components, though, are a pair of
2036 one called the transcript and the other the contents.
2037 Their use is described in the next section.
2039 The overall structure is shown in Figure 5.
2040 Although it may seem that the data is touched many times on its
2043 it is read (by one UNIX system call) directly into the cache of the
2046 no extra copy is done.
2047 Similarly, when flushing the cache, the text is written
2048 directly from the cache to disc.
2049 Most operations act directly on the text in the cache.
2050 A principle applied throughout
2052 is that the fewer times the data is copied, the faster the program will run
2053 (see also the paper by Waite\u\s-4\&15\s+4\d).
2059 Figure 5. File data structures.
2060 The temporary files are stored in the standard repository for such files
2067 are accessed by a routine that
2068 copies to a buffer a substring of a file starting at a specified offset.
2069 To read a byte at a time, a
2071 array is loaded starting from a specified initial position,
2072 and bytes may then be read from the array.
2073 The implementation is done by a macro similar to the C standard I/O
2075 macro.\u\s-4\&14\s+4\d
2076 Because the reading may be done at any address, a minor change to the
2077 macro allows the file to be read backwards.
2078 This array is read-only; there is no
2084 has an unusual method for managing changes to files.
2085 The command language makes it easy to specify multiple variable-length changes
2086 to a file millions of bytes long, and such changes
2087 must be made efficiently if the editor is to be practical.
2088 The usual techniques for inserting and deleting strings
2089 are inadequate under these conditions.
2094 data structures are designed for efficient random access to long strings,
2095 but care must be taken to avoid super-linear behavior when making
2096 many changes simultaneously.
2099 uses a two-pass algorithm for making changes, and treats each file as a database
2100 against which transactions are registered.
2101 Changes are not made directly to the contents.
2102 Instead, when a command is started, a `mark' containing
2103 a sequence number is placed in the transcript
2105 and each change made to the file, either an insertion or deletion
2106 or a change to the file name,
2107 is appended to the end of the transcript.
2108 When the command is complete, the transcript is rewound to the
2109 mark and applied to the contents.
2111 One reason for separating evaluation from
2112 application in this way is to simplify tracking the addresses of changes
2113 made in the middle of a long sequence.
2114 The two-pass algorithm also allows all changes to apply to the
2116 data: no change can affect another change made in the same command.
2117 This is particularly important when evaluating an
2119 command because it prevents regular expression matches
2120 from stumbling over changes made earlier in the execution.
2122 algorithm is cleaner than the way other UNIX editors allow changes to
2126 idioms to do things like delete every other line
2127 depend critically on the implementation.
2130 simple model, in which all changes in a command occur effectively
2131 simultaneously, is easy to explain and to understand.
2133 The records in the transcript are of the form ``delete substring from
2135 123 to 456'' and ``insert 11 characters `hello there' at location 789.''
2136 (It is an error if the changes are not at monotonically greater
2137 positions through the file.)
2138 While the update is occurring, these numbers must be
2139 offset by earlier changes, but that is straightforward and
2140 local to the update routine;
2141 moreover, all the numbers have been computed
2142 before the first is examined.
2144 Treating the file as a transaction system has another advantage:
2146 All it takes is to invert the transcript after it has been
2147 implemented, converting insertions
2148 into deletions and vice versa, and saving them in a holding
2150 The `do' transcript can then be deleted from
2153 and replaced by the `undo' transcript.
2154 If an undo is requested, the transcript is rewound and the undo transcript
2156 Because the transcript
2158 is not truncated after each command, it accumulates
2160 A sequence of undo commands
2161 can therefore back up the file arbitrarily,
2162 which is more helpful than the more commonly implemented self-inverse form of undo.
2164 provides no way to undo an undo, but if it were desired,
2165 it would be easy to provide by re-interpreting the `do' transcript.)
2166 Each mark in the transcript contains a sequence number and the offset into
2167 the transcript of the previous mark, to aid in unwinding the transcript.
2168 Marks also contain the value of dot and the modified bit so these can be
2170 Undoing multiple files is easy; it merely demands undoing all files whose
2171 latest change has the same sequence number as the current file.
2173 Another benefit of having a transcript is that errors encountered in the middle
2174 of a complicated command need not leave the files in an intermediate state.
2175 By rewinding the transcript to the mark beginning the command,
2176 the partial command can be trivially undone.
2178 When the update algorithm was first implemented, it was unacceptably slow,
2179 so a cache was added to coalesce nearby changes,
2180 replacing multiple small changes by a single larger one.
2181 This reduced the number
2182 of insertions into the transaction
2184 and made a dramatic improvement in performance,
2185 but made it impossible
2186 to handle changes in non-monotonic order in the file; the caching method
2187 only works if changes don't overlap.
2188 Before the cache was added, the transaction could in principle be sorted
2189 if the changes were out of order, although
2190 this was never done.
2191 The current status is therefore acceptable performance with a minor
2192 restriction on global changes, which is sometimes, but rarely, an annoyance.
2194 The update algorithm obviously paws the data more than simpler
2195 algorithms, but it is not prohibitively expensive;
2197 (The principle of avoiding copying the data is still honored here,
2198 although not as piously:
2199 the data is moved from contents' cache to
2200 the transcript's all at once and through only one internal buffer.)
2201 Performance figures confirm the efficiency.
2202 To read from a dead start a hundred kilobyte file on a VAX-11/750
2203 takes 1.4 seconds of user time, 2.5 seconds of system time,
2204 and 5 seconds of real time.
2205 Reading the same file in
2207 takes 6.0 seconds of user time, 1.7 seconds of system time,
2208 and 8 seconds of real time.
2210 uses about half the CPU time.
2211 A more interesting example is the one stated above:
2212 inserting a character between every pair of characters in the file.
2219 and takes 3 CPU seconds per kilobyte of input file, of which
2220 about a third is spent in the regular expression code.
2221 This translates to about 500 changes per second.
2223 takes 1.5 seconds per kilobyte to make a similar change (ignoring newlines),
2226 .CW ex ,\u\s-4\&9\s+4\d
2229 done at the University of California at Berkeley,
2230 which allows one level of undoing, again takes 3 seconds.
2233 performance is comparable to that of other UNIX editors, although it solves
2238 The discussion so far has described the implementation of the host part of
2240 the next few sections explain how a machine with mouse and bitmap display
2241 can be engaged to improve interaction.
2243 is not the first editor to be written as two processes,\u\s-4\&16\s+4\d
2244 but its implementation
2245 has some unusual aspects.
2247 There are several ways
2249 host and terminal parts may be connected.
2250 The first and simplest is to forgo the terminal part and use the host
2251 part's command language to edit text on an ordinary terminal.
2252 This mode is invoked by starting
2259 runs separate host and terminal programs,
2260 communicating with a message protocol over the physical
2261 connection that joins them.
2262 Typically, the connection is an RS-232 link between a Blit
2263 (the prototypical display for
2266 the Ninth Edition of the UNIX operating system.\u\s-4\&8\s+4\d
2267 (This is the version of the system used in the Computing Sciences Research
2268 Center at AT&T Bell Laboratories [now Lucent Technologies, Bell Labs], where I work. Its relevant
2269 aspects are discussed in the Blit paper.\u\s-4\&1\s+4\d)
2270 The implementation of
2272 for the SUN computer runs both processes on the same machine and
2273 connects them by a pipe.
2275 The low bandwidth of an RS-232 link
2276 necessitated the split between
2278 The division is a mixed blessing:
2279 a program in two parts is much harder to write and to debug
2280 than a self-contained one,
2281 but the split makes several unusual configurations possible.
2282 The terminal may be physically separated from the host, allowing the conveniences
2283 of a mouse and bitmap display to be taken home while leaving the files at work.
2284 It is also possible to run the host part on a remote machine:
2288 connects to the terminal in the usual way, and then makes a call
2289 across the network to establish the host part of
2291 on the named machine.
2292 Finally, it cross-connects the I/O to join the two parts.
2295 to be run on machines that do not support bitmap displays;
2298 is the editor of choice on our Cray X-MP/24.
2303 machines: the remote host, the terminal, and the local host.
2304 The local host's job is simple but vital: it passes the data
2305 between the remote host and terminal.
2307 The host and terminal exchange messages asynchronously
2308 (rather than, say, as remote procedure calls) but there is no
2309 error detection or correction
2310 because, whatever the configuration, the connection is reliable.
2311 Because the terminal handles mundane interaction tasks such as
2312 popping up menus and interpreting the responses, the messages are about
2314 For example, the host knows nothing about what is displayed on the screen,
2315 and when the user types a character, the message sent to the host says
2316 ``insert a one-byte string at location 123 in file 7,'' not ``a character
2317 was typed at the current position in the current file.''
2318 In other words, the messages look very much like the transaction records
2321 Either the host or terminal part of
2323 may initiate a change to a file.
2324 The command language operates on the host, while typing and some
2325 mouse operations are executed directly in the terminal to optimize response.
2326 Changes initiated by the host program must be transmitted to the terminal,
2329 (A token is exchanged to determine which end is in control,
2330 which means that characters typed while a time-consuming command runs
2331 must be buffered and do not appear until the command is complete.)
2332 To maintain consistent information,
2333 the host and terminal track changes through a per-file
2334 data structure that records what portions of the file
2335 the terminal has received.
2336 The data structure, called a
2338 (a weak pun: it's a file with holes)
2339 is held and updated by both the host and terminal.
2344 holding those parts of the file known to the terminal,
2345 separated by counts of the number of bytes in the interstices.
2346 Of course, the host doesn't keep a separate copy of the data (it only needs
2347 the lengths of the various pieces),
2348 but the structure is the same on both ends.
2352 in the terminal doubles as a cache.
2353 Since the terminal keeps the text for portions of the file it has displayed,
2354 it need not request data from the host when revisiting old parts of the file
2355 or redrawing obscured windows, which speeds things up considerably
2356 over low-speed links.
2358 It's trivial for the terminal to maintain its
2360 because all changes made on the terminal apply to parts of the file
2361 already loaded there.
2362 Changes made by the host are compared against the
2364 during the update sequence after each command.
2365 Small changes to pieces of the file loaded in the terminal
2366 are sent in their entirety.
2367 Larger changes, and changes that fall entirely in the holes,
2368 are transmitted as messages without literal data:
2369 only the lengths of the deleted and inserted strings are transmitted.
2370 When a command is completed, the terminal examines its visible
2371 windows to see if any holes in their
2373 intersect the visible portion of the file.
2374 It then requests the missing data from the host,
2375 along with up to 512 bytes of surrounding data, to minimize
2376 the number of messages when visiting a new portion of the file.
2377 This technique provides a kind of two-level lazy evaluation for the terminal.
2378 The first level sends a minimum of information about
2379 parts of the file not being edited interactively;
2380 the second level waits until a change is displayed before
2381 transmitting the new data.
2383 performance is also helped by having the terminal respond immediately to typing
2384 and simple mouse requests.
2385 Except for small changes to active pieces of the file, which are
2386 transmitted to the terminal without negotiation,
2387 the terminal is wholly responsible for deciding what is displayed;
2390 only to tell the terminal what might be relevant.
2392 When a change is initiated by the host,
2393 the messages to the terminal describing the change
2394 are generated by the routine that applies the transcript of the changes
2395 to the contents of the
2397 Since changes are undone by the same update routine,
2399 no extra code in the communications;
2400 the usual messages describing changes to the file are sufficient
2401 to back up the screen image.
2405 is a particularly good example of the way caches are used in
2407 First, it facilitates access to the active portion of the text by placing
2408 the busy text in main memory.
2409 In so doing, it provides efficient access
2410 to a large data structure that does not fit in memory.
2411 Since the form of data is to be imposed by the user, not by the program,
2412 and because characters will frequently be scanned sequentially,
2413 files are stored as flat objects.
2414 Caches help keep performance good and linear when working with such
2419 and several of the other caches have some
2421 that is, the cache is loaded with more information than is needed for
2422 the job immediately at hand.
2423 When manipulating linear structures, the accesses are usually sequential,
2424 and read-ahead can significantly reduce the average time to access the
2425 next element of the object.
2426 Sequential access is a common mode for people as well as programs;
2427 consider scrolling through a document while looking for something.
2429 Finally, like any good data structure,
2430 the cache guides the algorithm, or at least the implementation.
2433 was actually invented to control the communications between the host and
2434 terminal parts, but I realized very early that it was also a form of
2435 cache. Other caches were more explicitly intended to serve a double
2436 purpose: for example, the caches in
2438 that coalesce updates not only reduce traffic to the
2439 transcript and contents
2441 they also clump screen updates so that complicated changes to the
2442 screen are achieved in
2443 just a few messages to the terminal.
2444 This saved me considerable work: I did not need to write special
2445 code to optimize the message traffic to the
2447 Caches pay off in surprising ways.
2448 Also, they tend to be independent, so their performance improvements
2451 Data structures in the terminal
2453 The terminal's job is to display and to maintain a consistent image of
2454 pieces of the files being edited.
2455 Because the text is always in memory, the data structures are
2456 considerably simpler than those in the host part.
2459 typically has far more windows than does
2461 the window system within which its Blit implementation runs.
2463 has a fairly small number of asynchronously updated windows;
2465 needs a large number of synchronously updated windows that are
2466 usually static and often fully obscured.
2467 The different tradeoffs guided
2469 away from the memory-intensive implementation of windows, called
2470 .CW Layers ,\u\s-4\&17\s+4\d
2473 Rather than depending on a complete bitmap image of the display for each window,
2475 regenerates the image from its in-memory text
2478 when necessary, although it will use such an image if it is available.
2483 uses the screen bitmap as active storage in which to update the image using
2484 .CW bitblt .\u\s-4\&18,19\s+4\d
2485 The resulting organization, pictured in Figure 6,
2486 has a global array of windows, called
2488 each of which holds an image of a piece of text held in a data structure
2491 which in turn represents
2492 a rectangular window full of text displayed in some
2496 appears in a global list that orders them all front-to-back
2497 on the display, and simultaneously as an element of a per-file array
2498 that holds all the open windows for that file.
2499 The complement in the terminal of the
2501 on the host is called a
2512 Figure 6. Data structures in the terminal.
2514 are also linked together into a front-to-back list.
2516 are discussed in the next section.
2524 contains the image of the text.
2525 For a fully visible window, the
2527 will be the screen (or at least the
2532 while for partially obscured windows the
2535 If the window is fully obscured, the
2542 When making changes to the display, most of the original image will
2543 look the same in the final image, and the update algorithms exploit this.
2546 software updates the image in the
2550 is not just an image, it is a data structure.\u\s-4\&18,19\s+4\d
2551 The job of the software that updates the display is therefore
2552 to use as much as possible of the existing image (converting the
2553 text from ASCII characters to pixels is expensive) in a sort of two-dimensional
2554 string insertion algorithm.
2555 The details of this process are described in the next section.
2559 software has no code to support overlapping windows;
2560 its job is to keep a single
2565 software to multiplex the various
2568 The problem of maintaining overlapping
2571 .CW Layers \u\s-4\&17\s+4\d
2572 because changes are made synchronously and because the contents of the window
2573 can be reconstructed from the data stored in the
2578 makes no such assumptions.
2581 the window being changed is almost always fully visible, because the current
2582 window is always fully visible, by construction.
2583 However, when multi-file changes are being made, or when
2584 more than one window is open on a file,
2585 it may be necessary to update partially obscured windows.
2587 There are three cases: the window is
2588 fully visible, invisible (fully obscured), or partially visible.
2589 If fully visible, the
2591 is part of the screen, so when the
2593 update routine calls the
2595 update routine, the screen will be updated directly.
2596 If the window is invisible,
2597 there is no associated
2599 and all that is necessary is to update the
2601 data structure, not the image.
2602 If the window is partially visible, the
2604 routine is called to update the image in the off-screen
2606 which may require regenerating it from the text of the window.
2609 code then clips this
2617 being modified, and the remainder is copied to the display.
2619 This is much faster than recreating the image off-screen
2620 for every change, or clipping all the changes made to the image
2622 Unfortunately, these caches can also consume prohibitive amounts of
2623 memory, so they are freed fairly liberally \(em after every change to the
2624 front-to-back order of the
2629 exist only while multi-window changes are occurring,
2630 which is the only time the performance improvement they provide is needed.
2631 Also, the user interface causes fully-obscured windows to be the
2632 easiest to make \(em
2633 creating a canonically sized and placed window requires only a button click
2634 \(em which reduces the need for caching still further.
2639 Only two low-level primitives are needed for incremental update:
2641 which copies rectangles of pixels, and
2643 (which in turn calls
2645 which draws a null-terminated character string in a
2651 each of which defines a horizontal strip of text in the window
2655 has a character string
2660 that defines the location of the strip in the window.
2667 associated with the window's file, so
2669 are self-contained.)
2670 The invariant is that
2673 can be reproduced by calling
2677 to draw the string in
2679 and the resulting picture fits perfectly within
2683 define the tiling of the window.
2684 The tiling may be complicated by long lines of text, which
2685 are folded onto the next line.
2686 Some editors use horizontal scrolling to avoid this complication,
2687 but to be comfortable this technique requires that lines not be
2691 has no such restriction.
2692 Also, and perhaps more importantly, UNIX programs and terminals traditionally fold
2693 long lines to make their contents fully visible.
2695 Two special kinds of
2698 character: either a newline or a tab.
2699 Newlines and tabs are white space.
2702 always extends to the right edge of the window,
2703 forcing the following
2706 The width of a tab depends on where it is located:
2709 to begin at a tab location.
2711 have a minimum width equivalent to a blank (blanks are
2714 and are not treated specially); newlines have a minimum width of zero.
2721 Figure 7. A line of text showing its
2725 contain tabs; the last contains a newline.
2726 Spaces are handled as ordinary characters.
2730 The update algorithms always use the
2732 image of the text (either the display or cache
2734 they never examine the characters within a
2738 needs to be split in two.
2739 Before a change, the window consists of a tiling of
2741 after the change the window is tiled differently.
2742 The update algorithms rearrange the tiles in place, without
2744 The algorithms are not strictly optimal \(em for example, they can
2745 clear a pixel that is later going to be written upon \(em
2746 but they never move a tile that doesn't need to be moved,
2747 and they move each tile at most once.
2749 on a Blit can absorb over a thousand characters a second if the strings
2750 being inserted are a few tens of characters long.
2754 Its job is to delete a substring from a
2756 and restore the image of the
2758 The image of a substring has a peculiar shape (see Figure 2) comprising
2759 possibly a partial line,
2760 zero or more full lines,
2761 and possibly a final partial line.
2762 For reference, call this the
2767 begins by splitting, if necessary, the
2769 containing the ends of
2770 the substring so the substring begins and ends on
2773 Because the substring is being deleted, its image is not needed,
2774 so the Z-shape is then cleared.
2775 Then, tiles (that is, the images of
2779 from immediately after the Z-shape to
2780 the beginning of the Z-shape,
2781 resulting in a new Z-shape.
2783 whose contents would span two lines in the new position must first be split.)
2785 Copying the remainder of the
2788 this way will clearly accomplish the deletion but eventually,
2789 typically when the copying algorithm encounters a tab or newline,
2792 coordinates of the tile
2793 to be copied are the same.
2794 This correspondence implies
2795 that the Z-shape has its beginning and ending edges aligned
2796 vertically, and a sequence of at most two
2798 can be used to copy the remaining tiles.
2799 The last step is to clear out the resulting empty space at the bottom
2801 the number of lines to be cleared is the number of complete lines in the
2802 Z-shape closed by the final
2804 The final step is to merge horizontally adjacent
2807 The complete source to
2809 is less than 100 lines of C.
2812 is more complicated because it must do four passes:
2813 one to construct the
2815 list for the inserted string,
2817 one to copy (in opposite order to
2821 to make the hole for the new text,
2822 and finally one to copy the new text into place.
2825 has a similar flavor to
2827 and needn't be described further.
2829 and its subsidiary routines comprise 211 lines of C.
2831 The terminal source code is 3024 lines of C,
2832 and the host source is 5797 lines.
2838 The immediate ancestor of
2840 was the original text editor for the Blit, called
2845 two-process structure and mouse language almost unchanged, but
2847 suffered from several drawbacks that were addressed in the design of
2849 The most important of these was the lack of a command language.
2852 was easy to use for simple editing, it provided no direct help with
2853 large or repetitive editing tasks. Instead, it provided a command to pass
2854 selected text through a shell pipeline,
2855 but this was no more satisfactory than could be expected of a stopgap measure.
2858 was written primarily as a vehicle for experimenting with a mouse-based
2859 interface to text, and the experiment was successful.
2863 the second window system for the Blit, is essentially a multiplexed
2864 version of the terminal part of
2868 user interface\u\s-4\&20\s+4\d was closely modeled on
2870 But after a couple of years,
2872 had become difficult to maintain and limiting to use,
2873 and its replacement was overdue.
2875 I began the design of
2879 customers what they wanted.
2880 This was probably a mistake; the answers were essentially a list of features
2881 to be found in other editors, which did not provide any of the
2882 guiding principles I was seeking.
2883 For instance, one common request was for a ``global substitute,''
2884 but no one suggested how to provide it within a cut-and-paste editor.
2885 I was looking for a scheme that would
2886 support such specialized features comfortably in the context of some
2887 general command language.
2888 Ideas were not forthcoming, though, particularly given my insistence
2889 on removing all limits on file sizes, line lengths and so on.
2890 Even worse, I recognized that, since the mouse could easily
2891 indicate a region of the screen that was not an integral number of lines,
2892 the command language would best forget about newlines altogether,
2893 and that meant the command language had to treat the file as a single
2894 string, not an array of lines.
2896 Eventually, I decided that thinking was not getting me very far and it was
2897 time to try building.
2898 I knew that the terminal part could be built easily \(em
2901 behaved acceptably well \(em and that most of the hard work was going
2902 to be in the host part: the file interface, command interpreter and so on.
2903 Moreover, I had some ideas about how the architecture of
2905 could be improved without destroying its basic structure, which I liked
2906 in principle but which hadn't worked out as well as I had hoped.
2907 So I began by designing the file data structure,
2908 starting with the way
2910 worked \(em comparable to a single structure merging
2914 which I split to make the cache more general
2915 \(em and thinking about how global substitute could be implemented.
2916 The answer was clearly that it had to be done in two passes,
2917 and the transcript-oriented implementation fell out naturally.
2920 was written bottom-up,
2921 starting from the data structures and algorithms for manipulating text,
2922 through the command language and up to the code for maintaining
2924 In retrospect, it turned out well, but this implementation method is
2925 not recommended in general.
2926 There were several times when I had a large body of interesting code
2927 assembled and no clue how to proceed with it.
2928 The command language, in particular, took almost a year to figure out,
2929 but can be implemented (given what was there at the beginning of that year)
2930 in a day or two. Similarly, inventing the
2932 data structure delayed the
2933 connection of the host and terminal pieces by another few months.
2935 took about two years to write, although only about four months were
2936 spent actually working on it.
2938 Part of the design process was unusual:
2939 the subset of the protocol that maintains the
2941 was simulated, debugged
2942 and verified by an automatic protocol analyzer,\u\s-4\&21\s+4\d and was bug-free
2944 The rest of the protocol, concerned mostly
2945 with keeping menus up to date,
2946 was unfortunately too unwieldy for such analysis,
2947 and was debugged by more traditional methods, primarily
2948 by logging in a file all messages in and out of the host.
2953 is essentially the only interactive editor used by the sixty or so members of
2954 the computing science research center in which I work.
2955 The same could not be said of
2957 the lack of a command language kept some people from adopting it.
2958 The union of a user interface as comfortable as
2960 with a command language as powerful as
2964 †The people who criticize
2966 as an interactive program often forget that it and its close relative
2967 .CW sed \u\s-4\&7\s+4\d
2968 still thrive as programmable editors. The strength of these programs is
2969 independent of their convenience for interactive editing.
2978 was first made available to the
2981 almost everyone switched to it within two or three days.
2982 In the months that followed, even people who had never adopted
2990 still gets occasional use, but usually when
2991 something quick needs to be done and the overhead of
2992 downloading the terminal part of
2994 isn't worth the trouble.
2995 Also, as a `line' editor,
2999 when using a good old ASCII terminal, it's comforting to have
3001 But it is fair to say that
3003 command language has displaced
3005 for most of the complicated editing that has kept line editors
3006 (that is, command-driven editors) with us.
3009 command language is even fancier than
3013 customers don't come near to using all its capabilities.
3014 Does it need to be so sophisticated?
3015 I think the answer is yes, for two reasons.
3021 command language is really relatively simple, and certainly simpler than that of
3023 For instance, there is only one kind of textual loop in
3032 command, the global flag on substitutions, and the implicit loop over
3033 lines in multi-line substitutions).
3036 substitute command is necessary to make changes within lines, but in
3040 command is more of a familiar convenience than a necessity;
3044 can do all the work.
3047 given a community that expects an editor to be about as powerful as
3049 it's hard to see how
3051 could really be much simpler and still satisfy that expectation.
3052 People want to do ``global substitutes,'' and most are content
3053 to have the recipe for that and a few other fancy changes.
3054 The sophistication of the command language is really just a veneer
3055 over a design that makes it possible to do global substitutes
3057 Some people will always want something more, however, and it's gratifying to
3058 be able to provide it.
3061 command language comes from composability of the operators, which is by
3062 nature orthogonal to the underlying model.
3065 is not itself complex, but it makes complex things possible.
3066 If you don't want to do anything complex, you can ignore the
3067 complexity altogether, and many people do so.
3069 Sometimes I am asked the opposite question: why didn't I just make
3071 a real programmable editor, with macros and variables and so on?
3072 The main reason is a matter of taste: I like the editor
3073 to be the same every time I use it.
3074 There is one technical reason, though:
3075 programmability in editors is largely a workaround for insufficient
3077 Programmable editors are used to make particular, usually short-term,
3078 things easy to do, such as by providing shorthands for common actions.
3079 If things are generally easy to do in the first place,
3080 shorthands are not as helpful.
3082 makes common editing operations very easy, and the solutions to
3083 complex editing problems seem commensurate with the problems themselves.
3084 Also, the ability to edit the
3086 window makes it easy to repeat commands \(em it only takes a mouse button click
3087 to execute a command again.
3092 has several other good points,
3093 and its share of problems.
3094 Among the good things is the idea of
3095 structural regular expressions,
3096 whose usefulness has only begun to be explored.
3097 They were arrived at serendipitously when I attempted to distill the essence of
3099 way of doing global substitution and recognized that the looping command in
3101 was implicitly imposing a structure (an array of lines) on the file.
3105 good things is its undo capability.
3106 I had never before used an editor with a true undo,
3107 but I would never go back now.
3110 be done well, but if it is, it can be relied on.
3112 it's safe to experiment if you're not sure how to write some intricate command,
3113 because if you make a mistake, it can be fixed simply and reliably.
3114 I learned two things about undo from writing
3116 first, it's easy to provide if you design it in from the beginning, and
3117 second, it's necessary, particularly if the system has some subtle
3118 properties that may be unfamiliar or error-prone for users.
3121 lack of internal limits and sizes is a virtue.
3122 Because it avoids all fixed-size tables and data structures,
3124 is able to make global changes to files that some of our other
3125 tools cannot even read.
3126 Moreover, the design keeps the performance linear when doing such
3127 operations, although I must admit
3129 does get slow when editing a huge file.
3132 Externally, the most obvious is that it is poorly integrated into the
3133 surrounding window system.
3134 By design, the user interface in
3136 feels almost identical to that of
3138 but a thick wall separates text in
3140 from the programs running in
3142 For instance, the `snarf buffer' in
3144 must be maintained separately from that in
3146 This is regrettable, but probably necessary given the unusual configuration
3147 of the system, with a programmable terminal on the far end of an RS-232 link.
3150 is reliable; otherwise, people wouldn't use it.
3151 But it was written over such a long time, and has so many new (to me)
3152 ideas in it, that I would like to see it done over again to clean
3153 up the code and remove many of the lingering problems in the implementation.
3154 The worst part is in the interconnection of the host and terminal parts,
3155 which might even be able to go away in a redesign for a more
3156 conventional window system.
3157 The program must be split in two to use the terminal effectively,
3158 but the low bandwidth of the connection forces the separation to
3159 occur in an inconvenient part of the design if performance is to be acceptable.
3160 A simple remote procedure call
3161 protocol driven by the host, emitting only graphics
3162 commands, would be easy to write but wouldn't have nearly the
3163 necessary responsiveness. On the other hand, if the terminal were in control
3164 and requested much simpler file services from the host, regular expression
3165 searches would require that the terminal read the entire file over its RS-232
3166 link, which would be unreasonably slow.
3167 A compromise in which either end can take control is necessary.
3168 In retrospect, the communications protocol should have been
3169 designed and verified formally, although I do not know of any tool
3170 that can adequately relate the protocol to
3175 users are comfortable with its command language, and few are adept.
3176 Some (venerable) people use a sort of
3183 command language is not exactly
3185 (The reason, of course, is that
3187 model for text does not include newlines, which are central to
3189 Making the text an array of newlines to the command language would
3190 be too much of a break from the seamless model provided by the mouse.
3191 Some editors, such as
3193 are willing to make this break, though.)
3194 The difficulty is that
3196 syntax is so close to
3198 that people believe it
3201 I thought, with some justification in hindsight,
3206 would make it easier to learn and to accept.
3207 But I may have overstepped and raised the users'
3208 expectations too much.
3209 It's hard to decide which way to resolve this problem.
3211 Finally, there is a tradeoff in
3213 that was decided by the environment in which it runs:
3215 is a multi-file editor, although in a different system there might instead be
3216 multiple single-file editors.
3217 The decision was made primarily because starting a new program in a Blit is
3219 If the choice could be made freely, however, I would
3220 still choose the multi-file architecture, because it allows
3221 groups of files to be handled as a unit;
3222 the usefulness of the multi-file commands is incontrovertible.
3223 It is delightful to have the source to an entire program
3224 available at your fingertips.
3228 Tom Cargill suggested the idea behind the
3231 Norman Wilson and Ken Thompson influenced the command language.
3232 This paper was improved by comments from