1 .HTML "Fossil, an Archival File Server
4 ... .fp 5 CW LucidaSansCW83
6 Fossil, an Archival File Server
11 jmk,rsc@plan9.bell-labs.com
13 This paper describes the internals and
14 operation of Fossil, an archival file server built for Plan 9.
15 Fossil has not yet replaced the current Plan 9 file server
18 but that is our eventual intent.
19 Both fossil and this documentation are
20 works in progress. Comments on either are most welcome.
28 Fossil is an archival file server built for Plan 9.
29 In a typical configuration, it maintains a traditional file system
30 in a local disk partition and periodically archives snapshots of the file system
31 to a Venti server. These archives are made available through
32 a file system interface.
33 Fossil can also be run without a Venti server, in which case the
34 snapshots (if any) occupy local disk space.
36 The bulk of this paper explains the underlying data structures:
37 Venti trees, the Venti archival file system format, and finally Fossil's
39 The end of the paper discusses the architecture of the Fossil server.
41 The presentation of the data structures is very detailed, perhaps
42 too detailed for most readers.
43 The intent is to record all the details necessary to make structural
44 changes to the file system format.
45 Feel free to jump ahead when boredom strikes.
47 Venti trees and directory hierarchies
49 Venti [3] is an archival block storage server.
50 Once a block is stored, it can be retrieved by presenting the 20-byte
51 SHA1 hash of its contents, called a
53 Blocks on Venti have a maximum length of about 56 kilobytes,
54 though in practice smaller blocks are used.
55 To store a byte stream of arbitrary length, Venti uses a hash tree.
56 Conceptually, the data stream is broken into fixed-size (say,
58 chunks, which are stored on the Venti server.
59 The resulting scores are concatenated into a new pointer stream, which is
60 broken into fixed size (say,
62 chunks, which are stored on the Venti server.
66 so that we can ensure that each pointer block holds an
67 integral number of pointers.)
68 This yields a new pointer stream, and so on, until there is a single block
69 and finally a single score describing the entire tree.
70 The resulting structure looks like:
77 B0: box invis wid 1 "\f(CWVtDataType\fP"
83 L0c: box invis wid 0.2 "..."
90 L0f: box invis wid 0.2 "..."
97 L0i: box invis wid 0.2 "..."
104 L0l: box invis wid 0.2 "..."
109 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
110 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
111 X: box invis at 0.1<$1.nw,$1.ne>
112 Y: box invis at 0.1<$1.sw,$1.se>
113 line -> from 0.5<X,Y> to $2.nw
114 X: box invis at 0.3<$1.nw,$1.ne>
115 Y: box invis at 0.3<$1.sw,$1.se>
116 line -> from 0.5<X,Y> to $3.nw
117 "..." at 0.7<$1.w,$1.e>
121 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
122 line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
123 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
124 X: box invis at 0.1<$1.nw,$1.ne>
125 Y: box invis at 0.1<$1.sw,$1.se>
126 line -> from 0.5<X,Y> to $2.nw
127 X: box invis at 0.3<$1.nw,$1.ne>
128 Y: box invis at 0.3<$1.sw,$1.se>
129 line -> from 0.5<X,Y> to $3.nw
130 "..." at 0.6<$1.w,$1.e>
131 X: box invis at 0.9<$1.nw,$1.ne>
132 Y: box invis at 0.9<$1.sw,$1.se>
133 line -> from 0.5<X,Y> to $4.nw
137 line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
138 line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
139 X: box invis at 0.1<$1.nw,$1.ne>
140 Y: box invis at 0.1<$1.sw,$1.se>
141 line -> from 0.5<X,Y> to $2.nw
142 "..." at 0.5<$1.w,$1.e>
143 X: box invis at 0.9<$1.nw,$1.ne>
144 Y: box invis at 0.9<$1.sw,$1.se>
145 line -> from 0.5<X,Y> to $3.nw
149 L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
150 boxppddd(L1abc, L0a, L0b)
151 L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
152 boxppddd(L1def, L0d, L0e)
153 L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
154 boxppddd(L1ghi, L0g, L0h)
155 L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
156 boxppdddp(L1jklm, L0j, L0k, L0m)
157 B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
159 L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
160 boxppddd(L2abcdef, L1abc, L1def)
161 L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
162 boxpdddp(L2ghijklm, L1ghi, L1jklm)
163 B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
165 L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
166 boxpdddp(L3atom, L2abcdef, L2ghijklm)
167 B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
170 The leaves are the original data stream. Those blocks have type
172 The first pointer stream has type
177 The figure ends with a single block of type
179 but in general trees can have height up to
186 of 8180 bytes (409 pointers),
187 this gives a maximum stream size of approximately 10 zettabytes
188 (2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
190 Data blocks are truncated to remove trailing runs of zeros before
191 storage to Venti; they are zero-filled back to
193 bytes after retrieval from Venti.
194 Similarly, trailing runs of pointers to zero-length blocks are
195 removed from and added back to pointer blocks.
196 These simple rules happen to make it particularly efficient to store
197 large runs of zeros, as might occur in a data stream with ``holes:''
198 the zero-length block itself can be interpreted as a tree of any
199 depth encoding an all-zero data stream.
201 Reconstructing the data stream requires the score and type of the
202 topmost block in the tree, the data chunk size, the pointer chunk size,
203 and the data stream size.
204 (From the data stream size and the chunk sizes we could derive the
205 depth of the tree and thus the type of the topmost block, but it is convenient
206 to allow trees that are deeper than necessary.)
207 This information is kept in a 40-byte structure called a
212 gen[4] \fRgeneration number\fP
213 psize[2] \fRsize of pointer blocks\fP
214 dsize[2] \fRsize of data blocks\fP
217 size[6] \fRlength of file\fP
218 score[20] \fRscore of root block in tree\fP
226 Integers are stored in big-endian order.
228 really is a 48-bit field.)
230 is made up of the following bit fields.
233 0x01 VtEntryActive \fRentry is allocated\fP
234 0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
235 0x1C VtEntryDepthMask \fRmask for tree depth\fP
236 0x20 VtEntryLocal \fRreserved (q.v.)\fP
239 The depth of the described tree is stored in the 3 bits indicated:
240 a tree with a topmost node of type
246 we can build more complicated data structures,
247 ones with multiple or nested data streams.
248 A data stream consisting of
250 structures is called a Venti directory.
251 It is identical in structure to the Venti data stream
252 we described earlier except that the bottom-level type is
257 describing a Venti directory has the
262 for a Venti directory
263 is a multiple of 40 so that each data chunk holds
267 By analogy with Venti directories,
268 we call the original data stream a
270 Note that Venti files are assumed
272 to contain pointers to other Venti blocks.
273 The only pointers to Venti blocks occur in
277 (and in the internal hash tree structure of the
278 individual files and directories).
279 Note also that these directories are nothing more than pointer lists.
280 In particular, there are no names or metadata like in a file system.
282 To make it easier to pass hierarchies between applications,
283 the root of a hierarchy is described in a 300-byte structure
290 name[128] \fRname of structure (just a comment)\fP
291 type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
292 score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
293 blockSize[2] \fRmaximum block size in structure\fP
294 prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
297 This structure is stored to Venti and its score is passed
298 between applications, typically in the form
299 ``\fItype\f(CW:\fIrootscore\fR,''
302 is the type field from the
310 structures can be chained together using the
312 field to encode an archival history
313 of the data structure.
315 For example, a small Venti hierarchy might look like:
328 "\f(CWVtRoot\fP" ljust
337 ] with .nw at VtRoot.sw+(0.2,-.1)
345 ] with .nw at RootMeta.sw+(0,-.1)
357 ] with .nw at MetaSource.sw+(0,-.1)
358 SB1: box invis at Source.B1
359 SB2: box invis at Source.B2
360 SB3: box invis at Source.B3
361 ] with .nw at Root.sw+(0.4,-.1)
377 ] with .nw at MetaSource.sw+(0,-.1)
378 File: box wid 0.8 with .nw at Source.sw+(0,-.1)
379 ] with .nw at Level1.sw+(0.6,-.1)
381 line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
382 line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
383 line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
384 line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
386 line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
387 line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
388 line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
391 KEY: box wid 1.5 invis "Key"
392 line from KEY.sw to KEY.se
397 "Venti file" ljust with .w at last box .w+(kk,0)
398 ] with .nw at KEY.sw+(0,2*k)
401 "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
402 ] with .nw at A.sw+(0,k)
409 "Venti directory" ljust with .w at CC.w+(kk,0)
410 ] with .nw at B.sw+(0,k)
412 line -> right 3*boxwid
413 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
414 ] with .nw at C.sw+(0,k)
415 ] with .nw at VtRoot.nw+(3,0)
418 Venti files are shown as white boxes, while directories are shown
419 as shaded boxes. Each shaded square represents a
421 Arrows represent pointers from
424 Venti files or directories.
426 The hierarchical structure provided by Venti files and directories
427 can be used as the base for more complicated data structures.
428 Because this structure captures all the information
429 about pointers to other blocks, tools written to traverse
430 Venti hierarchies can traverse the more complicated
431 data structures as well.
436 copies a Venti hierarchy from one Venti server to another,
439 Because the traditional file system described in later sections is
440 layered on a Venti hierarchy,
442 can copy it without fully understanding its structure.
444 Vac file system format
446 The Venti archive format
448 builds a traditional file system using a Venti hierarchy.
449 Each vac file is implemented as a Venti file;
450 each vac directory is implemented as a Venti
451 directory and a Venti file to provide traditional file system metadata.
452 The metadata is stored in a structure called a
457 magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
459 elem[s] \fRname (final path element only)\fP
460 entry[4] \fRentry number for Venti file or directory\fP
461 gen[4] \fRgeneration number\fP
462 mentry[4] \fRentry number for Venti file holding metadata\fP
463 mgen[4] \fRgeneration number\fP
464 qid[8] \fRunique file serial number\fP
467 mid[s] \fRlast modified by\fP
468 mtime[4] \fRlast modification time\fP
469 ctime[4] \fRcreation time\fP
470 atime[4] \fRlast access time\fP
471 mode[4] \fRmode bits\fP
475 denotes a string stored as a two-byte length
476 and then that many bytes.
477 The above describes Version 9 of the
479 format. Versions 7 and 8 are very similar; they can be
482 source code but are not written.
483 Earlier versions were not widespread.
486 may be followed by optional extension sections, though none
490 bits include bits commonly used by
491 Unix and Windows, in addition to those used by Plan 9.
495 field is an index into the parallel Venti directory.
500 field in the corresponding
509 are the index and generation number
510 for the metadata Venti file,
513 describes a vac directory.
515 The relation between Venti files and directories and
516 vac files and directories can be seen in this figure:
529 "\f(CWVtRoot\fP" ljust
536 "fs root block" ljust
537 ] with .nw at VtRoot.sw + (0.2, -.2)
544 "root directory info block" ljust
545 ] with .nw at SuperRoot.sw+(0.2, -.2)
550 "root metadata" ljust
559 ] with .nw at RootMeta.sw+(0,-.2)
560 MB1: box wid mb invis at MetaSource.B1
561 MB2: box wid mb invis at MetaSource.B2
562 MB3: box wid mb invis at MetaSource.B3
563 MB4: box wid mb invis at MetaSource.B4
564 MB5: box wid mb invis at MetaSource.B5
576 ] with .nw at MetaSource.sw+(0,-.1)
577 SB1: box invis at Source.B1
578 SB2: box invis at Source.B2
579 SB3: box invis at Source.B3
580 SB4: box invis at Source.B4
581 SB5: box invis at Source.B5
582 SB6: box invis at Source.B6
583 SB7: box invis at Source.B7
584 SB8: box invis at Source.B8
585 ] with .nw at Root.sw+(0.4,-.2)
605 ] with .nw at MetaSource.sw+(0,-.1)
606 File: box wid 0.8 with .nw at Source.sw+(0,-.2)
607 ] with .nw at Level1.sw+(0.6,-.2)
609 line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
610 line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
611 line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
612 line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
613 line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
615 line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
616 line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
617 line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
619 arrowwid = arrowwid/2
621 line -> from Level1.MB1 to Level1.SB1.n
622 line -> from Level1.MB2 to Level1.SB2.n
623 line -> from Level1.MB2 to Level1.SB3.n
624 line -> from Level1.MB4 to Level1.SB7.n
625 line -> from Level1.MB5 to Level1.SB5.n
626 arrowwid = arrowwid * 2
627 arrowht = arrowht * 2
629 box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
630 box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
631 box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
634 KEY: box wid 1.5 invis "Key"
635 line from KEY.sw to KEY.se
640 "Venti file" ljust with .w at last box .w+(kk,0)
641 ] with .nw at KEY.sw+(0,2*k)
644 "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
645 ] with .nw at A.sw+(0,k)
652 "Venti directory" ljust with .w at CC.w+(kk,0)
653 ] with .nw at B.sw+(0,k)
655 line -> right 3*boxwid
656 "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
657 ] with .nw at C.sw+(0,k)
659 box dotted wid 4*boxwid
660 "Vac file" ljust with .w at last box .w+(kk,0)
661 ] with .nw at D.sw+(0,k)
664 "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
665 ] with .nw at DD.sw+(0,k)
667 box dashed wid 4*boxwid
668 "Vac directory" ljust with .w at last box .w+(kk,0)
669 ] with .nw at E.sw+(0,k)
671 arrowwid = arrowwid/2
673 line -> right 1.5*boxwid
674 "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
675 arrowwid = arrowwid * 2
676 arrowht = arrowht * 2
677 ] with .nw at G.sw+(0,k)
678 ] with .nw at VtRoot.nw+(3,0)
681 In reality, the story is slightly more complicated.
682 The metadata file in a Vac directory
683 is not just the concatenation of
686 Instead, it is the concatenation of
690 contains some number of
692 structures along with a sorted index to make it easy
693 to look for a particular
698 The details are in the source code.
700 As shown in the diagram,
701 the root directory of the file system is summarized by
704 structures describing
705 the Venti directory for the children of the root,
706 the Venti file for the metadata describing the children of the root,
707 and a Venti file holding metadata for the root directory itself.
710 structures are placed in a Venti directory of their own,
711 described by the single
716 Fossil file system format
718 Fossil uses the vac format, with some small changes.
719 The changes only affect the data on the local disk; the data
720 archived to Venti is exactly in vac format.
722 Blocks stored on local disk may contain scores pointing at local disk
723 blocks or at Venti blocks.
724 Local block addresses are stored as 20-byte scores in which the first 16 bytes
725 are all zero and the last 4 bytes specify a block number in the disk.
726 Before a block is archived, all the
727 blocks it points to must be archived, and the local scores in the block
728 must be changed to Venti scores.
729 Using block addresses rather than content hashes for local data
730 makes the local file system easier to manage: if a local block's contents
731 change, the pointer to the block does not need to change.
735 Fossil is an archival file server.
736 It takes periodic snapshots of the file system,
737 which are made accessible through the file system.
738 Specifically, the active file system is presented in
740 Ephemeral snapshots (those that are kept on local disk and eventually deleted)
742 \f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
755 Archival snapshots (those that are archived to Venti and persist forever)
757 \f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
763 are year, month, and day as before,
766 is a sequence number if more than one
767 archival snapshot is done in a day.
768 For the first snapshot,
771 For the subsequent snapshots,
779 To implement the snapshots, the file server maintains a
782 for the active file system.
783 Each local block has a label that records, among other things,
784 the epoch in which the block was allocated.
785 If a block was allocated in an epoch earlier than the current one,
786 it is immutable and treated as copy-on-write.
787 Taking a snapshot can be accomplished by
788 recording the address of the current root block and then
789 incrementing the epoch number.
790 Notice that the copy-on-write method makes
791 snapshots both time efficient and space efficient.
792 The only time cost is waiting for all current file system
793 requests to finish and then incrementing a counter.
794 After a snapshot, blocks only get copied when they are
795 next modified, so the per-snapshot
796 space requirement is proportional
797 to the amount of new data rather than the total
798 size of the file system.
800 The blocks in the archival snapshots are moved to Venti,
801 but the blocks in the ephemeral snapshots take up space
802 in the local disk file.
803 To allow reclamation of this disk space, the file system
807 which is the epoch of the earliest ephemeral snapshot
809 Fossil only allows access to snapshots with epoch numbers
811 low epoch and the current epoch
812 (also called the high epoch).
813 Incrementing the low epoch thus makes old
814 snapshots inaccessible.
815 The space required to store those snapshots can then
816 be reclaimed, as described below.
820 The bulk of the local disk file is the local blocks.
821 Each block has a 14-byte label associated with it, of the format:
825 state[1] \fRblock state\fP
826 type[1] \fRblock type\fP
827 epoch[4] \fRallocation epoch\fP
828 epochClose[4] \fRclose epoch\fP
829 tag[4] \fRrandom tag\fP
834 is an analogue of the block types described earlier,
835 though different names are used, to distinguish between
836 pointers blocks in a hash tree for a data stream
837 and pointer blocks for a directory stream.
840 was mentioned in the last section.
841 The other fields are explained below.
843 There are two distinguished blocks states
849 which mark blocks that are available for allocation
850 and blocks that are bad and should be avoided.
853 is not one of these values, it is a bitwise
855 of the following flags:
858 0x01 BsAlloc \fRblock is in use\fP
859 0x02 BsCopied \fRblock has been copied\fP
860 0x04 BsVenti \fRblock has been stored on Venti\fP
861 0x08 BsClosed \fRblock has been unlinked from active file system\fP
864 The flags are explained as they arise in the discussions below.
866 It is convenient to store some extra fields in the
868 structure when it describes a Venti file or directory
869 stored on local disk.
870 Specifically, we set the
873 and then use the bytes 7-16 of the score (which would
874 otherwise be zero, since it is a local score) to hold these fields:
877 archive[1] \fRboolean: this is an archival snapshot\fP
878 snap[4] \fRepoch number if root of snapshot\fP
879 tag[4] \fRrandom tag\fP
884 structure is called an
893 is used to identify dangling pointers or other file system corruption:
894 all the local blocks in a hash tree must
895 have tags matching the tag in the
899 points at the root of a snapshot,
902 field is the epoch of the snapshot.
903 If the snapshot is intended to be archived to Venti,
910 The blocks in the active file system form a tree: each
911 block has only one parent.
912 Once a copy-on-write block
914 is replaced by its copy, it is no longer
915 needed by the active file system.
918 is unlinked from the active file system.
923 it is needed only for snapshots.
924 When a block is closed, the
926 bit is set in its state, and the current epoch (called the block's closing epoch)
935 A block is referenced by snapshots with epochs
936 between the block's allocation epoch and its closing epoch.
937 Once the file system's low epoch grows to be greater than or equal to the block's
938 closing epoch, the block is no longer needed for any snapshots
941 In a typical configuration, where nightly archival snapshots
942 are taken and written to Venti, it is desirable to reclaim
943 the space occupied by now-archived blocks if possible.
944 To do this, Fossil keeps track of whether the pointers
945 in each block are unique to that block.
948 is allocated, a pointer to
950 is written into exactly one active block (say,
952 In the absence of snapshots, the pointer to
954 will remain unique to
956 so that if the pointer is zeroed,
958 can be immediately reused.
959 Snapshots complicate this invariant:
962 is copied-on-write, all its pointers
963 are no longer unique to it.
964 At time of the copy, the
966 state bit in the block's label
967 is set to note the duplication of the pointers contained within.
971 The file system header describes the file system layout and has this format:
975 magic[4] \fR0x3776AE89 (HeaderMagic)\fP
976 version[2] \fR1 (HeaderVersion)\fP
977 blockSize[2] \fIfile system block size\fP
978 super[4] \fRblock offset of super block\fP
979 label[4] \fRblock offset of labels\fP
980 data[4] \fRdata blocks\fP
981 end[4] \fRend of file system\fP
984 The corresponding file system layout is:
990 Empty: box "empty" ht 0.25
991 Header: box "header" with .n at Empty.s
992 Empty2: box "empty" with .n at Header.s
993 Super: box "super block" with .n at Empty2.s
994 Label: box "label" "blocks" with .n at Super.s ht 0.25
995 Data: box "data" "blocks" with .n at Label.s ht 0.3
996 " 0" ljust at Empty.ne
997 " 128kB" ljust at Header.ne
998 " \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
999 " \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
1000 " \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
1001 " \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
1006 The numbers to the right of the blocks are byte offsets
1009 The super block describes the file system itself and looks like:
1013 magic[4] \fR0x2340A3B1 (SuperMagic)\fP
1014 version[2] \fR1 (SuperVersion)\fP
1015 epochLow[4] \fRfile system low epoch\fP
1016 epochHigh[4] \fRfile system high (active) epoch\fP
1017 qid[8] \fRnext qid to allocate\fP
1018 active[4] \fRdata block number: root of active file system\fP
1019 next[4] \fRdata block number: root of next file system to archive\fP
1020 current[4] \fRdata block number: root of file system currently being archived\fP
1021 last[20] \fRVenti score of last successful archive\fP
1022 name[128] \fRname of file system (just a comment)\fP
1028 The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
1032 The file server is structured as a set of processes synchronizing
1033 mostly through message passing along queues.
1034 The processes are given names, which can be seen in the output of
1039 processes announce on various network addresses.
1042 process handles each incoming connection, reading 9P requests
1043 and adding them to a central message queue.
1045 processes remove 9P requests from the queue,
1046 handle them, and write the responses to the appropriate
1051 process handles disk I/O requests made by the other processes.
1054 process writes dirty blocks from the in-memory block cache to disk.
1057 process frees previously linked blocks once the blocks that point at them
1058 have been written to disk.
1062 reads from each console file (typically a pipe posted in
1064 adding the typed characters to the input queue.
1067 process echoes input and runs the commands, saving
1068 output in a ring buffer.
1069 Because there is only one
1071 process, only one console command may be executing at a time.
1074 process copies this ring buffer to each console file.
1078 process runs periodic events, like
1079 flushing the root metadata to disk or
1080 taking snapshots of the file system.
1084 Fossil maintains an in-memory block cache which
1085 holds both local disk blocks and Venti blocks.
1086 Cache eviction follows a least recently used policy.
1087 Dirty blocks are restricted to at most half the cache.
1088 This can be changed by editing
1093 The block cache uses soft updates [1] to ensure that the on-disk
1094 file system is always self-consistent.
1098 and no need to check a file system
1099 that was shut down without halting.
1103 A background process writes blocks in archival snapshots to Venti.
1105 .CW /archive/\fIyyyy\fP/\fImmdds\fR
1108 at the time of the snapshot,
1109 the archival process archives the
1110 entire file tree rather than just
1111 the subtree rooted at
1114 .CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
1115 are stored as empty directories.
1116 Once all the blocks have been archived,
1119 header for the file system is archived.
1120 The score of that header is recorded in
1122 and also printed on the file server console.
1123 The score can used by
1125 to restore a file system (see
1128 Contrast with the old file server
1130 The most obvious difference between Fossil and the
1131 old Plan 9 file server [2] is that Fossil uses a Venti server as
1132 its archival storage in place of a WORM juke box.
1133 There are a few other architectural differences to be
1136 Fossil is a user-level program run on a standard kernel.
1138 Fossil does not have any way to concatenate, stripe, or
1139 mirror disk files. For functionality similar to the old file server's
1140 configuration strings, use the experimental file stack device
1144 Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
1146 ... XXX words about converting an old file system to fossil?
1150 [1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
1152 ``Soft Updates: A Solution to the Metadata Update Problem
1154 .I "ACM Transactions on Computer Systems" ,
1155 Vol 18., No. 2, May 2000, pp. 127\-153.
1157 [2] Sean Quinlan, ``A Cached WORM File System,''
1158 .I "Software\(emPractice and Experience" ,
1159 Vol 21., No 12., December 1991, pp. 1289\-1299.
1161 [3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
1162 .I "Usenix Conference on File and Storage Technologies" ,