docs/ARCHITECTURE.md

   1 # Design and open questions about libsyntax
   2
   3
   4 The high-level description of the architecture is in RFC.md. You might
   5 also want to dig through https://github.com/matklad/fall/ which
   6 contains some pretty interesting stuff build using similar ideas
   7 (warning: it is completely undocumented, poorly written and in general
   8 not the thing which I recommend to study (yes, this is
   9 self-contradictory)).
  10
  11 ## Tree
  12
  13 The centerpiece of this whole endeavor is the syntax tree, in the
  14 `tree` module. Open questions:
  15
  16 - how to best represent errors, to take advantage of the fact that
  17   they are rare, but to enable fully-persistent style structure
  18   sharing between tree nodes?
  19
  20 - should we make red/green split from Roslyn more pronounced?
  21
  22 - one can layout nodes in a single array in such a way that children
  23   of the node form a continuous slice. Seems nifty, but do we need it?
  24
  25 - should we use SoA or AoS for NodeData?
  26
  27 - should we split leaf nodes and internal nodes into separate arrays?
  28   Can we use it to save some bits here and there? (leaves don't need
  29   first_child field, for example).
  30
  31
  32 ## Parser
  33
  34 The syntax tree is produced using a three-staged process.
  35
  36 First, a raw text is split into tokens with a lexer (the `lexer` module).
  37 Lexer has a peculiar signature: it is an `Fn(&str) -> Token`, where token
  38 is a pair of `SyntaxKind` (you should have read the `tree` module and RFC
  39 by this time! :)) and a len. That is, lexer chomps only the first
  40 token of the input. This forces the lexer to be stateless, and makes
  41 it possible to implement incremental relexing easily.
  42
  43 Then, the bulk of work, the parser turns a stream of tokens into
  44 stream of events (the `parser` module; of particular interest are
  45 the `parser/event` and `parser/parser` modules, which contain parsing
  46 API, and the `parser/grammar` module, which contains actual parsing code
  47 for various Rust syntactic constructs). Not that parser **does not**
  48 construct a tree right away. This is done for several reasons:
  49
  50 * to decouple the actual tree data structure from the parser: you can
  51   build any data structure you want from the stream of events
  52
  53 * to make parsing fast: you can produce a list of events without
  54   allocations
  55
  56 * to make it easy to tweak tree structure. Consider this code:
  57
  58   ```
  59   #[cfg(test)]
  60   pub fn foo() {}
  61   ```
  62
  63   Here, the attribute and the `pub` keyword must be the children of
  64   the `fn` node. However, when parsing them, we don't yet know if
  65   there would be a function ahead: it very well might be a `struct`
  66   there. If we use events, we generally don't care about this *in
  67   parser* and just spit them in order.
  68
  69 * (Is this true?)  to make incremental reparsing easier: you can reuse
  70   the same rope data structure for all of the original string, the
  71   tokens and the events.
  72
  73
  74 The parser also does not know about whitespace tokens: it's the job of
  75 the next layer to assign whitespace and comments to nodes. However,
  76 parser can remap contextual tokens, like `>>` or `union`, so it has
  77 access to the text.
  78
  79 And at last, the TreeBuilder converts a flat stream of events into a
  80 tree structure. It also *should* be responsible for attaching comments
  81 and rebalancing the tree, but it does not do this yet :)
  82
  83 ## Validator
  84
  85 Parser and lexer accept a lot of *invalid* code intentionally. The
  86 idea is to post-process the tree and to proper error reporting,
  87 literal conversion and quick-fix suggestions. There is no
  88 design/implementation for this yet.
  89
  90
  91 ## AST
  92
  93 Nothing yet, see `AstNode` in `fall`.