docs/dev/syntax.md

   1 # Syntax in rust-analyzer
   2
   3 ## About the guide
   4
   5 This guide describes the current state of syntax trees and parsing in rust-analyzer as of 2020-01-09 ([link to commit](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6)).
   6
   7 ## Source Code
   8
   9 The things described are implemented in three places
  10
  11 * [rowan](https://github.com/rust-analyzer/rowan/tree/v0.9.0) -- a generic library for rowan syntax trees.
  12 * [ra_syntax](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax) crate inside rust-analyzer which wraps `rowan` into rust-analyzer specific API.
  13   Nothing in rust-analyzer except this crate knows about `rowan`.
  14 * [parser](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/parser) crate parses input tokens into an `ra_syntax` tree
  15
  16 ## Design Goals
  17
  18 * Syntax trees are lossless, or full fidelity. All comments and whitespace get preserved.
  19 * Syntax trees are semantic-less. They describe *strictly* the structure of a sequence of characters, they don't have hygiene, name resolution or type information attached.
  20 * Syntax trees are simple value types. It is possible to create trees for a syntax without any external context.
  21 * Syntax trees have intuitive traversal API (parent, children, siblings, etc).
  22 * Parsing is lossless (even if the input is invalid, the tree produced by the parser represents it exactly).
  23 * Parsing is resilient (even if the input is invalid, parser tries to see as much syntax tree fragments in the input as it can).
  24 * Performance is important, it's OK to use `unsafe` if it means better memory/cpu usage.
  25 * Keep the parser and the syntax tree isolated from each other, such that they can vary independently.
  26
  27 ## Trees
  28
  29 ### Overview
  30
  31 The syntax tree consists of three layers:
  32
  33 * GreenNodes
  34 * SyntaxNodes (aka RedNode)
  35 * AST
  36
  37 Of these, only GreenNodes store the actual data, the other two layers are (non-trivial) views into green tree.
  38 Red-green terminology comes from Roslyn ([link](https://ericlippert.com/2012/06/08/red-green-trees/)) and gives the name to the `rowan` library. Green and syntax nodes are defined in rowan, ast is defined in rust-analyzer.
  39
  40 Syntax trees are a semi-transient data structure.
  41 In general, frontend does not keep syntax trees for all files in memory.
  42 Instead, it *lowers* syntax trees to more compact and rigid representation, which is not full-fidelity, but which can be mapped back to a syntax tree if so desired.
  43
  44
  45 ### GreenNode
  46
  47 GreenNode is a purely-functional tree with arbitrary arity. Conceptually, it is equivalent to the following run of the mill struct:
  48
  49 ```rust
  50 #[derive(PartialEq, Eq, Clone, Copy)]
  51 struct SyntaxKind(u16);
  52
  53 #[derive(PartialEq, Eq, Clone)]
  54 struct Node {
  55     kind: SyntaxKind,
  56     text_len: usize,
  57     children: Vec<Arc<Either<Node, Token>>>,
  58 }
  59
  60 #[derive(PartialEq, Eq, Clone)]
  61 struct Token {
  62     kind: SyntaxKind,
  63     text: String,
  64 }
  65 ```
  66
  67 All the difference between the above sketch and the real implementation are strictly due to optimizations.
  68
  69 Points of note:
  70 * The tree is untyped. Each node has a "type tag", `SyntaxKind`.
  71 * Interior and leaf nodes are distinguished on the type level.
  72 * Trivia and non-trivia tokens are not distinguished on the type level.
  73 * Each token carries its full text.
  74 * The original text can be recovered by concatenating the texts of all tokens in order.
  75 * Accessing a child of particular type (for example, parameter list of a function) generally involves linearly traversing the children, looking for a specific `kind`.
  76 * Modifying the tree is roughly `O(depth)`.
  77   We don't make special efforts to guarantee that the depth is not linear, but, in practice, syntax trees are branchy and shallow.
  78 * If mandatory (grammar wise) node is missing from the input, it's just missing from the tree.
  79 * If an extra erroneous input is present, it is wrapped into a node with `ERROR` kind, and treated just like any other node.
  80 * Parser errors are not a part of syntax tree.
  81
  82 An input like `fn f() { 90 + 2 }` might be parsed as
  83
  84 ```
  85 FN@0..17
  86   FN_KW@0..2 "fn"
  87   WHITESPACE@2..3 " "
  88   NAME@3..4
  89     IDENT@3..4 "f"
  90   PARAM_LIST@4..6
  91     L_PAREN@4..5 "("
  92     R_PAREN@5..6 ")"
  93   WHITESPACE@6..7 " "
  94   BLOCK_EXPR@7..17
  95     L_CURLY@7..8 "{"
  96     WHITESPACE@8..9 " "
  97     BIN_EXPR@9..15
  98       LITERAL@9..11
  99         INT_NUMBER@9..11 "90"
 100       WHITESPACE@11..12 " "
 101       PLUS@12..13 "+"
 102       WHITESPACE@13..14 " "
 103       LITERAL@14..15
 104         INT_NUMBER@14..15 "2"
 105     WHITESPACE@15..16 " "
 106     R_CURLY@16..17 "}"
 107 ```
 108
 109 #### Optimizations
 110
 111 (significant amount of implementation work here was done by [CAD97](https://github.com/cad97)).
 112
 113 To reduce the amount of allocations, the GreenNode is a [DST](https://doc.rust-lang.org/reference/dynamically-sized-types.html), which uses a single allocation for header and children. Thus, it is only usable behind a pointer.
 114
 115 ```
 116 *-----------+------+----------+------------+--------+--------+-----+--------*
 117 | ref_count | kind | text_len | n_children | child1 | child2 | ... | childn |
 118 *-----------+------+----------+------------+--------+--------+-----+--------*
 119 ```
 120
 121 To more compactly store the children, we box *both* interior nodes and tokens, and represent
 122 `Either<Arc<Node>, Arc<Token>>` as a single pointer with a tag in the last bit.
 123
 124 To avoid allocating EVERY SINGLE TOKEN on the heap, syntax trees use interning.
 125 Because the tree is fully immutable, it's valid to structurally share subtrees.
 126 For example, in `1 + 1`, there will be a *single* token for `1` with ref count 2; the same goes for the ` ` whitespace token.
 127 Interior nodes are shared as well (for example in `(1 + 1) * (1 + 1)`).
 128
 129 Note that, the result of the interning is an `Arc<Node>`.
 130 That is, it's not an index into interning table, so you don't have to have the table around to do anything with the tree.
 131 Each tree is fully self-contained (although different trees might share parts).
 132 Currently, the interner is created per-file, but it will be easy to use a per-thread or per-some-contex one.
 133
 134 We use a `TextSize`, a newtyped `u32`, to store the length of the text.
 135
 136 We currently use `SmolStr`, a small object optimized string to store text.
 137 This was mostly relevant *before* we implemented tree interning, to avoid allocating common keywords and identifiers. We should switch to storing text data alongside the interned tokens.
 138
 139 #### Alternative designs
 140
 141 ##### Dealing with trivia
 142
 143 In the above model, whitespace is not treated specially.
 144 Another alternative (used by swift and roslyn) is to explicitly divide the set of tokens into trivia and non-trivia tokens, and represent non-trivia tokens as
 145
 146 ```rust
 147 struct Token {
 148     kind: NonTriviaTokenKind,
 149     text: String,
 150     leading_trivia: Vec<TriviaToken>,
 151     trailing_trivia: Vec<TriviaToken>,
 152 }
 153 ```
 154
 155 The tree then contains only non-trivia tokens.
 156
 157 Another approach (from Dart) is to, in addition to a syntax tree, link all the tokens into a bidirectional link list.
 158 That way, the tree again contains only non-trivia tokens.
 159
 160 Explicit trivia nodes, like in `rowan`, are used by IntelliJ.
 161
 162 ##### Accessing Children
 163
 164 As noted before, accessing a specific child in the node requires a linear traversal of the children (though we can skip tokens, because the tag is encoded in the pointer itself).
 165 It is possible to recover O(1) access with another representation.
 166 We explicitly store optional and missing (required by the grammar, but not present) nodes.
 167 That is, we use `Option<Node>` for children.
 168 We also remove trivia tokens from the tree.
 169 This way, each child kind generally occupies a fixed position in a parent, and we can use index access to fetch it.
 170 The cost is that we now need to allocate space for all not-present optional nodes.
 171 So, `fn foo() {}` will have slots for visibility, unsafeness, attributes, abi and return type.
 172
 173 IntelliJ uses linear traversal.
 174 Roslyn and Swift do `O(1)` access.
 175
 176 ##### Mutable Trees
 177
 178 IntelliJ uses mutable trees.
 179 Overall, it creates a lot of additional complexity.
 180 However, the API for *editing* syntax trees is nice.
 181
 182 For example the assist to move generic bounds to where clause has this code:
 183
 184 ```kotlin
 185  for typeBound in typeBounds {
 186      typeBound.typeParamBounds?.delete()
 187 }
 188 ```
 189
 190 Modeling this with immutable trees is possible, but annoying.
 191
 192 ### Syntax Nodes
 193
 194 A function green tree is not super-convenient to use.
 195 The biggest problem is accessing parents (there are no parent pointers!).
 196 But there are also "identify" issues.
 197 Let's say you want to write a code which builds a list of expressions in a file: `fn collect_expressions(file: GreenNode) -> HashSet<GreenNode>`.
 198 For the input like
 199
 200 ```rust
 201 fn main() {
 202     let x = 90i8;
 203     let x = x + 2;
 204     let x = 90i64;
 205     let x = x + 2;
 206 }
 207 ```
 208
 209 both copies of the `x + 2` expression are representing by equal (and, with interning in mind, actually the same) green nodes.
 210 Green trees just can't differentiate between the two.
 211
 212 `SyntaxNode` adds parent pointers and identify semantics to green nodes.
 213 They can be called cursors or [zippers](https://en.wikipedia.org/wiki/Zipper_(data_structure)) (fun fact: zipper is a derivative (as in ′) of a data structure).
 214
 215 Conceptually, a `SyntaxNode` looks like this:
 216
 217 ```rust
 218 type SyntaxNode = Arc<SyntaxData>;
 219
 220 struct SyntaxData {
 221     offset: usize,
 222     parent: Option<SyntaxNode>,
 223     green: Arc<GreeNode>,
 224 }
 225
 226 impl SyntaxNode {
 227     fn new_root(root: Arc<GreenNode>) -> SyntaxNode {
 228         Arc::new(SyntaxData {
 229             offset: 0,
 230             parent: None,
 231             green: root,
 232         })
 233     }
 234     fn parent(&self) -> Option<SyntaxNode> {
 235         self.parent.clone()
 236     }
 237     fn children(&self) -> impl Iterator<Item = SyntaxNode> {
 238         let mut offset = self.offset;
 239         self.green.children().map(|green_child| {
 240             let child_offset = offset;
 241             offset += green_child.text_len;
 242             Arc::new(SyntaxData {
 243                 offset: child_offset,
 244                 parent: Some(Arc::clone(self)),
 245                 green: Arc::clone(green_child),
 246             })
 247         })
 248     }
 249 }
 250
 251 impl PartialEq for SyntaxNode {
 252     fn eq(&self, other: &SyntaxNode) -> bool {
 253         self.offset == other.offset
 254             && Arc::ptr_eq(&self.green, &other.green)
 255     }
 256 }
 257 ```
 258
 259 Points of note:
 260
 261 * SyntaxNode remembers its parent node (and, transitively, the path to the root of the tree)
 262 * SyntaxNode knows its *absolute* text offset in the whole file
 263 * Equality is based on identity. Comparing nodes from different trees does not make sense.
 264
 265 #### Optimization
 266
 267 The reality is different though :-)
 268 Traversal of trees is a common operation, and it makes sense to optimize it.
 269 In particular, the above code allocates and does atomic operations during a traversal.
 270
 271 To get rid of atomics, `rowan` uses non thread-safe `Rc`.
 272 This is OK because trees traversals mostly (always, in case of rust-analyzer) run on a single thread. If you need to send a `SyntaxNode` to another thread, you can send a pair of **root**`GreenNode` (which is thread safe) and a `Range<usize>`.
 273 The other thread can restore the `SyntaxNode` by traversing from the root green node and looking for a node with specified range.
 274 You can also use the similar trick to store a `SyntaxNode`.
 275 That is, a data structure that holds a `(GreenNode, Range<usize>)` will be `Sync`.
 276 However, rust-analyzer goes even further.
 277 It treats trees as semi-transient and instead of storing a `GreenNode`, it generally stores just the id of the file from which the tree originated: `(FileId, Range<usize>)`.
 278 The `SyntaxNode` is the restored by reparsing the file and traversing it from root.
 279 With this trick, rust-analyzer holds only a small amount of trees in memory at the same time, which reduces memory usage.
 280
 281 Additionally, only the root `SyntaxNode` owns an `Arc` to the (root) `GreenNode`.
 282 All other `SyntaxNode`s point to corresponding `GreenNode`s with a raw pointer.
 283 They also point to the parent (and, consequently, to the root) with an owning `Rc`, so this is sound.
 284 In other words, one needs *one* arc bump when initiating a traversal.
 285
 286 To get rid of allocations, `rowan` takes advantage of `SyntaxNode: !Sync` and uses a thread-local free list of `SyntaxNode`s.
 287 In a typical traversal, you only directly hold a few `SyntaxNode`s at a time (and their ancestors indirectly), so a free list proportional to the depth of the tree removes all allocations in a typical case.
 288
 289 So, while traversal is not exactly incrementing a pointer, it's still pretty cheap: TLS + rc bump!
 290
 291 Traversal also yields (cheap) owned nodes, which improves ergonomics quite a bit.
 292
 293 #### Alternative Designs
 294
 295 ##### Memoized RedNodes
 296
 297 C# and Swift follow the design where the red nodes are memoized, which would look roughly like this in Rust:
 298
 299 ```rust
 300 type SyntaxNode = Arc<SyntaxData>;
 301
 302 struct SyntaxData {
 303     offset: usize,
 304     parent: Option<SyntaxNode>,
 305     green: Arc<GreeNode>,
 306     children: Vec<OnceCell<SyntaxNode>>,
 307 }
 308 ```
 309
 310 This allows using true pointer equality for comparison of identities of `SyntaxNodes`.
 311 rust-analyzer used to have this design as well, but we've since switched to cursors.
 312 The main problem with memoizing the red nodes is that it more than doubles the memory requirements for fully realized syntax trees.
 313 In contrast, cursors generally retain only a path to the root.
 314 C# combats increased memory usage by using weak references.
 315
 316 ### AST
 317
 318 `GreenTree`s are untyped and homogeneous, because it makes accommodating error nodes, arbitrary whitespace and comments natural, and because it makes possible to write generic tree traversals.
 319 However, when working with a specific node, like a function definition, one would want a strongly typed API.
 320
 321 This is what is provided by the AST layer. AST nodes are transparent wrappers over untyped syntax nodes:
 322
 323 ```rust
 324 pub trait AstNode {
 325     fn cast(syntax: SyntaxNode) -> Option<Self>
 326     where
 327         Self: Sized;
 328
 329     fn syntax(&self) -> &SyntaxNode;
 330 }
 331 ```
 332
 333 Concrete nodes are generated (there are 117 of them), and look roughly like this:
 334
 335 ```rust
 336 #[derive(Debug, Clone, PartialEq, Eq, Hash)]
 337 pub struct FnDef {
 338     syntax: SyntaxNode,
 339 }
 340
 341 impl AstNode for FnDef {
 342     fn cast(syntax: SyntaxNode) -> Option<Self> {
 343         match kind {
 344             FN => Some(FnDef { syntax }),
 345             _ => None,
 346         }
 347     }
 348     fn syntax(&self) -> &SyntaxNode {
 349         &self.syntax
 350     }
 351 }
 352
 353 impl FnDef {
 354     pub fn param_list(&self) -> Option<ParamList> {
 355         self.syntax.children().find_map(ParamList::cast)
 356     }
 357     pub fn ret_type(&self) -> Option<RetType> {
 358         self.syntax.children().find_map(RetType::cast)
 359     }
 360     pub fn body(&self) -> Option<BlockExpr> {
 361         self.syntax.children().find_map(BlockExpr::cast)
 362     }
 363     // ...
 364 }
 365 ```
 366
 367 Variants like expressions, patterns or items are modeled with `enum`s, which also implement `AstNode`:
 368
 369 ```rust
 370 #[derive(Debug, Clone, PartialEq, Eq, Hash)]
 371 pub enum AssocItem {
 372     FnDef(FnDef),
 373     TypeAliasDef(TypeAliasDef),
 374     ConstDef(ConstDef),
 375 }
 376
 377 impl AstNode for AssocItem {
 378     ...
 379 }
 380 ```
 381
 382 Shared AST substructures are modeled via (object safe) traits:
 383
 384 ```rust
 385 trait HasVisibility: AstNode {
 386     fn visibility(&self) -> Option<Visibility>;
 387 }
 388
 389 impl HasVisibility for FnDef {
 390     fn visibility(&self) -> Option<Visibility> {
 391         self.syntax.children().find_map(Visibility::cast)
 392     }
 393 }
 394 ```
 395
 396 Points of note:
 397
 398 * Like `SyntaxNode`s, AST nodes are cheap to clone pointer-sized owned values.
 399 * All "fields" are optional, to accommodate incomplete and/or erroneous source code.
 400 * It's always possible to go from an ast node to an untyped `SyntaxNode`.
 401 * It's possible to go in the opposite direction with a checked cast.
 402 * `enum`s allow modeling of arbitrary intersecting subsets of AST types.
 403 * Most of rust-analyzer works with the ast layer, with notable exceptions of:
 404   * macro expansion, which needs access to raw tokens and works with `SyntaxNode`s
 405   * some IDE-specific features like syntax highlighting are more conveniently implemented over a homogeneous `SyntaxNode` tree
 406
 407 #### Alternative Designs
 408
 409 ##### Semantic Full AST
 410
 411 In IntelliJ the AST layer (dubbed **P**rogram **S**tructure **I**nterface) can have semantics attached, and is usually backed by either syntax tree, indices, or metadata from compiled libraries.
 412 The backend for PSI can change dynamically.
 413
 414 ### Syntax Tree Recap
 415
 416 At its core, the syntax tree is a purely functional n-ary tree, which stores text at the leaf nodes and node "kinds" at all nodes.
 417 A cursor layer is added on top, which gives owned, cheap to clone nodes with identity semantics, parent links and absolute offsets.
 418 An AST layer is added on top, which reifies each node `Kind` as a separate Rust type with the corresponding API.
 419
 420 ## Parsing
 421
 422 The (green) tree is constructed by a DFS "traversal" of the desired tree structure:
 423
 424 ```rust
 425 pub struct GreenNodeBuilder { ... }
 426
 427 impl GreenNodeBuilder {
 428     pub fn new() -> GreenNodeBuilder { ... }
 429
 430     pub fn token(&mut self, kind: SyntaxKind, text: &str) { ... }
 431
 432     pub fn start_node(&mut self, kind: SyntaxKind) { ... }
 433     pub fn finish_node(&mut self) { ... }
 434
 435     pub fn finish(self) -> GreenNode { ... }
 436 }
 437 ```
 438
 439 The parser, ultimately, needs to invoke the `GreenNodeBuilder`.
 440 There are two principal sources of inputs for the parser:
 441   * source text, which contains trivia tokens (whitespace and comments)
 442   * token trees from macros, which lack trivia
 443
 444 Additionally, input tokens do not correspond 1-to-1 with output tokens.
 445 For example, two consecutive `>` tokens might be glued, by the parser, into a single `>>`.
 446
 447 For these reasons, the parser crate defines a callback interfaces for both input tokens and output trees.
 448 The explicit glue layer then bridges various gaps.
 449
 450 The parser interface looks like this:
 451
 452 ```rust
 453 pub struct Token {
 454     pub kind: SyntaxKind,
 455     pub is_joined_to_next: bool,
 456 }
 457
 458 pub trait TokenSource {
 459     fn current(&self) -> Token;
 460     fn lookahead_nth(&self, n: usize) -> Token;
 461     fn is_keyword(&self, kw: &str) -> bool;
 462
 463     fn bump(&mut self);
 464 }
 465
 466 pub trait TreeSink {
 467     fn token(&mut self, kind: SyntaxKind, n_tokens: u8);
 468
 469     fn start_node(&mut self, kind: SyntaxKind);
 470     fn finish_node(&mut self);
 471
 472     fn error(&mut self, error: ParseError);
 473 }
 474
 475 pub fn parse(
 476     token_source: &mut dyn TokenSource,
 477     tree_sink: &mut dyn TreeSink,
 478 ) { ... }
 479 ```
 480
 481 Points of note:
 482
 483 * The parser and the syntax tree are independent, they live in different crates neither of which depends on the other.
 484 * The parser doesn't know anything about textual contents of the tokens, with an isolated hack for checking contextual keywords.
 485 * For gluing tokens, the `TreeSink::token` might advance further than one atomic token ahead.
 486
 487 ### Reporting Syntax Errors
 488
 489 Syntax errors are not stored directly in the tree.
 490 The primary motivation for this is that syntax tree is not necessary produced by the parser, it may also be assembled manually from pieces (which happens all the time in refactorings).
 491 Instead, parser reports errors to an error sink, which stores them in a `Vec`.
 492 If possible, errors are not reported during parsing and are postponed for a separate validation step.
 493 For example, parser accepts visibility modifiers on trait methods, but then a separate tree traversal flags all such visibilities as erroneous.
 494
 495 ### Macros
 496
 497 The primary difficulty with macros is that individual tokens have identities, which need to be preserved in the syntax tree for hygiene purposes.
 498 This is handled by the `TreeSink` layer.
 499 Specifically, `TreeSink` constructs the tree in lockstep with draining the original token stream.
 500 In the process, it records which tokens of the tree correspond to which tokens of the input, by using text ranges to identify syntax tokens.
 501 The end result is that parsing an expanded code yields a syntax tree and a mapping of text-ranges of the tree to original tokens.
 502
 503 To deal with precedence in cases like `$expr * 1`, we use special invisible parenthesis, which are explicitly handled by the parser
 504
 505 ### Whitespace & Comments
 506
 507 Parser does not see whitespace nodes.
 508 Instead, they are attached to the tree in the `TreeSink` layer.
 509
 510 For example, in
 511
 512 ```rust
 513 // non doc comment
 514 fn foo() {}
 515 ```
 516
 517 the comment will be (heuristically) made a child of function node.
 518
 519 ### Incremental Reparse
 520
 521 Green trees are cheap to modify, so incremental reparse works by patching a previous tree, without maintaining any additional state.
 522 The reparse is based on heuristic: we try to contain a change to a single `{}` block, and reparse only this block.
 523 To do this, we maintain the invariant that, even for invalid code, curly braces are always paired correctly.
 524
 525 In practice, incremental reparsing doesn't actually matter much for IDE use-cases, parsing from scratch seems to be fast enough.
 526
 527 ### Parsing Algorithm
 528
 529 We use a boring hand-crafted recursive descent + pratt combination, with a special effort of continuing the parsing if an error is detected.
 530
 531 ### Parser Recap
 532
 533 Parser itself defines traits for token sequence input and syntax tree output.
 534 It doesn't care about where the tokens come from, and how the resulting syntax tree looks like.