docs/dev/syntax.md

   1 # Syntax in rust-analyzer
   2
   3 ## About the guide
   4
   5 This guide describes the current state of syntax trees and parsing in rust-analyzer as of 2020-01-09 ([link to commit](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6)).
   6
   7 ## Source Code
   8
   9 The things described are implemented in two places
  10
  11 * [rowan](https://github.com/rust-analyzer/rowan/tree/v0.9.0) -- a generic library for rowan syntax trees.
  12 * [ra_syntax](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax) crate inside rust-analyzer which wraps `rowan` into rust-analyzer specific API.
  13   Nothing in rust-analyzer except this crate knows about `rowan`.
  14 * [ra_parser](https://github.com/rust-analyzer/rust-analyzer/tree/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser) crate parses input tokens into an `ra_syntax` tree
  15
  16 ## Design Goals
  17
  18 * Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved.
  19 * Syntax trees are semantic-less. They describe *strictly* the structure of a sequence of characters, they don't have hygiene, name resolution or type information attached.
  20 * Syntax trees are simple value type. It is possible to create trees for a syntax without any external context.
  21 * Syntax trees have intuitive traversal API (parent, children, siblings, etc).
  22 * Parsing is lossless (even if the input is invalid, the tree produced by the parser represents it exactly).
  23 * Parsing is resilient (even if the input is invalid, parser tries to see as much syntax tree fragments in the input as it can).
  24 * Performance is important, it's OK to use `unsafe` if it means better memory/cpu usage.
  25 * Keep the parser and the syntax tree isolated from each other, such that they can vary independently.
  26
  27 ## Trees
  28
  29 ### Overview
  30
  31 The syntax tree consists of three layers:
  32
  33 * GreenNodes
  34 * SyntaxNodes (aka RedNode)
  35 * AST
  36
  37 Of these, only GreenNodes store the actual data, the other two layers are (non-trivial) views into green tree.
  38 Red-green terminology comes from Roslyn ([link](https://ericlippert.com/2012/06/08/red-green-trees/)) and gives the name to the `rowan` library. Green and syntax nodes are defined in rowan, ast is defined in rust-analyzer.
  39
  40 Syntax trees are a semi-transient data structure.
  41 In general, frontend does not keep syntax trees for all files in memory.
  42 Instead, it *lowers* syntax trees to more compact and rigid representation, which is not full-fidelity, but which can be mapped back to a syntax tree if so desired.
  43
  44
  45 ### GreenNode
  46
  47 GreenNode is a purely-functional tree with arbitrary arity. Conceptually, it is equivalent to the following run of the mill struct:
  48
  49 ```rust
  50 #[derive(PartialEq, Eq, Clone, Copy)]
  51 struct SyntaxKind(u16);
  52
  53 #[derive(PartialEq, Eq, Clone)]
  54 struct Node {
  55     kind: SyntaxKind,
  56     text_len: usize,
  57     children: Vec<Arc<Either<Node, Token>>>,
  58 }
  59
  60 #[derive(PartialEq, Eq, Clone)]
  61 struct Token {
  62     kind: SyntaxKind,
  63     text: String,
  64 }
  65 ```
  66
  67 All the difference between the above sketch and the real implementation are strictly due to optimizations.
  68
  69 Points of note:
  70 * The tree is untyped. Each node has a "type tag", `SyntaxKind`.
  71 * Interior and leaf nodes are distinguished on the type level.
  72 * Trivia and non-trivia tokens are not distinguished on the type level.
  73 * Each token carries its full text.
  74 * The original text can be recovered by concatenating the texts of all tokens in order.
  75 * Accessing a child of particular type (for example, parameter list of a function) generally involves linerary traversing the children, looking for a specific `kind`.
  76 * Modifying the tree is roughly `O(depth)`.
  77   We don't make special efforts to guarantree that the depth is not liner, but, in practice, syntax trees are branchy and shallow.
  78 * If mandatory (grammar wise) node is missing from the input, it's just missing from the tree.
  79 * If an extra erroneous input is present, it is wrapped into a node with `ERROR` kind, and treated just like any other node.
  80 * Parser errors are not a part of syntax tree.
  81
  82 An input like `fn f() { 90 + 2 }` might be parsed as
  83
  84 ```
  85 FN@0..17
  86   FN_KW@0..2 "fn"
  87   WHITESPACE@2..3 " "
  88   NAME@3..4
  89     IDENT@3..4 "f"
  90   PARAM_LIST@4..6
  91     L_PAREN@4..5 "("
  92     R_PAREN@5..6 ")"
  93   WHITESPACE@6..7 " "
  94   BLOCK_EXPR@7..17
  95     BLOCK@7..17
  96       L_CURLY@7..8 "{"
  97       WHITESPACE@8..9 " "
  98       BIN_EXPR@9..15
  99         LITERAL@9..11
 100           INT_NUMBER@9..11 "90"
 101         WHITESPACE@11..12 " "
 102         PLUS@12..13 "+"
 103         WHITESPACE@13..14 " "
 104         LITERAL@14..15
 105           INT_NUMBER@14..15 "2"
 106       WHITESPACE@15..16 " "
 107       R_CURLY@16..17 "}"
 108 ```
 109
 110 #### Optimizations
 111
 112 (significant amount of implementation work here was done by [CAD97](https://github.com/cad97)).
 113
 114 To reduce the amount of allocations, the GreenNode is a DST, which uses a single allocation for header and children. Thus, it is only usable behind a pointer
 115
 116 ```
 117 *-----------+------+----------+------------+--------+--------+-----+--------*
 118 | ref_count | kind | text_len | n_children | child1 | child2 | ... | childn |
 119 *-----------+------+----------+------------+--------+--------+-----+--------*
 120 ```
 121
 122 To more compactly store the children, we box *both* interior nodes and tokens, and represent
 123 `Either<Arc<Node>, Arc<Token>>` as a single pointer with a tag in the last bit.
 124
 125 To avoid allocating EVERY SINGLE TOKEN on the heap, syntax trees use interning.
 126 Because the tree is fully immutable, it's valid to structurally share subtrees.
 127 For example, in `1 + 1`, there will be a *single* token for `1` with ref count 2; the same goes for the ` ` whitespace token.
 128 Interior nodes are shared as well (for example in `(1 + 1) * (1 + 1)`).
 129
 130 Note that, the result of the interning is an `Arc<Node>`.
 131 That is, it's not an index into interning table, so you don't have to have the table around to do anything with the tree.
 132 Each tree is fully self-contained (although different trees might share parts).
 133 Currently, the interner is created per-file, but it will be easy to use a per-thread or per-some-contex one.
 134
 135 We use a `TextSize`, a newtyped `u32`, to store the length of the text.
 136
 137 We currently use `SmolStr`, a small object optimized string to store text.
 138 This was mostly relevant *before* we implemented tree interning, to avoid allocating common keywords and identifiers. We should switch to storing text data alongside the interned tokens.
 139
 140 #### Alternative designs
 141
 142 ##### Dealing with trivia
 143
 144 In the above model, whitespace is not treated specially.
 145 Another alternative (used by swift and roslyn) is to explicitly divide the set of tokens into trivia and non-trivia tokens, and represent non-trivia tokens as
 146
 147 ```rust
 148 struct Token {
 149     kind: NonTriviaTokenKind
 150     text: String,
 151     leading_trivia: Vec<TriviaToken>,
 152     trailing_trivia: Vec<TriviaToken>,
 153 }
 154 ```
 155
 156 The tree then contains only non-trivia tokens.
 157
 158 Another approach (from Dart) is to, in addition to a syntax tree, link all the tokens into a bidirectional link list.
 159 That way, the tree again contains only non-trivia tokens.
 160
 161 Explicit trivia nodes, like in `rowan`, are used by IntelliJ.
 162
 163 ##### Accessing Children
 164
 165 As noted before, accessing a specific child in the node requires a linear traversal of the children (though we can skip tokens, because the tag is encoded in the pointer itself).
 166 It is possible to recover O(1) access with another representation.
 167 We explicitly store optional and missing (required by the grammar, but not present) nodes.
 168 That is, we use `Option<Node>` for children.
 169 We also remove trivia tokens from the tree.
 170 This way, each child kind generally occupies a fixed position in a parent, and we can use index access to fetch it.
 171 The cost is that we now need to allocate space for all not-present optional nodes.
 172 So, `fn foo() {}` will have slots for visibility, unsafeness, attributes, abi and return type.
 173
 174 IntelliJ uses linear traversal.
 175 Roslyn and Swift do `O(1)` access.
 176
 177 ##### Mutable Trees
 178
 179 IntelliJ uses mutable trees.
 180 Overall, it creates a lot of additional complexity.
 181 However, the API for *editing* syntax trees is nice.
 182
 183 For example the assist to move generic bounds to where clause has this code:
 184
 185 ```kotlin
 186  for typeBound in typeBounds {
 187      typeBound.typeParamBounds?.delete()
 188 }
 189 ```
 190
 191 Modeling this with immutable trees is possible, but annoying.
 192
 193 ### Syntax Nodes
 194
 195 A function green tree is not super-convenient to use.
 196 The biggest problem is accessing parents (there are no parent pointers!).
 197 But there are also "identify" issues.
 198 Let's say you want to write a code which builds a list of expressions in a file: `fn collect_exrepssions(file: GreenNode) -> HashSet<GreenNode>`.
 199 For the input like
 200
 201 ```rust
 202 fn main() {
 203     let x = 90i8;
 204     let x = x + 2;
 205     let x = 90i64;
 206     let x = x + 2;
 207 }
 208 ```
 209
 210 both copies of the `x + 2` expression are representing by equal (and, with interning in mind, actually the same) green nodes.
 211 Green trees just can't differentiate between the two.
 212
 213 `SyntaxNode` adds parent pointers and identify semantics to green nodes.
 214 They can be called cursors or [zippers](https://en.wikipedia.org/wiki/Zipper_(data_structure)) (fun fact: zipper is a derivative (as in ′) of a data structure).
 215
 216 Conceptually, a `SyntaxNode` looks like this:
 217
 218 ```rust
 219 type SyntaxNode = Arc<SyntaxData>;
 220
 221 struct SyntaxData {
 222     offset: usize,
 223     parent: Option<SyntaxNode>,
 224     green: Arc<GreeNode>,
 225 }
 226
 227 impl SyntaxNode {
 228     fn new_root(root: Arc<GreenNode>) -> SyntaxNode {
 229         Arc::new(SyntaxData {
 230             offset: 0,
 231             parent: None,
 232             green: root,
 233         })
 234     }
 235     fn parent(&self) -> Option<SyntaxNode> {
 236         self.parent.clone()
 237     }
 238     fn children(&self) -> impl Iterator<Item = SyntaxNode> {
 239         let mut offset = self.offset
 240         self.green.children().map(|green_child| {
 241             let child_offset = offset;
 242             offset += green_child.text_len;
 243             Arc::new(SyntaxData {
 244                 offset: child_offset;
 245                 parent: Some(Arc::clone(self)),
 246                 green: Arc::clone(green_child),
 247             })
 248         })
 249     }
 250 }
 251
 252 impl PartialEq for SyntaxNode {
 253     fn eq(&self, other: &SyntaxNode) {
 254         self.offset == other.offset
 255             && Arc::ptr_eq(&self.green, &other.green)
 256     }
 257 }
 258 ```
 259
 260 Points of note:
 261
 262 * SyntaxNode remembers its parent node (and, transitively, the path to the root of the tree)
 263 * SyntaxNode knows its *absolute* text offset in the whole file
 264 * Equality is based on identity. Comparing nodes from different trees does not make sense.
 265
 266 #### Optimization
 267
 268 The reality is different though :-)
 269 Traversal of trees is a common operation, and it makes sense to optimize it.
 270 In particular, the above code allocates and does atomic operations during a traversal.
 271
 272 To get rid of atomics, `rowan` uses non thread-safe `Rc`.
 273 This is OK because trees traversals mostly (always, in case of rust-analyzer) run on a single thread. If you need to send a `SyntaxNode` to another thread, you can send a pair of **root**`GreenNode` (which is thread safe) and a `Range<usize>`.
 274 The other thread can restore the `SyntaxNode` by traversing from the root green node and looking for a node with specified range.
 275 You can also use the similar trick to store a `SyntaxNode`.
 276 That is, a data structure that holds a `(GreenNode, Range<usize>)` will be `Sync`.
 277 However rust-analyzer goes even further.
 278 It treats trees as semi-transient and instead of storing a `GreenNode`, it generally stores just the id of the file from which the tree originated: `(FileId, Range<usize>)`.
 279 The `SyntaxNode` is the restored by reparsing the file and traversing it from root.
 280 With this trick, rust-analyzer holds only a small amount of trees in memory at the same time, which reduces memory usage.
 281
 282 Additionally, only the root `SyntaxNode` owns an `Arc` to the (root) `GreenNode`.
 283 All other `SyntaxNode`s point to corresponding `GreenNode`s with a raw pointer.
 284 They also point to the parent (and, consequently, to the root) with an owning `Rc`, so this is sound.
 285 In other words, one needs *one* arc bump when initiating a traversal.
 286
 287 To get rid of allocations, `rowan` takes advantage of `SyntaxNode: !Sync` and uses a thread-local free list of `SyntaxNode`s.
 288 In a typical traversal, you only directly hold a few `SyntaxNode`s at a time (and their ancestors indirectly), so a free list proportional to the depth of the tree removes all allocations in a typical case.
 289
 290 So, while traversal is not exactly incrementing a pointer, it's still pretty cheap: TLS + rc bump!
 291
 292 Traversal also yields (cheap) owned nodes, which improves ergonomics quite a bit.
 293
 294 #### Alternative Designs
 295
 296 ##### Memoized RedNodes
 297
 298 C# and Swift follow the design where the red nodes are memoized, which would look roughly like this in Rust:
 299
 300 ```rust
 301 type SyntaxNode = Arc<SyntaxData>;
 302
 303 struct SyntaxData {
 304     offset: usize,
 305     parent: Option<SyntaxNode>,
 306     green: Arc<GreeNode>,
 307     children: Vec<OnceCell<SyntaxNode>>,
 308 }
 309 ```
 310
 311 This allows using true pointer equality for comparison of identities of `SyntaxNodes`.
 312 rust-analyzer used to have this design as well, but we've since switched to cursors.
 313 The main problem with memoizing the red nodes is that it more than doubles the memory requirements for fully realized syntax trees.
 314 In contrast, cursors generally retain only a path to the root.
 315 C# combats increased memory usage by using weak references.
 316
 317 ### AST
 318
 319 `GreenTree`s are untyped and homogeneous, because it makes accommodating error nodes, arbitrary whitespace and comments natural, and because it makes possible to write generic tree traversals.
 320 However, when working with a specific node, like a function definition, one would want a strongly typed API.
 321
 322 This is what is provided by the AST layer. AST nodes are transparent wrappers over untyped syntax nodes:
 323
 324 ```rust
 325 pub trait AstNode {
 326     fn cast(syntax: SyntaxNode) -> Option<Self>
 327     where
 328         Self: Sized;
 329
 330     fn syntax(&self) -> &SyntaxNode;
 331 }
 332 ```
 333
 334 Concrete nodes are generated (there are 117 of them), and look roughly like this:
 335
 336 ```rust
 337 #[derive(Debug, Clone, PartialEq, Eq, Hash)]
 338 pub struct FnDef {
 339     syntax: SyntaxNode,
 340 }
 341
 342 impl AstNode for FnDef {
 343     fn cast(syntax: SyntaxNode) -> Option<Self> {
 344         match kind {
 345             FN => Some(FnDef { syntax }),
 346             _ => None,
 347         }
 348     }
 349     fn syntax(&self) -> &SyntaxNode {
 350         &self.syntax
 351     }
 352 }
 353
 354 impl FnDef {
 355     pub fn param_list(&self) -> Option<ParamList> {
 356         self.syntax.children().find_map(ParamList::cast)
 357     }
 358     pub fn ret_type(&self) -> Option<RetType> {
 359         self.syntax.children().find_map(RetType::cast)
 360     }
 361     pub fn body(&self) -> Option<BlockExpr> {
 362         self.syntax.children().find_map(BlockExpr::cast)
 363     }
 364     // ...
 365 }
 366 ```
 367
 368 Variants like expressions, patterns or items are modeled with `enum`s, which also implement `AstNode`:
 369
 370 ```rust
 371 #[derive(Debug, Clone, PartialEq, Eq, Hash)]
 372 pub enum AssocItem {
 373     FnDef(FnDef),
 374     TypeAliasDef(TypeAliasDef),
 375     ConstDef(ConstDef),
 376 }
 377
 378 impl AstNode for AssocItem {
 379     ...
 380 }
 381 ```
 382
 383 Shared AST substructures are modeled via (object safe) traits:
 384
 385 ```rust
 386 trait HasVisibility: AstNode {
 387     fn visibility(&self) -> Option<Visibility>;
 388 }
 389
 390 impl HasVisbility for FnDef {
 391     fn visibility(&self) -> Option<Visibility> {
 392         self.syntax.children().find_map(Visibility::cast)
 393     }
 394 }
 395 ```
 396
 397 Points of note:
 398
 399 * Like `SyntaxNode`s, AST nodes are cheap to clone pointer-sized owned values.
 400 * All "fields" are optional, to accommodate incomplete and/or erroneous source code.
 401 * It's always possible to go from an ast node to an untyped `SyntaxNode`.
 402 * It's possible to go in the opposite direction with a checked cast.
 403 * `enum`s allow modeling of arbitrary intersecting subsets of AST types.
 404 * Most of rust-analyzer works with the ast layer, with notable exceptions of:
 405   * macro expansion, which needs access to raw tokens and works with `SyntaxNode`s
 406   * some IDE-specific features like syntax highlighting are more conveniently implemented over a homogeneous `SyntaxNode` tree
 407
 408 #### Alternative Designs
 409
 410 ##### Semantic Full AST
 411
 412 In IntelliJ the AST layer (dubbed **P**rogram **S**tructure **I**nterface) can have semantics attached, and is usually backed by either syntax tree, indices, or metadata from compiled libraries.
 413 The backend for PSI can change dynamically.
 414
 415 ### Syntax Tree Recap
 416
 417 At its core, the syntax tree is a purely functional n-ary tree, which stores text at the leaf nodes and node "kinds" at all nodes.
 418 A cursor layer is added on top, which gives owned, cheap to clone nodes with identity semantics, parent links and absolute offsets.
 419 An AST layer is added on top, which reifies each node `Kind` as a separate Rust type with the corresponding API.
 420
 421 ## Parsing
 422
 423 The (green) tree is constructed by a DFS "traversal" of the desired tree structure:
 424
 425 ```rust
 426 pub struct GreenNodeBuilder { ... }
 427
 428 impl GreenNodeBuilder {
 429     pub fn new() -> GreenNodeBuilder { ... }
 430
 431     pub fn token(&mut self, kind: SyntaxKind, text: &str) { ... }
 432
 433     pub fn start_node(&mut self, kind: SyntaxKind) { ... }
 434     pub fn finish_node(&mut self) { ... }
 435
 436     pub fn finish(self) -> GreenNode { ... }
 437 }
 438 ```
 439
 440 The parser, ultimately, needs to invoke the `GreenNodeBuilder`.
 441 There are two principal sources of inputs for the parser:
 442   * source text, which contains trivia tokens (whitespace and comments)
 443   * token trees from macros, which lack trivia
 444
 445 Additionally, input tokens do not correspond 1-to-1 with output tokens.
 446 For example, two consecutive `>` tokens might be glued, by the parser, into a single `>>`.
 447
 448 For these reasons, the parser crate defines a callback interfaces for both input tokens and output trees.
 449 The explicit glue layer then bridges various gaps.
 450
 451 The parser interface looks like this:
 452
 453 ```rust
 454 pub struct Token {
 455     pub kind: SyntaxKind,
 456     pub is_joined_to_next: bool,
 457 }
 458
 459 pub trait TokenSource {
 460     fn current(&self) -> Token;
 461     fn lookahead_nth(&self, n: usize) -> Token;
 462     fn is_keyword(&self, kw: &str) -> bool;
 463
 464     fn bump(&mut self);
 465 }
 466
 467 pub trait TreeSink {
 468     fn token(&mut self, kind: SyntaxKind, n_tokens: u8);
 469
 470     fn start_node(&mut self, kind: SyntaxKind);
 471     fn finish_node(&mut self);
 472
 473     fn error(&mut self, error: ParseError);
 474 }
 475
 476 pub fn parse(
 477     token_source: &mut dyn TokenSource,
 478     tree_sink: &mut dyn TreeSink,
 479 ) { ... }
 480 ```
 481
 482 Points of note:
 483
 484 * The parser and the syntax tree are independent, they live in different crates neither of which depends on the other.
 485 * The parser doesn't know anything about textual contents of the tokens, with an isolated hack for checking contextual keywords.
 486 * For gluing tokens, the `TreeSink::token` might advance further than one atomic token ahead.
 487
 488 ### Reporting Syntax Errors
 489
 490 Syntax errors are not stored directly in the tree.
 491 The primary motivation for this is that syntax tree is not necessary produced by the parser, it may also be assembled manually from pieces (which happens all the time in refactorings).
 492 Instead, parser reports errors to an error sink, which stores them in a `Vec`.
 493 If possible, errors are not reported during parsing and are postponed for a separate validation step.
 494 For example, parser accepts visibility modifiers on trait methods, but then a separate tree traversal flags all such visibilities as erroneous.
 495
 496 ### Macros
 497
 498 The primary difficulty with macros is that individual tokens have identities, which need to be preserved in the syntax tree for hygiene purposes.
 499 This is handled by the `TreeSink` layer.
 500 Specifically, `TreeSink` constructs the tree in lockstep with draining the original token stream.
 501 In the process, it records which tokens of the tree correspond to which tokens of the input, by using text ranges to identify syntax tokens.
 502 The end result is that parsing an expanded code yields a syntax tree and a mapping of text-ranges of the tree to original tokens.
 503
 504 To deal with precedence in cases like `$expr * 1`, we use special invisible parenthesis, which are explicitly handled by the parser
 505
 506 ### Whitespace & Comments
 507
 508 Parser does not see whitespace nodes.
 509 Instead, they are attached to the tree in the `TreeSink` layer.
 510
 511 For example, in
 512
 513 ```rust
 514 // non doc comment
 515 fn foo() {}
 516 ```
 517
 518 the comment will be (heuristically) made a child of function node.
 519
 520 ### Incremental Reparse
 521
 522 Green trees are cheap to modify, so incremental reparse works by patching a previous tree, without maintaining any additional state.
 523 The reparse is based on heuristic: we try to contain a change to a single `{}` block, and reparse only this block.
 524 To do this, we maintain the invariant that, even for invalid code, curly braces are always paired correctly.
 525
 526 In practice, incremental reparsing doesn't actually matter much for IDE use-cases, parsing from scratch seems to be fast enough.
 527
 528 ### Parsing Algorithm
 529
 530 We use a boring hand-crafted recursive descent + pratt combination, with a special effort of continuting the parsing if an error is detected.
 531
 532 ### Parser Recap
 533
 534 Parser itself defines traits for token sequence input and syntax tree output.
 535 It doesn't care about where the tokens come from, and how the resulting syntax tree looks like.