docs/RFC.md

   1 - Feature Name: libsyntax2.0
   2 - Start Date: 2017-12-30
   3 - RFC PR: (leave this empty)
   4 - Rust Issue: (leave this empty)
   5
   6
   7 >I think the lack of reusability comes in object-oriented languages,
   8 >not functional languages. Because the problem with object-oriented
   9 >languages is they’ve got all this implicit environment that they
  10 >carry around with them. You wanted a banana but what you got was a
  11 >gorilla holding the banana and the entire jungle.
  12 >
  13 >If you have referentially transparent code, if you have pure
  14 >functions — all the data comes in its input arguments and everything
  15 >goes out and leave no state behind — it’s incredibly reusable.
  16 >
  17 > **Joe Armstrong**
  18
  19 # Summary
  20 [summary]: #summary
  21
  22 The long-term plan is to rewrite libsyntax parser and syntax tree data
  23 structure to create a software component independent of the rest of
  24 rustc compiler and suitable for the needs of IDEs and code
  25 editors. This RFCs is the first step of this plan, whose goal is to
  26 find out if this is possible at least in theory. If it is possible,
  27 the next steps would be a prototype implementation as a crates.io
  28 crate and a separate RFC for integrating the prototype with rustc,
  29 other tools, and eventual libsyntax removal.
  30
  31 Note that this RFC does not propose to stabilize any API for working
  32 with rust syntax: the semver version of the hypothetical library would
  33 be `0.1.0`. It is intended to be used by tools, which are currently
  34 closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls`
  35 and hypothetical `rustfix`. While it would be possible to create
  36 third-party tools on top of the new libsyntax, the burden of adopting
  37 to breaking changes would be on authors of such tools.
  38
  39
  40 # Motivation
  41 [motivation]: #motivation
  42
  43 There are two main drawbacks with the current version of libsyntax:
  44
  45 * It is tightly integrated with the compiler and hard to use
  46   independently
  47
  48 * The AST representation is not well-suited for use inside IDEs
  49
  50
  51 ## IDE support
  52
  53 There are several differences in how IDEs and compilers typically
  54 treat source code.
  55
  56 In the compiler, it is convenient to transform the source
  57 code into Abstract Syntax Tree form, which is independent of the
  58 surface syntax. For example, it's convenient to discard comments,
  59 whitespaces and desugar some syntactic constructs in terms of the
  60 simpler ones.
  61
  62 In contrast, IDEs work much closer to the source code, so it is
  63 crucial to preserve full information about the original text. For
  64 example, IDE may adjust indentation after typing a `}` which closes a
  65 block, and to do this correctly, IDE must be aware of syntax (that is,
  66 that `}` indeed closes some block, and is not a syntax error) and of
  67 all whitespaces and comments. So, IDE suitable AST should explicitly
  68 account for syntactic elements, not considered important by the
  69 compiler.
  70
  71 Another difference is that IDEs typically work with incomplete and
  72 syntactically invalid code. This boils down to two parser properties.
  73 First, the parser must produce syntax tree even if some required input
  74 is missing. For example, for input `fn foo` the function node should
  75 be present in the parse, despite the fact that there is no parameters
  76 or body. Second, the parser must be able to skip over parts of input
  77 it can't recognize and aggressively recover from errors. That is, the
  78 syntax tree data structure should be able to handle both missing and
  79 extra nodes.
  80
  81 IDEs also need the ability to incrementally reparse and relex source
  82 code after the user types. A smart IDE would use syntax tree structure
  83 to handle editing commands (for example, to add/remove trailing commas
  84 after join/split lines actions), so parsing time can be very
  85 noticeable.
  86
  87
  88 Currently rustc uses the classical AST approach, and preserves some of
  89 the source code information in the form of spans in the AST. It is not
  90 clear if this structure can full fill all IDE requirements.
  91
  92
  93 ## Reusability
  94
  95 In theory, the parser can be a pure function, which takes a `&str` as
  96 an input, and produces a `ParseTree` as an output.
  97
  98 This is great for reusability: for example, you can compile this
  99 function to WASM and use it for fast client-side validation of syntax
 100 on the rust playground, or you can develop tools like `rustfmt` on
 101 stable Rust outside of rustc repository, or you can embed the parser
 102 into your favorite IDE or code editor.
 103
 104 This is also great for correctness: with such simple interface, it's
 105 possible to write property-based tests to thoroughly compare two
 106 different implementations of the parser. It's also straightforward to
 107 create a comprehensive test suite, because all the inputs and outputs
 108 are trivially serializable to human-readable text.
 109
 110 Another benefit is performance: with this signature, you can cache a
 111 parse tree for each file, with trivial strategy for cache invalidation
 112 (invalidate an entry when the underling file changes). On top of such
 113 a cache it is possible to build a smart code indexer which maintains
 114 the set of symbols in the project, watches files for changes and
 115 automatically reindexes only changed files.
 116
 117 Unfortunately, the current libsyntax is far from this ideal. For
 118 example, even the lexer makes use of the `FileMap` which is
 119 essentially a global state of the compiler which represents all know
 120 files. As a data point, it turned out to be easier to move `rustfmt`
 121 into the main `rustc` repository than to move libsyntax outside!
 122
 123
 124 # Guide-level explanation
 125 [guide-level-explanation]: #guide-level-explanation
 126
 127 Not applicable.
 128
 129
 130 # Reference-level explanation
 131 [reference-level-explanation]: #reference-level-explanation
 132
 133 It is not clear if a single parser can accommodate the needs of the
 134 compiler and the IDE, but there is hope that it is possible. The RFC
 135 proposes to develop libsynax2.0 as an experimental crates.io crate. If
 136 the experiment turns out to be a success, the second RFC will propose
 137 to integrate it with all existing tools and `rustc`.
 138
 139 Next, a syntax tree data structure is proposed for libsyntax2.0. It
 140 seems to have the following important properties:
 141
 142 * It is lossless and faithfully represents the original source code,
 143   including explicit nodes for comments and whitespace.
 144
 145 * It is flexible and allows to encode arbitrary node structure,
 146   even for invalid syntax.
 147
 148 * It is minimal: it stores small amount of data and has no
 149   dependencies. For instance, it does not need compiler's string
 150   interner or literal data representation.
 151
 152 * While the tree itself is minimal, it is extensible in a sense that
 153   it possible to associate arbitrary data with certain nodes in a
 154   type-safe way.
 155
 156
 157 It is not clear if this representation is the best one. It is heavily
 158 inspired by [PSI] data structure which used in [IntelliJ] based IDEs
 159 and in the [Kotlin] compiler.
 160
 161 [PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html
 162 [IntelliJ]: https://github.com/JetBrains/intellij-community/
 163 [Kotlin]: https://kotlinlang.org/
 164
 165
 166 ## Untyped Tree
 167
 168 The main idea is to store the minimal amount of information in the
 169 tree itself, and instead lean heavily on the source code for the
 170 actual data about identifier names, constant values etc.
 171
 172 All nodes in the tree are of the same type and store a constant for
 173 the syntactic category of the element and a range in the source code.
 174
 175 Here is a minimal implementation of this data structure with some Rust
 176 syntactic categories
 177
 178
 179 ```rust
 180 #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
 181 pub struct NodeKind(u16);
 182
 183 pub struct File {
 184         text: String,
 185         nodes: Vec<NodeData>,
 186 }
 187
 188 struct NodeData {
 189         kind: NodeKind,
 190         range: (u32, u32),
 191         parent: Option<u32>,
 192         first_child: Option<u32>,
 193         next_sibling: Option<u32>,
 194 }
 195
 196 #[derive(Clone, Copy)]
 197 pub struct Node<'f> {
 198         file: &'f File,
 199         idx: u32,
 200 }
 201
 202 pub struct Children<'f> {
 203         next: Option<Node<'f>>,
 204 }
 205
 206 impl File {
 207         pub fn root<'f>(&'f self) -> Node<'f> {
 208                 assert!(!self.nodes.is_empty());
 209                 Node { file: self, idx: 0 }
 210         }
 211 }
 212
 213 impl<'f> Node<'f> {
 214         pub fn kind(&self) -> NodeKind {
 215                 self.data().kind
 216         }
 217
 218         pub fn text(&self) -> &'f str {
 219                 let (start, end) = self.data().range;
 220                 &self.file.text[start as usize..end as usize]
 221         }
 222
 223         pub fn parent(&self) -> Option<Node<'f>> {
 224                 self.as_node(self.data().parent)
 225         }
 226
 227         pub fn children(&self) -> Children<'f> {
 228                 Children { next: self.as_node(self.data().first_child) }
 229         }
 230
 231         fn data(&self) -> &'f NodeData {
 232                 &self.file.nodes[self.idx as usize]
 233         }
 234
 235         fn as_node(&self, idx: Option<u32>) -> Option<Node<'f>> {
 236                 idx.map(|idx| Node { file: self.file, idx })
 237         }
 238 }
 239
 240 impl<'f> Iterator for Children<'f> {
 241         type Item = Node<'f>;
 242
 243         fn next(&mut self) -> Option<Node<'f>> {
 244                 let next = self.next;
 245                 self.next = next.and_then(|node| node.as_node(node.data().next_sibling));
 246                 next
 247         }
 248 }
 249
 250 pub const ERROR: NodeKind = NodeKind(0);
 251 pub const WHITESPACE: NodeKind = NodeKind(1);
 252 pub const STRUCT_KW: NodeKind = NodeKind(2);
 253 pub const IDENT: NodeKind = NodeKind(3);
 254 pub const L_CURLY: NodeKind = NodeKind(4);
 255 pub const R_CURLY: NodeKind = NodeKind(5);
 256 pub const COLON: NodeKind = NodeKind(6);
 257 pub const COMMA: NodeKind = NodeKind(7);
 258 pub const AMP: NodeKind = NodeKind(8);
 259 pub const LINE_COMMENT: NodeKind = NodeKind(9);
 260 pub const FILE: NodeKind = NodeKind(10);
 261 pub const STRUCT_DEF: NodeKind = NodeKind(11);
 262 pub const FIELD_DEF: NodeKind = NodeKind(12);
 263 pub const TYPE_REF: NodeKind = NodeKind(13);
 264 ```
 265
 266 Here is a rust snippet and the corresponding parse tree:
 267
 268 ```rust
 269 struct Foo {
 270         field1: u32,
 271         &
 272         // non-doc comment
 273         field2:
 274 }
 275 ```
 276
 277
 278 ```
 279 FILE
 280   STRUCT_DEF
 281     STRUCT_KW
 282     WHITESPACE
 283     IDENT
 284     WHITESPACE
 285     L_CURLY
 286     WHITESPACE
 287     FIELD_DEF
 288       IDENT
 289       COLON
 290       WHITESPACE
 291       TYPE_REF
 292         IDENT
 293     COMMA
 294     WHITESPACE
 295     ERROR
 296       AMP
 297     WHITESPACE
 298     FIELD_DEF
 299       LINE_COMMENT
 300       WHITESPACE
 301       IDENT
 302       COLON
 303       ERROR
 304     WHITESPACE
 305     R_CURLY
 306 ```
 307
 308 Note several features of the tree:
 309
 310 * All whitespace and comments are explicitly accounted for.
 311
 312 * The node for `STRUCT_DEF` contains the error element for `&`, but
 313   still represents the following field correctly.
 314
 315 * The second field of the struct is incomplete: `FIELD_DEF` node for
 316   it contains an `ERROR` element, but nevertheless has the correct
 317   `NodeKind`.
 318
 319 * The non-documenting comment is correctly attached to the following
 320   field.
 321
 322
 323 ## Typed Tree
 324
 325 It's hard to work with this raw parse tree, because it is untyped:
 326 node containing a struct definition has the same API as the node for
 327 the struct field. But it's possible to add a strongly typed layer on
 328 top of this raw tree, and get a zero-cost AST. Here is an example
 329 which adds type-safe wrappers for structs and fields:
 330
 331 ```rust
 332 // generic infrastructure
 333
 334 pub trait AstNode<'f>: Copy + 'f {
 335         fn new(node: Node<'f>) -> Option<Self>;
 336         fn node(&self) -> Node<'f>;
 337 }
 338
 339 pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option<Node<'f>> {
 340         node.children().find(|child| child.kind() == kind)
 341 }
 342
 343 pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box<Iterator<Item=A> + 'f> {
 344         Box::new(node.children().filter_map(A::new))
 345 }
 346
 347 // AST elements, specific to Rust
 348
 349 #[derive(Clone, Copy)]
 350 pub struct StructDef<'f>(Node<'f>);
 351
 352 #[derive(Clone, Copy)]
 353 pub struct FieldDef<'f>(Node<'f>);
 354
 355 #[derive(Clone, Copy)]
 356 pub struct TypeRef<'f>(Node<'f>);
 357
 358 pub trait NameOwner<'f>: AstNode<'f> {
 359         fn name_ident(&self) -> Node<'f> {
 360                 child_of_kind(self.node(), IDENT).unwrap()
 361         }
 362
 363         fn name(&self) -> &'f str { self.name_ident().text() }
 364 }
 365
 366
 367 impl<'f> AstNode<'f> for StructDef<'f> {
 368         fn new(node: Node<'f>) -> Option<Self> {
 369                 if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None }
 370         }
 371         fn node(&self) -> Node<'f> { self.0 }
 372 }
 373
 374 impl<'f> NameOwner<'f> for StructDef<'f> {}
 375
 376 impl<'f> StructDef<'f> {
 377         pub fn fields(&self) -> Box<Iterator<Item=FieldDef<'f>> + 'f> {
 378                 ast_children(self.node())
 379         }
 380 }
 381
 382
 383 impl<'f> AstNode<'f> for FieldDef<'f> {
 384         fn new(node: Node<'f>) -> Option<Self> {
 385                 if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None }
 386         }
 387         fn node(&self) -> Node<'f> { self.0 }
 388 }
 389
 390 impl<'f> FieldDef<'f> {
 391         pub fn type_ref(&self) -> Option<TypeRef<'f>> {
 392                 ast_children(self.node()).next()
 393         }
 394 }
 395
 396 impl<'f> NameOwner<'f> for FieldDef<'f> {}
 397
 398
 399 impl<'f> AstNode<'f> for TypeRef<'f> {
 400         fn new(node: Node<'f>) -> Option<Self> {
 401                 if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None }
 402         }
 403         fn node(&self) -> Node<'f> { self.0 }
 404 }
 405 ```
 406
 407 Note that although AST wrappers provide a type-safe access to the
 408 tree, they are still represented as indexes, so clients of the syntax
 409 tree can easily associated additional data with AST nodes by storing
 410 it in a side-table.
 411
 412
 413 ## Missing Source Code
 414
 415 The crucial feature of this syntax tree is that it is just a view into
 416 the original source code. And this poses a problem for the Rust
 417 language, because not all compiled Rust code is represented in the
 418 form of source code! Specifically, Rust has a powerful macro system,
 419 which effectively allows to create and parse additional source code at
 420 compile time. It is not entirely clear that the proposed parsing
 421 framework is able to handle this use case, and it's the main purpose
 422 of this RFC to figure it out. The current idea for handling macros is
 423 to make each macro expansion produce a triple of (expansion text,
 424 syntax tree, hygiene information), where hygiene information is a side
 425 table, which colors different ranges of the expansion text according
 426 to the original syntactic context.
 427
 428
 429 ## Implementation plan
 430
 431 This RFC proposes huge changes to the internals of the compiler, so
 432 it's important to proceed carefully and incrementally. The following
 433 plan is suggested:
 434
 435 * RFC discussion about the theoretical feasibility of the proposal,
 436   and the best representation representation for the syntax tree.
 437
 438 * Implementation of the proposal as a completely separate crates.io
 439   crate, by refactoring existing libsyntax source code to produce a
 440   new tree.
 441
 442 * A prototype implementation of the macro expansion on top of the new
 443   sytnax tree.
 444
 445 * Additional round of discussion/RFC about merging with the mainline
 446   compiler.
 447
 448
 449 # Drawbacks
 450 [drawbacks]: #drawbacks
 451
 452 - No harm will be done as long as the new libsyntax exists as an
 453   experiemt on crates.io. However, actually using it in the compiler
 454   and other tools would require massive refactorings.
 455
 456 - It's difficult to know upfront if the proposed syntax tree would
 457   actually work well in both the compiler and IDE. It may be possible
 458   that some drawbacks will be discovered during implementation.
 459
 460
 461 # Rationale and alternatives
 462 [alternatives]: #alternatives
 463
 464 - Incrementally add more information about source code to the current
 465   AST.
 466
 467 - Move the current libsyntax to crates.io as is. In the past, there
 468   were several failed attempts to do that.
 469
 470 - Explore alternative representations for the parse tree.
 471
 472 - Use parser generator instead of hand written parser. Using the
 473   parser from libsyntax directly would be easier, and hand-written
 474   LL-style parsers usually have much better error recovery than
 475   generated LR-style ones.
 476
 477 # Unresolved questions
 478 [unresolved]: #unresolved-questions
 479
 480 - Is it at all possible to represent Rust parser as a pure function of
 481   the source code? It seems like the answer is yes, because the
 482   language and especially macros were cleverly designed with this
 483   use-case in mind.
 484
 485
 486 - Is it possible to implement macro expansion using the proposed
 487   framework? This is the main question of this RFC. The proposed
 488   solution of synthesizing source code on the fly seems workable: it's
 489   not that different from the current implementation, which
 490   synthesizes token trees.
 491
 492
 493 - How to actually phase out current libsyntax, if libsyntax2.0 turns
 494   out to be a success?