guide.md

   1 # Guide to rust-analyzer
   2
   3 ## About the guide
   4
   5 This guide describes the current start of the rust-analyzer as of 2019-01-20
   6 (commit hash guide-2019-01). Its purpose is to
   7 document various problems and architectural solutions related to the problem of
   8 building IDE-first compiler.
   9
  10 ## The big picture
  11
  12 On the highest possible level, rust analyzer is a stateful component. Client may
  13 apply changes to the analyzer (new contents of `foo.rs` file is "fn main() {}")
  14 and it may ask semantic questions about the current state (what is the
  15 definition of the identifier with offset 92 in file `bar.rs`?). Two important
  16 properties hold:
  17
  18 * Analyzer does not do any IO. It starts in an empty state and all input data is
  19   provided via `apply_change` API.
  20
  21 * Only queries about the current state are supported. One can, of course,
  22   simulate undo and redo by keeping log of changes and inverse-changes.
  23
  24 ## IDE API
  25
  26 To see this big picture, let's take a look at the [`AnalysisHost`] and
  27 [`Analysis`] pair of types. `AnalysisHost` has three methods:
  28
  29 * `default` for creating an empty analysis
  30 * `apply_change(&mut self)` to make changes (this is how you get from an empty
  31   state to something interesting)
  32 * `analysis(&self)` to get an instance of `Analysis`
  33
  34 `Analysis` has a ton of methods for IDEs, like `goto_definition`, or
  35 `completions`. Both inputs and outputs of `Analysis`' methods are formulated in
  36 terms of files and offsets, and **not** in terms of Rust concepts like structs,
  37 traits, etc. The "typed" API with Rust specific types is slightly lower in the
  38 stack, we'll talk about it later.
  39
  40 [`AnalysisHost`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L265-L284
  41 [`Analysis`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L291-L478
  42
  43 The reason for `Analysis` and `AnalysisHost` separation is that we want apply
  44 changes "uniquely", but we might want to fork an `Analysis` and send it to
  45 another thread for background processing. That is, there is only a single
  46 `AnalysisHost`, but there may be several (equivalent) `Analysis`.
  47
  48 Note that all of the `Analysis` API return `Cancelable<T>`. This is required to
  49 be responsive in IDE setting. Sometimes a long-running query is being computed
  50 and the user types something in the editor and asks for completion. In this
  51 case, we cancel the long-running computation (so it returns `Err(Canceled)`),
  52 apply the change and execute request for completion. We never use stale data to
  53 answer requests. Under the cover, `AnalysisHost` "remembers" all outstanding
  54 `Analysis` instances. `AnalysisHost::apply_change` method cancels all
  55 `Analysis`es, blocks until of them are `Dropped` and then applies change
  56 in-place. This is the familiar to rustaceans read-write lock interior
  57 mutability.
  58
  59 Next, lets talk about what are inputs to the Analysis, precisely.
  60
  61 ## Inputs
  62
  63 Rust Analyzer never does any IO itself, all inputs get passed explicitly via
  64 `AnalysisHost::apply_change` method, which accepts a single argument:
  65 `AnalysisChange`. [`AnalysisChange`] is a builder for a single change
  66 "transaction", so it suffices to study its methods to understand all of the
  67 input data.
  68
  69 [`AnalysisChange`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L119-L167
  70
  71 The `(add|change|remove)_file` methods control the set of the input files, where
  72 each file has an integer id (`FileId`, picked by the client), text (`String`)
  73 and a filesystem path. Paths are tricky, they'll be explained in source roots
  74 section, together with `add_root` method. `add_library` method allows to add a
  75 group of files which are assumed to rarely change. It's mostly an optimization
  76 and does not change fundamental picture.
  77
  78 `set_crate_graph` method allows to control how the input files are partitioned
  79 into compilation unites -- crates. It also controls (in theory, not implemented
  80 yet) `cfg` flags. `CrateGraph` is a directed acyclic graph of crates. Each crate
  81 has a root `FileId`, a set of active `cfg` flags and a set of dependencies. Each
  82 dependency is a pair of a crate and a name. It is possible to have two crates
  83 with the same root `FileId` but different `cfg`-flags/dependencies. This model
  84 is lower than Cargo's model of packages: each Cargo package consists of several
  85 targets, each of which is a separate crate (or several crates, if you try
  86 different feature combinations).
  87
  88 Procedural macros should become inputs as well, but currently they are not
  89 supported. Procedural macro will be a black box `Box<dyn Fn(TokenStream) -> TokenStream>`
  90 function, and will be inserted into the crate graph just like dependencies.
  91
  92 Soon we'll talk how we build an LSP server on top of `Analysis`, but first,
  93 let's deal with that paths issue.
  94
  95
  96 ## Source roots (aka filesystems are horrible)
  97
  98 This is a non-essential section, feel free to skip.
  99
 100 The previous section said that the file system path is an attribute of a file,
 101 but this is not a whole truth. Making it an absolute `PathBuf` will be bad for
 102 several reasons. First, file-systems are full of (platform-dependent) edge cases:
 103
 104 * it's hard (requires a syscall) to decide if two paths are equivalent
 105 * some file-systems are case-sensitive
 106 * paths are not necessary UTF-8
 107 * symlinks can form cycles
 108
 109 Second, this might hurt reproducibility and hermeticity of builds. In theory,
 110 moving a project from `/foo/bar/my-project` to `/spam/eggs/my-project` should
 111 not change a bit in the output. However, if absolute path is a part of the
 112 input, it is at least in theory observable, and *could* affect the output.
 113
 114 Yet another problem is that we really-really want to avoid doing IO, but with
 115 Rust the set of "input" files is not necessary known up-front. In theory, you
 116 can have `#[path="/dev/random"] mod foo;`.
 117
 118 To solve (or explicitly refuse to solve) these problems rust analyzer uses the
 119 concept of source root. Roughly speaking, source roots is a contents of a
 120 directory on a file systems, like `/home/matklad/projects/rustraytracer/**.rs`.
 121
 122 More precisely, all files (`FileId`s) are partitioned into disjoint
 123 `SourceRoot`s. Each file has a relative utf-8 path within the `SourceRoot`.
 124 `SourceRoot` has an identity (integer id). Crucially, the root path of the
 125 source root itself is unknown to the analyzer: client is supposed to maintain a
 126 mapping between SourceRoot ids (which are assigned by the client) and actual
 127 `PathBuf`s. `SourceRoot`s give a sane tree model of the file system to the
 128 analyzer.
 129
 130 Note that `mod`, `#[path]` and `include!()` can only reference files from the
 131 same source root. It is of course is possible to explicitly add extra files to
 132 the source root, even `/dev/random`.
 133
 134 ## Language Server Protocol
 135
 136 Now let's see how `Analysis` API is exposed via JSON RPC based LSP protocol. The
 137 hard part here is managing changes (which can come either from the file system
 138 or from the editor) and concurrency (we want to spawn background jobs for things
 139 like syntax highlighting). We use the event loop pattern to manage the zoo, and
 140 the loop is the [`main_loop_inner`] function. The [`main_loop`] does a one-time
 141 initialization and tearing down of the resources.
 142
 143 [`main_loop`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L51-L110
 144 [`main_loop_inner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L156-L258
 145
 146
 147 Let's walk through a typical analyzer session!
 148
 149 First, we need to figure out what to analyze. To do this, we run `cargo
 150 metadata` to learn about Cargo packages for current workspace and dependencies,
 151 and we run `rustc --print sysroot` and scan sysroot to learn about crates like
 152 `std`. Currently we load this configuration once at the start of the server, but
 153 it should be possible to dynamically reconfigure it later without restart.
 154
 155 [main_loop.rs#L62-L70](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L62-L70)
 156
 157 The [`ProjectModel`] we get after this step is very Cargo and sysroot specific,
 158 it needs to be lowered to get the input in the form of `AnalysisChange`. This
 159 happens in [`ServerWorldState::new`] method. Specifically
 160
 161 * Create a `SourceRoot` for each Cargo package and sysroot.
 162 * Schedule a file system scan of the roots.
 163 * Create an analyzer's `Crate` for each Cargo **target** and sysroot crate.
 164 * Setup dependencies between the crates.
 165
 166 [`ProjectModel`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/project_model.rs#L16-L20
 167 [`ServerWorldState::new`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L38-L160
 168
 169 The results of the scan (which may take a while) will be processed in the body
 170 of the main loop, just like any other change. Here's where we handle
 171
 172 * [File system changes](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L194)
 173 * [Changes from the editor](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L377)
 174
 175 After a single loop's turn, we group them into one `AnalysisChange` and
 176 [apply] it. This always happens on the main thread and blocks the loop.
 177
 178 [apply]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216
 179
 180 To handle requests, like ["goto definition"], we create an instance of the
 181 `Analysis` and [`schedule`] the task (which consumes `Analysis`) onto
 182 threadpool. [The task] calls the corresponding `Analysis` method, while
 183 massaging the types into the LSP representation. Keep in mind that if we are
 184 executing "goto definition" on the threadpool and a new change comes in, the
 185 task will be canceled as soon as the main loop calls `apply_change` on the
 186 `AnalysisHost`.
 187
 188 ["goto definition"]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216
 189 [`schedule`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L426-L455
 190 [The task]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop/handlers.rs#L205-L223
 191
 192 This concludes the overview of the analyzer's programing *interface*. Next, lets
 193 dig into the implementation!
 194
 195 ## Salsa
 196
 197 The most straightforward way to implement "apply change, get analysis, repeat"
 198 API would be to maintain the input state and to compute all possible analysis
 199 information from scratch after every change. This works, but scales poorly with
 200 the size of the project. To make this fast, we need to take advantage of the
 201 fact that most of the changes are small, and that analysis results are unlikely
 202 to change significantly between invocations.
 203
 204 To do this we use [salsa]: a framework for incremental on-demand computation.
 205 You can skip the rest of the section if you are familiar with rustc red-green
 206 algorithm.
 207
 208 [salsa]: https://github.com/salsa-rs/salsa
 209
 210 It's better to refer to salsa's docs to learn about it. Here's a small excerpt:
 211
 212 The key idea of salsa is that you define your program as a set of queries. Every
 213 query is used like function K -> V that maps from some key of type K to a value
 214 of type V. Queries come in two basic varieties:
 215
 216 * **Inputs**: the base inputs to your system. You can change these whenever you
 217   like.
 218
 219 * **Functions**: pure functions (no side effects) that transform your inputs
 220   into other values. The results of queries is memoized to avoid recomputing
 221   them a lot. When you make changes to the inputs, we'll figure out (fairly
 222   intelligently) when we can re-use these memoized values and when we have to
 223   recompute them.
 224
 225
 226 For further discussion, its important to understand one bit of "fairly
 227 intelligently". Suppose we have to functions, `f1` and `f2`, and one input, `i`.
 228 We call `f1(X)` which in turn calls `f2(Y)` which inspects `i(Z)`. `i(Z)`
 229 returns some value `V1`, `f2` uses that and returns `R1`, `f1` uses that and
 230 returns `O`. Now, let's change `i` at `Z` to `V2` from `V1` and try to compute
 231 `f1(X)` again. Because `f1(X)` (transitively) depends on `i(Z)`, we can't just
 232 reuse its value as is. However, if `f2(Y)` is *still* equal to `R1` (despite the
 233 `i`'s change), we, in fact, *can* reuse `O` as result of `f1(X)`. And that's how
 234 salsa works: it recomputes results in *reverse* order, starting from inputs and
 235 progressing towards outputs, stopping as soon as it sees an intermediate value
 236 that hasn't changed.
 237
 238 ## Salsa Input Queries
 239
 240 All analyzer information is stored in a salsa database. `Analysis` and
 241 `AnalysisHost` types are newtype wrappers for [`RootDatabase`] -- a salsa
 242 database.
 243
 244 [`RootDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/db.rs#L88-L134
 245
 246 Salsa input queries are defined in [`FilesDatabase`] (which is a part of
 247 `RootDatabase`). They closely mirror the familiar `AnalysisChange` structure:
 248 indeed, what `apply_change` does is it sets the values of input queries.
 249
 250 [`FilesDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/input.rs#L150-L174
 251
 252 ## From text to semantic model
 253
 254 The bulk of the rust-analyzer is transforming input text into semantic model of
 255 Rust code: a web of entities like modules, structs, functions and traits.
 256
 257 An important fact to realize is that (unlike most other languages like C# or
 258 Java) there isn't a one-to-one mapping between source code and semantic model. A
 259 single function definition in the source code might result in several semantic
 260 functions: for example, the same source file might be included as a module into
 261 several crate, or a single "crate" might be present in the compilation DAG
 262 several times, with different sets of `cfg`s enabled.
 263
 264 The semantic interface is declared in [`code_model_api`] module. Each entity is
 265 identified by integer id and has a bunch of methods which take a salsa database
 266 as an argument and returns other entities (which are ids). Internally, this
 267 methods invoke various queries on the database to build the model on demand.
 268 Here's [the list of queries].
 269
 270 [`code_model_api`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/code_model_api.rs
 271 [the list of queries]: https://github.com/rust-analyzer/rust-analyzer/blob/7e84440e25e19529e4ff8a66e521d1b06349c6ec/crates/ra_hir/src/db.rs#L20-L106
 272
 273 The first step of building the model is parsing the source code.
 274
 275 ## Syntax trees
 276
 277 An important property of the Rust language is that each file can be parsed in
 278 isolation. Unlike, say, `C++`, an `include` can't change the meaning of the
 279 syntax. For this reason, Rust analyzer can build a syntax tree for each "source
 280 file", which could then be reused by several semantic models if this file
 281 happens to be a part of several crates.
 282
 283 Rust analyzer uses a similar representation of syntax trees to that of `Roslyn`
 284 and Swift's new [libsyntax]. Swift's docs give an excellent overview of the
 285 approach, so I skip this part here and instead outline the main characteristics
 286 of the syntax trees:
 287
 288 * Syntax trees are fully lossless. Converting **any** text to a syntax tree and
 289   back is a total identity function. All whitespace and comments are explicitly
 290   represented in the tree.
 291
 292 * Syntax nodes have generic `(next|previous)_sibling`, `parent`,
 293   `(first|last)_child` functions. You can get from any one node to any other
 294   node in the file using only these functions.
 295
 296 * Syntax nodes know their range (start offset and length) in the file.
 297
 298 * Syntax nodes share the ownership of their syntax tree: if you keep a reference
 299   to a single function, the whole enclosing file is alive.
 300
 301 * Syntax trees are immutable and the cost of replacing the subtree is
 302   proportional to the depth of the subtree. Read Swift's docs to learn how
 303   immutable + parent pointers + cheap modification is possible.
 304
 305 * Syntax trees are build on best-effort basis. All accessor methods return
 306   `Option`s. The tree for `fn foo` will contain a function declaration with
 307   `None` for parameter list and body.
 308
 309 * Syntax trees do not know the file they are build from, they only know about
 310   the text.
 311
 312 The implementation is based on the generic [rowan] crate on top of which a
 313 [rust-specific] AST is generated.
 314
 315 [libsyntax]: https://github.com/apple/swift/tree/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax
 316 [rowan]: https://github.com/rust-analyzer/rowan/tree/100a36dc820eb393b74abe0d20ddf99077b61f88
 317 [rust-specific]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_syntax/src/ast/generated.rs
 318
 319 The next step in constructing the semantic model is ...
 320
 321 ## Building a Module Tree
 322
 323 The algorithm for building a tree of modules is to start with a crate root
 324 (remember, each `Crate` from a `CrateGraph` has a `FileId`), collect all mod
 325 declarations and recursively process child modules. This is handled by the
 326 [`module_tree_query`], with a two slight variations.
 327
 328 [`module_tree_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L116-L123
 329
 330 First, rust analyzer builds a module tree for all crates in a source root
 331 simultaneously. The main reason for this is historical (`module_tree` predates
 332 `CrateGraph`), but this approach also allows to account for files which are not
 333 part of any crate. That is, if you create a file but do not include it as a
 334 submodule anywhere, you still get semantic completion, and you get a warning
 335 about free-floating module (the actual warning is not implemented yet).
 336
 337 The second difference is that `module_tree_query` does not *directly* depend on
 338 the "parse" query (which is confusingly called `source_file`). Why would calling
 339 the parse directly be bad? Suppose the user changes the file slightly, by adding
 340 an insignificant whitespace. Adding whitespace changes the parse tree (because
 341 it includes whitespace), and that means recomputing the whole module tree.
 342
 343 We deal with this problem by introducing an intermediate [`submodules_query`].
 344 This query processes the syntax tree an extract a set of declared submodule
 345 names. Now, changing the whitespace results in `submodules_query` being
 346 re-executed for a *single* module, but because the result of this query stays
 347 the same, we don't have to re-execute [`module_tree_query`]. In fact, we only
 348 need to re-execute it when we add/remove new files or when we change mod
 349 declarations.
 350
 351 [`submodules_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L41)
 352
 353 We store the resulting modules in a `Vec`-based indexed arena. The indices in
 354 the arena becomes module identifiers.
 355
 356
 357 ## Location Interner pattern
 358
 359 ## Macros and recursive locations
 360
 361 ## Name resolution
 362
 363 ## Source Map pattern
 364
 365 ## Tying it all together: completion