src/doc/guide-strings.md

   1 % The Guide to Rust Strings
   2
   3 Strings are an important concept to master in any programming language. If you
   4 come from a managed language background, you may be surprised at the complexity
   5 of string handling in a systems programming language. Efficient access and
   6 allocation of memory for a dynamically sized structure involves a lot of
   7 details. Luckily, Rust has lots of tools to help us here.
   8
   9 A **string** is a sequence of unicode scalar values encoded as a stream of
  10 UTF-8 bytes. All strings are guaranteed to be validly-encoded UTF-8 sequences.
  11 Additionally, strings are not null-terminated and can contain null bytes.
  12
  13 Rust has two main types of strings: `&str` and `String`.
  14
  15 # &str
  16
  17 The first kind is a `&str`. This is pronounced a 'string slice'.
  18 String literals are of the type `&str`:
  19
  20 ```{rust}
  21 let string = "Hello there.";
  22 ```
  23
  24 Like any Rust type, string slices have an associated lifetime. A string literal
  25 is a `&'static str`.  A string slice can be written without an explicit
  26 lifetime in many cases, such as in function arguments. In these cases the
  27 lifetime will be inferred:
  28
  29 ```{rust}
  30 fn takes_slice(slice: &str) {
  31     println!("Got: {}", slice);
  32 }
  33 ```
  34
  35 Like vector slices, string slices are simply a pointer plus a length. This
  36 means that they're a 'view' into an already-allocated string, such as a
  37 `&'static str` or a `String`.
  38
  39 # String
  40
  41 A `String` is a heap-allocated string. This string is growable, and is also
  42 guaranteed to be UTF-8.
  43
  44 ```{rust}
  45 let mut s = "Hello".to_string();
  46 println!("{}", s);
  47
  48 s.push_str(", world.");
  49 println!("{}", s);
  50 ```
  51
  52 You can coerce a `String` into a `&str` with the `as_slice()` method:
  53
  54 ```{rust}
  55 fn takes_slice(slice: &str) {
  56     println!("Got: {}", slice);
  57 }
  58
  59 fn main() {
  60     let s = "Hello".to_string();
  61     takes_slice(s.as_slice());
  62 }
  63 ```
  64
  65 You can also get a `&str` from a stack-allocated array of bytes:
  66
  67 ```{rust}
  68 use std::str;
  69
  70 let x: &[u8] = &[b'a', b'b'];
  71 let stack_str: &str = str::from_utf8(x).unwrap();
  72 ```
  73
  74 # Best Practices
  75
  76 ## `String` vs. `&str`
  77
  78 In general, you should prefer `String` when you need ownership, and `&str` when
  79 you just need to borrow a string. This is very similar to using `Vec<T>` vs. `&[T]`,
  80 and `T` vs `&T` in general.
  81
  82 This means starting off with this:
  83
  84 ```{rust,ignore}
  85 fn foo(s: &str) {
  86 ```
  87
  88 and only moving to this:
  89
  90 ```{rust,ignore}
  91 fn foo(s: String) {
  92 ```
  93
  94 If you have good reason. It's not polite to hold on to ownership you don't
  95 need, and it can make your lifetimes more complex.
  96
  97 ## Generic functions
  98
  99 To write a function that's generic over types of strings, use `&str`.
 100
 101 ```{rust}
 102 fn some_string_length(x: &str) -> uint {
 103         x.len()
 104 }
 105
 106 fn main() {
 107     let s = "Hello, world";
 108
 109     println!("{}", some_string_length(s));
 110
 111     let s = "Hello, world".to_string();
 112
 113     println!("{}", some_string_length(s.as_slice()));
 114 }
 115 ```
 116
 117 Both of these lines will print `12`.
 118
 119 ## Comparisons
 120
 121 To compare a String to a constant string, prefer `as_slice()`...
 122
 123 ```{rust}
 124 fn compare(x: String) {
 125     if x.as_slice() == "Hello" {
 126         println!("yes");
 127     }
 128 }
 129 ```
 130
 131 ... over `to_string()`:
 132
 133 ```{rust}
 134 fn compare(x: String) {
 135     if x == "Hello".to_string() {
 136         println!("yes");
 137     }
 138 }
 139 ```
 140
 141 Converting a `String` to a `&str` is cheap, but converting the `&str` to a
 142 `String` involves an allocation.
 143
 144 ## Indexing strings
 145
 146 You may be tempted to try to access a certain character of a `String`, like
 147 this:
 148
 149 ```{rust,ignore}
 150 let s = "hello".to_string();
 151
 152 println!("{}", s[0]);
 153 ```
 154
 155 This does not compile. This is on purpose. In the world of UTF-8, direct
 156 indexing is basically never what you want to do. The reason is that each
 157 character can be a variable number of bytes. This means that you have to iterate
 158 through the characters anyway, which is a O(n) operation.
 159
 160 There's 3 basic levels of unicode (and its encodings):
 161
 162 - code units, the underlying data type used to store everything
 163 - code points/unicode scalar values (char)
 164 - graphemes (visible characters)
 165
 166 Rust provides iterators for each of these situations:
 167
 168 - `.bytes()` will iterate over the underlying bytes
 169 - `.chars()` will iterate over the code points
 170 - `.graphemes()` will iterate over each grapheme
 171
 172 Usually, the `graphemes()` method on `&str` is what you want:
 173
 174 ```{rust}
 175 let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
 176
 177 for l in s.graphemes(true) {
 178     println!("{}", l);
 179 }
 180 ```
 181
 182 This prints:
 183
 184 ```{notrust,ignore}
 185 u͔
 186 n͈̰̎
 187 i̙̮͚̦
 188 c͚̉
 189 o̼̩̰͗
 190 d͔̆̓ͥ
 191 é
 192 ```
 193
 194 Note that `l` has the type `&str` here, since a single grapheme can consist of
 195 multiple codepoints, so a `char` wouldn't be appropriate.
 196
 197 This will print out each visible character in turn, as you'd expect: first "u͔", then
 198 "n͈̰̎", etc. If you wanted each individual codepoint of each grapheme, you can use `.chars()`:
 199
 200 ```{rust}
 201 let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
 202
 203 for l in s.chars() {
 204     println!("{}", l);
 205 }
 206 ```
 207
 208 This prints:
 209
 210 ```{notrust,ignore}
 211 u
 212 ͔
 213 n
 214 ̎
 215 ͈
 216 ̰
 217 i
 218 ̙
 219 ̮
 220 ͚
 221 ̦
 222 c
 223 ̉
 224 ͚
 225 o
 226 ͗
 227 ̼
 228 ̩
 229 ̰
 230 d
 231 ̆
 232 ̓
 233 ͥ
 234 ͔
 235 e
 236 ́
 237 ```
 238
 239 You can see how some of them are combining characters, and therefore the output
 240 looks a bit odd.
 241
 242 If you want the individual byte representation of each codepoint, you can use
 243 `.bytes()`:
 244
 245 ```{rust}
 246 let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
 247
 248 for l in s.bytes() {
 249     println!("{}", l);
 250 }
 251 ```
 252
 253 This will print:
 254
 255 ```{notrust,ignore}
 256 117
 257 205
 258 148
 259 110
 260 204
 261 142
 262 205
 263 136
 264 204
 265 176
 266 105
 267 204
 268 153
 269 204
 270 174
 271 205
 272 154
 273 204
 274 166
 275 99
 276 204
 277 137
 278 205
 279 154
 280 111
 281 205
 282 151
 283 204
 284 188
 285 204
 286 169
 287 204
 288 176
 289 100
 290 204
 291 134
 292 205
 293 131
 294 205
 295 165
 296 205
 297 148
 298 101
 299 204
 300 129
 301 ```
 302
 303 Many more bytes than graphemes!
 304
 305 # Other Documentation
 306
 307 * [the `&str` API documentation](std/str/index.html)
 308 * [the `String` API documentation](std/string/index.html)