doc/manual/gg-ref.tex

   1 \section{{\tt gg}, the grammar generator}
   2
   3 \verb|gg| is the grammar generator, the library with which Metalua
   4 parser is built. Knowing it allows you to easily write your own
   5 parsers, and to plug them into mlp, the existing Metalua source
   6 parser. It defines a couple of generators, which take parsers as
   7 parameters, and return a more complex parser as a result by combining
   8 them.
   9
  10 Notice that \verb|gg| sources are thoroughly commented, and try to be
  11 readable as a secondary source of documentation.
  12
  13 There are four main classes in gg, which allow to generate:
  14 \begin{itemize}
  15 \item sequences of keywords and subparsers;
  16 \item keyword-driven sequence sets, i.e. parsers which select a
  17   sequence parser depending on an initial keyword;
  18 \item lists, i.e. series of an undetermined number of elements of the
  19   same type, optionnaly separated by a keyword (typically ``{\tt,}'');
  20 \item expressions, built with infix, prefix and suffix operators
  21   around a primary expression element;
  22 \end{itemize}
  23
  24
  25
  26 \subsection{Sequences}
  27
  28 A sequence parser combines sub-parsers together, calling them one after
  29 the other. A lot of these sub-parsers will simply read a keyword, and
  30 do nothing of it except making sure that it is indeed here: these can
  31 be specified by a simple string. For instance, the following
  32 declarations create parsers that read function declarations, thanks
  33 to some subparsers:
  34 \begin{itemize}
  35 \item \verb|func_stat_name| reads a function name (a list of
  36   identifiers separated by dots, plus optionally a semicolon and a
  37   method name);
  38 \item \verb|func_params_content| reads a possibly empty list of
  39   identifiers separated with commas, plus an optional ``\verb|...|'';
  40 \item \verb|mlp.block| reads a block of statements (here the function
  41   body);
  42 \item \verb|mlp.id| reads an identifier.
  43 \end{itemize}
  44
  45 \begin{Verbatim}[fontsize=\scriptsize]
  46 -- Read a function definition statement
  47 func_stat = gg.sequence{ "function", func_stat_name, "(",
  48                          func_params_content, ")", mlp.block, "end" }
  49
  50 -- Read a local function definition statement
  51 func_stat = gg.sequence{ "local", "function", mlp.id, "(",
  52                          func_params_content, ")", mlp.block, "end" }
  53
  54 \end{Verbatim}
  55
  56 \subsubsection{Constructor {\tt gg.sequence (config\_table) }}
  57
  58 This function returns a sequence parser. \verb|config_table| contains,
  59 in its array part, a sequence of string representing keyword parsers,
  60 and arbitrary sub-parsers. Moreover, the following fields are allowed
  61 in the hash part of the table:
  62
  63 \begin{itemize}
  64 \item\verb|name = <string>|: the parser's name, used to generate error
  65   messages;
  66 \item\verb|builder = <function>|: if present, whenever the parser is
  67   called, the list of the sub-parsers results is passed to this
  68   function, and the function's return value is returned as the
  69   parser's result. If absent, the list of sub-parser results is simply
  70   returned. It can be updated at anytime with:\\
  71   \verb|x.builder = <newval>|.
  72 \item\verb|builder = <string>|: the string is added as a tag to the
  73   list of sub-parser results. \verb|builder = "foobar"| is equivalent
  74   to:\\
  75  \verb|builder = function(x) x.tag="foobar"; return x end|
  76 \item\verb|transformers = <function list>|: applies all the functions
  77   of the list, in sequence, to the result of the parsing: these
  78   functions must be of type AST$\rightarrow$AST. For instance, if the
  79   transformers list is {\tt\{f1, f2, f3\}}, and the builder returns
  80   {\tt x}, then the whole parser returns {\tt f3(f2(f1(x)))}.
  81 \end{itemize}
  82
  83 \subsubsection{Method {\tt :parse(lexstream)}}
  84
  85 Read a sequence from the lexstream. If the sequence can't be entirely
  86 read, an error occurs. The result is either the list of results of
  87 sub-parsers, or the result of \verb|builder| if it is non-nil. In the
  88 \verb|func_stat| example above, the result would be a list of 3
  89 elements: the results of \verb|func_stat_name|,
  90 \verb|func_params_content| and \verb|mlp.block|.
  91
  92 It can also be directly called as simply \verb|x(lexstream)| instead of
  93 \verb|x:parse(lexstream)|.
  94
  95 \subsubsection{Method {\tt .transformers:add(f)}}
  96 Adds a function at the end of the transformers list.
  97
  98 \subsection{Sequence sets}
  99
 100 In many cases, several sequence parsers can be applied at a given
 101 point, and the choice of the right parser is determined by the next
 102 keyword in the lexstream. This is typically the case for Metalua's
 103 statement parser. All the sequence parsers must start with a keyword
 104 rather than a sub-parser, and that initial keyword must be different
 105 for each sequence parser in the sequence set parser. The sequence set
 106 parser takes care of selecting the appropriate sequence
 107 parser. Moreover, if the next token in the lexstream is not a keyword, or
 108 if it is a keyword but no sequence parser starts with it, the sequence
 109 set parser can have a default parser which is used as a fallback.
 110
 111 For instance, the declaration of \verb|mlp.stat| the Metalua statement
 112 parser looks like:
 113
 114 \begin{verbatim}
 115 mlp.stat = gg.multisequence{
 116   mlp.do_stat, mlp.while_stat, mlp.repeat_stat, mlp.if_stat... }
 117 \end{verbatim}
 118
 119 \subsubsection{Constructor {\tt gg.multisequence (config\_table)}}
 120
 121
 122 This function returns a sequence set parser. The array part of
 123 \verb|config_table| contains a list of parsers. It also accepts tables
 124 instead of sequence parsers: in this case, these tables are supposed
 125 to be config tables for \verb|gg.sequence| constructor, and are
 126 converted into sequence parsers on-the-fly by calling
 127 \verb|gg.sequence| on them. Each of these sequence parsers has to
 128 start with a keyword, distinct from all other initial keywords,
 129 e.g. it's illegal for two sequences in the same multisequence to start
 130 with keyword {\tt do}; it's also illegal for any parser but the
 131 default one to start with a subparser, e.g. {\tt mlp.id}.
 132
 133 It also accepts the following fields in the table's hash part:
 134 \begin{itemize}
 135 \item\verb|default = <parser>|: if no sequence can be chosen, because
 136   the next token is not a keyword, or no sequence parser in the set
 137   starts with that keyword, then the default parser is run instead; if
 138   no default parser is provided and no sequence parser can be chosen,
 139   an error is generated when parsing.
 140 \item\verb|name = <string>|: the parser's name, used to generate arror
 141   messages;
 142 \item\verb|builder = <function>|: if present, whenever the parser is
 143   called, the selected parser's result is passed to this
 144   function, and the function's return value is returned as the
 145   parser's result. If absent, the selected parser's result is simply
 146   returned. It can be updated at anytime with
 147   \verb|x.builder = <newval>|.
 148 \item\verb|builder = <string>|: the string is added as a tag to the
 149   list of sub-parser results. \verb|builder = "foobar"| is equivalent
 150   to:\\
 151  \verb|builder = function(x) return { tag = "foobar"; unpack (x) } end|
 152 \item\verb|transformers = <function list>|: applies all the functions
 153   of the list, in sequence, to the result of the parsing: these
 154   functions must be of type AST$\rightarrow$AST.
 155 \end{itemize}
 156
 157 \subsubsection{Method {\tt :parse(lexstream)}}
 158
 159 Read from the lexstream. The result returned is the result of the selected
 160 parser (one of the sequence parsers, or the default parser). That
 161 result is either returned directly, or passed through \verb|builder|
 162 if this field is non-nil.
 163
 164 It can also be directly called as simply \verb|x(lexstream)| instead of
 165 \verb|x:parse(lexstream)|.
 166
 167 \subsubsection{Method {\tt :add(sequence\_parser)}}
 168
 169 Take a sequence parser, or a config table that would be accepted by
 170 \verb|gg.sequence| to build a sequence parser. Add that parser to the
 171 set of sequence parsers handled by x. Cause an error if the parser
 172 doesn't start with a keyword, or if that initial keyword is already
 173 reserved by a registered sequence parser, or if the parser is not a
 174 sequence parser.
 175
 176 \subsubsection{Field {\tt .default}}
 177 This field contains the default parser, and can be set to another
 178 parser at any time.
 179
 180 \subsubsection{Method {\tt .transformers:add}}
 181 Adds a function at the end of the transformers list.
 182
 183 \subsubsection{Method {\tt :get(keyword)}}
 184 Takes a keyword (as a string), and returns the sequence in the set
 185 starting with that keyword, or nil if there is no such sequence.
 186
 187 \subsubsection{Method {\tt :del(keyword)}}
 188 Removes the sequence parser starting with keyword {\tt kw}.
 189
 190 \subsection{List parser}
 191
 192 Sequence parsers allow to chain several different sub-parser. Another
 193 common need is to read a series of identical elements into a list, but
 194 without knowing in advance how many of such elements will be
 195 found. This allows to read lists of arguments in a call, lists of
 196 parameters in a function definition, lists of statements in a
 197 block\ldots
 198
 199 Another common feature of such lists is that elements of the list are
 200 separated by keywords, typically semicolons or commas. These are
 201 handled by the list parser generator.
 202
 203 A list parser needs a way to know when the list is finished. When
 204 elements are separated by keyword separators, this is easy to
 205 determine: the list stops when an element is not followed by a
 206 separator. But two cases remain unsolved:
 207 \begin{itemize}
 208 \item When a list is allowed to be empty, no separator keyword allows
 209   the parser to realize that it is in front of an empty list: it would
 210   call the element parser, and that parser would fail;
 211 \item when there are no separator keyword separator specified, they
 212   can't be used to determine the end of the list.
 213 \end{itemize}
 214
 215 For these two cases, a list parser can specify a set of terminator
 216 keywords: in separator-less lists, the parser returns as soon as a
 217 terminator keyword is found where an element would otherwise have been
 218 read. In lists with separators, if terminators are specified, and such
 219 a terminator is found at the beginning of the list, then no element is
 220 parsed, and an empty list is returned. For instance, for argument
 221 lists, ``\verb|)|'' would be specified as a terminator, so that empty
 222 argument lists ``\verb|()|'' are handled properly.
 223
 224 Beware that separators are consumed from the lexstream stream, but
 225 terminators are not.
 226
 227 %\caveat{FIXME: check that it works as advertized (for list parserd
 228 %  with separators AND terminators; it's related to the trailing
 229 %  separator issue in block and table\_content parsers). There still is
 230 %  a design issue to settle here.}
 231
 232 \subsubsection{Constructor {\tt gg.list (config\_table)}}
 233
 234 This function returns a list parser. \verb|config_table| can contain
 235 the following fields:
 236 \begin{itemize}
 237
 238 \item\verb|primary = <parser>| (mandatory): the parser used to read
 239   elemetns of the list;
 240
 241 \item\verb|separators = <list> |: list of strings representing the
 242   keywords accepted as element separators. If only one separator is
 243   allowed, then the string can be passed outside the list:\\
 244   \verb|separators = "foo"| is the same as
 245   \verb|separators = { "foo" }|.
 246
 247 \item\verb|terminators = <list> |: list of strings representing the
 248   keywords accepted as list terminators. If only one separator is
 249   allowed, then the string can be passed outside the list.
 250
 251 \item\verb|name = <string>|: the parser's name, used to generate arror
 252   messages;
 253
 254 \item\verb|builder = <function>|: if present, whenever the parser is
 255   called, the list of primary parser results is passed to this
 256   function, and the function's return value is returned as the
 257   parser's result. If absent, the list of sub-parser results is simply
 258   returned. It can be updated at anytime with
 259   \verb|x.builder = <newval>|.
 260
 261 \item\verb|builder = <string>|: the string is added as a tag to the
 262   list of sub-parser results. \verb|builder = "foobar"| is equivalent
 263   to:\\
 264  \verb|builder = function(x) return { tag = "foobar"; unpack (x) } end|
 265
 266 \item\verb|transformers = <function list>|: applies all the functions
 267   of the list, in sequence, to the result of the parsing: these
 268   functions must be of type AST$\rightarrow$AST.
 269
 270 \item if keyless element is found in \verb|config_table|, and there is
 271   no \verb|primary| key in the table, then it is expected to be a
 272   parser, and it is considered to be the primary parser.
 273 \end{itemize}
 274
 275 \subsubsection{Method {\tt :parse (lexstream)}}
 276
 277 Read a list from the lexstream. The result is either the list of elements
 278 as read by the primary parser, or the result of that list passed
 279 through \verb|builder| if it is specified.
 280
 281 It can also be directly called as simply \verb|x(lexstream)| instead of
 282 \verb|x:parse(lexstream)|.
 283
 284 \subsubsection{Method {\tt .transformers:add}}
 285 Adds a function at the end of the transformers list.
 286
 287 \subsection{Method {\tt .separators:add}}
 288 Adds a string to the list of separators.
 289
 290 \subsection{Method {\tt .terminators:add}}
 291 Adds a string to the list of terminators.
 292
 293 \subsection{Expression parser}
 294
 295 This is a very powerfull parser generator, but it ensues that its API
 296 is quite large. An expression parser relies on a primary parser, and
 297 the elements read by this parser can be:
 298 \begin{itemize}
 299 \item combined pairwise by infix operators;
 300 \item modified by prefix operators;
 301 \item modified by suffix operators.
 302 \end{itemize}
 303
 304 All operators are described in a way analoguous to sequence config
 305 tables: a sequence of keywords-as-strings and subparsers in a table,
 306 plus the usual \verb|builder| and \verb|transformers|
 307 fields. Each kind of operator has its own signature for \verb|builder|
 308 functions, and some specific additional information such as precedence
 309 or associativity. As in multisequences, the choice among operators is
 310 determined by the initial keyword. Therefore, it is illegal to have
 311 two operator sequences which start with the same keyword (for
 312 instance, two infix sequences starting with keyword ``\$'').
 313
 314 Most of the time, the sequences representing operators will have a
 315 single keyword, and no subparser. For instance, the addition is
 316 represented as:\\
 317 \verb~{ "+",  prec=60, assoc="left", builder= |a, _, b| `Op{ `Add, a, b } }~
 318
 319
 320 \paragraph{Infix operators}
 321 Infix operators are described by a table whose array-part works as for
 322 sequence parsers. Besides this array-part and the usual {\tt
 323   transformers} list, the table accepts the following fields in its hash-part:
 324 \begin{itemize}
 325
 326 \item\verb|prec = <number>| its precedence. The higher the precedence,
 327   the tighter the operator bind with respect to other operators. For
 328   instance, in Metalua, addition precedence is 60, whereas
 329   multiplication precedence is 70.
 330
 331 \item\verb|assoc = <string>| is one of \verb|"left"|, \verb|"right"|,
 332   \verb|"flat"| or \verb|"none"|, and specifies how an operator
 333   associates. If not specified, the default associativity is
 334   \verb|"left"|.
 335
 336   Left and right describe how to read sequences of operators with the
 337   same precedence, e.g. addition is left associative ({\tt 1+2+3} reads as
 338   {\tt(1+2)+3}), whereas exponentiation is right-associative (\verb|1^2^3|
 339   reads as \verb|1^(2^3)|).
 340
 341   If an operator is non-associative and an ambiguity is found, a
 342   parsing error occurs.
 343
 344   Finally, flat operators get series of them collected in a list,
 345   which is passed to the corresponding builder as a single
 346   parameter. For instance, if \verb|++| is declared as flat and its
 347   builder is \verb|f|, then whenever {\tt 1++2++3++4} is parsed, the
 348   result returned is {\tt f\{1, 2, 3, 4\}}.
 349
 350 \item\verb|builder = <function>| the usual result transformer. The
 351   function takes as parameters the left operand, the result of the
 352   sequence parser (i.e. \verb|{ }| if the sequence contains no
 353   subparser), and the right operand; it must return the resulting AST.
 354
 355 \end{itemize}
 356
 357 %For instance, Metalua's addition parser could be described by:
 358
 359 %\begin{verbatim}
 360 %{ "+", assoc = "left", prec = 60, builder = |a,_,b| `Op{ `Add, a, b } }
 361 %\end{verbatim}
 362
 363 \paragraph{Prefix operators}
 364 These operators are placed before the sub-expression they modify. They
 365 have the same properties as infix operators, except that they don't
 366 have an \verb|assoc| field, and \verb|builder| takes {\tt|operator,
 367   operand|} instead of {\tt|left\_operand, operator, right\_operand|}.
 368
 369 \paragraph{Suffix operators}
 370 Same as prefix operators, except that \verb|builder| takes
 371 {\tt|operand, operator|} instead of {\tt|operator, operand|}.
 372
 373 \subsubsection{Constructor {\tt gg.expr (config\_table)}}
 374
 375 This function returns an expression parser. \verb|config_table|
 376 is a table of fields which describes the kind of expression to be
 377 read by the parser. The following fields can appear in the table:
 378
 379 \begin{itemize}
 380 \item\verb|primary| (mandatory): the primary parser, which reads the
 381   primary elements linked by operators. In Metalua expression parsers,
 382   that would be numbers, strings, identifiers\ldots It is often a
 383   multisequence parser, although that's not mandatory.
 384 \item\verb|prefix|: a list of tables representing prefix operator
 385   sequences, as described above. It supports a {\tt default} parser:
 386   this parser is considered to have succesfully parsed a prefix
 387   operator if it returns a non-false result.
 388 \item\verb|infix|: a list of tables representing infix operator
 389   sequences, as described above. Supports a {\tt default} parser.
 390 \item\verb|suffix|: a list of tables representing suffix operator
 391   sequences, as described above. Supports a {\tt default} parser.
 392 \end{itemize}
 393
 394 \subsubsection{Methods {\tt .prefix:add()}, {\tt .infix:add()}, {\tt
 395     .suffix:add()}}
 396 Add an operator in the relevant table. The argument must be an
 397 operator sequence table, as described above.
 398
 399 \subsubsection{Method {\tt :add()}}
 400 This is just a shortcut for {\tt primary.add}. Unspecified behavior if
 401 {\tt primary} doesn't support method {\tt add}.
 402
 403
 404 \subsubsection{Method {\tt :parse (lexstream)}}
 405 Read a list from the lexstream. The result is built by \verb|builder1| calls.
 406
 407 It can also be directly called as simply \verb|x(lexstream)| instead of
 408 \verb|x:parse(lexstream)|.
 409
 410 \subsubsection{Method {\tt :tostring()}}
 411
 412 Returns a string representing the parser. Mainly useful for error
 413 message generation.
 414
 415 \subsection{{\tt onkeyword} parser}
 416
 417 Takes a list of keywords and a parser: if the next token is one of the
 418 keywords in the list, runs the parser; if not, simply returns
 419 \verb|false|.
 420
 421 Notice that by default, the keyword is consumed by the
 422 \verb|onkeyword| parser. If you want it not to be consumed, but
 423 instead passed to the internal parser, add a \verb|peek=true| entry in
 424 the config table.
 425
 426 \subsubsection{Constructor {\tt gg.onkeyword (config\_table)}}
 427
 428 Create a keyword-conditionnal parser. \verb|config_table| can contain:
 429
 430 \begin{itemize}
 431 \item strings, representing the triggerring keywords (at least one);
 432 \item the parser to run if one of the keywords is found (exactly one);
 433 \item \verb|peek=<boolean>| to indicate whether recognized keywords
 434   must be consumed or passed to the inner parser.
 435 \end{itemize}
 436 The order of elements in the list is not relevant.
 437
 438 \subsubsection{Method {\tt :parse (lexstream)}}
 439
 440 Run the parser. The result is the internal parser's result, or
 441 \verb|false| if the next token in the lexstream wasn't one of the
 442 specified keywords.
 443
 444 It can also be directly called as simply \verb|x(lexstream)| instead of
 445 \verb|x:parse(lexstream)|.
 446
 447 \subsection{{\tt optkeyword} parser}
 448
 449 Watch for optional keywords: an \verb|optkeyword| parser has a list of
 450 keyword strings as a configuration. If such a keyword is found as the
 451 nex lexstream element upon parsing, the keyword is consumed and that
 452 string is returned. If not, \verb|false| is returned.
 453
 454 \subsubsection{Constructor {\tt gg.optkeyword (keyword1, keyword2, ...)}}
 455
 456 Return a \verb|gg.optkeyword| parser, which accepts all of the
 457 keywords given as parameters, and returns either the found keyword, or
 458 \verb|false| if none is found.