,

Contents · Lexing, parsing (LL/LR/GLR), parser generators


Lexing vs parsing

  • Lexing maps characters → tokens using regular languages (regex/DFAs).
  • Parsing maps tokens → AST using context-free grammars (CFGs).
  • Separation simplifies grammar and improves performance.

Regular languages, DFAs/NFAs, and tokenization

  • Define tokens with regex; convert to NFA (Thompson), then DFA (subset construction).
  • Minimize DFA for speed; handle longest-match and priority rules.
  • Unicode and whitespace/comments require practical care.
IDENT = [A-Za-z_][A-Za-z0-9_]*
NUMBER = [0-9]+(\.[0-9]+)?

Top-down parsing: LL(k), recursive descent, PEG

  • LL parsers predict production by lookahead; left recursion must be eliminated.
  • PEG uses prioritized choice and backtracking; packrat parsing gives linear time with memoization.
  • Hand-written recursive descent is simple and flexible for small grammars.
Expr  → Term (("+"|"-") Term)*
Term  → Factor (("*"|"/") Factor)*
Factor→ NUMBER | IDENT | "(" Expr ")"

Bottom-up parsing: LR(0)/SLR/LALR/CLR

  • LR parses by shift/reduce using parse tables; handles a wide class of CFGs.
  • LALR merges LR states to reduce table size (yacc/bison default).
  • Conflicts (shift/reduce, reduce/reduce) resolved by precedence/associativity.
Actions: shift s_i, reduce A→α, goto transitions; parse stack drives decisions

Generalized LR (GLR) for ambiguous grammars

  • Explores multiple parses in parallel with graph-structured stack; cubic worst-case, linear often.
  • Useful for natural languages or languages with intentional ambiguity.
  • Variants: Earley, CYK for different trade-offs.

Parser generators: lex/flex, yacc/bison, ANTLR

  • flex + bison: classic C/C++ toolchain (LALR(1)).
  • ANTLR: LL(*) with modern targets (Java, C#, Python, Go, etc.).
  • tree-sitter: incremental parsing for editors; GLR-based.
# bison example (skeleton)
bison -d parser.y && flex lexer.l && cc parser.tab.c lex.yy.c -o parser

Error recovery and diagnostics

  • Panic-mode recovery: discard tokens until a synchronizing set.
  • Phrase-level recovery: insert/delete minimal tokens heuristically.
  • Great diagnostics: point to source span, suggest fixes, show expected tokens.

Tools and tips

  • Keep grammar unambiguous where possible; add precedence/associativity rules.
  • Test with large corpora; fuzz with random token streams.
  • Design AST for later passes: types, optimization, codegen.

Exercises

  1. Write a lexer for a tiny expression language using regex → NFA → DFA.
  2. Implement a recursive-descent parser for the expression grammar with precedence.
  3. Convert the grammar to LALR(1) and build with bison; compare error messages.
Pick parsing strategy based on grammar and tooling needs: LL for simplicity, LR for coverage, GLR for ambiguity.