Skip to content

q-uint/zpars

Repository files navigation

zpars logo

zpars

A grammar parser toolkit written in Zig - ABNF, BNF, PEG, CFG, ERE, and S-expressions
with comptime parser combinators, a bytecode VM, native JIT compilers, and AOT compilation for AArch64 and x86_64.

Zig 0.16.0+ Open VSX License: MPL-2.0

Features

  • Multi-format parsing - ABNF, BNF, PEG, CFG, ERE, and S-expression grammars, each with tokenizer, parser, and error diagnostics.
  • Validation - Detects duplicate rules, undefined references, unused rules, and unproductive cycles.
  • Runtime matching - Match input strings against any rule in a dynamically loaded grammar.
  • Formatting - Pretty-print grammars back to canonical form with aligned operators.
  • Comptime combinators - Zero-overhead parser combinator library resolved entirely at comptime.
  • ABNF compiler - Compile ABNF grammar strings into combinator types at comptime - define your grammar in standard ABNF and get a parser for free.
  • Grammar VM - Bytecode virtual machine inspired by LPeg. Compiles any grammar to instructions and executes with backtracking and capture groups.
  • Peephole optimizer - Optimizes compiled bytecode with charset-to-char reduction, consecutive char fusion into string instructions, common prefix factoring, and optional char fusion.
  • JIT compilers - Compiles bytecode to native machine code (AArch64 and x86_64), eliminating interpreter dispatch overhead.
  • AOT compilation - Compile grammars ahead of time to portable .zpar binary blobs containing native machine code. Load and execute them without the grammar compiler.
  • Parse trees - Run any multi-rule grammar with auto-rule-captures to produce a hierarchical tree (S-expression or JSON), with each rule call becoming a node typed by rule name. Works on both the VM and the JIT, with optional packrat memoization and Warth-style left recursion.
  • Tree queries - Tree-sitter-style structural queries over parse trees, with @captures, alternation, quantifiers, anchors, field selectors (name: (pattern), declared via backwards-compatible #@ field directives), anonymous-token literal matches (#@ tokens "..." or --tokens=all), and #eq?/#match? predicates (regex compiled at query-compile time via the bundled ERE engine).
  • Benchmarks - 1M-iteration benchmarks comparing comptime vs runtime, optimized vs unoptimized VM, and interpreter vs JIT.

Comptime ABNF Compiler

Define a grammar in ABNF and compile it to a parser type at comptime - zero runtime overhead:

const zpars = @import("zpars");

const HttpVersion = zpars.abnf.Compiler.Compile(
   \\version = "HTTP/" 1*DIGIT "." 1*DIGIT
, "version");

test "parse HTTP version" {
   const r = HttpVersion.parse("HTTP/1.1 OK").?;
   try std.testing.expectEqualStrings("HTTP/1.1", r.value);
   try std.testing.expectEqualStrings(" OK", r.rest);
}

Multi-rule grammars with cross-references work too:

const Pair = zpars.abnf.Compiler.Compile(
   \\number = 1*DIGIT
   \\pair   = number "," number
, "pair");

const r = Pair.parse("42,7!").?;
// r.value == "42,7", r.rest == "!"

All 16 RFC 5234 core rules (ALPHA, DIGIT, SP, CRLF, etc.) are available implicitly.

Parser Combinators

The combinator library can also be used directly:

const c = zpars.combinators;

const Digit = c.CharRange('0', '9');
const Number = c.Capture(c.Many(Digit, .{ .min = 1 }));
const P = c.Seq(.{ Number, c.Literal(","), Number });

const r = c.Capture(P).parse("42,7!").?;
// r.value == "42,7"

Available primitives: Literal, Char, CharRange, AnyOf, NoneOf, ByteLiteral, CaseInsensitiveLiteral, Any, Eof.

Available combinators: Seq, Alt, Many, Optional, Not, Map, Capture.

Runtime Matcher

For grammars loaded at runtime, use the Matcher:

const zpars = @import("zpars");

var scanner = zpars.abnf.Scanner.init(grammar);
const tokens = scanner.scanTokens();
var parser = zpars.abnf.Parser.init(tokens, grammar);
const rules = try parser.parse();
var validator = zpars.Validator.init(allocator, rules);
const merged = try validator.validate();
const matcher = try zpars.Matcher.init(allocator, merged);

const r = matcher.match("version", "HTTP/1.1 OK").?;
// r.value == "HTTP/1.1", r.rest == " OK"

Grammar VM

A bytecode virtual machine that compiles grammar ASTs into instructions and executes them with backtracking. Inspired by the LPeg parsing machine.

Disassemble a grammar to see the generated bytecode:

$ zpars vm grammar.peg
   0: call    -> 2
   1: match
   2: call    -> 8
   3: choice  -> 7
   4: char    '+'
   ...

Run a match with capture groups (ERE):

$ zpars vm email.ere "user@example.com"
match: 16 bytes "user@example.com"
  group 0: "user"
  group 1: "example"
  group 2: "com"

Trace execution step-by-step with -t:

$ zpars vm -t pattern.ere "abcbd"
   0: sp=0   pos=0   "abcbd" char 'a'
   1: sp=0   pos=1   "bcbd" choice -> 7
   2: sp=1   pos=1   "bcbd" choice -> 5
   3: sp=2   pos=1   "bcbd" char 'b'
      backtrack -> pc=5 pos=2
   ...

The VM supports all grammar formats (ABNF, BNF, PEG, ERE).

Peephole Optimizer

The optimizer runs after compilation, rewriting bytecode in-place before execution:

  • Charset-to-char - Replaces set instructions with a single-element charset by a cheaper char instruction.
  • Char fusion - Fuses runs of consecutive char instructions into a single string instruction.
  • Common prefix factoring - Factors shared prefixes out of alternation branches (e.g. https|http shares http).
  • Optional char fusion - Collapses choice/char/commit sequences for a? patterns into a single optional_char instruction.

JIT Compilers

The JIT compiler translates VM bytecode to native machine code, removing interpreter dispatch overhead entirely. Both AArch64 (ARM64) and x86_64 backends are supported, selected automatically at compile time. The JIT uses the same API as the interpreter VM:

const zpars = @import("zpars");

var compiler = zpars.vm.Compiler.compile(rules);
var jit = try zpars.vm.Jit.init(
    compiler.getCode(),
    compiler.getCharsets(),
    compiler.getStringData(),
    input,
);
defer jit.deinit();

const result = jit.execute();

AOT Compilation

The AOT compiler produces standalone .zpar binary blobs containing native machine code and all static data. Unlike the JIT (which compiles at runtime), AOT compiles once and produces a portable blob that can be loaded and executed without the grammar compiler:

$ zpars compile examples/calc.peg -o calc.zpar
$ zpars run calc.zpar "1+2*3"
match: 5 bytes "1+2*3"

The blob format includes the native code, charset tables, string data, and jump tables. At runtime, the code is mmap'd as executable and called directly -- the only overhead is a single mmap syscall at load time.

Parse Trees

Run any multi-rule grammar with rules_as_captures to produce a hierarchical parse tree where every rule call is a node typed by its rule name. The compiler emits event_open / event_close instructions around each rule body; the VM/JIT records them on a side channel so they survive backtracking, and a post-pass folds the events into a tree:

$ zpars tree examples/calc.peg "1+2*3"
(Expr [0,5]
  (Term [0,1]
    (Factor [0,1]))
  (Term [2,5]
    (Factor [2,3])
    (Factor [4,5])))

-j switches to compact JSON, --jit runs through the native JIT, -p enables packrat memoization (which transparently caches and replays the per-rule events on cache hits). The output format mirrors tree-sitter's S-expressions for easy inspection and diffing.

Error recovery (PEG, experimental)

PEG grammars can opt into labeled-failure recovery via backwards-compatible #@ comment directives that other PEG tools see as plain # comments:

Program <- Stmt*
Stmt    <- Expr ";"
         / Expr     #@ throw missing_semi
Expr    <- [a-z]+

#@ rule Stmt catches missing_semi -> recover_missing

#@ throw <label> ends an alternative with a labeled failure that propagates past ordered-choice points. #@ rule <name> catches <label> -> <handler> wraps the rule's body in a catch frame; recover_missing is a builtin handler that emits a zero-width MISSING(label) marker. zpars tree runs the resulting bytecode through the VM with capture events on and renders synthesized ERROR / MISSING nodes and partial rule subtrees:

$ zpars tree examples/recovery.peg "foo;bar"
(Program [0,7]
  (Stmt [0,4]
    (Expr [0,3]))
  (Stmt [4,7]
    (Expr [4,7])
    (MISSING [7,7] missing_semi)))

Recovery directives can't be mixed with PEG () capture groups in the same grammar.

Tree queries (experimental)

A tree-sitter-style query language runs structural patterns against parse trees. Queries are S-expression-shaped with @captures, [alternation], ?/*/+ quantifiers, . anchors, and #predicates?:

; Capture every Factor.
(Factor) @factor

; Capture Factors whose text matches a regex.
((Factor) @odd
 (#match? @odd "[13579]"))

; Capture Terms that contain at least two Factors.
(Term (Factor) (Factor)) @mul

; Capture the first Term in an Expr.
(Expr . (Term) @first-term)

Run with zpars query:

$ zpars query examples/calc.peg examples/calc.scm "1+2*3"
pattern: 3
  capture: name=first-term, range=[0,1], text='1'
pattern: 0
  capture: name=factor, range=[0,1], text='1'
pattern: 1
  capture: name=odd, range=[0,1], text='1'
pattern: 2
  capture: name=mul, range=[2,5], text='2*3'
pattern: 0
  capture: name=factor, range=[2,3], text='2'
pattern: 0
  capture: name=factor, range=[4,5], text='3'
pattern: 1
  capture: name=odd, range=[4,5], text='3'

The query layer also recognises the recovery-era node kinds: (ERROR) @e matches error_nodes, (MISSING) @m matches missing_nodes, and (Rule partial) filters to rule_partial nodes specifically.

Built-in predicates: #eq?, #match? (ERE, compiled at query-compile time), #any-of?, #contains?, each with a #not- form. #vim-match? / #lua-match? are aliased to #match? for nvim-treesitter compatibility. Unknown predicates pass through (tree-sitter convention).

Anonymous tokens

Bare string atoms in queries match anonymous-token nodes (literal-text matches in the parse tree). Token emission is opt-in via --tokens=<mode> on the CLI (or Compiler.Options.token_events programmatically). Three modes:

  • --tokens=off -- no token nodes (default for non-PEG grammars).
  • --tokens=tagged (default for PEG) -- only literals listed in #@ tokens "..." directives become token nodes. Backwards-compatible: other PEG tools see the directives as plain comments.
  • --tokens=all -- every literal-matching opcode emits a token node. Most ecosystem-compatible, but inflates the event log.
#@ tokens "+" "*"
Expr   <- Term ("+" Term)*
Term   <- Factor ("*" Factor)*
Factor <- "(" Expr ")" / [0-9]+
; operators.scm -- captures every "+" and "*" anonymous-token node.
"+" @plus
"*" @star
$ zpars query examples/calc-tokens.peg examples/operators.scm "1+2*3"
pattern: 0
  capture: name=plus, range=[1,2], text='+'
pattern: 1
  capture: name=star, range=[3,4], text='*'

Both the VM and JIT emit event_token natively.

Fields

Field annotations let queries select children by role rather than position. zpars stays PEG-compliant: fields are declared via backwards-compatible #@ field comment directives (other PEG tools see them as plain comments), then referenced from queries with the standard tree-sitter name: (pattern) syntax.

#@ tokens "function"
#@ field Func name = Ident
#@ field Func body = Body
Func  <- "function" _ Ident _ Body
Ident <- [a-z]+
Body  <- "{" _ "}"
_     <- " "*

When the same target appears multiple times in a rule body, an explicit ordinal disambiguates: #@ field Bin left = Expr#1, #@ field Bin right = Expr#2. Targets can also be quoted literals (= "function") for tagging anonymous-token children.

; func.scm
(Func
  name: (Ident) @fname
  body: (Body) @fbody)
$ zpars query examples/func.peg examples/func.scm "function foo{}"
pattern: 0
  capture: name=fname, range=[9,12], text='foo'
  capture: name=fbody, range=[12,14], text='{}'

Field events are auto-enabled when the PEG parser collects at least one #@ field directive. Both the VM and JIT emit event_field natively.

Multi-pattern groupings are supported: ((a) (b) (#pred? ...)) matches when the visited node has children matching a and b in order (gap-allowed by default; use . between them for strict adjacency). Predicates fire after the full sequence binds, so a failed predicate triggers backtracking through quantifiers naturally. Single-pattern groupings (((Rule (Factor)+ @f) (#eq? @f "x"))) get the same treatment: the group's predicates fold into the inner's child-sequence match so a rejected greedy bind retries with a shorter count.

The CLI prints one entry per pattern match by default; pass -c (or --captures) to flatten the output into one entry per capture, ordered by source position. JSON forms are available for both modes via -j. An empty result set prints no matches to stderr.

See examples/ for sample grammars (calc, email, JSON, URI, recovery, queries) and inputs.

CLI

zpars check   <file>                          # validate a grammar
zpars compile <file> -o <output>              # compile grammar to native .zpar blob
zpars fmt     <file>                          # format a grammar
zpars match   -r <rule> <file> <input>        # match input against a rule
zpars run     <blob> <input>                  # run a compiled .zpar blob
zpars tree    [-j] [-p|--jit] <file> <input>  # print parse tree (S-exp default, -j JSON)
zpars vm      [-t] [-p] <file> [<input>]      # disassemble and run via VM
zpars query   [-j] [-c] [--tokens=off|all|tagged] <grammar> <query-file> <input>  # run a tree-sitter-style query

check

Validate an ABNF grammar, reporting syntax errors and semantic issues:

$ zpars check grammar.abnf
grammar.abnf:1:12: error: expected element, found ')'
   foo = (a / )
              ^
$ zpars check grammar.abnf
grammar.abnf: warning: rule 'helper' is defined but never referenced
grammar.abnf: error: rule 'start' references undefined rule 'missing'

fmt

Parse and reformat a grammar with aligned = signs:

$ zpars fmt grammar.abnf
number = 1*DIGIT
pair   = number "," number

match

Match an input string against a rule:

$ zpars match -r version grammar.abnf "HTTP/1.1 OK"
HTTP/1.1

Editor Support

The VSCode extension provides semantic highlighting for ABNF, BNF, PEG, CFG, and S-expression grammars, powered by the zpars WASM build.

Install from Open VSX.

Building

Requires Zig 0.16.0+.

zig build                      # build the CLI executable
zig build test                 # run all tests
zig build bench                # run benchmarks (ReleaseFast, 1M iterations)
zig build lsp                  # build the LSP server
zig build vsx                  # build the Open VSX extension (WASM + TypeScript)

References

About

A grammar parser toolkit in Zig with comptime combinators, a bytecode VM, JIT, and AOT compilation

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages