A grammar parser toolkit written in Zig - ABNF, BNF, PEG, CFG, ERE, and S-expressions
with comptime parser combinators, a bytecode VM, native JIT compilers, and AOT compilation for AArch64 and x86_64.
- Multi-format parsing - ABNF, BNF, PEG, CFG, ERE, and S-expression grammars, each with tokenizer, parser, and error diagnostics.
- Validation - Detects duplicate rules, undefined references, unused rules, and unproductive cycles.
- Runtime matching - Match input strings against any rule in a dynamically loaded grammar.
- Formatting - Pretty-print grammars back to canonical form with aligned operators.
- Comptime combinators - Zero-overhead parser combinator library resolved entirely at comptime.
- ABNF compiler - Compile ABNF grammar strings into combinator types at comptime - define your grammar in standard ABNF and get a parser for free.
- Grammar VM - Bytecode virtual machine inspired by LPeg. Compiles any grammar to instructions and executes with backtracking and capture groups.
- Peephole optimizer - Optimizes compiled bytecode with charset-to-char reduction, consecutive char fusion into string instructions, common prefix factoring, and optional char fusion.
- JIT compilers - Compiles bytecode to native machine code (AArch64 and x86_64), eliminating interpreter dispatch overhead.
- AOT compilation - Compile grammars ahead of time to portable
.zparbinary blobs containing native machine code. Load and execute them without the grammar compiler. - Parse trees - Run any multi-rule grammar with auto-rule-captures to produce a hierarchical tree (S-expression or JSON), with each rule call becoming a node typed by rule name. Works on both the VM and the JIT, with optional packrat memoization and Warth-style left recursion.
- Tree queries - Tree-sitter-style structural queries over parse trees, with
@captures, alternation, quantifiers, anchors, field selectors (name: (pattern), declared via backwards-compatible#@ fielddirectives), anonymous-token literal matches (#@ tokens "..."or--tokens=all), and#eq?/#match?predicates (regex compiled at query-compile time via the bundled ERE engine). - Benchmarks - 1M-iteration benchmarks comparing comptime vs runtime, optimized vs unoptimized VM, and interpreter vs JIT.
Define a grammar in ABNF and compile it to a parser type at comptime - zero runtime overhead:
const zpars = @import("zpars");
const HttpVersion = zpars.abnf.Compiler.Compile(
\\version = "HTTP/" 1*DIGIT "." 1*DIGIT
, "version");
test "parse HTTP version" {
const r = HttpVersion.parse("HTTP/1.1 OK").?;
try std.testing.expectEqualStrings("HTTP/1.1", r.value);
try std.testing.expectEqualStrings(" OK", r.rest);
}Multi-rule grammars with cross-references work too:
const Pair = zpars.abnf.Compiler.Compile(
\\number = 1*DIGIT
\\pair = number "," number
, "pair");
const r = Pair.parse("42,7!").?;
// r.value == "42,7", r.rest == "!"All 16 RFC 5234 core rules (ALPHA, DIGIT, SP, CRLF, etc.) are available implicitly.
The combinator library can also be used directly:
const c = zpars.combinators;
const Digit = c.CharRange('0', '9');
const Number = c.Capture(c.Many(Digit, .{ .min = 1 }));
const P = c.Seq(.{ Number, c.Literal(","), Number });
const r = c.Capture(P).parse("42,7!").?;
// r.value == "42,7"Available primitives: Literal, Char, CharRange, AnyOf, NoneOf, ByteLiteral, CaseInsensitiveLiteral, Any, Eof.
Available combinators: Seq, Alt, Many, Optional, Not, Map, Capture.
For grammars loaded at runtime, use the Matcher:
const zpars = @import("zpars");
var scanner = zpars.abnf.Scanner.init(grammar);
const tokens = scanner.scanTokens();
var parser = zpars.abnf.Parser.init(tokens, grammar);
const rules = try parser.parse();
var validator = zpars.Validator.init(allocator, rules);
const merged = try validator.validate();
const matcher = try zpars.Matcher.init(allocator, merged);
const r = matcher.match("version", "HTTP/1.1 OK").?;
// r.value == "HTTP/1.1", r.rest == " OK"A bytecode virtual machine that compiles grammar ASTs into instructions and executes them with backtracking. Inspired by the LPeg parsing machine.
Disassemble a grammar to see the generated bytecode:
$ zpars vm grammar.peg
0: call -> 2
1: match
2: call -> 8
3: choice -> 7
4: char '+'
...
Run a match with capture groups (ERE):
$ zpars vm email.ere "user@example.com"
match: 16 bytes "user@example.com"
group 0: "user"
group 1: "example"
group 2: "com"
Trace execution step-by-step with -t:
$ zpars vm -t pattern.ere "abcbd"
0: sp=0 pos=0 "abcbd" char 'a'
1: sp=0 pos=1 "bcbd" choice -> 7
2: sp=1 pos=1 "bcbd" choice -> 5
3: sp=2 pos=1 "bcbd" char 'b'
backtrack -> pc=5 pos=2
...
The VM supports all grammar formats (ABNF, BNF, PEG, ERE).
The optimizer runs after compilation, rewriting bytecode in-place before execution:
- Charset-to-char - Replaces
setinstructions with a single-element charset by a cheapercharinstruction. - Char fusion - Fuses runs of consecutive
charinstructions into a singlestringinstruction. - Common prefix factoring - Factors shared prefixes out of alternation branches (e.g.
https|httpshareshttp). - Optional char fusion - Collapses
choice/char/commitsequences fora?patterns into a singleoptional_charinstruction.
The JIT compiler translates VM bytecode to native machine code, removing interpreter dispatch overhead entirely. Both AArch64 (ARM64) and x86_64 backends are supported, selected automatically at compile time. The JIT uses the same API as the interpreter VM:
const zpars = @import("zpars");
var compiler = zpars.vm.Compiler.compile(rules);
var jit = try zpars.vm.Jit.init(
compiler.getCode(),
compiler.getCharsets(),
compiler.getStringData(),
input,
);
defer jit.deinit();
const result = jit.execute();The AOT compiler produces standalone .zpar binary blobs containing native machine code and all static data. Unlike the JIT (which compiles at runtime), AOT compiles once and produces a portable blob that can be loaded and executed without the grammar compiler:
$ zpars compile examples/calc.peg -o calc.zpar
$ zpars run calc.zpar "1+2*3"
match: 5 bytes "1+2*3"
The blob format includes the native code, charset tables, string data, and jump tables. At runtime, the code is mmap'd as executable and called directly -- the only overhead is a single mmap syscall at load time.
Run any multi-rule grammar with rules_as_captures to produce a hierarchical parse tree where every rule call is a node typed by its rule name. The compiler emits event_open / event_close instructions around each rule body; the VM/JIT records them on a side channel so they survive backtracking, and a post-pass folds the events into a tree:
$ zpars tree examples/calc.peg "1+2*3"
(Expr [0,5]
(Term [0,1]
(Factor [0,1]))
(Term [2,5]
(Factor [2,3])
(Factor [4,5])))
-j switches to compact JSON, --jit runs through the native JIT, -p enables packrat memoization (which transparently caches and replays the per-rule events on cache hits). The output format mirrors tree-sitter's S-expressions for easy inspection and diffing.
PEG grammars can opt into labeled-failure recovery via backwards-compatible #@ comment directives that other PEG tools see as plain # comments:
Program <- Stmt*
Stmt <- Expr ";"
/ Expr #@ throw missing_semi
Expr <- [a-z]+
#@ rule Stmt catches missing_semi -> recover_missing
#@ throw <label> ends an alternative with a labeled failure that propagates past ordered-choice points. #@ rule <name> catches <label> -> <handler> wraps the rule's body in a catch frame; recover_missing is a builtin handler that emits a zero-width MISSING(label) marker. zpars tree runs the resulting bytecode through the VM with capture events on and renders synthesized ERROR / MISSING nodes and partial rule subtrees:
$ zpars tree examples/recovery.peg "foo;bar"
(Program [0,7]
(Stmt [0,4]
(Expr [0,3]))
(Stmt [4,7]
(Expr [4,7])
(MISSING [7,7] missing_semi)))
Recovery directives can't be mixed with PEG () capture groups in the same grammar.
A tree-sitter-style query language runs structural patterns against parse trees. Queries are S-expression-shaped with @captures, [alternation], ?/*/+ quantifiers, . anchors, and #predicates?:
; Capture every Factor.
(Factor) @factor
; Capture Factors whose text matches a regex.
((Factor) @odd
(#match? @odd "[13579]"))
; Capture Terms that contain at least two Factors.
(Term (Factor) (Factor)) @mul
; Capture the first Term in an Expr.
(Expr . (Term) @first-term)Run with zpars query:
$ zpars query examples/calc.peg examples/calc.scm "1+2*3"
pattern: 3
capture: name=first-term, range=[0,1], text='1'
pattern: 0
capture: name=factor, range=[0,1], text='1'
pattern: 1
capture: name=odd, range=[0,1], text='1'
pattern: 2
capture: name=mul, range=[2,5], text='2*3'
pattern: 0
capture: name=factor, range=[2,3], text='2'
pattern: 0
capture: name=factor, range=[4,5], text='3'
pattern: 1
capture: name=odd, range=[4,5], text='3'
The query layer also recognises the recovery-era node kinds: (ERROR) @e matches error_nodes, (MISSING) @m matches missing_nodes, and (Rule partial) filters to rule_partial nodes specifically.
Built-in predicates: #eq?, #match? (ERE, compiled at query-compile time), #any-of?, #contains?, each with a #not- form. #vim-match? / #lua-match? are aliased to #match? for nvim-treesitter compatibility. Unknown predicates pass through (tree-sitter convention).
Bare string atoms in queries match anonymous-token nodes (literal-text matches in the parse tree). Token emission is opt-in via --tokens=<mode> on the CLI (or Compiler.Options.token_events programmatically). Three modes:
--tokens=off-- no token nodes (default for non-PEG grammars).--tokens=tagged(default for PEG) -- only literals listed in#@ tokens "..."directives become token nodes. Backwards-compatible: other PEG tools see the directives as plain comments.--tokens=all-- every literal-matching opcode emits a token node. Most ecosystem-compatible, but inflates the event log.
#@ tokens "+" "*"
Expr <- Term ("+" Term)*
Term <- Factor ("*" Factor)*
Factor <- "(" Expr ")" / [0-9]+
; operators.scm -- captures every "+" and "*" anonymous-token node.
"+" @plus
"*" @star$ zpars query examples/calc-tokens.peg examples/operators.scm "1+2*3"
pattern: 0
capture: name=plus, range=[1,2], text='+'
pattern: 1
capture: name=star, range=[3,4], text='*'
Both the VM and JIT emit event_token natively.
Field annotations let queries select children by role rather than position. zpars stays PEG-compliant: fields are declared via backwards-compatible #@ field comment directives (other PEG tools see them as plain comments), then referenced from queries with the standard tree-sitter name: (pattern) syntax.
#@ tokens "function"
#@ field Func name = Ident
#@ field Func body = Body
Func <- "function" _ Ident _ Body
Ident <- [a-z]+
Body <- "{" _ "}"
_ <- " "*
When the same target appears multiple times in a rule body, an explicit ordinal disambiguates: #@ field Bin left = Expr#1, #@ field Bin right = Expr#2. Targets can also be quoted literals (= "function") for tagging anonymous-token children.
; func.scm
(Func
name: (Ident) @fname
body: (Body) @fbody)$ zpars query examples/func.peg examples/func.scm "function foo{}"
pattern: 0
capture: name=fname, range=[9,12], text='foo'
capture: name=fbody, range=[12,14], text='{}'
Field events are auto-enabled when the PEG parser collects at least one #@ field directive. Both the VM and JIT emit event_field natively.
Multi-pattern groupings are supported: ((a) (b) (#pred? ...)) matches when the visited node has children matching a and b in order (gap-allowed by default; use . between them for strict adjacency). Predicates fire after the full sequence binds, so a failed predicate triggers backtracking through quantifiers naturally. Single-pattern groupings (((Rule (Factor)+ @f) (#eq? @f "x"))) get the same treatment: the group's predicates fold into the inner's child-sequence match so a rejected greedy bind retries with a shorter count.
The CLI prints one entry per pattern match by default; pass -c (or --captures) to flatten the output into one entry per capture, ordered by source position. JSON forms are available for both modes via -j. An empty result set prints no matches to stderr.
See examples/ for sample grammars (calc, email, JSON, URI, recovery, queries) and inputs.
zpars check <file> # validate a grammar
zpars compile <file> -o <output> # compile grammar to native .zpar blob
zpars fmt <file> # format a grammar
zpars match -r <rule> <file> <input> # match input against a rule
zpars run <blob> <input> # run a compiled .zpar blob
zpars tree [-j] [-p|--jit] <file> <input> # print parse tree (S-exp default, -j JSON)
zpars vm [-t] [-p] <file> [<input>] # disassemble and run via VM
zpars query [-j] [-c] [--tokens=off|all|tagged] <grammar> <query-file> <input> # run a tree-sitter-style query
Validate an ABNF grammar, reporting syntax errors and semantic issues:
$ zpars check grammar.abnf
grammar.abnf:1:12: error: expected element, found ')'
foo = (a / )
^
$ zpars check grammar.abnf
grammar.abnf: warning: rule 'helper' is defined but never referenced
grammar.abnf: error: rule 'start' references undefined rule 'missing'
Parse and reformat a grammar with aligned = signs:
$ zpars fmt grammar.abnf
number = 1*DIGIT
pair = number "," number
Match an input string against a rule:
$ zpars match -r version grammar.abnf "HTTP/1.1 OK"
HTTP/1.1
The VSCode extension provides semantic highlighting for ABNF, BNF, PEG, CFG, and S-expression grammars, powered by the zpars WASM build.
Install from Open VSX.
Requires Zig 0.16.0+.
zig build # build the CLI executable
zig build test # run all tests
zig build bench # run benchmarks (ReleaseFast, 1M iterations)
zig build lsp # build the LSP server
zig build vsx # build the Open VSX extension (WASM + TypeScript)
- RFC 5234 - Augmented BNF for Syntax Specifications: ABNF
- RFC 7405 - Case-Sensitive String Support in ABNF
- Report on the Algorithmic Language ALGOL 60 (1960) - original BNF definition (Section 1.1)
- Parsing Expression Grammars: A Recognition-Based Syntactic Foundation (2004) - Bryan Ford's PEG paper
- IEEE Std 1003.1 - Extended Regular Expressions (ERE) - POSIX ERE specification (Section 9.4)
- RFC 9804 - S-Expressions
- A Text Pattern-Matching Tool based on Parsing Expression Grammars (2009) - Ierusalimschy's LPeg parsing machine
- Packrat Parsers Can Support Left Recursion (2008) - Warth et al.'s seed-growing algorithm for left recursion in PEGs
- Maidl, A. M., Mascarenhas, F., Medeiros, S., & Ierusalimschy, R. (2016). Error Reporting in Parsing Expression Grammars - labeled-failure model
- Medeiros, F., & Ierusalimschy, R. (2018). Syntax Error Recovery in Parsing Expression Grammars - the recovery model zpars's
#@directives implement