PEG grammar optimizer and formatter. Supports any grammar supported by PackCC parser generator.
Pegof parses peg grammar from input file and extracts it's AST. Then, based on the command it either just prints it out in nice and consistent format or directly as AST. It can also perform multi-pass optimization process that goes through the AST and tries to simplify it as much as possible to reduce number of rules and terms.
pegof [<options>] [--] [<input_file> ...]
-h/--help
Show help (this text)
-V/--version
Show version and exit
-c/--conf FILE
Use given configuration file
-v/--verbose [LEVEL]
Increase verbosity of logging by LEVEL (defaults to 1), may be repeated
-d/--debug
Output very verbose debug info, implies max verbosity
-S/--skip-validation
Skip result validation (useful only for debugging purposes)
-b/--benchmark SCRIPT
Benchmarking script, see documentation for details
-D/--debug-script SCRIPT
Debugging script, see documentation for details
-f/--format
Output formatted grammar (default)
-a/--ast
Output abstract syntax tree representation
-g/--graph
Output description of the grammar in GraphViz format
-p/--packcc
Output source files as if the grammar was passed to packcc
-P/--packcc-options
Additional comma separated options passed to packcc
Supported options are 'lines', 'ascii' and 'debug' and also their short forms 'a', 'l' and 'd'
Note: --lines might not work as expected, because temporary file is used
-n/--inplace
Modify the input files, use with caution
-i/--input FILE
Path to file with PEG grammar, multiple paths can be given
Value "-" can be used to specify standard input
Mainly useful in config file
If no file or --input is given, read standard input
-o/--output FILE
Output to file (should be repeated if there is more inputs)
Value "-" can be used to specify standard output
-I/--import PATH
Directory where to search for import files
May be repeated for multiple locations
-H/--header
Whether to write "Generated by pegof" header
Possible values are 'never, 'always' or 'auto' (which is the default)
-q/--quotes single/double
Switch between double and single quoted strings (defaults to double)
-w/--wrap-limit N
Wrap alternations with more than N sequences (default 1)
-t/--indent FORMAT
How to indent long rules
FORMAT may be 'sN' for spaces or 'tN' for tabs, where N is a number (e.g.: 't1' for single tab)
Some sane literal strings are also accepted (e.g.: ' ' or '\t\t')
Default is 4 spaces
-O/--optimize OPT[,...]
Comma separated list of optimizations to apply
-X/--exclude OPT[,...]
Comma separated list of optimizations that should not be applied
-l/--inline-limit N
Minimum inlining score needed for rule to be inlined
Number between 0.0 (inline everything) and 1.0 (most conservative)
Default is 0.2
Only applied when inlining is enabled
-N/--no-follow
Do not inline imported files while optimizing
-
all
All optimizations: Shorthand option for combination of all available optimizations. -
concat-char-classes
Character class concatenation: Join adjacent character classes in alternations into one. E.g.[AB] / [CD]
becomes[ABCD]
. -
concat-strings
String concatenation: Join adjacent string nodes into one. E.g."A" "B"
becomes"AB"
. -
double-negation
Removing double negations: Negation of negation can be ignored, because it results in the original term (e.g.!(!TERM)
->TERM
). -
double-quantification
Removing double quantifications: If a single term is quantified twice, it is always possible to convert this into a single potfix operator with equel meaning (e.g.(X+)?
->X*
). -
empty-action
Removing empty actions: Actions that contain only whitespace are discarded. -
inline
Rule inlining: Some simple rules can be inlined directly into rules that reference them. Reducing number of rules improves the speed of generated parser. -
none
No optimizations: Shorthand option for no optimizations. -
normalize-char-class
Character class optimization: Normalize character classes to avoid duplicities and use ranges where possible. E.g.[ABCDEFX0-53-9X]
becomes[0-9A-FX]
. -
remove-group
Remove unnecessary groups: Some parenthesis can be safely removed without changeing the meaning of the grammar. E.g.:A (B C) D
becomesA B C D
orX (Y)* Z
becomesX Y* Z
. -
repeats
Removing unnecessary repeats: Joins repeated rules to single quantity. E.g. "A A*" -> "A+", "B* B*" -> "B*" etc. -
single-char-class
Convert single character classes to strings: The code generated for strings is simpler than that generated for character classes. So we can convert for example[\n]
to"\n"
. -
unused-capture
Removing unused captures: Captures denoted in grammar, which are not used in any source block, error block or referenced (via$n
) are discarded. -
unused-variable
Removing unused variables: Variables denoted in grammar (e.g.e:expression
) which are not used in any source oe error block are discarded.
Configuration file can be specified on command line. This file can contain the same options as can be given on command line, only without the leading dashes. Short versions are supported as well, but not recommended, because config file should be easy to read. It is encouraged to keep one option per line, but any whitespace character can be used as separator.
Following config file would result in the same behavior as if pegof was called without any arguments:
format
input -
output -
double-quotes
inline-limit 10
wrap-limit 1
Pegof makes it simple to measure how the optimizations affect the size and speed of the final code.
When the parameter --benchmark
is passed, it's argument will be used as benchmarking script.
The benchmark is done twice - first when the input grammar is parsed and second time when
all the optimizations are done, so the results can be easily compared.
The actual script is called with two parameters:
<benchmark_script> [setup|benchmark|teardown] <basename>
setup
: when called with this argument, the script should set up the environment for the benchmark (e.g.: compile the code, prepare input data)benchmark
: this phase is the only one actually measured, so it should only do the actual parsingteardown
: is passed to clean-up after the benchmark (e.g. delete the compiled files or input data)<basename>
is always the base path to sources generated from the grammar
To get a better idea how to implement such script, see examples in the benchmark/scripts directory. The usage is like this:
pegof --optimize all --benchmark benchmark/scripts/json.sh --output /dev/null benchmark/grammars/json.peg
The output will look somewhat like this:
| lines | bytes | rules | terms | duration | memory
---------+------------+------------+------------+------------+------------+-----------
input | 1773 | 60271 | 10 | 56 | 341 | 29972
output | 1819 | 62053 | 4 | 69 | 298 | 21512
output % | 102% | 102% | 40% | 123% | 87% | 71%
Here input
line shows values before any optimization and output
are values after everything is optimized
according to the provided options. Meaning of the columns in the benchmark result table is following:
lines
: number of lines in the C code generated bypackcc
bytes
: length of the generated C code in bytesrules
: number of rules in the grammarterms
: number of terms in the grammarduration
: how long the benchmark ran in millisecondsmemory
: peak resident set memory in kB (only measured if GNU Time or BusyBox are installed)
Since pegof is still under development, it may sometimes contain bugs. There are two options that help to find out
what is happening under the hood. The first is --debug
which turns on maximum verbosity and show in great detail
all the optimization decision that are considered and used. Second option is --debug-script
, which is aimed
at more subtle bugs, where the grammar can be processed and compiled, but produces unexpected results. The script
passed to this option is run after each optimization pass, so it can be used to pinpoint which optimization causes
the problem. The only parameter passed to the scripts commandline is a name of file containing current state of the
grammar. The debug script should return 0 if the grammar works as expected. Non-zero exit code means that this
optimization iteration broke the grammar and pegof will immediately stop its execution.
This application was written with readability and maintainability in mind. Speed of execution is not a focus. Some of very big grammars (e.g. for Kotlin language) can take few minutes to process.
Pegof uses cmake. To build it just run:
cmake -B ./build
cmake --build ./build --parallel 4
cmake --build ./build --target test # optional, but recommended
Building on non-linux platforms has not been tested and might require some modifications to the process or even to the application itself.
If you just want to give pegof a quick try, without going through the hassle of compilation, there are two simple options.
There is a slightly limited version of pegof running in browser, which you can find at https://dolik-rce.github.io/pegof. It is only intended as a playground, not for serious work.
It uses the same code as the command-line, just compiled using emscripten to WebAssembly. This limits how it can be used, so some features are not available or do not make sense (e.g. operating on multiple files, imports or in-place changes).
If you want to try full version of pegof, but still avoid compilation, you can use a docker image, e.g.:
docker pull dolik/pegof:latest
# running without arguments just prints usage information
docker run dolik/pegof:latest
# to optimize a grammar you can call it like this:
docker run -i dolik/pegof:latest --verbose --optimize all < grammar.peg > optimized.peg
# to format all grammars in a directory, you'll have to mount it:
docker run -it -v "$PWD/grammars:/grammars" dolik/pegof:latest --verbose --inplace grammars/*.peg
Big thanks go to Arihiro Yoshida, author of PackCC for maintaining the great and very useful tool.