Counting parse tree nodes
Antlr4 parse trees have two node types:
- Parse tree rule nodes, which correspond to rules in the parser grammar;
- Parse tree leaf nodes, which correspond to rules in the lexer grammar.
Sometimes we are interested in the general size of a tree or subtree, so we can reason about the performance of a grammar. Antlr4 trees are converted in Trash to trees that interleave “off-channel” text. So, with Trash, there are additional nodes.
Parse tree rule nodes
Parser rules in an Antlr grammar begin with lowercase letters. We use this trait to query the nodes in a parse tree that begin with lowercase letters. XPath version 2 does not have a direct way to specify a character range, so we need to test the first character of the node name 26 different times. In practice, though, this is usually fast.
trparse input-file | trxgrep ' count(//*/name()[
starts-with(.,"a") or
starts-with(.,"b") or
starts-with(.,"c") or
starts-with(.,"d") or
starts-with(.,"e") or
starts-with(.,"f") or
starts-with(.,"g") or
starts-with(.,"h") or
starts-with(.,"i") or
starts-with(.,"j") or
starts-with(.,"k") or
starts-with(.,"l") or
starts-with(.,"m") or
starts-with(.,"n") or
starts-with(.,"o") or
starts-with(.,"p") or
starts-with(.,"q") or
starts-with(.,"r") or
starts-with(.,"s") or
starts-with(.,"t") or
starts-with(.,"u") or
starts-with(.,"v") or
starts-with(.,"w") or
starts-with(.,"x") or
starts-with(.,"y") or
starts-with(.,"z")
])'
Parse tree leaf nodes
Lexer rules in an Antlr grammar begin with uppercase letters. We use this trait to query the nodes in a parse tree to find the number of tokens in the tree. Note this does not consider off-channel nodes.
trparse input-file | trxgrep ' count(//*/name()[
starts-with(.,"A") or
starts-with(.,"B") or
starts-with(.,"C") or
starts-with(.,"D") or
starts-with(.,"E") or
starts-with(.,"F") or
starts-with(.,"G") or
starts-with(.,"H") or
starts-with(.,"I") or
starts-with(.,"J") or
starts-with(.,"K") or
starts-with(.,"L") or
starts-with(.,"M") or
starts-with(.,"N") or
starts-with(.,"O") or
starts-with(.,"P") or
starts-with(.,"Q") or
starts-with(.,"R") or
starts-with(.,"S") or
starts-with(.,"T") or
starts-with(.,"U") or
starts-with(.,"V") or
starts-with(.,"W") or
starts-with(.,"X") or
starts-with(.,"Y") or
starts-with(.,"Z")
])'
Alternative way to counting
Trash can be used in an alternative manner to count the number of node types
without the use of starts-with()
function call. Remember, Trash is quite
powerful because it can be combined with a shell scripting language.
trparse input-file | trxgrep ' //*/name()' | grep '^[a-z]' | wc
trparse input-file | trxgrep ' //*/name()' | grep '^[A-Z]' | wc
Further notes
It is natural to wonder whether one could use compare()
or >=
and <=
operators
instead of starts-with(...)
.
I have tried various alternatives, but they do not work.
trxgrep 'count(//*[substring(name(),1,1) >= "a" and substring(name(),1,1) <= "z"])'
trxgrep 'count(//*[compare(substring(name(),1,1),"a") >= 0 and compare(substring(name(),1,1),"z") <= 0])'
The reason is that comparison operators do not perform a lexicographic comparison
(e.g., strcmp()
in C), and compare()
is case-insensitive.