Counting parse tree nodes

Antlr4 parse trees have two node types:

Parse tree rule nodes, which correspond to rules in the parser grammar;
Parse tree leaf nodes, which correspond to rules in the lexer grammar.

Sometimes we are interested in the general size of a tree or subtree, so we can reason about the performance of a grammar. Antlr4 trees are converted in Trash to trees that interleave “off-channel” text. So, with Trash, there are additional nodes.

Parse tree rule nodes

Parser rules in an Antlr grammar begin with lowercase letters. We use this trait to query the nodes in a parse tree that begin with lowercase letters. XPath version 2 does not have a direct way to specify a character range, so we need to test the first character of the node name 26 different times. In practice, though, this is usually fast.

trparse input-file | trxgrep ' count(//*/name()[
		starts-with(.,"a") or
		starts-with(.,"b") or
		starts-with(.,"c") or
		starts-with(.,"d") or
		starts-with(.,"e") or
		starts-with(.,"f") or
		starts-with(.,"g") or
		starts-with(.,"h") or
		starts-with(.,"i") or
		starts-with(.,"j") or
		starts-with(.,"k") or
		starts-with(.,"l") or
		starts-with(.,"m") or
		starts-with(.,"n") or
		starts-with(.,"o") or
		starts-with(.,"p") or
		starts-with(.,"q") or
		starts-with(.,"r") or
		starts-with(.,"s") or
		starts-with(.,"t") or
		starts-with(.,"u") or
		starts-with(.,"v") or
		starts-with(.,"w") or
		starts-with(.,"x") or
		starts-with(.,"y") or
		starts-with(.,"z")
		])'

Parse tree leaf nodes

Lexer rules in an Antlr grammar begin with uppercase letters. We use this trait to query the nodes in a parse tree to find the number of tokens in the tree. Note this does not consider off-channel nodes.

trparse input-file | trxgrep ' count(//*/name()[
		starts-with(.,"A") or
		starts-with(.,"B") or
		starts-with(.,"C") or
		starts-with(.,"D") or
		starts-with(.,"E") or
		starts-with(.,"F") or
		starts-with(.,"G") or
		starts-with(.,"H") or
		starts-with(.,"I") or
		starts-with(.,"J") or
		starts-with(.,"K") or
		starts-with(.,"L") or
		starts-with(.,"M") or
		starts-with(.,"N") or
		starts-with(.,"O") or
		starts-with(.,"P") or
		starts-with(.,"Q") or
		starts-with(.,"R") or
		starts-with(.,"S") or
		starts-with(.,"T") or
		starts-with(.,"U") or
		starts-with(.,"V") or
		starts-with(.,"W") or
		starts-with(.,"X") or
		starts-with(.,"Y") or
		starts-with(.,"Z")
		])'

Alternative way to counting

Trash can be used in an alternative manner to count the number of node types without the use of starts-with() function call. Remember, Trash is quite powerful because it can be combined with a shell scripting language.

trparse input-file | trxgrep ' //*/name()' | grep '^[a-z]' | wc
trparse input-file | trxgrep ' //*/name()' | grep '^[A-Z]' | wc

Further notes

It is natural to wonder whether one could use compare() or >= and <= operators instead of starts-with(...). I have tried various alternatives, but they do not work.

trxgrep 'count(//*[substring(name(),1,1) >= "a" and substring(name(),1,1) <= "z"])'
trxgrep 'count(//*[compare(substring(name(),1,1),"a") >= 0 and compare(substring(name(),1,1),"z") <= 0])'

The reason is that comparison operators do not perform a lexicographic comparison (e.g., strcmp() in C), and compare() is case-insensitive.