Calculating the depth of a parse tree

Antlr4 grammars can sometimes produce very deep parse trees. This can be a problem because deep parse trees often indicate grammars that are unoptimized, which means they can be slow. The question is: How can we compute the maximum parse tree depth?

Trash can compute the answer to this question.

Computing the maximum parse tree depth

XPath can compute the maximum tree depth using the following expression: max(//*[not(*)]/count(ancestor::*)). This expression means “find a leaf node (any node that has no children, //*[not(*)]), compute the number of ancestors it has (count(ancestor::*)), and find the maximum of all these values (max(...)).”

This Bash script, max-depth.sh, computes the max depth of the parse tree:

#

while getopts 'xh' opt; do
    case "$opt" in
    x)
        set -x
        ;;
    ?|h)
        cat - <<EOF
NAME
       $(basename $0) - get the depth of the deepest node in the parse tree

SYNOPSIS
       $(basename $0) ([-x | -h])* [files]

DESCRIPTION
       This script must be run under Linux Bash or Windows MSYS2 Bash or Windows WSL Linux.

OPTIONS
    -h
        Output this help message.
    -x
        Execute "set -x" to debug script.

EXAMPLE USAGE

EOF
        exit 0
        ;;
    esac
done
shift $((OPTIND - 1))
files=("$@")
temp=`mktemp`
if [ ${#files[@]} -gt 0 ]
then
    trparse ${files[@]} > $temp
else
    cat - > $temp
fi

cat $temp | trxgrep ' max(//*[not(*)]/count(ancestor::*))'

rm -f $temp

Example

Grammar:

grammar Expr;
start_ : e (';' e)* EOF ;
e : '-' e | e ('*'|'/') e | e ('+'|'-') e | '(' e ')' | INT ;
INT : [0-9]+ ('.' [0-9]+)? ;
WS : [ \r\n\t] + -> channel(HIDDEN) ;

Input:

1+((2))/3*4

Parse tree for input, with max depth of 7:

Execution of Script:

10/12-20:58:46 ~/temp/Generated-CSharp
$ cat in.txt
1+((2))/3*4
10/12-20:58:49 ~/temp/Generated-CSharp
$ trparse in.txt | bash /c/Users/Kenne/Documents/GitHub/g4-scripts/max-depth.sh
CSharp 0 in.txt success 0.0306775
7
10/12-20:59:19 ~/temp/Generated-CSharp

Some interesting notes

The Antlr4 grammar and input with the greatest depth is java20 and ManyStringsConcat.java, which has a max depth of 932. The input contains an expression of 900 string concatenations. The parse tree looks absolutely ridiculous.

We can find the location in the input where the tree has maximum depth as well, via: trparse -l examples/ManyStringConcat.java | trxgrep ' //*[not(*) and count(ancestor::*) = max(//*[not(*)]/count(ancestor::*))]/(self::node|..)' | trcaret: However, I should note that computing this XPath expression is very time consuming (see this StackOverflow answer). It takes over 25 minutes on my machine.

$ date;trparse -l examples/ManyStringsConcat.java | trxgrep ' //*[not(*) and count(ancestor::*) = max(//*[not(*)]/count(ancestor::*))]/(self::node|..)' | trcaret;date
Fri, Oct 13, 2023  6:26:29 AM
CSharp 0 examples/ManyStringsConcat.java success 0.1356061
L4:         "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" +
            ^
Fri, Oct 13, 2023  6:52:27 AM
10/13-06:52:27 ~/issues/g4-3766/java/java20
$

A better expression would be to compute the max depth first in a for-loop, then another for-loop to go through each of the values.

$ date; trparse -l examples/ManyStringsConcat.java | trxgrep ' for $i in max(//*[not(*)]/count(ancestor::*))
    return
        (for $j in (//*[not(*) and count(ancestor::*) = $i])
            return $j/(self::node|..))' | trcaret; date
Fri, Oct 13, 2023  7:09:55 AM
CSharp 0 examples/ManyStringsConcat.java success 0.1353174
L4:         "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" + "x" +
            ^
Fri, Oct 13, 2023  7:09:57 AM
10/13-07:09:57 ~/issues/g4-3766/java/java20