A comparison of Antlr grammars for parsing Java

This is an article about the ongoing research I am working on regarding the state of Antlr grammars for parsing Java. Some of the tests are taking weeks of computing time, so the results are preliminary.

Antlr is a popular LL(*) parser generator for recognizers of C#, Java, and many other programming languages. For Java, there are three grammars available on the Antlr grammar website: Java, Java8, and Java9. If you are a developer who hasn’t followed the maintenance history, it is unclear which grammar one should choose. Some of the changes that have been made to one grammar have not been applied to the other grammars. The basis for all grammars, however, is The Java Language Specification. Unfortunately, the latest available now is version 13 making all of them out of date.

The Java grammar was written by Parr originally with left recursion and common left-factors removed. (For additional information on left recursion removal, see this blog series.) The code has not been updated in the last year and apparently does accept anything more recent than Java 8. After the release of Antlr version 4, left recursion and common left-factors were allowed. The Java8 code was written by Harwell and Parr with this in mind. It has been last updated a few months ago. The Java9 code was forked by Chan from the Java8 grammar and has been last updated two years ago. It apparently accepts Java 9 source code. Both Java8 and Java9 are very slow.

Which brings us to the following questions: How well do these grammars perform? How well do they accept currently available open-source code? Is it possible to improve the performance of the grammars? Which one would be best to choose to update to the current specification?

In order to answer these questions, we studied the Java and Java9 grammars for performance and acceptance of several open-source projects.

Methods

1. Two grammars for Java that were tested:

2. Tested runtime targets

  • Java
  • C#

3. The Java source code that was used for the tests:

  • http://hg.openjdk.java.net/jdk/jdk
    • 17234 files; 4816078 lines. (Using the Java grammar, 25349955 tokens; 38019443 tree nodes.
  • )https://github.com/AndroidSDKSources/android-sdk-sources-for-api-level-5.git
    • 11101 files; 2803192 lines. (Using the Java grammar, 16567260 tokens; 25457752 tree nodes in all parse trees generated.)
  • https://github.com/AndroidSDKSources/android-sdk-sources-for-api-level-29.git
    • 11473 files; 4175779 lines. (Using the Java grammar, 23624317 tokens; 35668427 tree nodes.)

4. Source and scripts for testing are here.

5. Machine AMD Ryzen 7 2700 eight-core processor, ASRock B450 Gaming-ITX/ac, 16 GB DDR4-2666 (1333 MHz) memory; Antlr 4.7.2 tool and runtime; Visual Studio 2019; Java SE 11.0.4.

Preliminary results:

  • In the Java grammar, white space is shunted into the HIDDEN channel. In the Java9 grammar, white space is thrown away via the “skip” Antlr keyword. After adjusting the grammars and token types, Antlr generates lexers that produce the same token input stream from anecdotal evidence of a few large test cases.
  • A Java source that ran particularly slow for the Java9 grammar compared to the Java grammar was android-sdk-sources-for-api-level-5/com/android/phone/BluetoothHandsfree.java. This file produced a large number of parse tree nodes associated with expressions, e.g., primary, assignmentExpression, expression, conditionalExpression, conditionalOrExpression, conditionalAndExpression, exclusiveOrExpression, inclusiveOrExpression, andExpression, equalityExpression, relationalExpression, shiftExpression, additiveExpression, multiplicativeExpression, postfixExpression, unaryExpressionNotPlusMinus, unaryExpression, identifier, etc. The Java9 grammar disambiguates expression using the usual refactoring, but notably includes left-recursive productions for andExpression, equalityExpression, relationalExpression, shiftExpression, additiveExpression, etc. Although Antlr handles these productions, it it very expensive to use.
  • Running a Net Core program through dotnet.exe against a stub input program has a minimum runtime of 1.2s.
  • For Net Core, the Java9 grammar ran on average 40x’s slower compared to the Java grammar for the Android 5 source code.
  • There were not any differences in the parse trees constructed between C# and Java for the Java grammar.
  • Out of 39673 Java source files, 134 parsed with errors with the Java grammar.
  • The file android-sdk-sources-for-api-level-29/com/android/server/pm/PackageManagerService.java parsed particularly slow with Java9: 18 m 50 s!

Update November 11, 2019

Posted in Tip

Notes on Language Server Protocol

Surprise! I’ve been implementing something like the Language Server Protocol (LSP) with AntlrVSIX. The LSP is an abstraction for an editor (ed) using JSON as the lingua franca between the editor and a GUI client. The editor server: offers the persistence of code files; organizes files in a “workspace”; provides an API for edits, tagging, go to def, find all refs, reformat, code completion, defs for tooltips, etc. LSP is a step in the right direction because it separates an editor server backend from a GUI frontend, which is what AntlrVSIX is about.

What is not surprising is that the LSP handles things a little different than what I did. In AntlrVSIX, parse trees were shared with the GUI code. But LSP doesn’t share them, rightly so if you don’t want to swamp the link between GUI and server. But, LSP doesn’t seem to offer a “workspace” model that has nested “projects”, or attributes associated with a document. In VS, this information encodes whether a file participates in the build, compiler options, etc. LSP also hardwires the classes of symbols in a file–it can’t handle an Antlr grammar because there is no class for terminal or non-terminal symbols. In AntlrVSIX, classifications are strings, so it can represent anything.

The next release of AntlrVSIX is taking a while. I’ve been cleaning it up considerably and fixing bugs.

–Ken

Posted in Tip

Adding “workspaces” to AntlrVSIX

After a lot of hemming and hawing, I am now adding the concept of “workspaces” to AntlrVSIX. By this I mean an equivalent of workspaces that is defined in Roslyn. In AntlrVSIX, a Document will be a source file; a Project will be a collection of Documents; a program will be a Workspace, a collection of Projects. Properties on a Project or Document are copied to the equivalent AntlrVSIX object as a property list.

The reason for this is clear if you consider what constitutes a program for languages like C# or Java. But, Antlr has a similar issue: a grammar can be “imported” into another, and the scope of the grammar symbols is local to the project.

This required a lot of changes to AntlrVSIX, and will be released as v4.0. Although there aren’t many new features, it is a significant change nonetheless.

Note: I found that obtaining the property lists for the older csproj formats very, very slow, while the newer format much faster. For AntlrVSIX itself in VS 2019, it takes an additional 10 seconds if I query the property lists for the 10 projects contained in the solution. Investigating this with the profiler, I found that querying the Value of a property runs especially slow. I now only query the FullPath property and ignore the others. I’m planning on performing a lazy evaluation of EnvDTE properties for properties in general because the Microsoft code that performs the parsing and semantic analysis of build files is exceedingly slow.

–Ken

Posted in Tip

Updates to AntlrVSIX

I’ve been busy extending AntlrVSIX to include new features and correct bugs. Some of the things added with the 40 or so changes since the beginning of September are:

  • Persistent option settings;
  • Tagging of Java symbols using a symbol table;
  • Improved reformatter for languages;
  • Improved symbol click-on highlighter;
  • Improved co-existence with other extensions.

What is more important is that I’ve been settling on how to write attribute evaluation for the parse tree using Antlr’s Listener pattern, using a symbol table on the Java parse tree. If done in an unstructured, undisciplined, and ill informed manner, the code could become quite disastrous.

For example, I looked for “listeners in parse tree traversal” and Google Search found in the top answers Saumitra’s blog, Jakub Dziworski’s blog, MIT OpenCouseWare page on Antlr, Positive Technoligies page on Antlr, but there is no discussion of how this framework would be used in attribute evaluation, which is discussed nicely in these lecture notes. As I’m not that disciplined to write a system to declare the formal semantic equations and an evaluator of those equations, I’ve settled to make sure to try to set attributes for a node within the node’s listener (either “enter” or “exit”). Otherwise, I wouldn’t know which listener is setting what attribute for a node. This is “basic” compiler stuff, but “basic” information gets lost when people start using these tools.

All this is important, and will be eventually applied to the Piggy transformation system.

More folks seem to be using it, maybe because there is a link from the Antlr.org developer page to the extension.

–Ken

Posted in Tip

Adding in a symbol table into AntlrVSIX

I’m now working on adding in a full symbol table implementation into AntlrVSIX. This will help make the extension much more powerful with tasks such as tagging and navigating to defs.

For better or worse, I’m starting with the implementation in Antlr Symtab. It looks like it’ll work out pretty well, but it is missing certain classes, and needs a little polishing. For example, VariableSymbol is the base class for FieldSymbol and ParameterSymbol, but there isn’t a class for variables defined and referenced in a method body. This makes it somewhat difficult to distinguish between a field and a local variable in a block.

Also, for better or worse, I’ve added Mono.Cecil into the symbol table code. I’m not sure whether I’ll need this, but it’s there just in case. Mono contains a reader of PE files, which might be useful. But, what I’m really thinking of is the equivalent over in Java and other languages.

Posted in Tip

Extending AntlrVSIX with Rust; Adding a general wrapper for CLI tools to Msbuild

I’ve started two new tasks.

(1) I’m adding a Rust parser to AntlrVSIX. The grammar I’m using is an old grammar from Jason Orendorff. Unfortunately, it hasn’t been updated for several years. Also, it looks like it isn’t correct because there are cases where tokens are not correctly defined, e.g., ‘>’ ‘=’ instead of ‘>=’. It’s incorrect because the parser would allow a space between the ‘>’ and ‘=’ in the input, which is probably not intended. I’m first updating the grammar by separating the parser and lexer grammars, then I’ll add in rules recognized by another grammar.

(2) I’m really disappointed in the fact that no one seems to have written a general wrapper for command-line tools for Msbuild. I wrote a wrapper for the Antlr Java tool, but I’ve noticed that I’m going to have to write yet again another wrapper for Rust’s tools. I now realize that all of these wrappers can be generalized into a single wrapper with a specification of inputs and outputs. When I’m finished with the Rust grammar, I plan to write this tool, and rewrite the Antlr Java wrapper with it. It is no wonder why it seems everyone is using Visual Studio Code. I would too, but it’s written is the most awful of all languages, Javascript.

Posted in Tip

Towards a Visual Studio IDE extension supporting any programming language

As I pointed out before, it now seems possible to write a VS extension that supports any programming language. I’ve been updating AntlrVSIX to input a description of the syntax of a programming language and tag a file of that language in the editor. I have this now working for Antlr and Java, and plan to try a dozen or so languages.

At issue is identifying what are the commonalities of a programming language, such as defining and applied occurrences, and how the editor should use that.

However, some extensions tag the programming language that my extension is also trying to tag, which results in multiple competing and conflicting classifiers and formatters. I solved this problem by changing the IContentType associated with a buffer to one that I define for the extension. But, ideally, I’d prefer to just “turn off” any formatters that are competing with my formatter.

It may not look like much, but here is how Java is formatted by Visual Studio by default, and how my extension formats it (first attempt to get something working).

Current VS 2019 “code++.java” formatting. Variable “token” labeled erroneously in comments.
Formatting from the Antlr extension. Squiggles indicate no defining occurrence in the file, but should be found via imported libraries.
Posted in Tip

Managed Extensibility Framework — a Static Type System

Now that I have my AntlrVSIX extension working for Visual Studio IDE, I have now set my sights on a “meta-language” editor. This picks up the idea of an add-in that supports Antlr, and brings it forward to an add-in that supports any language. For this to work, the grammar for the language would be specified in Antlr. I would also need to add in information on what parts of the parse tree constitute what classified types for tagging in the editor. A corpus would be needed for Codebuff to work on reformatting a file containing code in the language. All seems doable. Or is it?

This idea, however, isn’t quite possible with the current framework of Visual Studio Extensions, at least not dynamically. I was hoping to have the extension read this information and act on it at runtime.

As I pointed out in a Gitter message system–and never received a reply–the problem with this idea is that I use MEF (Managed Extensibility Framework), and virtually all samples that I can find are based in MEF. Unfortunately, MEF works on a statically defined type system, i.e., the types are created upfront and compiled into a .VSIX file. You must write C# attributes to describe the language it implements. E.g., this tagger provider and the file suffix associated with the language is hardwired at compile time:

    [Export(typeof(ITaggerProvider))]
    [ContentType(AntlrVSIX.Constants.ContentType)]
    [TagType(typeof(ClassificationTag))]
    internal sealed class AntlrClassifierProvider : ITaggerProvider
    {
        [Export]
        [Name(/* AntlrVSIX.Constants.LanguageName */ "Antlr")]
        [BaseDefinition("code")]
        internal static ContentTypeDefinition AntlrContentType = null;

        [Export]
        [FileExtension(/* AntlrVSIX.Constants.FileExtension */ “.g;*.g4”)] 
        [ContentType(AntlrVSIX.Constants.ContentType)]
        internal static FileExtensionToContentTypeDefinition AntlrFileType = null;
    …
 }

I can, of course, generate a per-language extension via a template, then build the extension. But I was hoping for something that would be more dynamic. Is there something more modern than MEF for implementing VS extensions, something that isn’t based on a static type system framework? Essentially, no. MEF is integrated very tightly with Visual Studio IDE.

I don’t understand why ITaggerProvider couldn’t simply require methods Name(string), BaseDefinition(string), and FileExtension(string), which could all be called during initialization in order to determine what the “contract” ITaggerProvider supports?

Update Aug 19 2019 — As it turns out, it is possible to create the language describing a type system dynamically without using MEF. Also, it seems one can create a tagger with a provider that doesn’t require the FileExtension to be specified. I’ve decided to try to enhance AntlrVSIX to test out whether I can isolate the description of the Antlr itself into a grammar and tables. If it works, I’ll write an extension that will support any language.

Posted in Tip

AntlrVSIX v2.0

Due to my work on Piggy and Campy, I’m extending my Antlr4 extension for Visual Studio in a number of ways. The plug-in hasn’t been updated for two years, and there are no extensions for Antlr that work with Visual Studio 2019, so it is due for an update. In fact, there isn’t a single extension for Antlr for Visual Studio 2019, and Antlr.org removed Visual Studio from the list of developer tools for Antlr.

Changes

Targeting Visual Studio 2019 and 2017

I’m not sure where I read it, but “good practice” says I should only support the current version and one version previous.

Improved Tagging

I’ve added new tags for channels and modes, and the whole classification and tagging routines now check if the source has changed before reparsing, which improves performance.

A menu for AntlrVSIX in Extensions

In Visual Studio 2019, the menus for extensions are where you should look–under “Extensions”. Previously, I have to admit that it was hard to figure out the UI for AntlrVSIX, hidden under various menus, or available through a right-click context menu. That’s basically bad UI design.

Although the right-click context menu for AntlrVSIX is still available, it mirrors exactly what you see in the main menu under “Extensions -> AntlrVSIX”.

Navigation to next and previous rules in a grammar file

I’m a big fan of the old-style Emacs and even older Vi! Back then, UI was keyboard-oriented. It was much more efficient instead of moving around this stupid mouse, pointing to something I can hardly see on the screen.

I’ve added a few new ways to navigate around a grammar file with some shortcuts. “Next rule” jumps to the next rule in a grammar; “Previous rule” jumps to the previous rule in the grammar; “Go to Visitor” navigates to a visitor method for a tree node corresponding to a non-terminal.

Generation of Visitors and Listeners

Go to visitor/listener jumps to C# code for grammar symbol. If the method doesn’t exist, then the Visitor or Listener classes and methods are generated.

Options for the extension

An options dialog box for extension-specific parameters is now included. It is needed for things like where to find the “corpus” of examples of formatted grammars that Codebuff will use to format your grammar.

Performance

While AntlrVSIX is the only extension for VS 2019, it’s rarely used compared to several other older–and in my opinion, less useful extensions. AntlrVSIX has had only one review, which says the performance sucks. (Thanks, dude for not saying *what* specifically are you trying to do that is slow. I’d never have you on my QA team–not helpful descriptions).

Performance has taken a more front seat with this release. Before parsing, I check whether the code buffer has changed.

Significant cleanup and bug fix of the source code

I spent a lot of the time cleaning up the source code. At the time when I first wrote the extension, I had no clue how to modify which feature in the UI because the documentation for Visual Studio extensions is absolutely terrible. I now have a better command of what is going on and fixed a lot of the code.

Migration from packages.config to PackageReference builds

It’s absolutely appalling that the examples for extensions Microsoft provides is out of date and are so poor. Many still use the “packages.config” file to list the NuGet dependencies of a C# project. I’ve updated AntrlVSIX to use the latest format, which required the tool to be migrated to Net 4.7.2.

Integrating builds with Antlr4BuildTasks

You can’t really have an extension for a language like Antlr if you don’t include some way of actually *building* the project that uses the language. After first writing AntlrVSIX, I then modified Antlr4cs, a wrapper for a C# version of the Antlr tool, to just be a wrapper the standard Antlr4 Java tool. So, it now generates the parsers/lexers in the IDE.

In addition, the Antlr4BuildTasks had a bug where one couldn’t change the build mode of the .g4 file to none. This is required for multi-file grammars that use the “import” statement in Antlr.

Fixing Intellisense

I fixed the Intellisense for the extension. The tooltip now works when hovering over a single-character grammar symbol. Command completion now offers a list of grammar symbols in the grammar.

*****

The work on the extension has been going on for two weeks and should be finished in the next week.

Posted in Tip

Porting extensions to Visual Studio 2019

I was recently was trying to use my Visual Studio extension for Antlr in Visual Studio 2019, when I found that it just isn’t working anymore. In fact, I couldn’t even install the extension because it wouldn’t even show up in the search for the plug-in. In fact, there weren’t any extensions available for “Antlr” for Visual Studio 2019! I guess I hadn’t ported the add-in to VS 2019, so I decided to work on that.

Continue reading
Posted in Tip