Produce a parser for complex strings with ANTLR

Facebook allows multiple features to produce a more friendly post like @mention to tag your friends. Imagine they want to allow HTML to be inserted into users' posts to produce more dynamic ads. Facebook will have to parse each entry to check the structure, and also the link inside to protect against JS injectors and the like.

The next question is, how do you parse the user's post?

The first common answer is to build your own parser. Create recursive loops to handle recursive <a><p><a><p><a></a></p></a></p></a> cases, to detect the class on  <a class="small main-text"> tags.

The parser can be as complicated as the features available on the chat.

You can already imagine how many hours of work you will have to do to develop this parser.

ANTLR (ANother Tool for Language Recognition) helps you build your parser faster. ANTLR is to parser what EntityFramework is to database access, a must-have to increase your productivity for a specific use.

An input always represents a language. It can be formal language in a conversation `Hello, I'm Pierre`, an arithmetic operation for maths `2+3-4*12`, or in our case, an post input language `<p>This is my <a href="http://...">post</a></p>`. We separate HTML from the post language because we can assume that Facebook will disable some HTML tags.

In other words, a parser transforms a sentence in a specific language into an object.

An input always represents a language. It can be a formal language in a conversation Hello, I'm Pierre, an arithmetic operation for math 2+3-4*12, or in our case, a post language <p>This is my <a href="http://...">message</a></p>. We separate the HTML from the post language because we can assume that Facebook will disable some HTML tags.

In other words, a parser turns a sentence in a specific language into an object.

antlr4/index.md at master · antlr/antlr4
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. - antlr4/index.md at master · antlr/...

Let's dive into ANTLR.

Build ANTLR grammar

Every language has its own grammar. It represents the way a sentence is produced. The major European languages have the same grammar for producing a sentence: it starts with a capital letter, contains an unlimited combination of letters and punctuation marks, and ends with a period.

In our case, we will try to produce a simple arithmetic language with addition and subtraction. No parentheses, no multiplication, keep it simple.

The simplest operation is 2+3. It consists of a number, followed by a symbol, followed by another number. A more complex operation is composed of an unlimited number of operations and numbers: 2+3-4+5+6.

An operation is represented by a number followed by an unlimited combination of a symbol and a number. It is written number (symbol number)* where * means that the pattern in parentheses can be repeated indefinitely. A symbol can be + or -, and a number can be an integer or a decimal.

We describe these steps in a ArithmeticLanguage.g4 file with the following grammar:

grammar ArithmeticLanguage; 

// Core building blocks 
DIGIT: [0-9]; 

// identity includes keywords too 
int: '-'? DIGIT+; 
decimal: '-'? DIGIT+ '.' DIGIT+; 

symbol: '+' 
	| '-'; 
number: int 
	| decimal; 
    
operation: number (symbol number)*;

ANTLR owns a lot of examples on their GitHub, check it out:

GitHub - antlr/grammars-v4: Grammars written for ANTLR v4; expectation that the grammars are free of actions.
Grammars written for ANTLR v4; expectation that the grammars are free of actions. - GitHub - antlr/grammars-v4: Grammars written for ANTLR v4; expectation that the grammars are free of actions.

Parse

The grammar is useless on its own. Without a parser, it's impossible to produce the architecture from the string. The output of the parser is a tree representation of the input.

operation
    number
    	int
            2
    symbol
    	+
    number
    	int
            3

Don't worry, ANTLR has planned everything. They produce excellent Nuget packages to create the parser instead of doing it manually.

  • Antlr4.Runtime to interpret Antlr files.
  • Antlr4.CodeGenerator to generate Parser classes from the grammar file.

Build the project with the grammar inside to generated parser classes. Check it out in obj > Debug > net

We can use them to produce the arithmetic language parser.

private static ArithmeticLanguageParser GetArithmeticLanguageParser(string target) 
{ 
    var inputStream = new AntlrInputStream(target); 
    var lexer = new ArithmeticLanguageLexer(inputStream); 
    var commonTokenStream = new CommonTokenStream(lexer); 
    var parser = new ArithmeticLanguageParser(commonTokenStream); 
    return parser; 
}

There is two important tools:

  • Lexer which reads the input and apply tokens to each parts
  • Parser which build the tree from the tokens

The parser contains a butch of functions that allows calling each kind of element. Depends on the string, we need to call the right function. ANTLR defines by it own all function from the grammar file. Since we evaluate an entire operation, we'll use  parser.operation().

public static void ParseQuery(string query) 
{ 
    var parser = GetArithmeticLanguageParser(query);
    var tree = parser.operation();
}

The method ToStringTree() shows a stringified representation of the expression. The tree contains each element of our operation: the number 2, the symbol +, the number 3.

Visitor

The last step consists to navigate inside the tree to produce the result. It can be an object, a list of objects, or in our case, a decimal result.

ANTLR generated the class ArithmeticLanguageBaseVisitor<T>, which takes a type and contains all available functions to override. Each element declared in the grammar has a specific visitor function (e.g. VisitDecimal() and VisitOperation()). The return type of these functions depends on the Visitor class.

Since VisitDecimal must return a decimal, and VisitSymbol must return an enumerator, two visitor classes are needed to solve this problem:  ArithmeticLanguageDecimalVisitor and ArithmeticSymbolVisitor, for the decimal and ESymbol types respectively (assuming it is an enumeration).

Each Visit function takes a context as parameter. It contains multiple functions, including two important ones.

  • GetText() to convert the content as a string.
  • GetChilds() to get the inside of the context.

Let's take an example with OperationContext. GetText() will return the integer expression 2+3 and GetChilds() will return 3 elements: ArgumentContext for the number 2, SymbolContext for the symbol and another ArgumentContext for the number 3.

The easiest way to produce the visitor is to start from the bottom, i.e. from the final parts of the parser. In our case: VisitDecimal, VisitInt and VisitSymbol.

public class ArithmeticLanguageDecimalVisitor : ArithmeticLanguageBaseVisitor<decimal>
{
	public override decimal VisitDecimal(ArithmeticLanguageParser.DecimalContext context)
	{
		return decimal.Parse(context.GetText());
	} 

	public override decimal VisitInt(ArithmeticLanguageParser.IntContext context)
	{
		return decimal.Parse(context.GetText());
	}
}
public class ArithmeticLanguageSymbolVisitor : ArithmeticLanguageBaseVisitor<ESymbol>
{
    public override ESymbol VisitSymbol(ArithmeticLanguageParser.SymbolContext context)
    {
        switch (context.GetText())
        {
            case "+":
                return ESymbol.Plus;
            case "-":
                return ESymbol.Minus;
            default:
                throw new NotImplementedException();
        }
    }
}

Every part of the operation is processed. Symbols and numbers are supported. Only the global operation is missing.

As a reminder, OperationContext is composed :
operation: number (symbol number)*;.

OperationContext starts with a number, then contains an infinite number of (symbol, number) pairs in that order. The result will contain the value of the first child, then a loop will apply the symbol between the result and the value of the current child. In this way, the result is calculated during the visit of the operation.

public class ArithmeticLanguageDecimalVisitor : ArithmeticLanguageBaseVisitor<decimal>
{
    #[...]
    public override decimal VisitOperation(ArithmeticLanguageParser.OperationContext context)
    {
        var symbolVisitor = new ArithmeticLanguageSymbolVisitor();
        var result = VisitArgument(context.GetChild(0) as ArithmeticLanguageParser.ArgumentContext);

        // We group two by two, since we have in this case 1 symbol followed by 1 number
        for (int i = 1; i < context.ChildCount; i = i+2)
        {
            var symbol = symbolVisitor.VisitSymbol(context.GetChild(i) as ArithmeticLanguageParser.SymbolContext);
            var number = VisitArgument(context.GetChild(i+1) as ArithmeticLanguageParser.ArgumentContext);

            switch (symbol)
            {
                case ESymbol.Plus:
                    result = result + number;
                    break;
                case ESymbol.Minus:
                    result = result - number;
                    break;
                default:
                    throw new NotImplementedException();
            }
        }
        return result;
    }
}

The ParseQuery function now includes the ArithmeticLanguageDecimalVisitor() to get a decimal result from the tree, and returns it.

public static decimal ParseQuery(string query)
{
    var parser = GetArithmeticLanguageParser(query);
    var tree = parser.operation();
    var stringifiedTree = tree.ToStringTree();
    var visitor = new ArithmeticLanguageDecimalVisitor();
    var result = visitor.Visit(tree);
    return result;
}

We can use our new function to calculate simple operation, like "2+3-4".

public void Test1()
{
    var query = "2+3-4";
    var result = ParseQuery(query);
}

It gets 1 as a result. You can add as many operations as you want, like "2+3-4-8+12+3+20", it will still work.

Perfect!

To conclude

This is a simple example to show the power of ANTLR. It is composed of 2 important parts:

  • Parser to transform the input into a tree using the grammar.
  • Visitor to go inside the tree and transform it into an object.

Remember to use it only in special cases where it is easier to represent an object as a string than as a structured object. Arithmetic operations are one of these cases.

To go further, you can implement more operations to create a complete arithmetic language. You can check this article, which allows you to go deeper.

The ANTLR Mega Tutorial
The definitive ANTLR mega tutorial on ANTLR4. Learn everything you need to know; with code in JavaScript, Python, Java and C#.

Enjoy!