Recognizing the Unrecognizable
And just like that, the parsing is done! It almost feels too easy... so you decide to ask that same coworker who helped you with the Scanner whether it's time to build an AST! They seem to be hiding a smirk as they ask you what does the following program mean?
5 (( * 17 / --- (
It's obvious right? It's, it's umm, hmm, take the (( and put down the * and then.
Okay, this actually makes no sense. "Arr", you say to yourself, "I know why their smirking, because while I can recognise all the valid tokens in the language, I still cannot recognise all the valid programs that can be written." You are all over it!
Hello Context Free Grammars (CFG), the older more powerful sibling to the much loved Regular Expressions. The reality is, there is about a 3% chance that CFGs could be explained in this one paragraph. So, instead below are a list of great resources that in total cover them in detail:
- Neso Academy [CFG Overview: 7:51m]
- The Coding Train [Into to Session 7: 14:29m]
- Ravindrababu Ravula [Intro to Lexical Analyser & Grammars: 19:50m]
- Stanford Online [Compiler Course Unit 3]
- Matt Might Article [The language of languages]
Assuming you stayed up all night learning all about CFGs, your manager is now asking why you haven't started running calculations yet on the new PDP-11s that you were so excited about? You can't tell them the truth, and so you let them that you are first double checking someone else's work. That should be enough to get them off your back for now! Okay, where were we again? That's right, CFGs. Working with nothing but intution you write up your first attempt at the grammar.
// Grammar 1.0
E -> E + E
| E - E
| E * E
| E / E
| (E)
| INTLITERAL
Hey, that looks good! It's small, easy to read and is straightforward enough to quickly understand what kind of programs will be generated. However, there is still a little issue hiding away in that grammar. Consider what happens when you take the leftmost (LM) and rightmost (RM) derivation for the program:
1 + 2 + 3
Leftmost
Rightmost
We know the drill! The goal here is to keep applying productions to the leftmost non terminal until we reach the target string. This means that INTLITERAL -> 1 should be the first token matched.
(1) E => E + E
(2) => E + E + E
(3) => INTLITERAL + E + E
(4) => 1 + E + E
(5) => 1 + INTLITERAL + E
(6) => 1 + 2 + E
(7) => 1 + 2 + INTLITERAL
(8) => 1 + 2 + 3
Got it! We can evaulate the string doing a depth first [inorder] traversal of the parse tree, which gives us the expression below:
(1 + 2) + 3 = 6
If we didn't know the drill before, we do now! Similar goal, however instead we need to keep applying productions to the rightmost non terminal until we reach the target string. This means that INTLITERAL -> 3 should be the first token matched.
(1) E => E + E
(2) => E + E + E
(3) => E + E + INTLITERAL
(4) => E + E + 3
(5) => E + INTLITERAL + 3
(6) => E + 2 + 3
(7) => INTLITERAL + 2 + 3
(8) => 1 + 2 + 3
Lucky us, we managed to find it again, this time going the other way. Evaulating the parse tree this time gives us a slightly different expression below:
1 + (2 + 3) = 6
Notice how we get different parse trees! In other words, the grammar is ambiguous. That doesn't sound too good. But, wait a second minute, both parse trees when evaluated equaled 6 (thank you kindly associativity). So, look how about this, can we just promise never to use the rightmost derivation and instead always use the leftmost which gives us (1 + 2) + 3? Almost.
Even if we could promise that, consider the LM derivation for the program
1 + 2 * 3
If we were to evalutate this program we would see that it would be
(1 + 2) * 3 = 9
Hang on, that doesn't look right! We know that multiplication has a higher precedence then addition, and therefore it should always be evaulated as
1 + (2 * 3) = 7
Turns out there is a way to enforce operator precedence by leveraging levels in the grammar. The basic idea is that multplicative expressions can only be generated by going through an additive expression first. That really doesn't make a whole bunch of sense written down. Instead, feast your eyes on the new and improved grammar 2.0 below.
// Grammar 2.0
E -> E + T
| E - T
| T
T -> T * F
| T / F
| F
F -> (E)
| INTLITERAL
Now it doesn't matter which derivation you use to generate the program 1 + 2 * 3. Not convinced? I don't blame you. See if you can create a parse tree for the expression where the + operator is a descendent of the * operator and there are no () in between. I'll sweeten the deal, if you can create a parse tree where that is true, email me and $100 aussie dollars are coming straight your way!
I know what you are thinking, "maybe I should just learn to be effective with postfix notation and abanden this whole idea of building acalculator language". Hang in there, you'll be getting your hands dirty in no time. Actually, speaking of getting our hands dirty, how are we going to actually implement the grammar? Like, it's great and all having a nice notation to work with, but we need to realise the grammar using a parsing algorithm. LL(1) parsing is one such algorithm. In a mouthfull, an LL(1) parser is a top down parser that reads the input left-to-right, using leftmost derivations to apply production rules, with at most 1 token of lookahead (don't worry if not a single word of that sentence made sense).
Knowing little about LL(1) parsers, you decide to consult your coworker again. They advise you that using LL(1) parser is a great choice, phew! But, before you can run back to your desk they qualify their statement by saying it's a great choice as long as the grammar does not have any left recursion. Left recursion? That doesn't ring a bell. You look at your grammar, trying to find signs of something that may be left recursion. Ah ha! You think you've found the culprit.
// Grammar 2.0
E -> E + T
...
But how does this cause a problem? Imagine applying LM derivations to the following program
1 + 2 + 3 + 4
(1) E => E + E
(2) E => E + E + E
(3) E => E + E + E + E
(4) E => INTLITERAL + E + E + E
(5) E => 1 + E + E + E
(6) E => 1 + INTLITERAL + E
(7) E => 1 + 2 + E + E
...
(11) E => 1 + 2 + 3 + 4
The burning question is, how did you know to apply this production?
(4) E => INTLITERAL + E + E + E
The answer probably goes along the line of, well I could see that the string contained three instances of the token + and so once I had three instances I knew to stop producing more and apply E -> INTLITERAL to each E. Without even realising, you've used a LL(k) parsing algo with k tokens of lookahead, where k is the number of tokens in the string. In other words, the parser could look at the whole string! Unfortunately, our parser doesn't have this sweet luxury but instead has to live it's life constrained to only 1 token of lookahead. It's important to note there are two well known ways to implement LL(k) parsers. The first is a recursive descent parser, and the second is a table driven parser.
The problem with left recursion is that for a recursive descent parser it will recurse forever, never terminating, while for a table driven parser we will end up with multiiple entries for a given cell and not know which one to apply. How do we get ourselves out of this bleek situation? Left recursion elimination. As if it was going to be called anything else haha. You decide in your next lunch break that you will read up on it!
After what I can only describe as a successful lunch break, you return to your desk with a revised (and hopefully final) version of the grammar.
// Grammar 3.0
E -> TE'
E' -> +TE' | -TE' | <eps> // <eps> is eplison
T -> FT'
T' -> *FT' | /FT' | <eps>
F -> (E) | INTLITERAL
Your coworker notices the revised grammar as they walk past. Interested, they take a closer look. Before you can turn around, you hear them announce "That's the one!". They continue, "You've nailed it! The grammar has the correct operator precedence, each operator has the correct associativity, each production is free of left recursion and there are no common prefixes. All you have to do now is to compute the first, follow and select sets of the grammar." You sink at the sound of more unfamilar words first, follow and select sets. When will this end?
Turns out there is a special type of top down parsers that we want to use to maximise the effeciency of the parsing algorithm. Remember, we need to run this on a PDP-11! It is called a predictaive top down parser, which just means that it doesn't require backtracking. The key thing to realise is that without a predictive parser, we essentially need to try all productions, and then choose the one that gives us the longest match (if any). If we can compute the first, follow and select sets for the grammar then we will have everything needed to implement the parser! Ravindrababu Ravula does a fantastic job walking us through that process over here on their youtube channel.
After being enlightened on how to compute the sets, you sit down and decide to have a crack at the final edition of the grammar. You can feel the implementation of the parser only minutes away. I should add, select sets are simply a convenience set thats wraps around first and follow sets to make building the parsing table or recursive decent parser more straightforward. You'll see the definition in the stub file for the Parser! One long coffee later, and there they are below in all their glory!
First
Follow
Select
// Non terminals
First(eps) = { eps }
First(+) = { + }
First(-) = { - }
First(/) = { / }
First(*) = { * }
First("(") = { ( }
First(")") = { ) }
First(INTLITERAL) = { INTLITERAL }
// Terminals
First(E) = First(T)
= { (, INTLITERAL }
First(E') = First(+) U First(-) U First(eps)
= { +, -, eps }
First(T) = First(F)
= { (, INTLITERAL }
First(T') = First(*) U First(/) U First(eps)
= { *, /, eps }
First(F) = First("(") U First(INTLITERAL)
= { (, INTLITERAL }
// Terminals Only
Follow(E) = { $ } U First(")")
Follow(E') = Follow(E)
= { $, ) }
Follow(T) = First(E') U Follow(E') U Follow(E)
= First(E') U Follow(E') // by Follow(E') = Follow(E)
= { +, -, eps, $, ) }
Follow(T') = Follow(T)
= { +, -, eps, $, ) }
Follow(F) = First(T') U Follow(T') U Follow(T)
= First(T') U Follow(T') // by Follow(T') = Follow(T)
= { *, /, eps, +, -, $, ) }
// Terminals Only
Select(E -> TE') = First(TE') = First(T)
= { (, INTLITERAL }
Select(E' -> +TE') = First(+TE') = First(+)
= { + }
Select(E' -> -TE') = First(-TE) = First(-)
= { - }
Select(E' -> eps) = First(eps) - {eps} U Follow(E')
= { $, ) }
Select(T -> FT') = First(FT') = First(F)
= { (, INTLITERAL }
Select(T' -> *FT') = First(*FT') = First(*)
= { * }
Select(T' -> /FT') = First(/FT') = First(/)
= { / }
Select(T' -> eps) = First(eps) - {eps} U Follow(T')
= { +, -, $, ) }
Select(F -> (E)) = First((E)) = First("(")
= { ( }
Select(F -> INTLITERAL)
= First(INTLITERAL)
= { INTLITERAL }
To complete this stage, you need to implement a [predictive] recursive descent LL(1) parser. A template scanner has been defined in compiler/Parser/Parser.js. To test your implementation as you go along, run:
$ yarn run test:Recognizer
Don't forget to checkout the handy tool a little further down to help visualise what the parse tree looks like for a given program. When a production is applied, for example E => TE' that corresponds to a node called E that has two children T and E' (as is the case with the root node below). An interesting side note, if you change the stream of characters to be just the program 1, it requires 9 nodes in total in the tree!