Parsing Tools - Cs262 qa - Udacity

In class we studied parsing using memoization, but I've heard in the real world many tools use techniques such as LALR(1). Why didn't we learn those? That is a good question, and actually there are a number of answers to it. I'll enumerate maybe five of them. One of the first is that this memoization or chart parsing technique that we learned is actually very similar to--almost isomorphic to-- some of the more advanced techniques used in practice. You can read up on or you may have heard of GLR--generalized LR parsing. That's commonly used in tools such as Oink--the real name for a software project, go check it out--which can be used to parse complicated languages like C++. It turns out that C++'s grammar doesn't really easily fit into these restrictive little cubbyholes that some of the faster tools prefer. It can really be a contortion that distorts meaning to try to get C++ to fit into one of these other tools. By teaching you a very general technique earl on that can handle any context-free grammar you know how parsing works even in these obscure corner cases, and I'm teaching you the approach that used in the more advanced tools. Another reason related to this is that all of these tools have the same interface. If you want to go try Oink out later on, it accepts this same sort of context-free grammar format that we're used to. So does Yacc. So does Bison. So does Ocamlyacc. So does Java CUP. So does RE Yacc, Ruby Yacc. I've taught you the way to phrase it, and then the particular tool under the hood can use whatever parsing algorithm it prefers. In the same way that we write regular expressions--ab+-- and then depending on whether we're writing a PERL program or a Python program or some other program, it might convert that regular expression into a finite state machine slightly differently, but we know what the final output is going to be. We know what strings are going to be matched. Similarly here, when you write your grammar, regardless of what parser-generating algorithm is eventually used, we know what the behavior is going to be. That's what we've covered in this class. There's a related argument. You might say, well, why are we using Python in this class when the real world uses C or Java or C++ or whichever language you think is commonly used in the real world--perhaps C#. That's the same argument here. Why are we using this chart parsing, when the real world uses these slightly different tools. This is a big question in education and pedagogy. I honestly believe as a practitioner and a researcher that there is a key difference between knowing how to program and knowing how to program in a particular language. The former is much more important. If you know how to program in Java, if you survived this class, you can pick up another language like Ruby, Java, Python, and do the same thing-- knowing how to program, thinking about recursion, thinking about laying out data, thinking about algorithms. That's much more important than knowing the particular syntax for any language. Similarly here, I believe that really understanding context-free grammars and parsing is more important than knowing which particular parser generator tool implementation is being used under the hood. You know one particular algorithm for creating parsers, and that may as well be the one that's used in practice. Another reason that I focused on this algorithm is that it is simple. It relates grammars to parsing, which is one of the concepts in this class. We lay out context-free grammars, but now we need to be able to tell if a string or an utterance is in the language of that grammar and build up its parse tree. It's only a few lines long, and it allows us to introduce lovely concepts like memoization or list comprehensions. A number of you might wonder, if we were to write our parser in another way, maybe you've heard of techniques like recursive dissent or even using LALR(1) grammars. Would it be just as convenient? I have done them all, and the answer is no. By no means is a recursive dissent parser or an LALR(1) grammar for the language fragment that we're considering here, as easy as the parsing algorithm that I showed you. Now, this is in fact the most elegant approach. Then finally a number of students were concerned about the time taken by our parser, which is cubic or runs in time proportional to the size times the size times the size of the input program, in the worst case for an ambiguous grammar. Now, this is bit beyond the scope of this course, but it's worth pointing out that these other tools that claim to be faster, it's not actually a fair comparison. The parsing algorithm I taught you handles any grammar. The tools that claim to be faster don't. They only handle special restricted grammars. An interesting note that I didn't cover in this course, but that's nonetheless true, is that the particular parsing algorithm that we covered also runs in linear time, is very fast for the same small class of grammars--LRK grammars they're called-- that are accepted by these faster tools. If you were the sort of person who would take all the work and time required to shoehorn your language's description into the particular straight jacket format favored by these other faster tools. The faster tools are actually just as slow or just as fast--your choice-- as the algorithm I taught you in the limit. Their asymptotic complexity is the same. They're both linear time. You don't loose anything in terms of time either. Again, getting back to the first point, I think it's more of an historical accident that most of these tools happen to use LALR(1) or other hacks for speed in practice. In the last few years--the last 10 years, say--research has focused increasingly on GLR parsing or this sort of chart parsing that I've shown you as a way of allowing people to write the grammars that they want, get linear time, super fast performance most of the time, pay a little bit when things are ambiguous, and then get on with their lives. If foresee that languages like C++ or Java or whatnot-- more complicated languages--will be analyzed using tools like that in the future. It's my honest believe as an educator that I'm teaching you the right approach, teaching something like LALR(1)--that information is really not necessary in the modern world, and it's log--those lectures feel like they go on forever. By contrast, this chart-based approach is elegant, simple, allows us to introduce fun concepts, handles all grammars, which the other approaches do not, fits in just a few slides, and is just as fast in the limit. What's not to like?