Beyond Parsing - Cs262 qa - Udacity

So in lecture, we spent a lot of time learning about context-free grammars and languages. Can context-free grammars be used to do anything other than parsing a programming language? That's a good point, Peter, and the answer is yes. In fact, there are a large number of applications for language formalisms or context-free grammars. I'll just list 4 or 5 of them. One of the first is security. We saw last time that regular expressions can be used in virus checkers. And in fact, that's how the vast majority of virus checkers work. They're like a big lexer. It turns out that these days the most common attacks or exploits or vulnerabilities reported involve cross-site scripting or SQL--database--code injection vulnerabilities. Both of these involve an application, a web application like Udacity's servers, reading in a string from the user and treating it as if it were part of another important language like HTML or SQL, the language of database queries. It turns out that there a number of techniques available for using context-free grammars to detect if a string from a user is normal or if it's trying to take advantage of some special system structure if it's one of these attacks. And essentially, the way to do that is we pretend to parse the final string given the context that the user has provided. And if the user's changes, the string provided by the user ends up influencing a large part of the parse tree, of the abstract syntax tree, in essence, if the high watermark of the changes or the tainted information from the user reaches very, very high, then we know it's a SQL code injection or cross-site scripting attack because those attacks are based on destroying or conflicting or deforming the essential structure of the output language. So 1 possible use for them is in security. Some of the best ways to figure out if something is SQL code injection or cross-site scripting, let's say, before you actually run it and find out, involve the use of grammars. Another is in the world of optimization. A production compiler or interpreter is often very interested in getting the same results but faster or using less memory or--and this is a big one these days-- using less power. If you're running a program on your smartphone or some other mobile device, you really want it to last as long as possible. Later on in the class, around Unit 6, we'll actually cover some basic optimizations. I'll teach some of them to you, and you'll get a chance to use them in your JavaScript interpreter for the web browser. But for now, just take my word for it. In order to do optimizations, we have to figure out what the values and variables are. For example, if I can figure out that X is always 0 and you have a line in your program that says Y gets Y + X, I don't have to bother interpreting that line because I know that adding 0 to something doesn't change its value. That sort of thing is called a data flow analysis or an optimization, because if I can reason about how numbers flow through your program, then I can make them faster, higher, stronger, use less power. It's a really nice idea. It's going to turn out that the best known research idea for figuring out values as they flow through various functions in your program is related to context-free grammars or, more formally, context-free language reachability. It turns out that interprocedural data flow analysis is the same problem as, in a theoretical sense, context-free language reachability, which is the same problem as could I generate a string in a context-free grammar. It turns out all of you are experts at figuring out if we could generate a string in the language of a context-free grammar, and exactly that sort of knowledge is one of the things program optimizers use to squeeze every last drop of goodness out of a program. Third possibility is in linguistics or computational linguistics. There are a number of references to these underneath the lecture videos, but it turns out that a lot of these notions of context-free grammars or generative grammars or universal grammars actually came out of psychology or linguistics or computational linguistics to help us understand human and natural language. And it's a really important use. But let me skip past that one since it's too easy in some sense. I'll give you 2 more. One more--near and dear to my heart--is specification mining. It turns out that often we want to figure out what a program should be doing just by looking at its source code. And this is super tricky. It's like trying to learn the rules of English grammar by reading a bunch of high school student essays and looking at what they get right and get wrong in common. As you can imagine, it's very difficult to get this perfectly because just because a number of students say something doesn't necessarily mean that they're following the rules. They might all get it wrong the same way. And if you've seen people post on forums, maybe not everyone has perfect grammar. So just taking samples might not be the right way to do it. It turns out that the output of specification mining often takes the form of a formal grammar or some sort of state machine: You should always call open and then you should call close. A lot of important security or software engineering policies can be described in terms of some sort of grammar. It turns out that learning grammars from examples is really, really difficult. Both that and this notion I mentioned earlier of interprocedural data flow analysis are the sorts of things that one might get into in a subsequent course after this one but where you'd build on the knowledge you learned here. Here's my last example, and this one is perhaps the farthest from the norm, but I think students may appreciate it. It turns out that if you're using a cellular phone, people might be able to intercept what you're saying if they just listen very hard. So these days, modern cell phones or mobile phones might try to encrypt the data packets that they send containing your vocal information so that, in theory, eavesdroppers can't figure out what's going on. However, if I've used a lot of that computational linguistic information, if I've studied the world, it turns out that in different spoken languages or with different accents we have different patterns of speech and then pauses. And some of the initial implementations of phone encryption tried to be smart and save power. They would only send information when you actually said something. So this meant that attackers could listen in and, using linguistic analysis information, even if they couldn't tell what you were saying, even if it was all scrambled, just by doing a statistical analysis of how long you said and then how long you paused and then how long you said again, we can reliably identify what language you're speaking and, worse than that, regional accents--say you're speaking English-- whether you're speaking with some sort of New York/Brooklyn accent or from the South or from California. We can tell based just on the patterns of how long you speak and how long the pauses are. That has very unfortunate Orwellian implications for some sort of surveillance state. Conveniently, it is more or less totally defeated by encrypting the silence as well-- using more power to just get rid of that side channel, prevent people from figuring out when you're speaking and when you're not. But that's an example of an application that also uses this sort of linguistic analysis-- perhaps a little less related to context-free grammars but still in the same vein. All in all, context-free grammars, context-free languages come up in significantly more of computer science than just parsing.