Tip:
Highlight text to annotate it
X
So in lecture, we spent a lot of time learning about context-free grammars and languages.
Can context-free grammars be used
to do anything other than parsing a programming language?
That's a good point, Peter, and the answer is yes.
In fact, there are a large number of applications for language formalisms
or context-free grammars.
I'll just list 4 or 5 of them.
One of the first is security.
We saw last time that regular expressions can be used in virus checkers.
And in fact, that's how the vast majority of virus checkers work.
They're like a big lexer.
It turns out that these days the most common attacks or exploits
or vulnerabilities reported involve cross-site scripting
or SQL--database--code injection vulnerabilities.
Both of these involve an application, a web application like Udacity's servers,
reading in a string from the user and treating it as if it were part of another important language
like HTML or SQL, the language of database queries.
It turns out that there a number of techniques available
for using context-free grammars to detect if a string from a user is normal
or if it's trying to take advantage of some special system structure
if it's one of these attacks.
And essentially, the way to do that is we pretend to parse the final string
given the context that the user has provided.
And if the user's changes, the string provided by the user
ends up influencing a large part of the parse tree, of the abstract syntax tree,
in essence, if the high watermark of the changes or the tainted information from the user
reaches very, very high, then we know it's a SQL code injection
or cross-site scripting attack because those attacks are based on destroying
or conflicting or deforming the essential structure of the output language.
So 1 possible use for them is in security.
Some of the best ways to figure out if something is SQL code injection
or cross-site scripting, let's say, before you actually run it and find out,
involve the use of grammars.
Another is in the world of optimization.
A production compiler or interpreter is often very interested in getting the same results
but faster or using less memory or--and this is a big one these days--
using less power.
If you're running a program on your smartphone or some other mobile device,
you really want it to last as long as possible.
Later on in the class, around Unit 6, we'll actually cover some basic optimizations.
I'll teach some of them to you, and you'll get a chance to use them
in your JavaScript interpreter for the web browser.
But for now, just take my word for it.
In order to do optimizations, we have to figure out what the values and variables are.
For example, if I can figure out that X is always 0
and you have a line in your program that says Y gets Y + X,
I don't have to bother interpreting that line
because I know that adding 0 to something doesn't change its value.
That sort of thing is called a data flow analysis or an optimization,
because if I can reason about how numbers flow through your program,
then I can make them faster, higher, stronger, use less power.
It's a really nice idea.
It's going to turn out that the best known research idea for figuring out values
as they flow through various functions in your program
is related to context-free grammars or, more formally, context-free language reachability.
It turns out that interprocedural data flow analysis is the same problem as,
in a theoretical sense, context-free language reachability,
which is the same problem as could I generate a string in a context-free grammar.
It turns out all of you are experts at figuring out if we could generate a string
in the language of a context-free grammar,
and exactly that sort of knowledge is one of the things program optimizers use
to squeeze every last drop of goodness out of a program.
Third possibility is in linguistics or computational linguistics.
There are a number of references to these underneath the lecture videos,
but it turns out that a lot of these notions of context-free grammars
or generative grammars or universal grammars actually came out of psychology
or linguistics or computational linguistics
to help us understand human and natural language.
And it's a really important use.
But let me skip past that one since it's too easy in some sense.
I'll give you 2 more.
One more--near and dear to my heart--is specification mining.
It turns out that often we want to figure out what a program should be doing
just by looking at its source code.
And this is super tricky.
It's like trying to learn the rules of English grammar
by reading a bunch of high school student essays
and looking at what they get right and get wrong in common.
As you can imagine, it's very difficult to get this perfectly
because just because a number of students say something
doesn't necessarily mean that they're following the rules.
They might all get it wrong the same way.
And if you've seen people post on forums,
maybe not everyone has perfect grammar.
So just taking samples might not be the right way to do it.
It turns out that the output of specification mining often takes the form of a formal grammar
or some sort of state machine: You should always call open
and then you should call close.
A lot of important security or software engineering policies
can be described in terms of some sort of grammar.
It turns out that learning grammars from examples is really, really difficult.
Both that and this notion I mentioned earlier of interprocedural data flow analysis
are the sorts of things that one might get into in a subsequent course after this one
but where you'd build on the knowledge you learned here.
Here's my last example, and this one is perhaps the farthest from the norm,
but I think students may appreciate it.
It turns out that if you're using a cellular phone,
people might be able to intercept what you're saying if they just listen very hard.
So these days, modern cell phones or mobile phones might try to encrypt the data packets
that they send containing your vocal information so that, in theory,
eavesdroppers can't figure out what's going on.
However, if I've used a lot of that computational linguistic information,
if I've studied the world, it turns out that in different spoken languages
or with different accents we have different patterns of speech and then pauses.
And some of the initial implementations of phone encryption tried to be smart
and save power.
They would only send information when you actually said something.
So this meant that attackers could listen in and, using linguistic analysis information,
even if they couldn't tell what you were saying, even if it was all scrambled,
just by doing a statistical analysis of how long you said and then how long you paused
and then how long you said again, we can reliably identify what language you're speaking
and, worse than that, regional accents--say you're speaking English--
whether you're speaking with some sort of New York/Brooklyn accent or from the South
or from California.
We can tell based just on the patterns of how long you speak and how long the pauses are.
That has very unfortunate Orwellian implications for some sort of surveillance state.
Conveniently, it is more or less totally defeated by encrypting the silence as well--
using more power to just get rid of that side channel,
prevent people from figuring out when you're speaking and when you're not.
But that's an example of an application that also uses this sort of linguistic analysis--
perhaps a little less related to context-free grammars but still in the same vein.
All in all, context-free grammars, context-free languages
come up in significantly more of computer science than just parsing.