Tip:
Highlight text to annotate it
X
NaturalLanguageProcessing-Lecture06 Instructor (Christopher Manning):Hi again.
Welcome back to 224n. Okay. So todayˇs lecture is hopefully a little bit of a respite where
we can finish off some stuff and do high-level introduction to information extraction and
you can, hopefully, feel relaxed. You handed in the first program assignment and are getting
a good start on your second assignment. And then on Wednesday, we kinda get into the next
big area of technical content, which is then doing discriminative sequence classifiers.
So what Iˇm gonna do today is, first of all, finish off with a bit that I didnˇt say about
machine translation, about using some tech models. And then I want to give a high-level
introduction to information extraction and some of the motivations and talk about a couple
of historical and still used but straightforward method to go about information extraction.
And then next time, weˇll go into the main technique that weˇll cover, which is discriminative
sequence classifiers. Okay. For about the first quarter of the class, I just want to
finish off by saying a little bit more about what I didnˇt get to in machine translation.
So at the end time, if you remember, we were talking about statistic phrase-based system.
And statistical phrase-based systems are kinda just the de facto state of the art as to what
people build and use for machine translation systems. And that the deal was there that
we actually still used IBM style alignment models to build good alignment. So we donˇt
actually use them to translate any more, but we use them to learn alignments using EM,
kind of the way weˇve discussed and youˇre doing for the assignment. Weˇre then using
an algorithm to merge their outputs. And there are various algorithms that are used, and
Iˇm not gonna go into the details. But conceptually, this is just fixing the brokenness of the
fact that the IBM models allow one too many in one direction, but not in the other direction.
So to some extent, this looks like a kind of an ad hoc procedure and I think itˇs fair
to say it is.
But itˇs basically state of the art what everyone does. And so then once youˇve got
that final phrase table over here, you do the phrase extraction process. And the phrase
extraction process, as I said last time, is that you have legal phrases and legal phrases
are things that have complete, in that it has to contain all the alignment points for
all the words in both phrases. So you canˇt have this as a phrase, because thereˇs an
alignment point thatˇs outside it. And you canˇt have this as a phrase, because there
are alignment points outside it. But nevertheless, there are lots of possible phrases you can
extract. So this is a possible phrase. Thatˇs not a possible phrase. But this is a possible
phrase. And then the whole thingˇs a possible phrase. And then this gets cut off, but maybe
kind of all of this is a possible phrase, so you can recursively extract bigger and
bigger possible phrases. And people do, normally, up to some fixed limit. And then the decoder
runs, using these kinda multiword phrase chunks. And thatˇs proven to be a very successful
technique, essentially because these multiword phrase chunks can capture a lot of the word
selection and fluency of translation in common idioms. But still, these methods donˇt solve
all the worldˇs problems, and so I wanted to go on and say a little bit about syntax-based
machine translation systems. So the general idea of what weˇre doing here is this is
the Vauquois triangle that I showed earlier on. And the idea of this was you could be
operating just at the word translation level. You could be heading to deeper levels of linguistic
analysis. And, although people had previously done a lot of work with syntax-based machine
translation before statistical machine translation came to the fore, essentially, with the IBM
models, things went right back to the beginning.
Although there are kind of a couple of little hacks like the Model 4, remembering the displacement,
and so you didnˇt get penalized twice, which a little bit captured the idea of phrase structure,
basically youˇre down at the level of doing word-for-word translation at the bottom of
this triangle. And so recent work in machine translation has been starting to push back
up this triangle. And so now these statistical phrase-based systems are usual, and most people,
but not quite everybody, believes that pretty soon the wave of the future is building syntax-based
machine translation systems. And so Iˇm just going to show a little bit of one early system
that used syntax. Some of the ideas of why you want a syntax-based system is youˇd like
to have output sentences, which are grammatical. If you get far enough in your assignment,
have your Model 2 working decently, youˇll hopefully see that some things are kind of
okay. Itˇs finding a lot of the content words that it should be, and so on. But youˇre
not really gonna get out sentences, which are grammatical sentences that make it past
your English teacher. So somehow it would be nice if we could actually kind of at least
translate into sentences. They may not be perfect translations, but shouldnˇt they
at least be grammatically well formed. And another place where statistical phrase-based
system still badly fall down is reordering and word order between languages, so thatˇs
where when you have something like Arabic commonly has verb initial sentences, German
commonly has verb final sentences. All of these different word order possibilities that,
to some extent, statistical phrase-based systems can reorder phrases, but theyˇre very weak
models of doing that. And youˇd hope that syntax-based models could do a lot better.
Phrase-based systems are also normally bad at putting in the right function words. Thatˇs
knowing that when you translate into English you have to put thus and us in various places,
where in other languages you have to put various kinds of case particles or verb endings, because
they donˇt really understand any of the syntax. Okay. So hereˇs the kind of motivating example
for trying to do something better. This is a Japanese sentence and weˇre translating
it to, ¨He adores listening to music.〃 First thing to notice is that thereˇs a ton
of reordering here. So we have the ¨here〃 at the beginning, the ¨kare,〃 but then
the next thing, ¨adores,〃 is right at the end of the sentence, ¨daisuki.〃 And then
more than that, thereˇs kind of a lot of sort of syntax stuff going on.
So here in English we have, ¨listening to music,〃 whereas in the Japanese we have
¨ongaku ow.〃 So this is kind of an object marker. So youˇre not getting ¨to,〃 youˇre
kind of getting, ¨listening music〃 in Japanese. So itˇs one of these syntax facts that you
have to know about how things get translated. And then in both of them, effectively, youˇve
got a nominalization. So ¨listening to music〃 is the English nominalization, and in Japanese,
youˇre also getting this nominalization of ¨ongaku.〃 But itˇs got a lot of its own
syntactic structure, these little function words that are going together to put it together,
and somehow it seems like to actually be able to translate well, we want to understand some
of that syntactic structure. And then to know general facts about word order. So that in
Japanese, youˇre generally getting the predicates at the end of phrase. So weˇre getting the
¨adores,〃 at the end of this phrase, and then inside the nominalization, weˇre getting
the ¨listening〃 at the end, and the ¨music〃 before it. So that partˇs the opposite to
English. So thereˇs lots of syntax going on, which we might want to model. And so the
suggestion in this Yamada and Knight paper was we can just take the so ultimately,
weˇre going to be translating Japanese to English. So doing the usual Bayesian version
still, we have an English to Japanese channel model, and so maybe what we could do is, since
we have parsing technology for English thatˇs pretty good, that we could build the channel
model around an English parser. And itˇs probably better to just immediately move to
the picture. So this is the idea of what the channel model now looks like. The channel
model starts with an English sentence and then it does the sequence of operations that
changes things around. So the first part of it is to treat this tree like a mobile that
you can kind of change around the parts of. So at each level, youˇre allowed to do reordering
operations. So I can swap around ¨to music〃 to ¨music to.〃 And then I can swap around
the phrase and the verb that used to be ¨listening to music,〃 and swap them to ¨music to listening.〃
And then I can recursively do the same one more time up the top, and get ¨adores〃
at the end and the object before it. So I can sort of just by sweeping things around,
turn things into the right word order for Japanese. If you want to, in your spare time,
you could think about whether thatˇs always possible or whether there are counter examples
to that. Then the second phase of this is putting in these spurious words, the extra
words that are turning up for no reason. So now weˇve got a syntax-based model. And so
Japanese has these case marking particles that denote grammatical functions, like whatˇs
the subject or whatˇs the object. And thatˇs just the kind of thing that you can very straightforwardly
do if you have a syntax tree. Because you could look at the syntax tree and say, oh,
hereˇs the subject, so it should have the subject I guess the topic case particle
¨ha〃 after it. And then down actually, I should be careful. This is actually a topic
comment structure, so this is the topic and then this is actually the subject with another
marker. But you can put in the appropriate case markers. You can put in extra bits of
verbal morphology that you have to, with reference to the syntax. Yes?
Student:[Inaudible].
Instructor (Christopher Manning):So there certainly could be, yes. To be honest, in
the model of Yamada and Knight, the chances of putting these words in didnˇt depend on
the context. They depended solely on the parent category. Now, obviously, you could imagine
better than that, but essentially, its model was Iˇm going to be able to put in the appropriate
grammatical markers in the appropriate syntactic positions. Yes?
Student:[Inaudible] a lot of language, not just Japanese, these types of words where
having then in there [inaudible].
Instructor (Christopher Manning):Well, I guess in a lot of languages, the corresponding things
are bits of morphology that are endings on words, i.e., cases that normally occurred.
So thatˇs something like in German, where you have nominative case, accusative case.
Or if you got tortured with Latin in your early youth, you saw nominative, accusative
case. So I think more commonly the answer is that theyˇre kind of endings on words
that have done this morphology, which youˇd have to handle slightly differently. But in
principle, you could do the same sort of thing. Itˇs not that there arenˇt other languages
that do things in this kind of way, more these sort of stand-alone particles. So you see
that in things like the Austronesian languages. Okay. So weˇve done reordering, inserting
extra bits of grammar, and then we just translate the words into Japanese words. And if weˇre
really, really lucky, we can just read off the tree and have the right sentence. Great.
Of course, it doesnˇt always work, but thatˇs the principle. So when they first published
this, they published this as a channel model and showed some alignment results and that
seemed kind of cool. But I kind of wondered how they were ever going to reverse this to
turn it into a real translation model in a decoder. But it turns out itˇs actually not
that hard to do. Essentially, what you can do is write a grammar that incorporates all
of this stuff into it so that your grammar rules have extra annotations to say whether
in the Japanese version things gets swapped. And so this is still just being written as
context-free grammar rules. You can add grammar rules that then taking a word in one language
and turning it to a word in another language, that kind of funny terminal rewrite rules.
And you can turn it all into a big phrase structure grammar context-free grammar, and
thatˇs what they did and built a decoder out of it. And so they did this in sort of
small-scale experiments for Japanese-English and the general result was that they could
actually get a very significant increase, especially in human judgments of acceptability
of their translations by doing this. Because I think this is particularly emphasizing the
fact that they can mess up as well, but providing they kind of get it right, that
theyˇve got much better control of the grammar and structure of their sentences and so they
sound a lot better to human beings. So they had three human beings judging 50 sentences,
and this was jest a three-way judgment that the human beings could say that it was good,
so-so, or bad. And that as you can see in these results, they were getting significantly
better results than IBM Model 5 is here. In particular, ten sentences were judged perfect,
as opposed to zero. And so hereˇs just a couple of pictures that are showing the alignments,
where the idea here is that the very dark lines are high probability alignments, and
the thinner lines are low probability alignments by the alignment model, or the channel model.
And so hereˇs IBM Model 3. And some of the time, this works perfectly well. So you are
learning to translate it with the verb here at the end. Thatˇs the right thing to do.
¨Me,〃 ¨watashi,〃 thatˇs doing well. ¨Kare,〃 ¨he,〃 thatˇs doing well. Most
of it is okay. But somehow itˇs completely lost the correct alignment of revolvers. So
revolvers should be going to this guy here. But instead, itˇs being aligned with function
words. So this was the object particle, and this is dative particle. And I think youˇll
probably learn, to your frustration, when you build your own Model 2, that these IBM
models just really commonly do that kind of crazy stuff. I think itˇs partly in the sort
of biases that are there. And unless youˇre careful, these kind of function words and
you donˇt get precisely the same function words for the cases that youˇre doing, but
both French and English have lots of function words. They have kind of articles and auxiliary
verbs and little introducing words, like ¨that.〃 All those kind of words are really common
and can easily be that, in the modelˇs mind, the easiest way to do these alignments is
to sort of align things to function words, because theyˇre just around a lot and you
can kind of have them lined up.
Student:[Inaudible].
Instructor (Christopher Manning):The word order, yeah.
Student:[Inaudible].
Instructor (Christopher Manning):It did affect the way I maybe didnˇt say this clearly
enough. So each of these steps here is a probability distribution. So that youˇve got each one
of them is a little probabilistic model. So this one here, the translation one, is kinda
like all the ones that weˇve seen of the probability of fj given ei, that theyˇre
just probabilities of word substitutions. But this reordering one is also a probability
distribution. So what itˇs actually got is okay, hereˇs a rule, VB2 goes to VB2. Whatˇs
the probability distribution over different orderings of those components? And so they
could stay the same or they could swap. Or extra things can be inserted into them. And
you have a probability distribution of those possibilities. So free word order can be modeled
by effectively just having a flat probability distribution over reorderings.
Student:[Inaudible].
Instructor (Christopher Manning):Yes. Note that this is like a context-free grammar that
the reordering is only being done relative to a single level of tree. So looking at something
like this, so if there are subtle facts about particular words preferring different reorderings,
itˇs not capturing it.
Student:[Inaudible].
Instructor (Christopher Manning):So short answer to that, no. I mean, I think to some
extent, this is part of why so at the moment, there are lots of people actively
researching syntax-based models and there are some promising results, especially for
some languages, like Chinese or Japanese, with a lot of different things going on. But
in other circumstances, statistical phrase-based models still do better. And I think part of
the reason for that is, at the moment, there isnˇt kinda the perfect technology and the
different models are good for different things. So statistical phrase-based models do excellently
for that case, because theyˇll see that ¨listening to music〃 is just being translated as this
one word lots of times, if youˇve got enough data. Theyˇll learn that little phrasal alignment
rule, and translate it beautifully. Whereas, this kind of model, although itˇs doing some
clever stuff with syntax, in another sense, itˇs right back where the IBM models were,
because at the end of the day, every leap here is being translated independently, just
like in Model 1, 2, or 3. So it isnˇt capturing the advantages of a statistical phrase-based
system, and so, I guess thatˇs where, obviously, to do better is the goal. And you can kind
of think of ideas to do better. You could say, can I take a whole phrase and translate
it at once? And thatˇs an obvious thing to think about. But I mean, one of the things
that makes it difficult and part of why statistical phrase-based systems have been very successful
by ignoring syntax is a lot of their useful translation units donˇt actually correspond
to syntactic units. So like in this example, the useful translation unit is actually ¨listening
to,〃 because ¨listening to〃 gets translated in a certain way. Itˇs not that ¨to music〃
is a good translation unit, because if you were talking about something different, where
youˇre actually going to music tonight, the ¨to〃 would be translated as a Japanese
particle, which is really used for motion towards. But here, because somehow just this
English idiom of ¨listen to,〃 theyˇre not using that particle. Theyˇre just regarding
it as a direct object. So thatˇs kind of part of why this is tricky. Yeah?
Student:[Inaudible].
Instructor (Christopher Manning):So there certainly have been I donˇt think Iˇm
very expert on this field, I will say first off. I mean, there certainly is quite a lot
of work and studying of human translator. I guess, especially, there has been a lot
of psychological studies of people doing simultaneous translation, because thatˇs seen as a fairly
special skill. But I mean, absolutely I agree that human translators are recognizing chunks
that they can handle. But I think itˇs fair to say that the chunks that theyˇre recognizing,
some of them are idiomatic. I think itˇs clear that humans store a ton of idiomatic
chunks in their head, like we all know all of those regular things we see in the newspaper.
Ethnic cleansing and improvised explosive devises and all your other favorite newspaper
terms. We all remember those little chunks and then people start using them. So humans
carry a huge number of chunks in their heads and will translate chunks. But I think itˇs
also fair to say that the human being is also doing grammatical analysis in their heads.
So that if youˇre trying to translate something like a Chinese sentence, you really, really
need to work out which parts are different clauses and where the verb with the different
clause is before you can start to make any sense of how to translate it. Okay. Thatˇs
one more example that maybe kind of exhibits the same thing. Okay. So hereˇs my final
machine translation side. So whereˇs machine translation at the moment? I mean, itˇs just
still not the case that you can get machines to do high quality automatch machine translation.
Just doesnˇt exist. So for most of the translation industry, where the action and technology
actually is, is producing aids for human translators. And that means that the machine doesnˇt translate
the text at all. Because it turns out that if the machine does a kind of a bad job at
translating, it just take the human being more time to fix up all the junk that theyˇve
produced. And it would take the human being to do it right in the first place.
So all the technology, which commonly goes under this name of translation memories, is
effectively building big databases of translation chunks so that the system will know well how
to translate chunks. So if itˇs some phrase like, ¨move your mouse,〃 and youˇre wondering
how to translate your software manual into Icelandic, or some language like that, well,
the translation memory will be able to tell you that ¨move your mouse〃 should be translated
as this. And it actually then even sort of fits into standardizations, so large companies
wanting their translations to be consistent. So just as the same terminology of icon window
dialogue box is used in English that you donˇt want in the translation, five different translators
have chosen different creative ways to translate ¨dialogue box〃 into Slovenian, or something.
So precisely by having this kind of translation memories, you can standardize on a single
way to translate things. The exception to that, which has effectively been the enormous
growth area thatˇs revived a lot of MT technologies, is effectively thereˇs now this new concept
of user-initiated translations. So I think the traditional model is that really the only
thing that existed was authorial producer generated translation, and if youˇre the
authorial producer, not withstanding some of those appliance manuals you remember from
the 1980s, by and large, you donˇt want to be producing manuals that are clearly horrible
and make you look bad. And so you want a certain quality of translation. But if youˇre just
a user looking at something that you canˇt understand at all, youˇre pretty happy to
have something that kind of gives you the gist of it, even if the translations arenˇt
perfect. And so thatˇs led to this resurgence of medium- to low-quality machine translation
that appears a lot on the Web.
And I think I sort of said a bit about some of the future things. So Iˇll stop there.
Okay. So moving ahead now for sort of the remainder of the course, I guess, what Iˇm
wanting to do is sort of start talking about different natural language processing topics,
which start to sort of move up from fairly surface level processing, to starting to do
deeper processing, which can be a syntactic kind, like parsing. But a lot of it can be
thought of as starting to understand more of a text so you can do more of a text. And
so the lowest level of starting to understand some bits of a text that have been explored
quite a lot are these tasks of name density recognition and information extraction. So
what Iˇm gonna do today is motivate those tasks and start to stay a little about a couple
of ways that theyˇre approached early on. But let me motivate them in a slightly discursive
way by talking about some negative examples. So we have this technology of natural language
processing, and we want to do something with it. And thereˇs the Web out there, so an
obvious thing to say is, well, canˇt we use MLP to improve Web search? And if you think
about that for two minutes, the answer is, of course you can. There are all kinds of
places in which you could use at least low-level MLP technologies to do a better job with word
sense disambiguation. The most overused example of word sense ambiguity on the World Wide
Web is the word, ¨jaguar,〃 because it has all of these different meaning that are common
in the computer world. So thereˇs the big catˇs meaning. Thereˇs the version of Apple
Mac OS. There are jaguar cars. And there are all sorts of miscellaneous other littler ones,
as well. Iˇd say thereˇs molecule geometry and solvation energy package thatˇs called
Java. And where you get those same kind of word sense disambiguation problems with people.
And this happens more and more as greater and greater numbers of people have blogs,
home pages, etc.
So one name, like Michael Jordan, will refer to lots of people. And then you also have
the reverse problem, so that you have more synonyms of laptop you donˇt find references
to notebook. So there are lots of places in which you think a little bit of MLP would
help. And there are then slightly higher-level places with which MLP might help more, so
people point to various semantic search kinds of applications. So that if youˇre wanting
to do something like ¨clean espresso machine porter filter,〃 that if you kind of know
that ¨espresso machine porter filter〃 is the object of ¨clean〃 and you want to make
sure you search for that relationship, maybe youˇll get more precise searches. And it
sort of seems like thereˇs some technologies that you can apply to some of these problems.
So word sense disambiguation is certainly a technology that people have worked on a
ton. And there are fairly good word sense disambiguation systems. Learning synonyms
or just listening to synonyms, thatˇs a technology thatˇs worked on quite a bit. And so at various
times, there have been tons of companies that have, essentially, their mission statement
being to bring MLP to Web search. There are a whole bunch of them that were back around
the kind of the bubble days around 2000. There are some newer ones now. Traditional key word
search technology is hopelessly outdated. But it seems like, in practice, somehow this
is kind of an idea that hasnˇt quite worked for people. And itˇs not that I think people
should give up all hope of this, because in some sense there are places in which it should
be useful, and I think in kind of little edges and corners of the search world, it is starting
to be shown useful. But it turns out that some of these ideas kind of arenˇt as useful
as an LP person thinks, or arenˇt as easy to apply successful as an MLP person thinks.
So we can do things like correctly stemming words to their form of getting singulars and
plurals and things like that in the kind of good linguistic way.
But it doesnˇt actually work much better in practice than doing it in crude ways. We
can do word sense disambiguation fairly well, but it turns out itˇs not clear that we can
do it well enough, so thereˇs actually an early study over a decade ago, now, by Mark
Sanderson, whoˇs an IR person, in England. Essentially, it was a user study, which was
as accurate did word sense disambiguation have to be before, on balance, users were
happy with the system. And so it was attempting to do word sense disambiguation and give improved
results based on it. And the crucial observation is that if you disambiguate wrongly, you lose
big time. Whereas, if you donˇt do disambiguation, and you show a smattering of results, a lot
of the times people can just choose the one that they want. And so his results were that
you had to have about 90 percent accurate word sense disambiguation before for user
utility you could measure an increase on it. And the problem is, a lot of the time, somewhere
around 80 and 90 percent is the kind of accuracy we can do word sense disambiguation with.
Okay. So syntax. Iˇm very fond of syntax and parsing, and Iˇll talk about it for lots
of lectures any week now, so surely have the structure of sentences, just as I was motivating
for machine translation, that should help for doing Web search. Maybe someday, I actually
hold much less hope for that for machine translation. I think syntax will be proved useful in machine
translation much sooner than for Web search. It seems like for most of the things that
people want to do with Web search, that statistical phrases, that same kind of technology thatˇs
been so successful for MT, where youˇre just taking any substring of words that youˇre
finding distinctively often, has just been a very successful technology. So Web search
engines have gotten a lot of value out of using various notions of statistical phrases,
so as well as when people overtly put double quotes around things. Because it turns out
that, although you kinds of people put double quotes around things, most regular people
donˇt use the double quotes operator. So you can get all the benefits, or nearly all
the benefits, for recognizing quotes, recognizing covert phrases for them, and treating things
as phrase, even though they didnˇt type the double quotes. Or more generally, various
kinds of proximity weighting, where youˇre looking for search terms to appear near to
each other. Okay. So in practice, if you kind of look at whatˇs been driving a lot of the
progress in Web search, I couldnˇt honestly say that a lot of it has been high-level natural
processing, that lots of other things, like anchor text, have been giving much more value.
And there are kind of interesting ways to think about that. I mean, effectively, anchor
text, one of the big things itˇs giving you is itˇs giving you synonyms. Because people
who are writing anchor text are writing alternative ways to express the topic of some page. So
often they will do synonyms. So effectively, youˇre getting this very domain specific,
hand-written synonym or disambiguation technology, which is tending to be more useful. Okay.
So what should you do about that if youˇre a natural language processing person? So I
kind of feel like if you want to be useful as a natural language person, it is somewhat
unwise thing to aim at to say Iˇm going to build a better search engine. And probably
what you should be trying to do is actually to come up with technologies that actually
require more understanding. It turns out that doing document retrieval doesnˇt actually
require very much understanding of whatˇs on the page. That if you chose other tasks,
which actually require much more understanding of whatˇs on the page, then it seems that
there are much more opportunities to show that deeper MLP techniques actually deliver
value. So this is a kind of an old slide now from Oren Etzioni, whoˇs CS prof from the
University of Washington, whoˇs done a lot of much more recent work on the Know It All
project, which is a very interesting project for information acquisition from the Web.
But this was a slide, I think it actually dates back to the very late ˉ90s, that he
made, where he wanted to suggest that things like search engines, theyˇre the information
herbivores, and what we should be doing is, on top of the information herbivores, building
the information carnivores, which work in a much higher level of operation. And I think
when you get to wanting to build information carnivores, that maybe thatˇs where MLP gets
much more useful. Okay. And so I want to talk about a couple of examples of that. And one
of the obvious places on the Web is trying to do much more in the way of trying to do
product finding, comparison shopping in more automated fashions.
So it seems like a good higher level task is, rather than me spending half an hour variously
browsing around websites and doing Web searches and finding different pages and clicking through
various ecommerce sites, why canˇt I just tell my computer to go and collect information
for me on the prices and cost of shipping and handling, all of that, of different products?
And actually Oren Etzioni was one of the first people that tried to do this. He built the
Jango shopbot, which was an automatic price comparison system. I mean, when these were
first introduced, I think vendors hated them. But I think thatˇs kind of changed now, because
of the way that the Web has changed and these days itˇs more important than for vendors
to actually be found, rather than not found than for them to worry about the comparison
shopping angle. Here is a very old screen shot of Froogle. I will use this to illustrate
that information extraction isnˇt easy. But also to pose a question as to what the future
is. So you can still go to Froogle. It doesnˇt look quite like this anymore. And you can
type in your product name. I put in here ¨Lego fire engine.〃 And it will give you some
results. These results arenˇt very good. If you go and look at it today, the results
are slightly better. But in the old days, if you kind of look down at your results,
you got to this green bar, where it says, ¨The results below were automatically extracted
from web pages. Price and category information are uncertain.〃 And below the green bar,
they did information extraction. They went around Web pages and automatically tried to
extract details about products. Now, with no internal information, from what I can see
from the Web, they just dropped this and got rid of it. And they just show the stuff up
here. And whatˇs the stuff up here?
The stuff up here is vendors give data to Google, instructed XML formats, and they just
stick it up as the results, so youˇre doing the Web search over structured data, where
you have product names and attributes and things like that. Well, thatˇs a bit sad
for an MLP person, because weˇd be doing the part down here. But I mean, I presume
why that was done was because the accuracy down here was just too low and it wasnˇt
perceived as as good. So thereˇs still plenty of work for MLP people to do to get good accuracy.
But I still think that thereˇs actually very good prospects to doing automatic information
extraction. Because the fact of the matter is that thereˇs just tons of stuff where
you have scriptive content, which isnˇt very structured, but youˇd like to process it
automatically. So you see that in a lot of the most popular sites. Something like Craigslist,
they have a little bit of typing of the data, but nearly all of it is just written as unstructured
text. Something like ebay has a little bit more typing of the data, but still a lot of
the details that you want to know are unstructured text. So the thing thatˇs actually unclear
is one possibility in the world is everything will go the structured way, and the other
possibility in the world is that a lot of stuff will stay in unstructured natural language
text. And different people have different bets on that. So the much hyped semantic web
of Tim Berners-lee is essentially a bet that people will find it worth their while to start
turning things into highly structure data that makes it easy for computers to process.
Doesnˇt seem to have really succeeded. Iˇm a little bit skeptical. My bet, and my livelihood,
is lots of stuff will stay down here and someone will want to get information out of it at
some point. So this is the question. Will human beings keep on writing their emails
to their friends or their IMs or whatever other medium saying, ¨Hey, Sue, can we have
lunch tomorrow at noon?〃
Or will they start filling out structured lunch request forms and send them as structured
data to their friends? So why is doing this information extraction automatically hard?
Itˇs kinda not hard for humans. Weˇre very good at interpreting this kind of stuff, but
it turns out that when youˇre a machine, thereˇs a lot of stuff to notice here. So
here weˇve got a product. What kind of a product is this? A car? Or a book? Yes, this
is a book. Could you imagine a computer being fooled and not thinking it was a book? Yeah,
because it has yo-yos and it says here ¨toys.〃 But actually, itˇs a book. Of course, there
are other good clues that it is a book thatˇs paperback. So working out what the person
is selling, thatˇs a categorization task. Itˇs not necessarily tricky. Okay. Well hereˇs
the title of the book. Nothing explicit says itˇs the title. It does kind of explicitly
here say ¨author〃 in front of the authors, but for the title, youˇre just kinda meant
to know, based on how itˇs placed and the fact that itˇs bold and so on. In particular,
yeah, this could be a book title, but itˇs kinda hard to know if youˇre a computer.
Itˇs not so obvious whatˇs the book title. And then weˇd want to know how much this
sells for. And among other things, note even if youˇre clever enough to have your regular
expression for U.S. dollars, there are two prices here that are in U.S. dollars. Say
you need to know enough about how to interpret the context and where the buyer is to be able
to give them the right price in U.S. dollars. Okay. So this is the space of information
extraction systems. So the goal of information extraction systems is to say weˇre not going
to try and do full natural language understanding. What, instead, we have is a simple information
need, and the simple information need you can think of as a database record. Or you
can think of it as kind of an attribute value. Youˇve got several attributes for which you
want to find values. So in the previous example, I want to know, perhaps, the category of the
item, the name of the item, and the price of the item.
And thatˇs all the information I need. All the rest is superfluous. Iˇm just getting
out a simple record and so I want to be able to run information extraction system over
a lot of pages of different kinds and fill in that database table as accurately as possible.
If you think about it, there are kind of tons of things that people would like to do that
you can think of as basically information extraction problems. So maybe I want to collect
company information from all of those forms they have to file for the SCC. So I want to
find out for a company whatˇs its earning, whatˇs its profit, whatˇs all those other
numbers that people report. Or the book value or things like that. And I might want to get
out different kinds of relations. So I might have kind of entering relations. I want to
find out all the members of the board of that company and things like that. Thereˇs been
a lot of work in information extraction in the biomedical domain. So the sexiest part
of the news has been doing genome sequencing. But if you just have a sequence of base pairs,
that doesnˇt tell you anything about what they do. So if you actually want to know what
they do, basically all of the information about what things do is then contained in
papers where people write natural language about experiments. So thereˇs been lots of
work in recent years. And then trying to mine from research papers information about what
proteins enable some other protein to do something, or disable certain things going on, things
like that. Youˇre also increasingly starting to see this just in consumer operating systems.
So I guess here Iˇve got Microsoft ¨Smart Tags,〃 where you see exactly the same thing
in the recent release of Mac OSX, where you see it in Apple mail, where you get dates
and things like that tagged. You see the same thing in gmail. So that people are starting
to do various kinds of automatic recognition of that sort. Hereˇs the very first information
extraction project I ever worked on, which was back in 1999, a decade ago in Australia.
And this was for News Corporation, that well known and much loved media company. So like
their name, News Corporation Australia actually owns most of the newspapers, as opposed to
most of the television stations. And so what they were wanting to do was take all of their
real estate listings and be able to provide Web search over them. And at that time, at
least, all of those real estate listing were just in plain text. But you know, still, if
you look at something like Craigslist, most of what you see is in plain text. And I think
this is just a nice example, actually, because with just plain text and plain text searching,
you kind of can give very little value with something like real estate or other forms
of classified ads. Where if you can actually do information extraction over these, and
semantically interpret the information, thereˇs all sorts of interesting things that you could
do. So in this application, it extracted out addresses. It can use those to geocode to
locations. It extracted out open house times. I think around here, every open house is from
1:00 to 4:00 p.m. on Saturday. So thereˇs not very much information about open houses.
But in Australia, all the places just had different random open times on the weekend,
so it could be useful to fill in a little kind of if youˇre looking at a bunch
of places, fill in calendar entries of which order to do them in. You could get out the
price, you could get out various other requirements, like how many bathrooms that you wanted. And
it could give a really nice application, in which you could do semantic level searching
over these listings, and then interpret the results to show them in various nice ways
that gave value to the user. Whereas, in a domain like this, just trying to do plain
text search works absolutely horribly. And just to say a few sentences on that. So by
and large, one of the three biggest things that people want to search for in houses,
well, theyˇd want to search for where they are, how much they cost, and how many bedrooms
they have. Thatˇs the three big starting points. Now the one that most well works for
text search is searching for a location. But it turns out that even location doesnˇt search,
doesnˇt work very well for all sorts of reasons. First of all, there are normally lots of stuff
in the listings thatˇs misleading. So normally, real estate agents want to promote themselves
in the listing, so they put information in about themselves, including where their office
is.
So itˇll say something like, ¨Caldwell Banker,〃 which isnˇt where the house it, itˇs where
their office is. People use creative phrases like, ¨only 45 minutes from Parramatta,〃
which doesnˇt mean itˇs in Parramatta. They might mention multiple properties in a single
ad. And then the reverse direction. You kind of really want geographical information, because
commonly you want to search in an area thatˇs bigger than a particular town or suburb. But
this is really the best end of it. If you go onto other things, it just kind of gets
worse. Because for money, you donˇt want to textual match, you want a range. But you
also want more interpretation than that. People get quite creative in these ads. Was $155K,
now $145K. So you want to use this price. Offers in the high $700s, that means $700
and something thousand. Whereas, if you have something like rents for $270.00, that means
it Australia, $270.00 a week. That doesnˇt mean $270,000.00. Bedroooms, there are all
of these creative ways that you can abbreviate bedroom. So you want to be, again, recognizing
a number in the context of something like abbreviated bedrooms. So finding the things
to extract is a big problem. A second big problem is canonicalization of product information.
So these pages are from CNET, where Iˇve searched or at least, I searched a couple
of years ago, I guess, for an IBM X31 laptop in the days that they still made them. And
hereˇs sort of some of my results list. And what you see is, boy, they have a lot of X31
laptops listed. And if you look very carefully at the fine points, there is one of them thatˇs
different. This oneˇs a 1.3 gigahertz model. But this one, this one, and this one, all
identical. And this one kind of you canˇt tell, because this has too little information.
But effectively, what theyˇre doing is failing to merge together things that should be treated
as identical, so this is the problem of canonicalization, which is a big problem in all of these domains
is recognizing things that are actually the same. And because they fail to do canonicalization
sufficiently well, you kind of get crazy effects, because the price range here is different
from the price range there, where they should have merged that information together. Thatˇs
hard to do well, because people in natural language just describe things differently
because humans are so good at interpreting it. So hereˇs a digital camera website. And
this is looking at their technical specs pages for different camera. And what you can see
is, for the resolution of the camera, they just use all kinds of different terms, image-capture
devise, image sensor, imaging sensor, total pixels. Youˇre just kind of meant to know
that somehow those are referring to the same thing. And I mean, whatˇs kind of interesting
is I took all of these off one manufacturerˇs website. Itˇs not that different camera manufactures
use different terminology. Essentially, it sort of seems like itˇs just not worth the
manufacturer having consistent terminology just for themselves, because itˇs sort of,
effectively, when human are reading it, there isnˇt the cost. Because unless youˇre me
making an example for class, people just kind of look at the tech specs and theyˇre happy
because they can see the resolution of the camera and that they donˇt actually have
difficulties with these canonicalization issues. But our computers do. Iˇll skip that. Okay.
So first problem that weˇre gonna start off with is the problem of named entity recognition.
And weˇll look at this a bit before we get back to high-level issues, like the canonicalization.
So the idea of named entity recognition this was kind of invented in the late ˉ90s as
a subtask of information extraction was what we should do is have a separate component
that starts with text and just picks out these things that we called named entities.
So named entities are kind of a term of art, but whatˇs being meant by named entity is
something that is a name. It could be a person name, but it could also be a product name
or a place. But there are these kinds of things with definite reference that have proper names,
in some sense, that are called named entities. And then having identified them, the second
task we then want to do is to classify them. So in this particular data here, thereˇs
a four-way classification at work, so thereˇs organizations, locations, persons, and miscellaneous.
And if you have nothing else, kind of having person, organization, location are the first
big three for information extraction. And so the idea here is that, well, kind of a
lot of the key information in text comes with these named entities. And a lot of the higher-level
things that you want to do with text often involve named entities. So a lot of the time
when people are asking questions, theyˇre asking questions of named entities. So if
theyˇre saying, ¨Who starred in the 1993 cult film 〃 whatever cult films there
were in 1993, I donˇt know. Then the answer to that will be a person name, so youˇll
be looking for a person name as the answer. To give one more example, sometimes you can
do quite well, even with crude techniques that are nothing more than named entity recognition.
So in the biomedical domain, where a lot of the time people are wanting to study the task
of protein interaction, which means that the presence of one protein affects another protein
in some way it might be up regulating it, down regulating it but thereˇs some
effect between the two proteins. The kind of baseline method of doing protein-protein
interaction is you simply look for sentences that mention two protein names, and the implicit
assumption is if two proteins have been mentioned in the same sentence, they must have something
to do with each other. And so you simply just return pairs of proteins that occur in the
same sentence, having done named entity recognition.
And it actually turns out that that works not so badly. And so then all the kind of
work on top of that you essentially write your papers saying, ¨If I do this clever
information extraction, syntactic parsing and other stuff, I can get results that are
5 or 10 percent better than just looking for ones in the same sentence.〃 Okay. And so,
essentially, these are the kind of tasks that youˇll get as a named entity recognition
task. Youˇre going through words for text. You might actually have extra information,
like the parts of speech of the word that weˇll talk about later. And your goal is
to identify entities. So hereˇs an entity thatˇs a person name. Hereˇs an entity thatˇs
an organization name. Hereˇs an entity, which is another organization name. And a very common
way to do this, which is what weˇll do in the next class, is doing what are called sequence
models, where, effectively, youˇre going through each token in turn and youˇre assigning
a class to it. So Iˇm assigning a class to these tokens, and then you have one extra
class thatˇs kind of the background class that Iˇm calling ¨O〃 for other, which
are all the words that arenˇt named entities. So Iˇll just spend a couple minutes talking
about evaluation of named entities. And one way you can evaluate named entities is sort
of saying whether these individual tokens are right or wrong. But thatˇs not the standard
way for a couple of reasons. Firstly, most of the text is background text that is other,
in general. And it seems like you shouldnˇt really gain a lot of points for getting that
right. And secondly, it sort of seems like you want to capture whole entities. So the
standard way that people do things is ask about whole entities, as to whether youˇre
getting them out correctly. And so then evaluation is done in a similar method to information
retrieval of using notions of precision and recall, which I imagine a bunch of you have
seen, but let me quickly say. So you draw this kind of two by two contingency matrix.
And so you have four cells here: the true positives, which are the things that were
entities and the system returned them; the false positive, which are the things that
the system said were a person name, but actually werenˇt; the false negatives, which are the
things that are a personˇs name, but the system didnˇt return; and then this other
box here is the true negatives. We donˇt worry about that for the evaluation. And we
evaluate just based on these three cells. And so we worked out precision, which is the
proportion of the things that you returned, which were correct things to return, and recall,
which is the percentage of the things that you should have returned that you did return.
Do those notions make sense? Can I rush ahead? Okay. And so that gives us two measures. And
theyˇre interestingly different measures. Quite a lot of the time we just want to combine
them into one measure, and thatˇs standardly done with something that everyone calls the
F measure. And what the F measure is, is the harmonic mean. If you remember back to some
high school class, if you did harmonic means. So this is the way to see it as the harmonic
mean. So itˇs a weighted itˇs a potentially weighted harmonic mean. So you can kind of
weigh the two components of precision and recall, and then thatˇs the harmonic mean
formula. But normally people donˇt use that formula. Normally, people paramaterize it
differently, like this, and so youˇll see references to beta equals three, or something,
in the weighting of the harmonic mean. But in particular, the simplest case of the harmonic
mean is when beta equal one, which corresponds to alpha equal a half, where you just weight
precision, recall equally. And if you do that, and itˇs easiest to do the math there, what
you end up with is that the F measure is two times the precision times the recall divided
by the precision plus the recall. And thatˇs the measure weˇll use later on in the third
assignment. And the idea of the harmonic mean is itˇs effectively a very conservative average.
It means that to do well, you have to do well on both precision and recall. So that the
nature of the harmonic mean is if the two values are very different, like if you have
90 percent precision and 10 percent recall, if you just did an arithmetic mean, you get
50 percent. Whereas, if you do a harmonic mean, I canˇt quite do it in my head, but
youˇll find out that you get something like about 21 percent, or something. That youˇre
way down in the direction of the minimum. Okay. Precision and recall are widely used,
very standard, and straightforward for tasks like information retrieval, where you just
return documents. Thereˇs actually a subtlety in the way they work for information extraction,
which is good to be aware of. That in classic information retrieval thereˇs only one grain
size, and thatˇs the document grain size. And so itˇs straightforward to draw that
two by two contingency matrix. But when you have information extraction systems, youˇre
picking out entities, which can be of different sizes. And it turns out that the most common
mistakes that systems make is that they get the boundaries wrong. So here the system has
returned Bank of Chicago as an organization name, where what it should have returned if
First Bank of Chicago as an organization name.
And you can kinda see how thatˇs a very easy mistake to make, because the first word of
a sentence is capitalized in English, so thatˇs not much of a signal that itˇs a company
name. And starting a sentence with ¨first,〃 thatˇs not unusual. First we did this, then
we did that. So itˇs just regarding this is a common word, and then this is the name.
So if you do an F measure evaluation of your named entity recognition system here, if you
do something like this, you actually get dinged twice. You get dinged once for not returning
First Bank of Chicago, thatˇs a recall error. And then you get dinged the second time for
returning Bank of Chicago, which is a precision error. And so, actually, you do better if
you return nothing than returning something thatˇs almost right, because then youˇd
only be dinged for the recall error. You could conclude that thatˇs kind of broken, and
I think many people think, in a sense, it is kind of broken. And there are various more
complex scoring systems that have been used that have partial credit systems for things
that are almost right. I mean, the problem with those partial credit systems is you then
kind of have to have extra parameters and models as to how much credit to give for being
how much of a percent right. If itˇs a six-word company name and you return one of those words,
is that worth anything or nothing at all? And so a lot of the time, people donˇt worry
about it and just use the F measure. Okay. So if youˇre computer scientists and have
been around a bit and youˇre thinking about information extraction, the first thing you
should think about is regular expressions. You can write regular expressions for some
of these things. So if you want to recognize prices for things, thatˇs the kind of thing
you can write a regular expression for. If youˇre wanting to recognize phone numbers,
you can start writing regular expressions for that. And if you start thinking more creatively,
there are a lot more things that you can write, which are effectively regular expressions.
I mean, in particular, for some of the higher-level things that you want to do, you might want
to use parts of speech, that you want to recognize proper nouns or you want to recognize prepositions
before them, or something like that. But anything thatˇs a dictionary or lexicon is really
just equivalent to a big regular expression. So if you have a word list of hereˇs a long
list of known digital camera manufacturers, thatˇs just 100 percent equivalent to a regular
expression, which is name one vertical bar, or name two vertical bar, name three, name
four. All name lists are equivalent to big disjunction regular expressions.
So the only work in named entity recognition information extraction was, essentially, people
handwriting rules, which started at the low level with regular expression, but then made
use of various kinds of syntactic categories, like parts of speech and even syntactic parsing,
to work out patterns of words and then to build patterns for extraction. And so people
would write patterns. I think I have a couple of better examples later to recognize various
kinds of patterns. And so this was actually where information extraction technology began.
That information extraction, like the Internet, was invested by DARPA, not quite as famous.
And so DARPA in the late well, early to late 1990s sponsored a whole series of
workshops called the MUC Workshops, Message Understanding Conference Workshops. And the
idea of them was to try and get people to build systems that did information extraction
tasks. And essentially, that was a task that no one had considered before then. And then
it was really as that task progressed that people invented the even lower-level task
of maybe before doing full information extraction, we should build good systems that do named
entity recognition. And it was during those years that at the start of the MUC competitions,
essentially, all of the systems were hand-coded and grammar-based. And as the MUC competitions
went on, statistical methods started to then infiltrate and start to show that they could
do these things better. So hereˇs a classic example of MUC system. This was the system
FASTUS that was done at SRI, which is over in Menlo Park. And FASTUS was, perhaps, a
classic example of pattern matching system for doing regular expressions, so for doing
information extraction. So the whole of it was cascaded regular expressions. So this
is the sort of task that you wanted to do. You had the piece of text at the top and you
wanted to do information extraction, where you were filling out these kind of templates.
So here you were having things like a relationship between two companies, what the companies
were, and what the amount was. And then it had some structure here, so this table points
over to this table. And this is saying what the activity was, itˇs production activity.
Hereˇs the company. Thereˇs the product. And thereˇs the start date. So you have this
somewhat complex database design that you want to be extracting facts into. And there
are various bits of text that you want to match up. So effectively, what FASTUS did
was very complex cascaded finite automata, i.e., fancy regular expressions, where the
fanciness is that youˇre allowed to have recursion in your regular expression so that
you can write a base regular expression for a phone number, and then you can write more
complex regular expressions that can use phone number as a subroutine. And so if would have
regular expressions and lots of lexicons to be, actually, trying to parse up quite a bit
of syntax. But this is the kind of thing that was actually being built. So that you were
building noun groups up by using simple regular expressions, where a noun group could be a
proper noun, where you could stop or then go with a possessive. Or it could be an article
followed by a noun, where you could stop. Or you could have a preposition and another
noun. Youˇre building up various bits of syntactic structure by doing regular expressions,
in terms of lower level things of parts of speech, which could be recognized as a big
list and regular expression. Okay. And so these are the kinds of patterns that you can
write. And thereˇs some other work that I wonˇt talk about that then goes on to learn
these kinds of patterns.
So if the relation that you want to extract is ¨where is an organization located,〃
if you can match organizations and locations, you can then write patterns that are kind
of higher-level regular expressions. So [org] in [location] or [org] followed by [location],
and then a word like ¨division,〃 ¨branch,〃 or ¨headquarters.〃 That that will work
fairly well. If you want to determine who holds various positions, you can have patterns
like: [person], [office] of [organization], which would match a pattern like this. And
so people, essentially, would, in the early days, handwrite these patterns to match common
things, and then some other work for automatically learning these patterns. Okay. So then for
the last few slides today, I just wanted to sort of start to introduce the sort of very
beginnings of how one might do that in a more probabilistic way, and then come back to that
next time. First a question. Has everyone seen naive Bayes classifiers, or are there
people who have not seen naive Bayes classifiers in some context? Nobody? Has everyone seen
them? I hope. If not, you should definitely have a look at material on naive Bayes classifiers.
But I will say it very quickly. So a naive Bayes classifier is going to be using Bayes
rule, like everything else we do, and what weˇre wanting to do is find the class based
on some attributes. We do Bayes rule and invert things around and say we have a prior probability
of classes, and then weˇre using the probability of the attributes given the class. And the
naive bit comes in of saying, well, we canˇt possibly estimate this as a join probability,
so we make this conditional independence assumption, where we pretend each thing is independent
conditioned on the class, and so we just use estimates of this form. Okay. Now let me say
the MLP particular part. For MLP, what we usually do is say that the xi are the words
in a document. So we treat them as independent words in a document, but they are the actual
word tokens that occur in a document. And so if you remember back to our language model
lectures, that naive Bayes model is neither more nor less than a class conditional unigram
language model. And so, in particular, I wonˇt dwell on this, but thereˇs another way you
can make a naive Bayes model for text, where you take the x variables to be word types.
So you have a variable for house and a variable for garden and a variable for travel and things
like that. And you can do that and if you paid attention to 228, it might seem the more
natural way to do it. But it actually doesnˇt work as well for just about any MLP application.
So think of it as class conditional unigram language model, and you wonˇt go astray.
Yeah?
Student:
Where c is class?
Instructor (Christopher Manning):Yeah, c is the class that weˇre gonna assign things
to. And if you think of them as language models, youˇll hopefully remember all of the stuff
that we did before that youˇll want to smooth them and youˇll probably want to work in
log arithmetics so that things donˇt underflow. So if you have one of these as a little classifier,
can you already do some kind of useful information extraction? And it turns out that you can.
And this was a paper by Nick Kushmerick in the ˉ90s, that what he wanted to do was to
build a system that learned change of address emails, so he could automatically update his
address book. So hereˇs a change of address email. And he built it as a two-stage text
classifier. So the first stage was binary text classification, kind of like spam or
not spam. But his text classification was, is this a change of address email or is it
not a change of address email? And then the second text classifier, it used a regular
expressions to detect candidate email addresses, and then it used a window around that regular
expression, where these windows are denoted in red. And each of those windows around the
email address was then being classified by text classifier to say, is this a new email
address in a change of address or is this not the new email address? So what you want
to get out is that this one isnˇt the new email address, and this one is the new email
address. And the clues for that are in the context. Things like ¨new email address is.〃
Pretty good clues. And he did that and got quite good results. You can see the results
there. Theyˇre not perfect, but theyˇre quite good. So for some simple kinds of information
extraction, you can just treat it as a text classification problem and use text classification
technology and youˇre done. But note crucially, and this motivates what Iˇll do next time,
this only works in situations where the entities are obvious. So he could just use a regular
expression to identify email addresses, and then you can do a text classification problem,
looking at the context to say, is this the personˇs new email address? Whereas, typically,
when weˇre doing named entity recognition information extraction, we canˇt use a regular
expression to identify the content. Thatˇs part of the problem that we want to solve
probabilistically. And so that leads us to use more complex methods, which leads into
the sequence models that Iˇll talk about next time. Okay. Done.
[End of Audio]
Duration: 76 minutes