Tip:
Highlight text to annotate it
X
NaturalLanguageProcessing-Lecture01 Instructor (Christopher Manning):
Hi, everyone. Welcome to the first class of Stanfordˇs cs224n, which is an intensive
introduction to natural language processing concentrating primarily, but not exclusively
on using probabilistic methods for doing natural language processing.
So let me just say a teeny bit about the structure of the course, and some of the administration.
If you want more information on any of these things, the main thing to do is go to the
website, which is cs224n.Stanford.edu.
So Iˇm the instructor, Christopher Manning, and there are two TAs this quarter, Paul Boundstark
and Dejuan Chang. The lectures right now are Monday/Wednesday 11:00 to 12:15 being broadcast
live. Most weeks thereˇll be a section on Fridays from 11:00 to 12:15. So for the handouts
for today there's course syllabus, there's the first lecture, and then most importantly
Iˇm handing out the first assignment already today, and Iˇll say a bit more about that
as the class proceeds. But this would be a good time to over the weekend look at the
first assignment and check that you know how to do all the kinds of things that you need
to be able to do to get productively working on the assignment next week in particular.
So for this class there are three programming assignments that respecify to you, and then
there's a final project, and most of the grade is that work. In addition to that, there are
going to be just a few percent on weekly quizzes just to check that people are vaguely keeping
up with all the other topics. But weˇre seeing this as primarily a project based learning
kind of class where a lot of the learning goes on in doing these projects.
And so for the first three projects, all three of those projects, have a lot of support code
for them to make it easy for you to do interesting things in the two weeks you get to do them
in, and all of that support code is written in Java. In particular, itˇs Java 1.5 for
generics and stuff like that. So basically, a requirement for this class is that you can
handle doing the Java programming. And so hopefully you've either seen some Java before,
or it wonˇt be too difficult to kind of catch up to speed. Javaˇs obviously not that different
from other languages like C++ or Python the way most things work. And you know weˇre
not itˇs not that weˇre heavily using systemˇs libraries beyond a few basic things
such as collections for razor lists.
Most other information, see the course webpage. The little note on the bottom was I surveyed
people where they actually want paper handouts of everything, or whether theyˇre completely
happy on getting everything off the webpage. And it seems like about 2/3 of people are
perfectly happy just to deal with stuff electronically. So what weˇre going to do is weˇre going
to print just a few copies of handouts, but not enough for everyone that turns up.
Okay, so I started talking a little bit about the class. I mean, in some sense this class
is sort of like an AI systems class in that thereˇs less of a pure buildup coherently
theory from the ground up in a concerted way. And it's more of a class thatˇs built around,
¨Okay, there are these problems that we want to deal with in understanding natural language,
and what kind of methods can we use to deal with those problems? And how can we build
systems that work on it effectively.〃
So there's a lot of hands-on, doing things in assignments, working out how to get things
to work kind of issues, and one of the things that youˇll find doing that is often that
practical issues, and working out how to define things right, and making sure that the dataˇs
being tokenized correctly, that a lot of these things can be just as important as theoretical
niceties. And I think thatˇs true of the field of NLP as a whole as well, that if you
look at papers in the field, papers tend to emphasize their key research idea that is
cool and novel, but in practice theyˇre the kind of systems people build if the experimental
results of the papers. Yes, they contain that idea, but they also contain a lot of other
hard work getting all of the details right.
And so that means that in this class weˇre going to sort of assume that people have some
background, and can exploit knowledge of a bit of linear algebra, a bit of probability
and statistics, that they have decent programming skills. Now obviously not everyone has exactly
the same background and skills. Some people know a ton about probability and machine learning,
others not so much. So that weˇre kind of hoping that as you go along that you can pickup
things that you donˇt know, are rusty on, and everyone can do well.
And I think in practice that works out fairly well because even people whoˇve done a lot
of machine learning often havenˇt seen so much of getting things to work in practical
context. And so the class tries to do a mix between teaching the theory, and actually
learning techniques that can be used in robust practical systems for natural language understanding.
Okay, so where does the idea of natural language understanding come from? The idea of natural
language understanding is basically is old as people thinking about computers because
as soon people started thinking about computers, and thinking about robots, that they thought
about wanting to communicate with them, and while the obvious way for human beings to
communicate is to use language. [Video playing]
Iˇm sorry, Dave. Iˇm afraid I canˇt do that.
Instructor (Christopher Manning):
Okay, so there's my little clip from Hal 2001, but I mean, actually, the idea goes much further
back than that. That if you go into the earliest origins of science fiction literature that
you can find in 1926, Metropolis, that you have False Maria who was a nonhuman robot,
and well, not surprisingly, you could talk to False Maria. And thatˇs just a very natural
interface modality for human beings to think about.
So in general the goal of the field of NLP is to say that computers could be a ton more
useful if they could do stuff for us, and then as soon as they you want computers
to do stuff for us, well, then you notice that a lot of human communication is by means
of natural language, and a lot of the information that is possessed by human beings whether
it's Amazon product catalogs, or research articles that are telling you about proteins,
that thatˇs information in natural language.
So computers could be a ton more useful if they could read our email, do our library
research, chat to us, do all of these things involve dealing with natural language. But
itˇs precisely that point that there's this problem that, well, computers are fazed by
human languages. They're pretty good at dealing with machine languages that are made for them,
but human languages, not so.
Now, maybe thatˇs not the computerˇs fault. Itˇs really itˇs the programmerˇs fault,
and partly thatˇs because natural languages are hard as Iˇll discuss, and partly itˇs
because that a lot of computer scientist, and computer science think is directed in
the direction thatˇs very different from how natural languages are. So computer scientist
generally tends to want precision, or precisely specified, unambiguous, APIs and meanings
for things, and natural languages arenˇt like that.
So one reaction to that is to say, ¨Well, natural languages just arenˇt the right thing
for computers.〃 And so thatˇs actually been a dominant strand in the computing industry.
So what people do is to find different ways of doing things whether itˇs using XML, the
idea of the semantic web, or designing gooies with things like menus or drop boxes, where
effectively what weˇre saying is, ¨Look. The computer just can't deal with the kind
of stuff that humans produce, and how they naturally interact.〃
And so instead, what weˇre going to do is we're going to substitute different means
of communication and interaction which are easy for computers. And well, to some extent
that has been a very successful strategy because things like gooies, and XML have been very
successful, but the basis of that success is essentially accepting that you canˇt get
computers to do clever things, but noticing that itˇs very easy to get human beings to
adapt and do clever things in different ways. And so weˇre exploiting human cleverness
rather than working out how to have computer cleverness.
And so eventually it seems like what we want to do is actually to work out some of the
hard problems, and workout how we can get computers to be at the understand and
process, and produce human languages, and perhaps to learn them just as the way as two-year-old
kids somehow manage to listen to this linguistic signal and learn languages. And so in this
course weˇll talk about some of these open research problems in natural language.
So what kind of things do people do in natural language processing? The goals of NLP can
be extremely far reaching, so there's the idea that you should just be able to pickup
any piece of text and understand it. You should be able to reason about the consequences of
those texts. You should be able to have real-time spoken dialogues between you and your robot.
At the other end, the goals can be very down-to-earth. So you might want to do context sensitive
spelling correction. So that was picked out by Walt Mossberg in the Wall Street Journal
as one of the three biggest innovations of Office 2007/2008. Thatˇs a very simple little
bit of natural language processing. You might have other low-level goals such as working
out how much different products sell for at different retailers on the web.
A funny thing about the field of natural language processing is if you look how the field has
moved, that in the ˉ70s and the ˉ80s people there was a lot of interest in cognitive
science, and people spent a lot of time working on those far reaching goals of how can we
do complete text understanding? Well, if you look at natural language processing in the
past decade, really, a vast amount of the work has been doing these very low-level problems
like finding prices of products, and doing context sensitive spelling correction. You
might think that that just means the field has been going backwards, or making no progress.
I think actually it is a sign of progress because back in the ˉ70s and ˉ80s natural
language understanding wasnˇt actually practical for any real applications. Where now things
are in a state where theyˇre actually real practical deployed applications where people
can use natural language understanding, and so thatˇs encouraging people to work on that
end of the spectrum where there are actually things that you can use.
Okay, but something I do want to emphasize about what weˇre going to focus on in this
course, is that weˇre going focus on stuff that is in some sense going beneath the surface
of the text, and involves some understanding of the text. So starting with either speech
or text input. There's the most superficial level, right? So thereˇs an audio signal
and you can record it, and you can play it back. And when youˇre playing it back, youˇre
playing back human speech if thatˇs what youˇve recorded. But youˇve done no kind
of processing or understanding of any of that.
Similarly, for text format input, you can do this very superficial level of processing
where you cut your words and white space, and then you maybe index them in the information
retrieval system like the very kind of base level of web search, and then you just kind
of match words, and you give stuff back. And again, it seems like youˇre not really doing
any understanding of natural language. And while thatˇs in some sense the very lowest
level of natural language technologies, Iˇm really going to be interested in this talk,
and how can we start to do things which donˇt fully understand the text, but somehow go
beneath the surface and start to recognize certain parts of the structure and meaning
of human language text, and can do cleverer things because of that. And you start to see
that happening even these days in even sort of effectively base the whole technologies
like information retrieval, and web search, that once upon a time you just cut in the
key words, and you index those key word. Where now all the search engines are increasingly
starting to do things like, how can we do things like having different forms of the
word count as equivalent? How can we do query expansion to find or turn [inaudible] related
terms, and things of that sort.
Okay, well, what is it thatˇs holding back the field of natural language understanding?
Hereˇs a quote that Iˇve always rather liked. Itˇs from Bill Gates whoˇs actually always
been an extremely big supporter of natural language understanding as part of his general
interest in using natural language interface technologies. So Bill Gates says, ¨Applications
always become more demanding. Until the computer can speak to you in perfect English, and understand
everything you say to it, and learn in the same way that an assistant would learn, until
it has the power to do that, we need all the cycles. We need to be optimized to do the
best we can. Right now, linguistics are right on the edge of what the process it can do.
As we get another factor of 2, then speech will start to be on the edge of what it can
do.〃
Well, that sounds a nice story, and the story thatˇs being told as well, just give us a
few more CPU cycles, and well, everything will work fine. But the problem is if you
then look up at the top of the slide, that these are actually remarks that Bill Gates
made in 1997. And if you can remember back to 1997, I think that was maybe when what
people had was original Pentiums that ran at 166 megahertz, or maybe Pentium Pros at
200 megahertz had come out. I mean, either way you look at it, since that time computers
have gotten an order of magnitude faster in terms of their CPUs, and we now have our dual
core, and quad core TIFFs, and all of these things. So it's sort of fairly easy to be
getting two orders of magnitude of performance compared to what we had then. But somehow,
that hasnˇt solved all of the problems of natural language processing. It seems like
well, oftentimes it still seems like weˇre just on the edge of what computer technology
can do. That natural language processing, and speech recognitions sort of work, but
still donˇt work quite well enough to satisfy what people need out of them. And so I think
one of the important things to realize from both this lecture and the class at a whole,
is just to really actually get some sense of why this problem is hard. I mean, some
things we can do really well with computers these days. Right, the visual effects that
you see in your Hollywood videos are completely amazing, and amazingly realistic, and lifelike.
So why isnˇt our ability to do human language understanding equally as good? And I think
one way to think about that is to think about the history of NLP, what kind of computers
people had and where things headed from that.
So if you look at the early history of NLP, NLP essentially started in the 1950s. It started
just after World War II in the beginning of the Cold War. And what NLP started off as
is the field of machine translation, of can you use computers to translate automatically
from one language to another language? Something thatˇs been noticed about the field of computing
actually is that you can tell the really old parts of computing because the old parts of
computer science are the ones that have machine in the name.
So that the association for computing machinery is a really old organization and you have
finite state machines. They were things that were invented a long time ago, and machine
translation was something that was studied a long time ago. Whereas something like computer
vision, or graphics, there's no machine in the name because really those were fields
that didnˇt even begin until decades later.
So the beginning of NLP was essentially in this Cold War context where the Soviet Union
and the United States were both convinced that the others had scientific advances that
would allow them to dominate the world. And the problem was they wrote about them all
in their own respective languages, and so both the Soviet Union and the United States
started pouring huge amounts of money into this goal of machine translation to be able
to translate from one language to the other language.
And at the beginning, there was a lot of excitement, and Iˇll show you that in just a moment.
But that excitement soon turned to disappointment. Maybe I should show you the excitement first.
[Video playing]
So this is a video from the 1950s talking about early work on machine translations.
Instructor (Christopher Manning):
So Iˇll stop there, and Iˇll talk about why it runs into trouble in a moment in the
context of natural language understanding. But I mean, in retrospect it's not very surprising
that this early work in the 1950s didnˇt go a long way. When you think of the kind
of computers were available in those days, that they had far less computing power than
what you now have in your cell phone, or even your average pocket calculator. And so one
part of it was that there was no there was really just about no computing power.
But the other part of it is that there was almost no understanding about human how
human languages worked. It was really only in the late 1950s and early 1960s that work
started to be done on understanding how the structure of human languages in terms of how
words put together in sentences really worked.
So if people in computer science know the name Noam Chomsky, they generally know it
for two reasons. One reason is because of his political take on the world which Iˇm
not going to discuss here, and the other reason is because of the Chomsky Hierarchy. And so
the Chomsky Hierarchy gets taught as an understanding of levels of complexity of formal languages,
and thatˇs related to things like building transducers, and compilers, and things like
that.
But Chomsky didnˇt actually develop the Chomsky Hierarchy for the purpose of making your life
grim, and formal languages, and altimeter classes. The reasons that Chomsky Hierarchy
was developed was that so that he could try and get a handle on, and present arguments
about what the complexity of the structure of human languages was. That he was wanting
to argue that the structure of human languages was clearly beyond that of context free languages,
whereas most of the models of language that were actually being built in the ˉ30s, ˉ40s,
and early ˉ50s were actually modeling language as finite state, and was there for [inaudible]
or completely inadequate the complexity of human languages.
And so it was really just there was neither the science nor the technology for this work
to be able to go far. So if you look at what was actually produced, what there was was
essentially word lookup and substitution programs. So youˇd take your sentence in Russian, youˇd
look up the words in the dictionary, and replace them with the first word that you saw in the
dictionary as a possible translation equivalent. And as weˇll see when we start talking about
more about machine translation later today, that that doest work very well. Thereˇs a
lot more going on in languages that make them difficult.
And so after awhile people noticed that the problem seemed intractable, and so there was
this famous report that was issued by the U.S. government in the 1960s, the ALPEC report
which essentially cancelled all work on machine translation research in the United States.
I think rightly arguing that given the technology and science that existed at the time, that
this was just a hopeless endeavor and essentially recommending that what instead should be being
done is the promotion of basic science so that people actually had a much better understanding
of how human languages worked.
Okay, so let me go on for a little into then talking about this question of why natural
language understanding is difficult, and essentially the point that I want to bring home is that
natural language understanding is in the space of what sometimes people call inverse problems.
And the idea of inverse problems is that there's stuff that you see on the surface, and then
you have to construct stuff that underlies it, and doing that is always the hard problem.
So the analogy is to go back to my example with movies so that there are two twin problems.
There's computer graphics where you want to generate good graphics of a scene, and there's
computer vision where you actually want to be able to look at a scene in a photo, and
understand whatˇs in it as a three dimensional model in your head.
So computer graphics is the direct problem. That you have a model of what you want on
the screen, and then you realize that in graphics. And direct problems are the easy problems.
Computer vision, thatˇs the inverse problem where youˇre just seeing the surface picture
and you want to reconstruct what objects there are in that scene. And thatˇs still a really
difficult problem. And in just the same way, natural language understanding is an inverse
problem. That you're going from a speech signal, or sequence of words, and youˇre trying to
reconstruct the structure of the sentence and the meaning of the sentence that underlies
it. And the reason that thatˇs difficult is that natural languages are extremely, extremely
ambiguous. So hereˇs a fairly straightforward sentence. My example here is a few years old,
but ever year the Fed is either raising or lowering rates. So we have this sentence,
¨Fed raises interest rates half a percent in effort to control inflation.〃
Well, you know, thatˇs just a sort of an obvious sentence. Thatˇs the one time, if
you look at the financial section of your newspaper thatˇs kind of one you read every
year. It's not hard to understand. For a human being it doesnˇt seem like there's any problem
there. That it doesnˇt seem like there's any ambiguity at all if you just read it casually.
But if you actually start looking at the sentence more carefully, there are tons and tons of
ambiguities. So there's the word ¨rates.〃 That can be either a noun or a verb, so it
can be, ¨She rates highly,〃 or ¨Our water rates are high.〃 Noun or verb. The word
¨interest,〃 that can be a noun or a verb. You can say, ¨Japanese movies interest me,〃
verb. Or you can say, ¨The interest rate is 8 percent,〃 noun. ¨Raises,〃 that can
be a noun or a verb as well. So it can be ¨Fed raises,〃 is a verb, or ¨The raises
we received was small.〃 That can be a noun. So you notice this fundamental fact that in
English a ton of words can be nouns or verbs. I mean, and very generally you can take nouns
and turn them into verbs. So you have a noun like butter and you can turn that into butter
your bread. You have a noun thatˇs a protocol like SSH, and then you can immediately turn
that into a verb, and say, ¨I SSHˇd the file to you.〃
So that kind of already introduces a ton of ambiguity. And so what does that mean? Well,
what that means is that there are a ton of different syntactic structures that we can
build because of that noun/verb ambiguity. So if you combine the two facts that raises
interest and rates can be nouns and verbs, and the fact that you can make noun compounds
in English by just sticking nouns together, you get a lot of ambiguity. So noun compounding
is things like computer network where you just stick two nouns next to each other to
build a bigger noun. Very productive in the computer industry. Thatˇs why when you read
the manuals and you have this sort of personal computer, power supply, enclosure, thatˇs
just all of the compound noun where youˇre sticking nouns together.
So hereˇs the structure that we meant to be getting where we make interest rates the
compound noun in Fed raises interest rates, that combining those two facts together you
could instead make Fed raises a noun phrase as a compound noun, and then interest rates
where interest is the verb. Or you could have Fed raises interest as the noun phrase, and
rates as the verb. And not only are these sentences syntactically wellformed, you can
tell a sort of semantic story for them as well. Itˇs not that you canˇt come up with
a semantic translation for these.
So this is the normal one where the Fed is raising interest rates. For this one, I could
tell a story about that there was an article in yesterdayˇs paper that the Federal Reserve
Board had decided to award everyone on the Board, i.e. themselves, a large pay raise
and then you could have a headline that says, ¨Fed Raises Interest Rates,〃 meaning that
the market rates were going down, or up because people though policy on the Fed was lacks,
or the Fed was corrupt, or something like that. And then once you have that idea, you
could go to this one, and now weˇve got something like the Nelson Ratings of television, and
one of the stations was showing a news item about these Fed raises. And so Fed raises
interest is then the level of interest in rate in raises at the Fed, and that rateˇs
only half a percent viewer ship, almost no one watched it.
So itˇs a bit of a stretch, but itˇs not that you canˇt tell a semantic story about
those how those fit together as well. So those are all the kind of ambiguities that
our computer systems will find. Somehow as human beings, that even as we just read or
hear the words weˇve got all of this kind of highly contextually based plausibility
deciding going on in our heads at the same time. So that most of the most we just never
notice those ambiguities, we just kind of go straight to the most plausible interpretation
and stick with that.
So thatˇs one kind of ambiguity. There are lots of other kinds of ambiguity, so just
to mention a few more. At the word level it gets still, as well as having these syntax
ambiguities of noun or verb, you also get semantic ambiguities, so called word sense
ambiguities. So the word Fed here refers to the Federal Reserve Board, but the word Fed
is also used for an FBI agent, so ¨A Fed trailed me on my way home.〃 It can also
of course be the verb thatˇs the past tense of the verb to feed. Interest, thatˇs a word
with tons of meaning, so it can mean something that you like. ¨My main interest is collecting
mushrooms.〃 It can be that you have a stake in a company, ¨He has a 5 percent interest
in Vulcan holdings.〃 It can be a rate that you're paid, the 8 percent interest rate,
lots and lots of meanings for the word interest.
But then there are also ambiguities above the word level. So towards the end of the
sentence you have these phrases ¨in effort,〃 and ¨control inflation.〃 So ¨in effort〃
is a prepositional phrase, then you have this infinitival phrase ¨to control inflation.〃
Whenever you have these kinds of prepositional phrases, or infinitival phrases, you have
ambiguities as to what they modify. So is ¨in effort〃 going to modify this noun phrase
of half a percent? Or is it going to be modifying the verb ¨raises〃 and weˇll see some more
ambiguities of that kind later.
But to give us slightly more of a sense of what kind of ambiguities occurs, let me show
the funny examples. These are all examples of real newspaper headlines which have ambiguities.
So Ban on Nude Dancing on Governorˇs Desk. Whatˇs this one? Well, this one is precisely
one of these prepositional phrase attachment ambiguities. So we have the prepositional
phrase, ¨On Governorˇs Desk,〃 and there are two ways that you can interpret that.
You can take ¨On Governorˇs Desk〃 as modifying ¨Nude Dancing.〃 So then you have, ¨Nude
Dancing on Governorˇs Desk〃 as a bigger noun phrase, and thereˇs a ban on that. So
thatˇs the funny reading.
Then the reading that you're meant to get is that there's a ban on nude dancing, and
that ban is on the governorˇs desk. So that what you're effectively then have is this
prepositional phrase ¨On Governorˇs Desk〃 is actually modifying the word ¨Ban.〃 And
so thatˇs refereed to in a lot of the literatures where does the prepositional phrase attach?
Meaning, what does it modify?
This oneˇs dated a few years, and so I guess this one comes from prior to 1991. Iraqi Head
Seeks Arms. Like, so this one is then playing on words sense ambiguity. So ¨Arms〃 can
be real arms of animals, but here itˇs talking about weapons. And ¨Head〃 can be a real
head, but here itˇs then referring to a head of state. In both those usages are affectively
illustrating the way in which a lot of new words senses come into being. That word senses
often come from what are referred to as sense extensions which occur fairly productively
where more or less metaphorically people will be extending the meaning of one word into
further domain. So that you have the idea of a head on a person thatˇs controls the
person, and so you then extend that idea to a state or a country, and then you have a
head of the country who is the person who controls that. And that keeps on happening
all the time. So you notice all through recent developments in the computer industry. So
that when we have the web, and then you spider the web, and all those things, itˇs all metaphorical
extensions that are going on.
Juvenile Court to Try Shooting Defender. This is then again, an ambiguity of syntactic structure
in terms of whatˇs modifying what. So one possibility is that you have shooting as a
modifier of defendant, so you have a shooting defendant, and then you put together the rest
of the sentence so that ¨to try〃 is then the main verb, and youˇre trying the shooting
defendant. But the alternative is that you have shooting as a verb, and defendant as
the object of that verb. So you're shooting the defendant, and thatˇs something the courtˇs
going to try.
And then we can continue on from there, and there are a bunch of other examples. Teachers
Strikes Idle Kids; Stolen Painting Found by Tree. This is more sort of semantic interpretation
of ¨by tree〃 as the to whether thatˇs being regarded as the agent of the finding,
or the location of the finding. Local High School Dropouts Cut in Half; Red Tape Holds
Up New Bridges; Clinton Wins on Budget, But More Lies Ahead; Hospitals are Sued by Seven
Foot Doctors; Kids Make Nutritious Snacks; Minister Accused of Having Eight Wives in
Jail.
So these are the sort of funny examples where there are real ambiguities that human beings
notice, and in a sense I went looking for these in a place where you can find them because
there's a fact about the way that headlines are constructed that makes these kind of funny
ambiguities more likely to arise, and that in headlines you often leave out a lot of
the function words, and so that tends to increase the amount ambiguity thatˇs going on. So
in some sense this is atypical, but I think the example that you more want to still keep
in your head is that example of Fed Raises Interest Rates Half a Percent in Effort to
Control Inflation. I mean, that exampleˇs not a funny example. When human beings read
it they donˇt see any ambiguities at all. And yet still in that example there are tons
of ambiguities present. And Iˇll show just one more example that then relates too where
there are semantic ambiguities. So in the old days in the GRE, and still in the LSAT
they have these logic puzzle questions. And I think probably often the kind of people
have become computer science majors that probably some time in high school or something youˇve
seen these kind of logic puzzles.
So the nature of the questions is that there's a scenario and then youˇre meant to answer
a question. So six sculptures are to be exhibited in three rooms, and then there are some conditions.
Sculpture C and E may not be exhibited in the same room. Sculptures D and G must be
exhibited in the same room. If sculptures E and F are exhibited in the same room, no
other sculpture may be exhibited in that room. And then there's a question that you're meant
to ask.
Now because these are [inaudible] problems that determine whether people get into graduate
school, that the people who build these at ETS theyˇre whole desire is to make these
questions completely unambiguous, so that people donˇt sue the ETS saying that they
couldnˇt get into grad school successfully. But it turns out that the human languages;
you just canˇt make them unambiguous because just by nature natural languages are ambiguous.
So you take a sentence like this last one, at least one sculpture must be exhibited in
each room, and no more than three sculptures may be exhibited in any room. So these phrases
like at least one sculpture, in each room, no more than three, any room, those are what
are referred to as quantifiers. And it's just a fact about English that whenever you have
multiple quantifiers you have scope ambiguities. So there are different ways to interpret things.
So if we take this last bit, no more than three sculptures may be exhibited in any room,
the way you're meant to interpret that is for any room it has at most three sculptures
in that. But itˇs also possible to interpret things differently with a opposite order of
quantifiers. So you could have the interpretation that there's a special set of sculptures which
numbers no more than three, maybe theyˇre ones that the museum owns or something, or
arenˇt precious, and these ones are capable of being exhibited in any room in the museum.
Whereas all the other sculptures somehow there are restrictions in which rooms you're allowed
to show them in. But thatˇs again, a perfectly good semantic translation according to how
natural languages work.
Okay, I have one more example of that, but maybe I should start to get ahead and so I
get to say a bit of stuff about machine translation before we run out of time. But the lesson
to takeaway is that natural languages are highly ambiguous at all levels. That the reason
that they work for human beings is because natural language understanding essentially
depends on making complex and subtle use of context, both the context of the rest of the
sentence, and context of prior discourse and thing around you in the world. So for humans
itˇs obvious what people are saying.
But that means that itˇs hard for computers to do things, and so somehow there's a lot
of different factors that influence how things are interpreted. And so that means that that
suggests this approach where we want to somehow make use of probabilistic factors and reasoning
to try and do a good job of simulating how human beings can understand natural language.
So [inaudible] doing natural language understanding is extremely, extremely difficult. Itˇs sometimes
referred to as one of the AI complete tasks. But itˇs also important to counter that with
the other extreme, which is it turns out that thereˇs some other things that you can do
with NLP which turn out to be surprisingly easy and to work rather well. And so a lot
of the time we have a vast amount to text available to us, and we can do a surprisingly
good job for many tasks by just looking through a large amount of text and counting things,
and then predicting basic on counting. So thatˇs the kind of thing thatˇs done with
context sensitive spelling correction. You just count up how often different spelling
occur in different context, and you say thatˇs the way it should be spelled when a gap appears
in similar context. And ideas like that just work incredibly successfully, and are incredibly
useful to people. So somehow we want to be in this middle ground where weˇre doing things
that are sometimes not even that hard, but give useful results.
Okay, and so thatˇs led into this direction of how can we do a fairly robust and reliable
way of combining at least an approximation of knowledge about language in terms of how
language is used in other context. What kind of context weˇre in at the moment, and put
that information together to come up with the right interpretation of sentences in a
straightforward way. And the answer thatˇs been getting a lot of traction in the last
decade and a half or so in natural language processing is to be using probabilistic models.
Itˇs essentially the whole of probability theory, the idea of it is, well, we have uncertain
knowledge about the world, but we want to combine all the evidence that we have which
we may in turn the uncertain [inaudible] and solve it, and combine it together to give
the best prediction or understanding we can of what is intended, and what was said, et
cetera. Thatˇs exactly what probability theory is for, and it works rather nicely for a lot
of natural language problems. So that the general idea is that anything it doesnˇt
immediately look like natural languages look like things you normally see in probability
theory because when you see probability theory you normally see numbers, and normal distributions,
and things like that. But weˇre going to imply just exactly the same ideas to language
stuff so that we have the French word [inaudible] which might be translated as house, and we
to just learn the fact that thatˇs a high probability translation. Whereas we have [inaudible],
general, we donˇt want to translate that as the general avocado, but we want to say
that weˇve got enough context around the word avacard that we can see that translating
as avocado would be a very bad idea.
Lately using probabilistic models has become quite trendy and widely used in a lot of areas
of artificial intelligence. You see it in vision, you see it in machine learning, you
see it in various forms of uncertain knowledge representation. Thatˇs itˇs sort of just
slightly intrinsic historically that the use of probabilities in natural language understanding
didnˇt come from those recent developments in artificial intelligence. Where this strand
of work came from is actually from electrical engineering. So in the greater field of natural
language there's text or natural language understanding, and thereˇs speech recognition.
And those two sub areas havenˇt always had as much communication between each other as
they should have had. And departmentally speech has always been mainly based in electrical
engineering departments.
But what happened was that at the same time that all the people in computer science and
AI in particular were convinced in the use of symbolic methods, and doing logical approaches
to artificial intelligence, and things like that, the speech people, often electrical
engineering departments, were separated enough off from the kind of things that they were
studying was that they were still studying how to use probabilistic methods. How to use
real numbers and integrals for doing signal processing, and all of those kind of things.
So that the natural way for electrical engineer and speech people, the process speech recognition
was thinking about in terms of probabilistic models of reconstructing words from an uncertain
audio signal. And some of those people then started thinking about how could we then start
applying some of those ideas further up the feeding chain for doing other kinds of natural
language processing. So maybe we could use those same probabilistic ideas to work out
the part of speech of words. And essentially this work really came out of two places. It
came out of AT&T labs, Bell labs, in those days, and IBM research on the East Coast.
That there were groups there that started spreading from speech into using probabilistic
techniques for natural language understanding, and essentially that lead to natural language
processing being revolutionized by probabilistic methods earlier than, and quite separately
from any of the work that happened in using probabilistic methods in other parts of AI.
Okay, so in the rest of this course what weˇre going to do is look at a bunch of different
problems and natural language processing, and how they can be approached.
So the very first problem weˇre going to start off with is that first problem of NLP
machine translation, and looking at how statistical machine translation systems are built. And
Iˇm going to use that as a backdrop for introducing some of the other key technologies we need
for using probabilistic models in NLP such as building language models, and algorithms
for estimation, and smoothing, and things like that.
Iˇll then go on and start talking about sequence models, the task like information extraction,
and part of speech tagging, and then move on to doing natural language parsing, looking
context free grammas and probabilistic equivalence of those. And then after that weˇll then
go further into doing semantic interpretations, the text, and particular applications.
There's only time in the quarter to look at a few applications, so there are lots of other
things that Iˇm just going to have to leave out, but try and do a few things in enough
detail that you have a kind of a good understanding of the main tools that people are using these
days for natural language processing.
Okay, so the first problem that weˇre going to be starting with is this problem of machine
translation. So the idea is weˇre going to start with this piece of text here, which
most of us arenˇt very good at reading [inaudible] there were these characters. [Inaudible] at
the beginning, but my Chinese isnˇt very good, and weˇd like to turn it into this
text where we can just understand and say, ¨Hereˇs the U.S. island of Guam is maintaining,〃
but many people have thought of this as kind of the classic acid test for natural language
processing. But if we could have a system that could fluently translate text that means
that we have completely solved the problem natural language understanding.
I think in modern times people are actually a little bit less convinced of that because
although one approach to doing natural machine translation is to do full text understanding
and presumably, thatˇs essentially when a human translator does most of the time. Really
if you look at just the inputs and outputs of machine translation, well, both the input
and the output are streams. So really, machine translation is just a string transduction
problem, and it's quite possible that you could do that quite well without really understanding
all of the text. And indeed quite a lot of the work on statistical machine translation
has taken exactly that approach.
But nevertheless, regardless of whether you see it as the fundamental test of natural
language processing, itˇs clearly an immensely important practical problem. You see it a
little bit less in the United States than in some other places since English is so dominant
in much of the United States, but if you look at other places like the European Union, so
the European Union spends billions of dollars each year translating stuff. Even in the United
States if you're a company that does business internationally, you have to pay to translate
stuff. The U.N. spends an huge amount of money translating stuff. And so hereˇs just a cute
little example that I got the other week from Scott Klemmer which shows that how even kind
of modern, new age web companies canˇt escape from the cost of translating stuff. So Scott
was talking to some of the people at Google about Google SketchUp, and they pointed out
this fact that while they only release a new version Google SketchUp every 18 months. Why
do they do that? Itˇs not because theyˇre not working on the code, and couldnˇt regularly
release new versions of the code. The problem is that they canˇt get the translations of
the user manual done more regularly than that.
So it's really doing machine translation doing translation that is really killing a lot of
the sort of productivity for being able to be nimble.
Okay, hereˇs a few examples just to give you a sense of the kind of stuff that happens
in translation. So there's another piece of Chinese text and the reference translation
is, according to the data provided today by the Ministry of Foreign Trade and Economic
cooperation, as of November this year, China has actually utilitized $46.959 billion U.S.
dollars of foreign capital, et cetera.
The main thing that weˇre going to talk about, and that you guys are going implement some
of for the second assignment, is these models the canonical early models of machine
translation by statistical means were proposed by IBM, and they built five models of increasing
complexity called IBM Models 1, 2, 3, 4, and 5, not very original names. And for the assignment
weˇre going to get you guys to build Models 1 and 2 because you can implement Models 1
and 2 fairly straightforwardly where things get much more complex if you go up to Model
3. But hereˇs some output from IBM Model 4. The Ministry of Foreign Trade and Economic
Cooperation including foreign direct investment $40.007 billion U.S. dollars today provide
data include that year to November, China actually using.
Well, there are some snatches of that in particular some noun phrases that they system just did
fine, right. The Ministry of Foreign Trade and Economic Cooperation, thatˇs fine. Then
the $40.007 billion U.S. dollars, somehow the number came out slightly different, but
maybe thatˇs exchange rate. But thatˇs kind of an okay bit, but clearly if you then look
at other parts of the syntactic structure itˇs just completely garbled because the
syntactic structure of Chinese is rather different to English.
So when you have these bits like, U.S. dollars today provide data include that year to November,
China actually using, right, the [inaudible] it seems like most of the words are there,
that youˇre kind of looking up words and sticking the translation in just like people
were doing in the 1950s. Here itˇs being done with somewhat better use of context,
so you're choosing the right words out of ambiguous translations. But clearly the syntax
is just kind of completely garbled, and it's not making any sense.
So when you do the assignment donˇt expect your translations to be perfect because really
theyˇll be worse than that one probably. But then in the bottom towards to the
end about some of the more recent work of trying to syntax based machine translation
models which then do a better job at maintaining the struc while recognizing therefore
correctly translating the structure of sentences. And so this is work from [inaudible] and their
translation is, that todayˇs available data of the Ministry of Foreign Trade and Economic
Cooperation shows that Chinaˇs actual utilization of November of this year will include $40.07
billion U.S. dollars for the foreign directive investment.
I mean, itˇs not quite perfect. It's a little bit awkward in a couple of places, but you
can see how the syntactic structure of the translation is just a ton better because itˇs
been correctly recognized in the source. And it's actually starting to be quite a passable
translation.
Okay, so this is a very quick history of machine translation research. Iˇve already mentioned
the work in the 1950s and ˉ60s, essentially just direct word for word replacement. That
was then the kind of shutdown of empty funding and research in the U.S. following the ALPEC
report in 1966. There then started to be a resurgence in other parts of the world, maybe
parts of the world that needed translation more. So in Europe and Japan these were the
days of the Japanese Generation Project when they were going to take over the world with
UAI. But thatˇs sort of started to slowly lead in the late ˉ80s and the early ˉ90s
to a gradual resurgence of machine translation work in the U.S. But what really turned the
world around was when these IBM machine translation models were produced, and the approach of
statistical machine translation started to be hotly pursued.
So over the time that Iˇve been doing this job, things have just been completely turned
around where from in the late 1990s almost no one worked in machine translation, and
most of the interesting citing research work was being done in the other areas of natural
language processing like parsing information extraction, and various other topics. Whereas
really in the last five years machine translation and a particular statistical machine translation,
has just become the hot area of machine translation. So these days all of the young graduate students,
or assistant professors when they get jobs, it seems like fully half of them what they
want to be doing is doing statistical machine translation. And so there's a ton of new and
interesting work leading to much better statistical machine translation systems.
Okay, so what happened between ALPEC when it was deemed impossible and now? Well, a
lot of things have happened. The need for machine translation certainly didnˇt go away.
In fact, the need for machine translation is just increased and increased for various
kind of economic and political reasons. So the greater internationalization of trade,
and multinational companies, the greater unification that occurs through things through the European
Union, or Hong Kong becoming part of China. That everywhere that there's sort of more
and more need for translation.
The technology has gotten a ton better, and thatˇs made things a lot more possible. The
other big thing thatˇs made things a lot more possible is not just the computers, but
actually having the data. In the early days that there just wasnˇt online digital data
in large amounts to do thing with, whereas now we have a ton of such data the most obvious
part of which is looking at the World Wide Web, but just more generally all of the data
thatˇs available in digital form. That there are now just literally billions of words of
data in major languages that anyone can get their hands on. Of course now we also know
more about linguistic structure and that helps.
The final thing that has helped progress is this item that Iˇve got down there as change
in expectations. That the change of expectations part is saying, there's something interesting
thatˇs happened is that there's become new opportunities for kind of so-so translation.
That in the early days essentially the only use for translation was for producers of documents.
So people had their user manual and they wanted to translate it into Spanish, German, French,
Italian, et cetera. And by and large those people had to pay human beings because machine
translation was too bad for them to be able to release on the market that their product
because frankly it would look bad.
Notwithstanding, those consumer electronic products where even when the human beings
translate them, that commonly the translation is still really bad, and makes the product
look bad. But the thing thatˇs changed now is that we now have much more of a system
through things like the World Wide Web where there are users that can originate their own
translation. So that I might want to do something like go and look at the Al Jazeera webpage
and wonder what itˇs saying, and my Arabic isnˇt very good, so I want it translated
for me. And Iˇll be happy providing can I understand the gist of what the page says.
I donˇt really mind if the translation isnˇt perfect. Itˇll be a lot better than having
nothing. And thatˇs precisely why there's now this opportunity that you find on all
the major search engines like Google and Yahoo, and Microsoft Live Search web, they can do
these low quality translations for you. And that same facility is being used by users
in lots of other places as well. So everything [inaudible] doing online chat, and playing
games, and things like that, that you can help out billions of people whoˇs primarily
language isnˇt English especially by providing these kind of facilities.
Okay, so overall there are different approaches to machine translation based on how deep you
are going in attempting to translate. And this was presented a very, very long time
ago by someone Voqua, which is then called the Voqua Triangle. And the idea here is weˇve
got the source text, and weˇve got the target text. And while weˇre able to do some amount
of processing, and then at some point weˇre going to try and convert across to the other
language.
So one possibility is that we do extremely minimal processing, we presumably want to
cut the text into words. We might want to normalize the words a little, and then go
straight at the word level across the target language. And so thatˇs then called direct
translation.
That was what was done in the 1950s when statistical machine translation started in the 1990s.
Actually, that was again what people did. It's just that they were doing it with statistics
to try and do it better. But most people have believed that ultimately to get better translation
you're actually going to want to understand more of the source text to use that information
to make a better translation.
So that you might want to do syntactic analysis produces syntactic structure for the sentence.
And then you could do syntactic transfer to a syntactic tree in the other language, and
then do generation from there to the target text. And thatˇs the kind of area thatˇs
sort of really the research area these days, at how to do syntax based statistical machine
translation.
But you could want to, and there was work in the 1980s especially that did go even deeper
than that. You might want to go up to the level of having semantic analysis of a particular
language, and then do semantic transfer. And then people have had the idea, and again,
this was sort of worked on actively in the 1980s, that maybe we could go even deeper
than that. We could kind of go beyond source language, whatever it is the Chinese semantic
structure. Why canˇt we define a universal semantic structure, which could be used for
all languages? Which is then referred to as the interlingua. And so itˇs that interlingua-based
machine translation where you translate from one language into lingua, and then from the
interlingua back to the other language.
Thereˇs an enormous appeal of that approach because if you could actually do it then you
can handle all of the languages of the world without very much work. Because if you were
down here, or down here, and you want to translate it between a lot of languages, the problem
is if there are end languages that you wanted to cover you have to build end squared systems.
And so thatˇs the kind of problem that Google still has today, right? That you can pull
down the menu of what languages you can do translations from and theyˇre building systems
for individual language pairs. So you can translate from Chinese to English, or English
to Chinese, and French to English, and English to French. But if you want to translate from
Chinese to French, it doesnˇt provide that. So you're in this sort of bad end squared
space.
Whereas if you could actually do this generation up to an interlingua then youˇd only have
to build order end translation systems, and you could translate between every language
pair. So thatˇs seems very appealing, but on the other hands itˇs been a very hard
idea to workout. And this is perhaps a false idea because the problem is individual different
languages have all kinds of distinctions, and ways of looking at the world that are
of their own and arenˇt reproduced in other languages. So you know, there are particular
things that we make distinctions about in English. We distinguish peas and beans as
separate words. Whereas in most language donˇt actually distinguish peas and beans, right?
They could have words for different kinds of beans, same as we have words for farmer
beans, and lima beans, but that theyˇre different kinds of beans. And every language does that,
right? This is kind of like the Eskimos having a lot of words for snow. And the problem is
if you're actually trying to design an interlingua that works for every language, that means
you have to put into the interlingua the complexities of every individual language, and that makes
it a very hard goal to get to.
Okay, so weˇre not going to be heading in that direction. But weˇre going to be looking
at how statistical machine translation can be done at essentially the word direct translation
level, and then talking a little bit about how to do it at the syntax level.
And so Iˇll have you do machine translation. The essential answer to how to do machine
translation by statistical methods is this picture. So this is a picture of the Rosetta
Stone which is the famous basis for how the decoding of hieroglyphics happened. And well,
how did that happen? How that happened was because someone had a hunch that this stone
contained the same text in three languages, in Greek, in Demotic Coptic, and Hieroglyphs,
and this assumption that this text, this correct assumption, that this stone had the same text
in three languages gave enough information to start to then workout how you could decode
hieroglyphs.
And so this is exactly what weˇre going to use. Weˇre not exactly going to be decoding
in exactly the same sense, but weˇre going to use the same source of information. Weˇre
going to take text that we know is parallel. There's lot of sources of parallel data these
days in the world because people like the European Union translates stuff, or the Hong
Kong Parliament produces proceedings in Chinese and English. And weˇre going to exploit this
parallel text to induce statistical models of machine translation. So hereˇs a list
of just some of the places where you can get parallel text. There are lots of others as
well, so lots of magazines appear in multiple languages.
And so what weˇre going to do is have our computer look at lots of this parallel text,
and say, ¨Hmm, every time I see a particular word in the Spanish text, the translation
of this sentence has particular words in the English text, and so that will be my translation.〃
This is actually in a sense also old idea because I mentioned already that in the early
years after World War II, a ton of money was thrown into machine translation research.
The person who actually initiated that work was Warren Weaver who was a prominent person
in the U.S. government in the years after World War II.
So Warren Weaver essentially was the person that initiated this research in machine translation,
and precisely where his thinking came from was coming from the direction of knowing about
code breaking. And so Weaver got interested in where that idea could be exploited because
he had noted the fact that during World War II that what were computers used for in World
War II? Essentially, they were used for working out the flight path of bombs, and for code
breaking.
And so thinking about machine translation he got interested in can computers do translation
as code breaking? And he wrote to a professor that he knew who was a foreign languages professor
about this idea. And he wrote, ¨Also knowing nothing official about, but having guessed
and inferred considerable about the powerful new mechanized methods, and cryptography.
Methods which I believe succeed even when one does not know what language is being coded.
One naturally wonders if the problem of translation could conceivably be treated as a problem
in cryptography.
When I looked at the article in Russian, I say, ¨This is really written in English but
it has been coded in some strange symbols. I will now proceed to decode.〃 So that was
seemed to be the idea in Warren Weaverˇs head.
Now in practice in terms of what was actually built in the 1950s and ˉ60s, really no one
picked up on this idea at all. That people started hand building translation dictionaries,
and hand building analysis rules in grammas, and hand building a big system, and it wasnˇt
approached in this manner at all. So strictly it was then in the ˉ90s that well, late
ˉ80s, ˉ90s that statistical work picked up this idea again, say, ¨Well, letˇs start
with this parallel text and view language translation as a decoding process. And letˇs
see if we can just automatically get our computers to build models of the translation process.〃
And so in a couple of lectures, Iˇm going to go through the details of how the IMB models
started to do that. In the very next lecture, Iˇll first of all talk about language models
which is a key ingredient weˇll need along the way. But for the last five minutes of
this lecture, what Iˇd like to do is sort of illustrate this process by just showing
how it works at a human level working with a small simulated parallel text.
And so this is following a kind of a nice example that was done by Kevin Knight who
said, ¨ISI at the University of Southern California, and ISI has been one of the most
prominent centers where people have worked on statistical machine translation in the
last decade.〃
So the Kevin Knight scenario is what we want to do is translate between Sentory or from
Sentory into Aucturan. So that we start off with this sentence in Sentory [foreign language]
and somehow that we want to learn how to translate this into Aucturan.
Well, fortunately for us weˇve actually also managed to find a little bit of parallel text.
Some example of sentences in Sentory for which we know their translation in Aucturan. So
hereˇs our parallel corpus. Weˇve got these 12 pairs of sentences in the two languages.
Now obviously when we ask our computers to do this task typically well, if weˇre
building large systems commonly weˇre actually giving them millions of words of parallel
text, and even [inaudible] system we might begin at tens of hundreds of thousands of
words of parallel text. But here Iˇll just illustrate it with this teeny, teeny text.
But nevertheless, the although weˇll do it here as human reasoning, what weˇre
doing is essentially exactly what weˇll want to build an algorithm to simulate.
So what weˇre going to build for algorithms is what are called EM based alignment algorithms.
So Iˇll talk later about the EM algorithm. It's a general probabilistic estimation algorithm
which tries to learn how to do translations. It learns alignments which are parings between
translated words.
So how might we do it in this context? Well, an obvious thing to do is to start with the
first word in the Sentory sentence. So suppose we look at that word furock. Well, what would
furock mean? Well, essentially our source of evidence is to say, ¨Well, can we find
furock in our parallel corpus? Oh yes, furock appears twice in our parallel corpus.〃
Well, if we donˇt know anything else a reasonable thing to assume is maybe that one of these
words translates the word furock. And presumably thatˇs true here. So a good heuristic would
be to look for a word that appears in both of the translations and while there's precisely
one of those words, this word ja.
Another heuristic that we might have used is is we might have guessed that word order
between the two languages has some similarities. So the fact that furock is second in both
of these sentences and ja is second in these sentences too, thatˇs at least weak confirmatory
evidence. And so tentatively we could say, ¨Well, we can translate furock as ja.〃
Okay, so then we can go on to the second word, crok. Thatˇs a kind of a tricky one because
crok only appears once in our parallel corpus, and well, I mean, if we really believe word
order was always the same we could count, but that seems kind of weak evidence since
we know word order varies a lot between languages. So that one seems kind of too hard. So maybe
we should just leave it for a moment and say it could be any of those words in the translation,
and maybe we could go on and look at other word.
Okay, so hehawk looks rather more hopeful because we can find hehawk in three sentences.
And so again, we could do this process of saying, ¨Okay, which word well, if we
look at these three sentences it seems like arrat appears in all of them.〃
Now you know, on practice a lot of time we wonˇt be this lucky because a word in Sentorian
will be translated sometime by one word in Aucturan, and sometimes by a different word.
And so then itˇs a bit harder, but weˇll at least look for words that frequently occur
in the translation of the sentence.
But here we seem to be in good business, and we can go with that alignment. And so we can
move onto yurock. So we can do the same kind of reasoning and say, ¨Well for yurock that
seems quite plausible that we can line that with maut. And so that then gets us onto this
word of clock. So clock is again the tricky one because clock only appears once in the
corpus. So weˇre still in this we might think that weˇre in this situation, ¨Well,
it only appears once. It could be any of those words in the other sentence. We canˇt make
any progress.〃
But actually thatˇs no longer true because since weˇve been kind of working ***
this problem, we can start to make some progress now because although if you just look at this
you could think it could be anything. Well, weˇve been looking at various words, and
so we can start to work out by some kind of process of elimination what might be going
on.
So that if we realize that well, actually weˇve worked out good alignments for other
things, that these are good alignments. Well, now it looks like, well, clock is the only
word left over and ba hasnˇt been aligned to anything, so maybe what we should do is
translate clock as ba. And so thatˇs this been giving us a process of elimination
where exploiting of some words we know we can guess that the remaining words go to untranslated
words. And the EM algorithm, one of the features of it is it provides an implementation of
this sort of process of elimination idea.
There are other sources of information we can also use for helping us along in some
cases. So if we look at this next sentence down here, weˇve got two words that we havenˇt
aligned here. One was the crok that we wanted to work out at the beginning, and then we
have zanszenot. So even using the process of elimination idea, there are still two words
here and one word there, and it might not seem quite obvious what to do. But a possible
source of information is to see that zanszenot and zanszena. Well, they look very similar
to each other, and so often words that look very similar in different languages actually
are related to each other. Those words are referred to as cognates.
And now for some languages which are related historically, or had a lot to do with each
other for various kind of historical reasons, there are tons of cognates. So French and
English have tons of cognates, and then as related in the European languages, and even
for languages which arenˇt actually related closely in the language families for a kind
of historical and cultural reasons there maybe a lot of borrowed words that give cognate.
So for example, youˇll find lots of Arabic words through vast stretches of Africa where
the languages arenˇt Arabic because various forms of science and culture were borrowed,
and so you see these borrowed words. That even works in modern times with languages
like Chinese and Japanese. That theyˇve been borrowing lots of words from English for various
technological things. So cognates can often be quite helpful.
Well, what does that leave as this mysterious word for crok? Now weˇre left with no underlined
words here. That doesnˇt mean we know the exact answer. One possibility is really we
should take a pair of these words and translate them as a word in the other language. There
are lots of cases in which you have something that can be expressed as two words in one
language, and one word in the other language. It can even happen in the same language that
you just have an alternative. So you can either say, ¨I went into the room,〃 or you can
say, ¨I entered the room.〃 So ¨went into,〃 and ¨entered〃 theyˇre kind of two possible
paraphrases. One is two words, one is one word. So we have to allow for those kind of
many to one alignments in various ways.
But the other possibility is that this is just a word that isnˇt translated. And in
the statistical MT literature, those referred to as zero fatality words, that the idea is
this word just doesnˇt have a key. Thatˇs there nothing that it becomes in the other
language. And so thatˇs a possible scenario.
And so in this this is the kind of process from which we can use parallel text to learn
how to translate. And the cute thing about this example from me to close on, is well,
actually this example is really Spanish/English except that Kevin Knight took all of the words
and changed them into these sort of made-up Sentory Aucturan words, but you actually have
exactly the same translation problem where what we want to translate is, ¨Clients do
not sell pharmaceuticals in Europe.〃 And this is our parallel corpus. And what you
can see in the alignment is exactly the kind of things that weˇre talking about.
So here we get the word order mismatches. So in Spanish you have the adjective appearing
after the noun. So you got these crossed alignments that we were seeing here. And then here, this
was the example with the cognates and the zero futilities, so here is the cognates,
so Iˇm sure what it is. I presume itˇs some kind of drug, and so its similar in both languages.
And well, what was our dual futility word? The zero futility word was do. So this is
just the kind of thing that happens commonly with languages. That you get these function
words. So in English youˇre putting in this axillary verb ¨do〃 to go along with the
negation, ¨do not sell.〃 Whereas in Spanish, there just is no equivalent. You just have
to note no sell, and thatˇs all there is. So these kind of function words often appear
as zero futility words.
Okay, so thatˇs where Iˇll stop for today, and then next time weˇll first of all take
a little detour into [inaudible] language modeling which is a kind of core technology
that weˇll need to do machine translation, and then on Wednesday get back to machine
translation proper.
[End of Audio]
Duration: 75 minutes