Tip:
Highlight text to annotate it
X
MALE SPEAKER: --from Everest. He got his PhD In Cambridge
with Stephen Hawking and then went to MIT and to Harvard.
After, he actually went west and studied a few years at the
Smith-Kettlewell Eye Research Institute in San Francisco.
And recently, he's been at UCLA for a number of years.
He's a professor in two departments, psychology and
statistics.
And soon, also in CS.
He wants to be a professor in as many
departments as possible.
He gets bored, otherwise.
And also, he once pointed out that he'd never been a
professor in a department where he had a degree.
[INAUDIBLE]
And today, he's going to talk about some of the work he's
doing in object recognition, detection, text detection, in
particular.
And I think we've resolved the issues.
So Alan, [UNINTELLIGIBLE].
ALAN YUILLE: OK, well, thank you very much
for having me here.
So I'm here to talk about work that's being done at the
Center for Image and Vision Sciences at UCLA, which I
direct together with professor Song-Chun Zhu.
And some of the work I describe will be mine.
Some will be joined, and some will be stuff that Song-Chu is
doing himself or in some attraction.
So yeah, just for people who may not work in vision,
usually you put up a default slide saying vision is hard.
Because people often think it's really easy.
Why it's easy is our brains were evolved to do it.
Our intelligence is basically in our cortex and at least 50%
of our cortex seems to be doing vision
in one way or another.
And so, if you think about it in intelligence terms, vision
is just looking at a room, and interpreting it is arguably a
far harder task than solving the most difficult mathematics
problem or building the most complex software system.
I was once hissed at a Harvard mathematics department for
saying this, but I think it's still objectively true.
If you take away the vision part of your brain and the
part that does motor control and a little bit that does
language, there's not really very much left.
So intelligence is often what's going on at the level
of vision and perception.
OK, so vision.
Why is it hard?
Well, vision requires decoding the image and passing it into
components such as objects.
And the difficulties are due to the fact that images are
very complex and also extremely ambiguous.
One way of pointing this out was the observation that if
you look at all the possible images which
are just 10 by 10--
10 pixels in the x direction, 10 pixels in the y direction
and count out how many images there are, you'll see there
are far more of those than all the images that have been seen
by all humans over the whole period of
evolutionary history.
Allowing for how many billion humans have ever existed and
you see 30 images a second and so on.
And you live, on average, 60, 70, 80 years' life.
No one has seen all the possible 10 by 10 images.
It's a very high dimensional space.
It's a very complex place.
Here is a quick example to test your visual ability.
So these squares, this one and this one,
which one is brighter.
Has anyone seen this example before?
AUDIENCE: They're the same, obviously.
ALAN YUILLE: Very sophisticated
people here, right.
AUDIENCE: It's on Wikipedia.
Everyone's seen it.
ALAN YUILLE: Everyone's seen it by now.
I still find some people who don't, but
anyway, they're the same.
But they look very different.
And the reason it appears to be that when you look at an
image like this, your brain is not just registering the
intensity that you're actually getting here directly.
It's doing some complicated inference.
It's figuring out that this thing seems to be in shadow of
here, and so its intensity is actually different from what
you actually perceive it to be.
So it's doing an inverted process.
OK, so here is a typical image, illustrating, I think,
Everest from the Tibetan side, illustrating some of the
degrees of complexity that you get in the visual scene.
And the work that we're doing, one part of it is related to
what we called image parsing in a paper that we published a
couple of years back, which would give you the basic
flavor of what we want to do.
The idea being that's an image, take this one, can be
thought of as being composed of a number
of different patterns.
Patterns for the person, patterns for their--
a hierarchical representation of patterns.
He is a person, here is a face, here is a body.
He is a sports field.
The sports field is made up of certain components.
The spectators are also made up of certain components,
which can be texture of different types and sometimes
larger elementary objects.
So think of the visual world as being made up of a large
number of these patterns organized in some sort of
hierarchical method like this.
And then the idea of interpreting an image, in our
terminology it would be parsing it, taking image and
decomposing it into patterns.
So you can take this picture two ways.
One way is you go up here, which means you take these
patterns and you stick them together to make the image.
The other way is you start out with the image, you decode it
by taking it apart and getting this representation for it.
And once you've done that, you have solved the problem of
detecting objects an image, recognizing them, and
understanding the entire thing.
And so our approach was to formulate this in terms of
generating the image by probabilistic grammar, and a
grammar allowing you to have a fairly abstract level of
knowledge representation, probabilistic so that things
are not purely deterministic.
Here is a very simple conceptual picture of this.
The image would be a scene.
It would contain a face.
It would contain some text.
And it would contain a certain amount of background.
And so you could generate an image by coming up with a
scene node, a certain probability of there being a
face somewhere in the scene, a
probability of text, et cetera.
The model would enable you to generate these, and this would
be synthetic samples of text, synthetic samples of faces.
So that would be the generation process starting
from this abstract probabilistic model for this
scene, and then generating it.
Now, the talk of interpreting it would go backwards.
It would have certain dynamics.
You'd have a grammar.
You'd have a grammar representation of the scene in
terms of these nodes and the elements in it.
You could do various operations.
Like here, you'd start out by interpreting the image in
terms of text and background without realizing there's
anyone in the scene.
And then over here, you'd do a move, a translation on the
graph, which involves creating a node structure for the face
and then explaining that part of the image in terms of the
face model.
And other types of transmissions can be
done on this also.
I was not planning to go into details of this type of thing.
If people want to give details, give me feedback, and
the more questions people ask, the more I can adapt to the
audience and make sure I'm on the right page.
Here is a rather more complicated diagram of how a
system of that sort would work.
And basically, it would work in two sets of stages.
There's sort of a bottom-up component and a top-down one.
The bottom-up component would be a system which would look
for and would be particularly targeted to try and detect or
make proposals about the presence of certain important
things to the image, such as text or faces or any other
types of objects.
And so, for people involved in machine learning, you would
have a series of tests,
discriminative probability tests.
For example, AdaBoost, or methods of that sort, are
often very good procedures for that, which you would train on
data, and which would look at a certain part of the image
and say with certain probability we think there
probably is a face there or there probably is text or
whatever other types of objects you want to have.
These would make proposals into this sort of hierarchical
passing representation, which would allow you to create
models or faces, destroy them, move them around, and so on,
up to a high level where there would be more generative
models of faces or text, which would sort of interpret what
the low level is telling them.
Also, the high-level models could start making predictions
about the implications of certain things in the object
or making predictions that, OK, you found a face here.
There's certain probability that there ought to be a face
over there, based on high-level representation.
AUDIENCE: Is there a degree of pixel ownership,
probabilistically between the low-level components?
ALAN YUILLE: The low-level things wouldn't necessarily
own pixels, The high-level--
the generative models--
should impose that.
The low-level, you could have competition.
Something jumps up saying, I think this is a face.
And something else is saying, I think you're wrong.
I think it's a tree.
But the high-level stuff, generative, is the thing that
really controls it and makes everything fit
together and so on.
Here are some examples of whether-- oh, yeah?
AUDIENCE: [INAUDIBLE]?
ALAN YUILLE: In this version, it only has
one, but that's not--
yeah, this is a version from two or three years back that
was implemented by Zhuowen Tu and Alex Chen and so on.
So there's nothing in principle to stop it from
having more.
And some of the that professor Zhu is implementing would have
far more nodes of that sort.
Here is an aspect which would illustrate the issue of the
[UNINTELLIGIBLE] of the bottom-up and top-down.
So here, based on the filters we had several years ago,
would be our estimates of what a face is and what a text in
the image is.
So it's getting these faces OK, but it also thinks this
thing is a face.
And if you look closely in it--
take out the context--
it actually doesn't look that different.
There's a bit here that looks like an eye, an eye.
This could be in a nose.
That could be a mouth.
So it's a plausible error that the system could make.
And given the complexity of images, you're often going to
find things of this sort.
If you had a model of a tree or a particular tree thing,
that would compete.
That would say, hey, this is far more likely to be a tree.
You shouldn't have a face there.
But at the bottom level of these cues, which are working
semi-independently, it's a legitimate thing for a face.
And over here, there's another face here, et cetera.
Over here, text here, text here.
It thinks this is text.
Well, it looks consistent.
This could be a bunch of ones, and so on.
So these sort of discriminative, bottom-up
approaches are not always consistent, and they may, in
certain cases, be wrong.
So here, when you have the high levels of generative
model,s you start resolving the ambiguities.
These things are now detected as faces.
But this area here, this is a part of the general
region for the tree.
It's no longer described as a face itself.
And many things are--
you remove this area here, which was considered to be
possible text, is now described by the high-level
generative model as being a type of texture, because that
fits it better.
Over here, the nine, you may not have noticed, this nine
was not picked up by the lower-level text detection,
but it has been found here and is now interpreted as a nine.
So this is giving you this basic
strategy of the approach.
Images would be composed of patterns.
You would have various bottom-up cues, which you
would typically learn by machine learning approaches.
They would activate hypotheses.
These hypotheses would be tested by generative models.
And the generative models, in turn, wou;d try and impose a
uniform interpretation of the whole image, which makes
everything consistent and happy.
Now I was going to show you this, which I think I should
show on the--
how do I get out of the slide presentation?
AUDIENCE: Probably Escape.
ALAN YUILLE: Escape?
OK, so I may need help with--
AUDIENCE: [INAUDIBLE]?
ALAN YUILLE: --guy here.
And now, let's see if I can put up a demonstration here.
So I'm seeing this here.
AUDIENCE: [INAUDIBLE]?
ALAN YUILLE: Drag it over to the left?
AUDIENCE: [INAUDIBLE].
ALAN YUILLE: OK, let me just go back here.
Go back to the beginning and make it full screen.
[SIDE CONVERSATION]
ALAN YUILLE: OK, so this is great low-tech talk.
Here is an example of what one can do now, though this is
some time further on.
And so, if you don't know vision, this may not be very
impressive.
If you do know vision, you should be pretty
impressed by this.
So on the left-hand side, there's video taken just by
[UNINTELLIGIBLE]
simple video camera.
You go out into the street, you show these, run it around.
And on the right-hand side, it's in practically real time.
It's detecting text.
It is binarizing it.
AUDIENCE: [INAUDIBLE]?
ALAN YUILLE: Well, in some, cases it's--
it's not keeping any memory of what is there from
frame to frame .
So there were certain aspects--
as it moves around, it's coming into view and into
focus and that is coming out of out of focus moving around
so some of the text will appear in one image and then
will disappear again in the other image.
Nevertheless, I think that this is
something that is not--
I don't know anything else that can do it than a method
of this type.
AUDIENCE: What kind of processing power does it take
to run this sort of thing?
ALAN YUILLE: It's not a lot, actually.
It's very--
Daniel, you do know the--
MALE SPEAKER: Yeah, it was run on, I think, a Pentium 3 or 4.
But it's only processing a few frames--
maybe 10 frames per second or something.
AUDIENCE: 10?
MALE SPEAKER: Yeah, maybe.
It's a relatively low-resolution frame.
But that's on a single [INAUDIBLE].
ALAN YUILLE: And so, what this is relying on is having data
sets of images, learning from that methods for what is
distinctive about text, implementing them, and then
proceeding on to do binarization.
And from here on, you would go on to do recognition.
So the systems we have, this would be one of the more
practical things so far.
A bit of history or memory that these things existed over
time would remove the flicker.
But it will make certain mistakes in places which,
again, you'll see they'll flicker on and then they will
disappear later on.
AUDIENCE: So the algorithm that's doing this, really, is
just single-frame analysis.
There's no temporal correlation.
ALAN YUILLE: There's no temporal correlation.
So yeah, that's the flickering.
AUDIENCE: So there could actually be improved accuracy
if you took into account temporal correlation to get
rid of some of the other--
ALAN YUILLE: Oh yeah, definitely.
The temporal correlation would make things better,
definitely.
Everything here is based on cues.
And each picture is sort of producing its own cues and the
information from both of them.
AUDIENCE: So some of this text, it seems to change forms
as it flickers around.
Is it doing different things in that case?
ALAN YUILLE: No.
AUDIENCE: We got off the example.
Sometimes it comes out very readable and sometimes it's
very confused.
I'm just trying to understand exactly what we're looking at.
ALAN YUILLE: it's the nature of the text.
Certain of the text is different than others.
And some of that also relates to the binarization.
There are two stages in here.
One stage is actually doing the detection.
Then the binarization is making a certain amount of
errors in itself.
The detection itself, I think, is working extremely well.
The binarization, well, we're working on improving the
binarization.
And there are certain rules we have which are good, and then
there are certain cases where there were failure modes which
we are then isolating and working on it.
AUDIENCE: I think there's also some motion blur, interlace
artifact, that our eyes are taking out of the original
image, but which, because of the way it's composed in the
single-frame, no-memory, causes the--
MALE SPEAKER: Right.
[SIDE CONVERSATION]
ALAN YUILLE: So that was work on, basically, we've been
concentrating on finding faces and, well, particularly on
finding text.
Now the question is really, how do you go beyond this?
Because really, the idea is you don't just want to find
text and faces and things in images.
You want to find everything.
And everything we had before relied on having a certain
amount of training data.
To get the text thing to work, we had to have large numbers
of examples of text in real images so that we could train,
so we could get out with them so we could distinguish
between the text that actually was text and the parts of
images that could be anything else that looked like text.
And the same thing for faces.
And so that involves, really, having enormous numbers of
images because you really have to see what there is in the
environment that actually could correspond to text.
And if you think about it, vision is a fairly strange
subject in the fact that no one has ever really
characterized what happens in all possible images.
In speech, there were a certain number of words that
people utter.
There were phonemes and so on.
There's a fairly basic understanding of what the
basic vocabulary is, what the basic inputs are.
If you're an astronomer, you study stars.
And you know there are stars.
You know there are galaxies.
And you know there are dust clouds and a few other things.
You know what the basic element of your domain are.
But for vision, there's not really a large amount of
knowledge of that.
On one level, you know you're working with images, with
pixels, with intensity values.
But there's very little understanding of the whole
complexities of what actually goes on inside them.
And so my colleague, Song-Chun Zhu being done at Lotus Hill
and a certain amount at UCLA is developing this really
fairly ambitious project, which is more or less to take
a very large number of images and try to map them and to
understand, well actually, what really goes on with them.
Essentially, while the work I described so far was based on
the idea of parsing images into certain components, he's
taking it further, on one hand, by trying to parse all
enormous numbers of scenes into visual components.
On the other hand, at the moment, it's being done more
or less interactively.
So it first started out a year or so ago with, I think, 20
Chinese art students sitting in front of images and hand
parsing them.
And I'll show you some examples later on.
And then it's moved more into an interactive approach so
that you can spare the art students some of their time by
putting in vision algorithms which can find certain
structures and then all the art students have to do is to
validate that they're correct or not or to make changes.
And so, certain of the work we did on text was made possible
by having some of the amounts of data that we got out from
this process.
And so, once you have these representations, you learn
some very big things by having them.
One issue, you get this idea of the structure that images
form in the forms of the graphical structures you have.
You get the idea of what can happen in all possible images.
And then, after you've done that, you can use the
representations for learning.
And then you can also use them for benchmarking.
You can see how good the algorithms actually perform
and how well they don't.
If you're in the computer vision community, you know one
problem with computer vision is that, while it's easy to
come up with an algorithm that works nicely on a few images
and you can publish a paper on it, actually getting that
algorithm to generalize and to work in very large data sets
is a completely different business.
So here is an example of a typical Chinese image.
I guess he gets some of his funding from China, obviously.
And so here is an image.
This is obviously outside the Forbidden City, I think,
looking from Tiananmen Square.
And this would be trying to take that scene, segment it.
At this level, sky, building, flag, et cetera.
Streetlights, portraits, and so on.
Then mapping these down with line drawings, line
illustrations, of all these types of structures.
Here is the, yeah, putting layers.
What structures are behind which other
structures, et cetera.
And whereas the text--
this was stuff we were using, for example.
Labeling these as being text, these areas as being Chinese
characters, and there should be some here on
the top of the bus.
AUDIENCE: Is that part of the automated--
ALAN YUILLE: Yeah, that's in here.
It's the user interface annotation tool.
So this would be the Chinese art students.
The algorithms, this would be the Chinese graduate students.
And then tying it in with the knowledge database and so on.
Because the representations of the image would be based on a
representation that Zhu has defined based on a sort of
and/or, graph type version of the grammar.
And now he's initially guessing what that sort of
grammar representation ought to be.
So they have one version of the grammar.
And then, based on seeing how well they can describe certain
structures, they have to modify it, and so on.
So it's, in a sense, learning a good knowledge
representation for the data.
And the automation, well, more comes in.
Of course, in a sense, if you can automate the whole
process, then you've solved the whole vision problem and
there's nothing else to do.
But they're some way away from that yet.
But still, it's interesting to find out which computer vision
algorithms are actually useful for something like this--
which is hard--
and where they have to work reliably, and
which ones are not.
So here is just an example of a sort of representation.
So here is a small element of a scene.
So a boy with a backpack.
You would draw it at one level.
Then you would represent at this hierarchical manner, in
terms of these parts that contain the hand or the
zipper and so on.
The boy would be represented in terms of the face, to hair,
the ears, the parts, and so on.
And all these would be encoded within this hierarchical data
representation structure that Song-Chun
Zhu's people are doing.
Here is, I think, another illustration of what one does
with, I think, a chair scene.
So you're labeling the parts of the chair.
You're putting the--
here is the seat of the chair, the cushion, the back, the
light, and so on.
I think the window and the so on is in the background.
So from these you can both represent and you could learn
methods for detecting chairs.
You could also find the probabilistic relationships
between certain structures happening in the image, like
the chair and the table and so on.
This has being talked of in computer vision.
It's something that sort of comes and goes.
I remember even when I started vision 20 years ago, there
were schema being developed by [? Alan Wiseman and Hanson, ?]
which talked about these issues, but it was practically
impossible to do anything like that then.
There was no data.
There was no real-world images that you could really work on.
And there was no possibility that you could actually even
think of designing algorithms that would work on those types
of processes.
Street scene segmentation.
Google Earth images, I guess he's got from him here, I
trust, with your permission.
We'll take these images, label all the cars and so on.
Label all the buildings and get these representations and
then proceed to use these domains to build image parsing
systems based on the principles that I've been
describing.
Database.
I think the size of this is big, certainly by the size
used in computer vision.
I remember five years ago, databases of 100 hundred
images were quite large.
After that, the Berkeley had a database of segmented images
which was about 2,000, which was considered big.
Well here, the total number of images is something like
500,000 so far, which Song-Chun says are
hand-annotated.
I haven't checked them all myself, so I'm not quite sure
of the performance criteria.
But still, it is an amazing amount.
[UNINTELLIGIBLE] up by looking at the outdoor scenes, indoor
scenes, images done by activities, aerial images and
also coding of various animals, objects and so on.
And this is by no means the end point.
The institute has been running now for about a year and a
half, I think, and 500,000, I think, is a big number.
But Song-Chung is shooting for a lot more--
orders of magnitude, videos being included.
AUDIENCE: [INAUDIBLE] a big proportion of that is videos,
so I guess part of what makes this [INAUDIBLE] is that you
can use [INAUDIBLE].
ALAN YUILLE: Yeah, I'm not sure exactly on that, how much
of that is done by exploiting that property.
But apparently, there was a number of videos of Chairman
Mao's speeches which he had.
AUDIENCE: Do you know how much time it takes to load one
image, on average?
ALAN YUILLE: I'm not sure, exactly.
I spent time out there.
There are highly motivated people working eight, nine
hours a day, just doing these things.
I'm glad I'm not doing it myself, but Song-Chun, I'm
sure, could give you the figures of doing that.
It's also going to change as soon as he [UNINTELLIGIBLE]
putting the annotation in there, and you start just
having to prune things out which are bad.
[INAUDIBLE], have you got any idea of time?
MALE SPEAKER: No.
ALAN YUILLE: [INAUDIBLE].
And I think, also, with this process there's time issues
like, how long do you learn to train the students to do this
type of thing, et cetera, and the efficiencies
which come with time.
So that's part of how one could take the
image parsing further.
One needs to have this data set.
You need to have these types of representations.
You need to have this for training.
You need to do it for validation.
You need to have it for learning all
your types of models.
Here are just a variety of cats in here.
Because, partly, I like cats, so I picked one
example slide there.
And then, certainly, the representations also go into
3-D scene labeling, so I just put in a slide of that form.
Any questions on the data sets?
Because I'm now moving on to, I guess, the
final part of the talk.
AUDIENCE: From the groups and hierarchy that you have in the
various domains, are there generalized catch-alls?
For instance, under land mammal, is there the notion
that you can actually build a model that represents a
probabilistic match to a land mammal, as opposed to only
having the specific models for big cat, dog, and you'll
always map one of those?
ALAN YUILLE: I'm not sure of exact status.
You would like to be able to have as much a genetic model
as you can, obviously, for grounds of interpretation.
And to the extent that you can find similarities and you'd
have a grammar that could generate a kangaroo with one
input and generate something else with another input.
that's what you'd like to do.
And frankly, I think that's quite practical, given the
fact that all mammals have very similar structures.
There's been work in the past. There's work I did with
Song-Chun Zhu a long time ago when he was a graduate student
in a form representation system, which involved
representing things by skeletons and so on.
And a lot of other people have worked on that sort of
skeleton-type representation structure.
And so I think it's not impossible to do that.
I should also say that, in terms of detection as well--
so, for the earlier system that we had published, we
built something which would detect faces bottom up.
We build something that would detect text bottom up.
But you don't really want to build something that detects
the 8,000, 10,000 possible objects in the scene
individually.
You want something that brings out, depends on the
similarities between them and will function and will output
a number of these possible structures.
There's been a bit of work done by Bill Freeman at MIT
related to that [UNINTELLIGIBLE].
[? Wiseman ?] has also considered that issue, but you
want to find commonality and build up with ideas of
compositionality.
Putting certain parts together, parts that hopefully
reoccur, and can be detected individually.
So any more?
and Yeah?
AUDIENCE: Sort of a similar question.
Some of the objects were referred to in aggregate, like
in the Maps image, a set of cars was all one [INAUDIBLE].
How do you discriminate between a car and what
constitutes an aggregate?
ALAN YUILLE: Actually, there were some slides which I took
out of varieties of cars.
For that, there was at least a generic car, and then a series
of small specific type cars.
And quite how that functioned--
what was your question, more specifically?
I mean, it's both suspects being addressed--
AUDIENCE: So in the Google Maps figure, there were single
cars labeled on the road.
There were also groups of aligned cars.
Is that generally a big problem?
Each of those, obviously, is made up of single cars.
ALAN YUILLE: Oh, right.
Well, then you'd have the hierarchical representation
where at the bottom-level nodes, you'd have the
individual cars, and then you would have those grouped into
regularities such as rows of cars.
And then there would be a node higher up,
which would be parking--
well, not a complete parking lot, but some public
structure, and so on, and then the whole node higher up,
which would be the whole parking lot.
So the hierarchical modeling is intended to take care of
those issues.
AUDIENCE: So are you using the same recognizer for the map
shots as you would for a ground-level scene shot?
Or are they separate?
ALAN YUILLE: They would be separate.
Yeah.
The cues [UNINTELLIGIBLE], I think, the viewpoints suggest
two different--
for that to be done.
So now, for the last part of the talk, I want to get in
some other talk with another person called Zhu, but this is
a graduate student, Zhu Long.
So not to be confused with Song-Chun.
I'm told the Chinese character is actually different, even
though the Anglicized version, Zhu, is the same.
So here is an attempt to learn probability grammars in an
unstructured way, just from input images.
So here, we're not relying on the types of detailed
segmentations and annotations that Song-Chun's group is
producing because, well, they weren't around when we were
starting this work.
And in any case, one would like to see how far one can go
without having to have hordes of Chinese graduate students
do that type of work for you.
And so we are addressing work in the Caltech 101 dataset
that, I guess, a number of people may know.
So here, this was set up at Caltech by Fei-Fei Li working
with Pietro Perona.
And she got 101 different categories.
I think, apparently, Pietro told her to get more than 100,
so she went up to 101 and stopped there.
And so this data set though, it's simple compared with what
Song-Chun is producing, and there are certain criticisms
of it, but still, it's considered a good,
state of the art--
it's something that people are starting to use and compare
results on.
And so here is certain of the objects you have in it.
And this is chair, cougar, et cetera, and below, certain of
our recognition results performed on it.
The important thing, perhaps, for us isn't so much the
quality of the performance, which is fairly good, but it's
more the concepts behind it, at least for
this part of the talk.
So you think of having a probabilistic model or grammar
that could generate these objects.
And so, the way these things are done is you're given a
series of images.
And in each image, there is an object.
And it's a face, but you do not know where
it is in the image.
And you don't know what the background is, and the
background is fairly complicated.
So I think Pietro Perona calls is unsupervised learning.
So it's not quite fair.
It's more semi-supervised.
But still, you don't have the detailed knowledge of where
the boundary of the faces or the particular position or the
size or the scale.
So the idea is that you try and learn the grammar
incrementally from these sets of images.
And so the first thing you do is you assume that the image
is just purely background.
It's the default model.
It's like it's random.
That corresponds, schematically, to this graph
structure here.
This is just a background generating
image, and that's it.
I should say this work here, initially, is done just on
feature points extracted from image, and later on, it's
generalized.
So you can think of these as featured points in the image
to start out with.
You take the image, you run an interest point detector on it,
you end up with 40 feature points, something like that,
and you want to explain them.
And they have attributes like the appearance properties and
so on, and you're using methods like the Kadir-Brady
SIFT descriptors and so on.
So you start out by this powerful model.
So that's a rather boring model, but it's where you need
to start out with.
So then you start seeing, can you find more structure, more
regularities in the image, which are more likely than the
data just being generated by this independent model.
So here, you start doing it in terms of combinations of
triplets of features.
Triplets being quite useful because, from triplets of
features, you can get properties of the invariant to
the orientation, and to the scale, so you don't need to
have those things fixed.
Then you see whether this model will explain
it better than this.
Then you grow this model to a more complex one, see, can you
explain it better, and so on.
And you keep on adding new features, adding more elements
of this graph, to this grammar.
While your ability to explain the data goes up, you'll carry
on adding extra features in here.
So you'll grow your grammar over time with data.
People are probably, maybe, familiar with AdaBoost, so
AdaBoost algorithm, you do feature selection, you use a
certain number of weak classifiers to do a task, and
then there's a procedure which decides what's the best new
weak classifier to add on into the system.
Well, this is a little like this, except it's rather more
complicated.
Because you're here, you're learning how to increase the
structure of a very general probabilistic model, rather
than just a classifier like AdaBoost.
So [UNINTELLIGIBLE] with those things here, and the details
of this are--
well, I can describe them later to people who are
interested, but this gives you, I hope, the basic idea.
From that, you could learn a grammar of the object.
And this grammar is a fairly simple one at the moment
because it's based on feature points.
It would be comparable to the types of models, somewhat,
that are done by Perona and Fergus on the constellation
models, if people are, perhaps, familiar with that
literature.
But there's a difference here.
Everything here is learned completely unsupervised.
And the grammar's we're getting here--
the models that Pietro and the other people get have fixed
numbers of points.
Here, with the grammar, you're able to have different numbers
of points depending on different
aspects of the object.
So the motorbike can be decomposed into this type of
aspect appearance for it.
Here is another form of appearance.
Here as an another.
And depending on the amount of viewpoint variation, you could
have other aspects being developed as well.
Another advantage of this process is also, once you've
learned it, how can you do the inference?
And the form of the grammar is set up so that doing the
inference is very fast. So for example, once you've learned
the structure, performing the inference to find out if there
is a motorbike in the image or not is done in about one
second, which by the speeds of what alternative methods here
can do, is very fast, moving towards the practicality of
real-time efforts.
Three tasks.
This is just to bring out the main different task.
Once you've learned this probabilistic model, how will
you do inference?
Which would involve detecting the object in the image,
parsing it, finding its boundary.
The other would be learning the parameters of the model
when the structure of the model is fixed.
And the other is structure learning, where you allow the
grammar to grow based on the data.
So there are three different tasks related here.
The inference is the part that's really fast. The
parameter learning part is reasonably quick.
Structured learning is a bit slower, on the order of
several hours, but that's something that you do offline,
in any case.
And here, just to say some of the algorithms involved,
you're using EM, you using dynamic
programming, saddle point.
There's a certain range of techniques put in
there to do that task.
So here, coming up with, again, more examples.
Roosters, pianos et cetera.
Faces airplanes, and so on.
Now, these methods are effective--
here's invariance to rotation and scale.
I'll skim over that because we're getting short, but the
form of representation enables it to do
that thing quite well.
Here is, I think, conceptually interesting, is that if you
don't say what is in the image, it will start learning
the models individually as different
branches of the grammar.
So before, you would give an image, which there would be a
face in there somewhere.
You'd have that information there was a face in there.
Now we're saying, and we're making it weaker, there's an
image somewhere that is a face, or it could be a plane,
or it could be a motorbike in the image.
And then it will learn the grammar, and one part of the
grammar will correspond to the plane, one to the face, one to
the motorbike.
So it will start learning that there are different types of
objects, automatically, without you being taught it.
Now these results here, are I think they're good.
And that's comparable or maybe better to the current state of
the art of the representations, which are
based on representing objects by interest points.
But interest points, they're a sparse
representation of the object.
And you are only getting a limited amount of
information from them.
You're getting ability to do certain tasks, like certain
classification tasks seem to be possible based on interest
points, because a grand piano and a face, they may look very
different just based on the interest points.
But if you want to get more fine-scale resolution or if
you want to actually find the boundaries of the objects, you
have to go to a richer set of vocabularies.
So the next stage is to say, OK, we don't just represent
the image in terms of the feature points.
You represent them in terms of interest points, then you add
extra features like, it could be a mask for the whole shape
of the object, it could be edgelets.
And as we can move forward with this, we can start
putting in extra features.
So eventually, one would hope to get enough features so that
you could really represent the object completely.
The interest points are very good places to start because
they sort of have invariant properties, there are small
numbers of them, they can get you started.
They can tell you roughly how big the object is.
They can tell you the orientation, et cetera.
Once you have those, these other features can be applied.
So over here, we start out with the basic objects, just
using interest points.
We find the interest points, and that interest point starts
telling you roughly where the image should be.
So here is a box based around the interest points of this
star, of this thing here is one based on the interest
point you found here.
So locate a certain area from that.
Then from this, automatically, you can start by hypothesizing
a model for the shape of the object
given, again, 100 examples.
And you can learn that by doing inference, where that
would be using certain information from the inside of
the object and from the outside the object.
Again, purely unsupervised.
And so from that, you end up being able to learn masks of
the object, which are reasonably good.
Though [UNINTELLIGIBLE] screw up some places,
like with the chair.
They're not quite getting the legs of the chair very
accurately.
But nevertheless, for most of this objects, they're getting
a pretty good model of the shape of it.
And so you're going here from the basic, very sparse
interest points to a probabilistic
model of the shape.
And then you're adding in edges and so on.
So I'm hoping that with this method, by adding more
features, you can sort of bootstrap your way up until
you have a model that generates not just the
interest points and the mask and the features but could
generate, maybe, the appearance of
the objects as well.
And so, with this, you can not only do things like
classification, which is what people do on the the Caltech
database and get numbers for this, which are good or better
or something than the other things but also, you can get
methods for the detection for the positions of the
boundaries and so on.
And I think this would address one of the concerns with the
Caltech thing is that, though there were large numbers of
objects in the dataset, they're not really
representative of the domain.
And so, if you compare it, say, to the work on text
detection, which we've done a lot on, we had to get examples
of text and then we had to get lots and lots of examples of
things that were not text in the image--
thousands, tens of thousands, even more non-text things to
find out anything that you could confuse with text.
And so, if you take the Caltech 101 dataset, and you
take the interest point models we have or the models that
other people have, sure, they'll work quite well in the
Caltech 101 dataset, but there's going to be a lot of
other things out there in the world that they're going to
mix up with these objects because they have too limited,
too restricted representation of the data.
So hopefully, with what we're doing here though, as you add
on more features, as you use the interest points just to
get started, you add on more and more features, and that
allows you to discriminate better and better between
these objects and everything out there.
And then also, it enables you to find the boundaries and do
every other thing that you want with it--
finding the boundaries, doing the parsing, saying, hey,
that's not just a grand piano, but there are the lets, here
are the keys, et cetera.
So that would be one goal.
So I guess I'm finishing more or less on time.
So here would be the conclusion slide.
So generally, the group--
our Harvard group, the Center for Image and Vision Science,
myself, and Song-Chun, we're interested in formulating
problem tasks like image parsing, using probabilistic
grammars for objects [UNINTELLIGIBLE]
patterns, with high-level models of structure, low-level
models acting in a bottom-up, top-down method to use
low-level cues to activate the high-level models, which is
going to be checked, confirmed, and make everything
consistent.
And I think where we've gone furthest on this so far with
something that's pretty good is the automatic detection and
binarization of text.
We're then moving on to these other things.
So Song-Chun Zhu's genome project of images with all
these art students and extracting these data sets and
getting these hierarchical representations by hand and
using them for training and for testing, et cetera.
And then this other work I'm describing at the end, the
issues of how far you can go with learning these grammars
in an unsupervised way without having to get Chinese art
students to give you clues part of the way.
So I think, hopefully, these aspects come together and one
can get very powerful methods.
The talk has been mainly about computer vision.
I should say I'm also very interested in how the brain
does these types of things.
And earlier this week, I was at a meeting at NSF in
Washington where issues of how you would model the brain from
all different perspectives--
from physics, computer science, statistics, and so
on-- were coming out.
And so I guess I was also arguing there for this as a
type of model that you could have. We're trying to build
here a system that is as sophisticated, more or less,
as a human visual system.
And while you can try and see whether certain of the aspects
of this relate to the properties that we know about
the human brain and how it's organized and whether
something like this could be both tested, validated,
confirmed or I could find out that the brain is doing
something different, better, or possibly something worse.
Who knows?
So I think it ties in here and it could also start using as a
model for a theoretical neuroscience model capable of
some tests.
Anyway, thank you very much for your attention.
AUDIENCE: Are your probabilistic grammars and
parsing techniques all built on specific points or specific
features or something like that, or do they use
[INAUDIBLE] patches, or--
ALAN YUILLE: They could use most things.
It's only the one at the end that we're starting off with
the interest points.
The things at the beginning for the faces, that could be
patches, that could be any type of description that is--
yeah, anything that one can put in there.
So there's no restriction to interest points.
You really want to be able to explain the whole image.
AUDIENCE: What's the nature of the parsing algorithm that
works with these?
[INAUDIBLE]
patch as a thing to do distance matching on a parser
something like that?
ALAN YUILLE: No, well, if you were trying to recognize a
face, you'd have a generative model for what a face would
look like, which you would learn by
training data by having--
for the face, it would be a bit like the active appearance
models, so you'd have some sort of spatial warp, some
models of the intensity and so on.
The patches there could come in as possible cues, if you
could statistically say that certain patches were highly
likely to be present where there was a face, then that
would be a useful feature that you could put into a bottom-up
AdaBoost detector.
But that would be there merely to drive the activation of the
top-level generative model to validate and to confirm it.
AUDIENCE: So I had a questions around, I'll call it viewpoint
sensitivity.
Although you had an objective rotation and scale and
variance, that was for a given viewpoint on
the original image.
I was looking through the Caltech 256 database just now,
browsing the images, and it seemed as if, very often, the
image of the object always respected what might be called
the principal axis of the object.
But you never saw a book face-on like this and then
edge-on like this and expected it to both
recognize as the book.
So is there a precision recall sensitivity issue, in terms of
the perturbation from this trained viewpoint?
ALAN YUILLE: Well, there isn't, but of the system I
described, it should be able to deal with that by having a
grammar, one aspect being for what the book would look like
front-on, another for what it would look like from side-on.
So at the top of the graph--
AUDIENCE: So given the right training data--
ALAN YUILLE: Given the right training data, yeah.
It's not clearly shown here--
no, there's nothing here which really quite demonstrates it.
but here, this would be like an Or graph.
Or it could look like this or this or this.
So if you had enough training data, it could happen
automatically.
There is an issue, though, weather that is the ideal
thing to do or whether there are similarities between the
representation here and the representation of the angles,
that by going to a full 3-D representation, you
would go away with.
So at the moment, given enough training data, this sort of
thing should work, should deal with that.
But there could be a smarter, better representation--
AUDIENCE: Does it represent the underlying physics in
terms of the light field?
ALAN YUILLE: Yeah, it keeps on coming up.
There's an issue about-- certain of the machine
learning techniques are very data intensive.
You rely a lot on them.
There's knowledge of how humans learn.
Humans seem to learn very quickly, from one or two
examples, and then they generalize.
And babies do it enormously well.
There's interesting work being done by Josh Tenenbaum and
people at MIT on that, not directly on vision, but on the
issue of how humans learn concepts.
And it does seem there that, in his models, you've got a
more structured representation, which means
that it gives you more of the structure and allows you to
generalize more easily from a small number of examples.
And so I think that's an area that I'm partly trying to push
themselves in at the moment, since having enough data works
and we know how to do it on.
But still, it would be certainly more elegant and
more effective in the long term to get at
those invariant issues.
And it's an area of importance at the moment.
AUDIENCE: So I guess I'm just curious how close this is to
being production quality.
If I wanted to, say, find text in Google Images, what are the
issues standing in the way of doing that in a useful way?
ALAN YUILLE: I think finding text would work to be pretty
well, I would say, since we've tested on
large numbers of things.
There's the finding text, there the binarization, and
then there's reading the text.
The finding the text, I think, goes pretty well.
Bank on that.
Binarization, that's working pretty well, but there are
certain failure modes that one detects.
And again, with more datasets and more labeled images, you
find ways around those.
The reading of text--
we've taken the outputs of these things, you put them
into OCR systems like ABBYY or things like that, and it's
surprising how often those systems fail, even when we
give it binarization that looks very good.
So at some level, I think the limitation of those things is
the text reading systems, which, after all, they're
designed to be done on documents like these.
They're not designed to be done on real-world images.
And I think the limitations are those.
If you want to find the text, if you want to binarize it and
put it into a system, then I think these
things went pretty well.
For the other aspects, other things I think we have not
spent so much time on or developed so far.
And partly, it has been an aspect of having the data.
Text detection stuff, we started working on five, six,
seven, eight, years back.
We started getting images.
We started with small numbers.
We tested, we expanded, et cetera, et cetera.
With these other things, it's only in the last year and a
half that Song-Chun has got these people to get these
types of images and to go forward.
Also, it gets into this issue of variability.
Text, the amount of variability is not as much as
there would be for a deformable object like a panda
or something like that.
And the more deformable it is, unless you have a good
knowledge representation system that is able to deal
with that, the more deformable it is, the more data you're
going to need to do it.
AUDIENCE: So how hard would it be to generalize, to stylize
things like logos and stuff, where they take normal text
and they intentionally make it unique?
ALAN YUILLE: Those can be frustrating sometimes.
On those, we do surprisingly well, except with certain
particular cases where there are some slight problems. For
example, the text stuff is based on
black and white images.
If you have color images where the text is green on red and
the intensity is almost exactly the same, then the
system is not able to do that.
I think, again, with a bit of training or a bit of putting
in certain cues for those things, I don't think there's
going to be too much of a problem for those, as far of
the detection station and maybe the binarization stage.
But certainly, yeah, those have been frustrating in the
past.
AUDIENCE: So in your grammar--
so ideally, you have the whole branch.
So it should be one type of, let's say face.
So could it be possible for one branch, a human face mixed
with a, maybe, cougar face--
ALAN YUILLE: Well, if you showed the person cougar
images mixed together with face images and just learned
it, it would be quite--
we haven't done that.
Maybe we should try it and see what happens.
The other issue is, at the level of interest points,
cougar faces and human faces are a bit similar.
They're different enough so that when you learn the model,
they are able to distinguish between them.
But in the stages where you're trying to learn it without
telling the difference between the cougar and so on, it might
not work just with the interest points.
I think, when you start putting in the other stuff,
it's going to go a lot better.
But with interest points alone, you might get mixed up
until you've gone far enough to learn the full model.
OK, thank you.