Tip:
Highlight text to annotate it
X
>>
Our next speaker is a UC Berkeley representative, Go Bears!. That's where I hail from. Trevor
Darrell is on the faculty of the UC Berkeley EECS Department and leads the Computer Vision
group at the International Computer Science Institute in Berkeley. Before that, from 1999
to 2008, he served on the faculty at MIT at EECS Department where he led the CSAIL Vision
Interfaces group. Since he joined us at [INDISTINCT]. He's been working there as a visual perception,
machine learning, multimodal interfaces and he's also doing some entrepreneurial activity
in Berkeley on the side. Please help me welcome Trevor Darrell.
>> DARRELL: Results in the last month, this is my one-month old daughter and her older
brother. So, that's been a--her name is Lenea. So that's been our happiness at home. And
I think today I'd like to be maybe a bit of the reconciler after this 3D versus 2D debate
that we've seen unfold and tell you a little bit of the work that we're doing in Berkeley
and our perspective on using machine learning to really bridge the gap between sort of robotic
vision perspective that we've heard, and some of the more traditional computer vision perspective.
So, one thing to say is even though we have seen a lot of progress in computer vision,
there's still a long way to go. I don't think we're close yet to having broad category level
recognition even in relatively simple indoor environments. So maybe we're close, maybe
a year or two away but it's still going to require some significant advances. And maybe
it will come from bridging the gaps that we've seen here today, right? So, it's almost a
bit of a set up with just the discussion that we had which is nice. Because I do think that
there are divergent perspectives in terms of how vision people traditionally would see
these problems, and robotics people would see these problems. So, what's the vision
perspective? Well, at least from the morning, it's really a machine learning philosophy.
And that can often be boiled down to who has the largest dataset wins, right? I mean, algorithms
matter but data really does matter. There's a downside of that. And we see in the flipside
when we look at how roboticist approach the problem, it leads to the least common denominator
in terms of the features that one uses. You don't really know what color means on the
webs. You probably not going to use it. There's not a lot of 3D data on the web, yet. So,
we'll probably not going to use it. But there's a lot of data and we can learn categories
that would be features. What's the robotic point of view? Well, it's a sensing paradigm.
Sensors are cheap. They're getting cheaper. Why not just throw multi-spectral imaging?
Four different connects. Have everything on there, 3D models, it's great. Whoever has
the most sensors win, right? Well, it's sort of the flipside of that coin. With so many
sensors, you really can only get training data in the particular environment that you're--you
built your rig. It's very hard to generalize, at least, with conventional methods. So we
have strong features, but really an instance recognition focus so far. So, where do we
go from here? So, which one is right? I'd like a show of hands. So, how many people
go for computer vision? How many people go for robotic vision? No one wants to commit
here which is probably the right answer. Neither one is entirely right. Both are probably right.
And the philosophy that I'd like to advocate really combines many of the different themes
we've seen today including a machine learning foundation which is, let's learn to use both
of them, right? We're really not going to simply adopt the 3D paradigm or 2D paradigm.
And so, here are the kinds of "what I mean by that", right? We'd like to learn how to
leverage robotic and vision perspectives together. For example, detect--develop rich, low level
features that can handle the kind of complicated sensing that happens in the real world. Exploit
machine learning to learn mappings between modalities. So, we may have a certain set
of modalities, a training time, and a different set of modalities that test time, and still
without any manual tuning or engineering, get something done. And finally learn the
shift between domains. We maybe trying to run a robot in this environment, recognizing
objects that are found on the fly, but using data from an environment completely external
to this environment with different types of imaging conditions and whatnot. And we'd like
to do better than a naīve approach would at that. So, there's a sampling of different
projects that I could tell you about today, ranging from--including hallucination and
learning of this 2.1D representations. I'm not really going to have time to talk about
those two. I'm instead going to focus on these three. The first is learning low and mid-level
features using sort of, hierarchical, probabilistic sparse coding, really--actually echoing many
of the themes that Andrew presented earlier today. Then I'll talk about our work on domain
adaptation, how we learn in one domain and test in another. And then I'll briefly, also
sing the praises of the Connect and related 3D sensors and mention our effort to collect
new dataset for 3d category recognition. So, let's dive right in. So, first, let me tell
you about this work which is a probabilistic model for recursive factorized image features.
This is a CPDR11 paper by Sergey Karayev, Mario Fritz, Sanja Fidler, and myself. And
it actually follows much of the same philosophy that we saw this morning. So I really can--I
can go both sides here. I can tell the world a garage story or I can tell the--it's all
just, you know, semi-supervised learning or sparse coding story. We'd like to have a distributive
coding of local image features especially the kinds of representations that are useful
for vision. Historically, they've been coded using relatively naīve vector quantization
approaches to find these visual words. And as we saw this morning, approaches based on
additive models especially sparse coding have been quite successful. We've also looked at
these ourselves in terms of a probabilistic topic model approach to an additive decomposition
of--for example SIFT-descriptors so you can see how you might take a local descriptor
and factor it into constituent parts that might capture the fact that there's multiple
different things happening locally in a--in an image or an image patch. So, we're also
on the [INDISTINCT] hierarchical models. Hierarchies are very important. I think everyone had a
slide like this today. It's interesting. You shouldn't have one of [INDISTINCT]. Yes, good.
So, it probably doesn't require much motivation to say that hierarchies have strong inspiration
from biology that if you look at psychophysics there must be top-down influences in perception
and so we'd like--and also the hierarchies enable effective sharing of visual features.
And that there's been a long tradition of models in computational neuroscience, in computer
division that have chipped away at this idea of learning a representation and learning
in a hierarchical fashion. But one limit--and I won't have time to go into all of these
models. One limitation of the--of many of the existing methods is that they do so only
in a feed forward fashion. They learn--they approach learning and inference only from
the bottom-up. So, if you have a representation here where you--I apologize if you can't see
my pointer--where you learn a set of sparse codes or in our case, probabilistic topic
model over image patches. And then you go to then explore a hierarchy of that or recursion
on this representation. But you take the stacked output of those activations and use them as
inputs in another layer. Well, first, you learn a representation for the first layer
and then you attack the entire problem again and learn representation for the second layer.
Instead, what we've explored in our recent work is what's--what is the benefit of jointly
optimizing across the hierarchy? So that you--you don't in fact, rely on a fixed representation
from the bottom layer when you're learning the top layer but you can optimize all aspects
of the representation jointly. So, this is our goal and our result in our recent CDPR11
paper: A distributed coding of local features in a hierarchical model that allows full inference
and full recursion. So, we derive our methods based on probabilistic topic models using
Latent Dirichlet Allocation and we call it, alternately, recursive LDA or sometimes hierarchical
LDA. So, I won't go into this in detail and many of you have seen these models before
they've been used in vision in many places over the years. Most famously, they're used
to describe topics which are composed of quantized SIFT-descriptors or visual words. That's not
how we're using it here. We're actually build--we're actually building topics over the descriptor
itself, not topics over quantized descriptors. So we have a representation for example, of
SIFT and we formed topics over the individual cells in the--in the [INDISTINCT] histogram.
So our topics can be--excuse me. Our topics can be thought of as finding structures that
are additively combined or transparently combined. And we can find visual--so we can find these
topics which is similar to what we've seen before in sparse coding but now they're evaluated
over local SIFT descriptors or their forming the constituent basis of histogram-based descriptors.
And here--and here's a visualization of average images that might give rise to each topic.
So our base line is just the feed-forward method. We can express here in instance of
the recursive model in its full glory and here's a visualization of how you might jointly
be estimating topics over a patch and then topics on top--on top of topics or a patch.
We've approached this with a Gibbs sampler for inference and do use a few schemes for
efficient initialization. And so far in our test, we're able to show that this recursive
model that jointly optimizes both layers. We'll go beyond two layers eventually, but
so far that's what we've explored. It's better than just feed-forward initialization and
it's better than just a single flat model even when you control for dimensionality.
And it compares very favorably to all the other published hierarchical models. I think
giving all that we've considered today and so that's very well compared even the best
of the best, such as the work that Kyle and others have shown earlier today and the saliency
methods from UCSD. So, these are the kinds of visualizations we got. So, the first part
of the story is how we can learn good models from the bottom-up. We can learn them in a
hierarchical fashion from data, and when we find that when we do so, we get better performance
jointly optimizing the representation, rather than just doing a feed-forward. And there
are many different feature directions we are considering, including pushing the hierarchy
all the way up to the object level, looking at a spatiotemporal volume and training indiscriminately
and considering non-parametric models. Can you still hear me if I walk away from the
podium? >> [INDISTINCT]
>> DARRELL: Oh, good. So, the other, it was the first theme today. The second theme is
domain adaptation which is what happens when you train on one thing but you test on another.
What you see is not always what you get in vision. And this is the work of Kate Seanko
and Brian Kulis, who [INDISTINCT] in my lab, also with Mario Fritz. And Kate is jointly
appointed in Harvard and in Berkeley, now that's a commute. So, people are very good
at a wide range of visual category problems. How you can recognize these mice in many different
domains? Even if I show you the examples of Mickey Mouse, you can then find the matching
mouse in a real image. But machines often suffer even with very simple transformations.
So, even if you just go from one type of image sensor to another image sensor, you can find
yourself in a very different feature space. And you can also find very different domains
just by looking at objects in different scales or in different poses, illuminations or backgrounds,
or when you're moving from one domain which is more artistic to another domain which is
more surveillance, for example, or seasons. So, we'd like to overcome this problem of
training in one domain and testing another domain. So, in most object recognition paradigms,
we assume that we have essentially the same distribution of training data and testing.
So we assume that we have a lot of labeled data and we get, you know, nice image of some
instance. And we just go back and compare what does this look like using some fancy
machine learning algorithm, and that's fine. This is what most--all of existing computer
vision research has been exploring recently. But in the real world you go out to take a
picture of this cup which is taken in a completely different domain. The temperature just changed
about 10 degrees. I apologize for the video tool [INDISTINCT]. So, we'd like to be able
to recognize the cup on the bottom which is taken from a completely different domain.
And the sad truth is, the existing methods don't work right out of the box. In fact,
if you just try and train on one domain such as objects in an office, but train on images
that were collected outside of that domain such as Amazon or the web. You know, if you
train on Amazon and test on Amazon, life is pretty good. But if you then try and train
on webcam images, all hell breaks lose and you lose anything close to good performance.
So, what's the solution to this? Well, one solution is just to ask everyone to label
everything in every environment. That's maybe the conventional approach. But it's awfully
imposing and that's not what we'd like to do. It's too expensive. [INDISTINCT]. You
could try and engineer your features, such that in that feature space everything is invariant
to everything that could go wrong. And that's actually probably the conventional approach.
You think of what the representations we use today in fact do. They don't even hand engineer
to try an invariant elimination and the kinds of post variation that don't seem to matter.
But they don't seem to solve the whole problem. So, that's obviously some part of the problem
that we can't hand engineer and we should probably turn to machine learning to solve
that part of the problem. So, that's our idea. Try and learn some notion of domain shift
from the data, right? So, here we have an example where we have instances of different
categories in different domains-- for example here, three different categories but they're
in different domains. So, for example, I have circles, both taken in a green domain and
in the blue domain, and squares in the blue domain and squares in the green domain. And
stars--and here for the stars, maybe I actually have labeled data in both domains. Or rather
for stars, I have no labeled data in both domains but for the others I do. So, I can--I
can actually give--I can do something with the stars. I can take advantage of the fact
that I know that these stars--that these circles and the squares had corresponding elements.
So I know that, that there were squares over here in the blue domain and over here in the
green domain and circles over here in the red and the green domain, and over here the
blue domain. So, it looks like there's some sort of domain shift that's going on in the
space. And much as I can optimize representation for specific categories, here I can optimize
a representation for a specific domain shift. So, I know I don't really want to change things
in this direction but it does seem like changing things in this direction is useful for this,
for whatever transformation is happening between feature spaces between these two domains.
So that's the cracks of our idea. We'd like to have some sort of feature space transformation
that maps from a raw domain to a domain in which--from a raw space to a space in which
domain shift is minimized. So we can frame this as a transformation learning problem,
given pairs of similar examples, much like in traditional metric learning, we can say
let's find the transformation of a space such that the distance between examples that are
in fact the same but in different domains is small using our learned representation.
But with a raw representation, it would be large. So, how do we learn W? Well, this is
where we put our machine learning apps on and take advantage of a variety of interesting
approaches that have been applied for metric learning and other forms of transformation
learning, the work of NGO. Here at Google, has inspired us in the past. But to our knowledge,
no one had really applied this for domain transformation where you try and solve an
optimization problem that balances the regularization against constraints formed from those similarities.
So, there's two ways you can think of forming these constraints. One is based on category
where you say these are two cups. I'm going to form these links between these cups. These
are not cups. This is a cleaner. So, the important thing is that you don't--if you're going to
learn a domain transformation, you actually don't want to do the traditional metric learning
thing which is to have interclass similarity constraints and dissimilarity constraints
across different classes. Here, we only want to learn the domain shift yet you don't want
to learn anything related to a specific class. So that's--much of the art of this idea is
in how you form those constraints. You can also form them across instances if you happen
to know that this is exactly the same corresponding object. That's the best--the best case for
this domain shift idea. Okay. So, and there are ways to learn this transformations efficiently
even with kernelized or kernelized version of this transformations and we both looked
at a symmetric transformations and most--that was an easy CV paper and our CVPR paper is
of a symmetric case. So, there is a stream of related work in domain transformation,
a modeling domain shift especially in the natural language and text [INDISTINCT] in communities. Some of
which had been applied to vision. There--and there's a sort classic techniques that look
at preprocessing the data, how you can create domain and source and domain specific versions.
There are ways that adapt SPM parameters for video transformation, and there are also ways
to model a match--readdress the training data to match the test data distribution. None
of these can handle the specific case that we were looking at, which is when you have
training data for some number of categories but a category for--in both domains but you
have a new category where you have no training data at all. So I come into this room, I want
to recognize these chairs. I have no label data for this chair, but I do have label data
for a few other objects that may both be on the web and be in this domain, and I use that
to learn a transformation that can then help me learn these chairs. So, we've done some
work, evaluating this on a data set that we collected for domain transformation. So the--here
for example are--is a category of keyboard that we collected from Amazon, high resolution
camera and a low resolution camera. We've shown how the methods we've developed improve
over various baselines. We've also shown that the asymmetric method improves over the symmetric
method. And we have a few really egregious examples which are, you know, somewhat unfair,
which is that if you take completely different feature spaces--baseline techniques will give
you garbage. So, if you simply change the level of quantization in a normal visual word
model and then try to your nearest neighbors, of course you find garbage. You find that
this pad matches these, all these other different types of objects. But our method can easily
learn that transformation and I mean, you could--you could engineer that transformation
as well but you can also learn it directly. Okay, so we're excited about this idea of
domain transformation. We think it's an important one in computer vision and we've shown a variety
of cool things. Okay. Any questions about this so far? So, last but not least, I want
to go and, you know, throw my hat back in with the 3D folks and say that I think there
is--that the time is right again to really take a serious look at 3D for many of the
vision problems that recently had been tackled mostly with 2D. And looking back at my original
slides, the reason for the first slides that I presented at the beginning of the talk,
the reason I think vision research has looked at 2D features is they wanted category level
variation, and wanted to look at things that could be found from the web. And so then,
if you're going to match images from the web, you really have to rely on some of the techniques
that were developed for wide baseline matching and not really 3D object descriptors. But
to--in every several years there's a new revolutionary 3D sensor, right? I think we--those of us
who've been in the community for quite a while can remember these. There are--real time stereo
hardware is still exciting. My friend at the [INDISTINCT] company are--continue to pioneer
in that realm. The time of flight sensors that came out, three or four or five years
ago, LIDAR, you know, sort of the success of robotic vision made everyone think that
the LIDAR sensor was going to be the impressive thing until recently. But now, the KINECT
has come on the scene and provides images that really do have a quality and I guess
ubiquity that seems quite exciting. And so, we think the time is right now to attack category
level vision. The kind of vision challenges that we've seen in Pascal and Caltech101 and
so on and so forth. In contrast, the instance base challenge is using this type of sensor.
We looked in the literature and we couldn't find many existing--many existing 3D category
level data sets. They all had limited scene, limited scene complexity, limited number of
categories, limited post variation, and many had just explicitly instance focus. So, we've
begun a collection effort that will be open-sourced to the community in the spirit of the level
mid-data set that has complicated real world scenes where we've collected/registered depth
and intensity data. Here's an example of the size distribution of the categories that we
have. Here's an example from the chair category. So you can see there's just a lot of--that
KINECT does give you a lot of signal there. And I think the vision community had avoided
tackling the problem of category level representations with exclusive 3D descriptors. And now is
the time to come back and get to it. One of the most obvious things when you look at 3D
data is the distribution of size priors. So, if you compare the distribution of categories
in 2D in a scene and of course, you get a huge range of variation. If you look at the
3D variation it's--in many cases, much tighter. So that's the first thing that one can exploit
with 3D. And the other obvious thing one can do that several groups have proposed recently
is to directly use an orientation histogram descriptor on a depth signal. So we considered
both of these as baseline methods on this data set, and it's a mixed bag. I think the
simple things are not necessarily going to immediately be what wins. Surprisingly, even
with this really beautiful clean data, quote and quote, you have problems. There are some
monitors, glasses; they still don't necessarily show up in KINECT. Flat objects, still you
may not find a KINECT signature. So if you just use depth alone, you're not going to
have necessarily a signature from those objects, but you can often improve the search complexity
by pruning. So this is a project we've just started and are interested in collaborating
with other groups who are KINECT hackers. And so please contact us if you like to join
forces. Okay, so those are the three themes that I want to talk about today. How to take
a hierarchical approach to learning low level and mid-level image features, how to learn
transformations between domains so you can train them when environment and test in another.
And our first look at 3D category level recognition. I think we're going to see an evolution of
visual object for recognition research move towards 3D and robotics domains where you
have the whole problem being considered at the same time including interaction with people.
I think having PR2 that can recognize everything in the environment and have a conversation
with a person about the objects is really an exciting topic. I didn't have time to mention
any of the ideas we have and how language and vision might work together. But that is
one for the future. And so I would like to end with just of these themes that I think
are important for the coming year and coming years. How we can effectively bridge category
level in instance level learning; fully integrate the richness of scene and task content--context
that the--these robotic domains are going to provide us an explicitly model and user
interaction and leverage data from multiple sources. Let me put in the plug for two things
that really do work right now. These are two local companies that are looking at optimal
fusion of crowd sourcing and computer vision in the case of IQ engines. And--an old interest
of mine that I think is coming back now with KINECT, how we use vision for interface in
gestures that can control lightweight user interfaces and I do advice both of these companies
so its a bit of a plug. So I'd like to thank everyone who helped with these projects including
my students and post-docs who are listed here and they deserve all the credit and I take
all of the criticisms. Thanks very much. >> You're fast. So you have lots of time for
questions still. Before the refreshments arrive. >> [INDISTINCT] I could put back in. They
all have equation annexed, so. You might prefer to just ask me a few questions.
>> Well most of the [INDISTINCT] >> DARREL: Well I think we use transparency
for our early work and added other probabilistic models because we thought it was a fun and
entertaining example. But it's certainly not true that all objects were transparent but
it is true that these additive models are useful for general objects and that's the
result that we showed--that I showed earlier confirmed for Caltech 101 or any kind of object
that you can add--and also it's similar to what Andrew and others and [INDISTINCT] sparse
coding. Even for non-transparent objects, objects that you think of as being traditionally
a single process often aren't perhaps because of background effects or occlusion effects
or because of un-modeled elimination effects even for non-transparent objects. So I think
the set of objects that are truly modeled by sort of constant albedo patch that's, you
know, homogenous with other rectangle, its pretty small. I should have repeated the question.
Yes, the question was--you could figure out the answer, the question from the answer,
right? This is Jeopardy for you all. One of you guys--I don't know which is first.
>> So your first patented work, so you picked up this hierarchical model and the--since
you're already working on learning the hierarchical [INDISTINCT] simultaneous [INDISTINCT], so
why don't you directly working for other pixels? So I say learn everything from the pixels,
that will be cool right? >> Yes, that would be cool and I think that's--although
that is what many people already had done. So we wanted to, sort of, attack part of this
base that had not yet really been addressed which is applying these probabilistic topic
models as far as coding models directly to SIFT and seeing what we can get out of that.
But, time permitting, I think we would like to go back and have a model that was both
and which different layers were tuned into different underlying noise models of the raw
pixels. And you should be able to get the whole thing out of the three layer model.
>> When you did learn your domain transformation matrix W, do you actually need to put in feature
correspondences or--and if you don't, why not?
>> DARREL: We don't put in the feature correspondences; we do put in instant correspondences. So we
say, here, given a particular feature or presentation, lets say, we're just doing a traditional bag
of words model at this point, here's what a cup looks like in my particular presentation
and here's what the same cup looks like in a different domain. Or even just, "here are
cups" in one domain and "here are cups" in another domain. So, we're learning a transformation
on the entire features base not per individual features you were thinking of. Any other questions?
Sure. >> So at the very beginning you mentioned
in the training time, you probably can use [INDISTINCT] modalities are missing, right?
But as a whole in your following work you didn't in particular mentioned or [INDISTINCT].
>> DARREL: But that was one of the--the topics that I didn't cover today in the interest
of time, where we do--this was the CPBR10 paper we had where we do hallucinate modalities
in essence. If you have a certain set of modalities that are present in your test data, but not
present in your training data, you can essentially apply semi-supervised learning. Learn the
model that maps from one modality to another from this pile of unlabeled data that you
have a test time and it--you know, it's counterintuitive but it actually helps if you go back and hallucinate
more training data, at least it helped in our experiments and use that to build a model
there. I'm not sure we've--I think there are variety ways one can approach that problem
and we approached it from a supervised learning aggression problem--paradigm using [INDISTINCT]
processes. But there are--there are other approaches.
>> I was interested [INDISTINCT] your application and the graphics models. I know some people
working with the text where they claim that methods based on [INDISTINCT] principle...
>> I haven't and that's a topic that many folks have investigated-- envisioned in the
last few years and I think often it's frustrating that's hard to build good co-occurrences models
at the level of visual words. Certainly, there had been recent efforts that have started
to show success and I think [INDISTINCT] if she's still in the room, has a model that
does touch on that. So, I might defer to her or anyone else but I haven't done it and they
don't have a good summary of the recent literature. Does it work? Didn't you do--feature co-occurrences
in a topic model? >> [INDISTINCT] ...
>> There are many ways to do that. Any last questions? With that-- maybe we should return
sometime to the--to those of us who have to drive back to Berkeley.
>> All right, let's thank the speaker again.