Bay Area Vision Meeting - Learning representations for real - World recognition

>> Our next speaker is a UC Berkeley representative, Go Bears!. That's where I hail from. Trevor Darrell is on the faculty of the UC Berkeley EECS Department and leads the Computer Vision group at the International Computer Science Institute in Berkeley. Before that, from 1999 to 2008, he served on the faculty at MIT at EECS Department where he led the CSAIL Vision Interfaces group. Since he joined us at [INDISTINCT]. He's been working there as a visual perception, machine learning, multimodal interfaces and he's also doing some entrepreneurial activity in Berkeley on the side. Please help me welcome Trevor Darrell. >> DARRELL: Results in the last month, this is my one-month old daughter and her older brother. So, that's been a--her name is Lenea. So that's been our happiness at home. And I think today I'd like to be maybe a bit of the reconciler after this 3D versus 2D debate that we've seen unfold and tell you a little bit of the work that we're doing in Berkeley and our perspective on using machine learning to really bridge the gap between sort of robotic vision perspective that we've heard, and some of the more traditional computer vision perspective. So, one thing to say is even though we have seen a lot of progress in computer vision, there's still a long way to go. I don't think we're close yet to having broad category level recognition even in relatively simple indoor environments. So maybe we're close, maybe a year or two away but it's still going to require some significant advances. And maybe it will come from bridging the gaps that we've seen here today, right? So, it's almost a bit of a set up with just the discussion that we had which is nice. Because I do think that there are divergent perspectives in terms of how vision people traditionally would see these problems, and robotics people would see these problems. So, what's the vision perspective? Well, at least from the morning, it's really a machine learning philosophy. And that can often be boiled down to who has the largest dataset wins, right? I mean, algorithms matter but data really does matter. There's a downside of that. And we see in the flipside when we look at how roboticist approach the problem, it leads to the least common denominator in terms of the features that one uses. You don't really know what color means on the webs. You probably not going to use it. There's not a lot of 3D data on the web, yet. So, we'll probably not going to use it. But there's a lot of data and we can learn categories that would be features. What's the robotic point of view? Well, it's a sensing paradigm. Sensors are cheap. They're getting cheaper. Why not just throw multi-spectral imaging? Four different connects. Have everything on there, 3D models, it's great. Whoever has the most sensors win, right? Well, it's sort of the flipside of that coin. With so many sensors, you really can only get training data in the particular environment that you're--you built your rig. It's very hard to generalize, at least, with conventional methods. So we have strong features, but really an instance recognition focus so far. So, where do we go from here? So, which one is right? I'd like a show of hands. So, how many people go for computer vision? How many people go for robotic vision? No one wants to commit here which is probably the right answer. Neither one is entirely right. Both are probably right. And the philosophy that I'd like to advocate really combines many of the different themes we've seen today including a machine learning foundation which is, let's learn to use both of them, right? We're really not going to simply adopt the 3D paradigm or 2D paradigm. And so, here are the kinds of "what I mean by that", right? We'd like to learn how to leverage robotic and vision perspectives together. For example, detect--develop rich, low level features that can handle the kind of complicated sensing that happens in the real world. Exploit machine learning to learn mappings between modalities. So, we may have a certain set of modalities, a training time, and a different set of modalities that test time, and still without any manual tuning or engineering, get something done. And finally learn the shift between domains. We maybe trying to run a robot in this environment, recognizing objects that are found on the fly, but using data from an environment completely external to this environment with different types of imaging conditions and whatnot. And we'd like to do better than a naīve approach would at that. So, there's a sampling of different projects that I could tell you about today, ranging from--including hallucination and learning of this 2.1D representations. I'm not really going to have time to talk about those two. I'm instead going to focus on these three. The first is learning low and mid-level features using sort of, hierarchical, probabilistic sparse coding, really--actually echoing many of the themes that Andrew presented earlier today. Then I'll talk about our work on domain adaptation, how we learn in one domain and test in another. And then I'll briefly, also sing the praises of the Connect and related 3D sensors and mention our effort to collect new dataset for 3d category recognition. So, let's dive right in. So, first, let me tell you about this work which is a probabilistic model for recursive factorized image features. This is a CPDR11 paper by Sergey Karayev, Mario Fritz, Sanja Fidler, and myself. And it actually follows much of the same philosophy that we saw this morning. So I really can--I can go both sides here. I can tell the world a garage story or I can tell the--it's all just, you know, semi-supervised learning or sparse coding story. We'd like to have a distributive coding of local image features especially the kinds of representations that are useful for vision. Historically, they've been coded using relatively naīve vector quantization approaches to find these visual words. And as we saw this morning, approaches based on additive models especially sparse coding have been quite successful. We've also looked at these ourselves in terms of a probabilistic topic model approach to an additive decomposition of--for example SIFT-descriptors so you can see how you might take a local descriptor and factor it into constituent parts that might capture the fact that there's multiple different things happening locally in a--in an image or an image patch. So, we're also on the [INDISTINCT] hierarchical models. Hierarchies are very important. I think everyone had a slide like this today. It's interesting. You shouldn't have one of [INDISTINCT]. Yes, good. So, it probably doesn't require much motivation to say that hierarchies have strong inspiration from biology that if you look at psychophysics there must be top-down influences in perception and so we'd like--and also the hierarchies enable effective sharing of visual features. And that there's been a long tradition of models in computational neuroscience, in computer division that have chipped away at this idea of learning a representation and learning in a hierarchical fashion. But one limit--and I won't have time to go into all of these models. One limitation of the--of many of the existing methods is that they do so only in a feed forward fashion. They learn--they approach learning and inference only from the bottom-up. So, if you have a representation here where you--I apologize if you can't see my pointer--where you learn a set of sparse codes or in our case, probabilistic topic model over image patches. And then you go to then explore a hierarchy of that or recursion on this representation. But you take the stacked output of those activations and use them as inputs in another layer. Well, first, you learn a representation for the first layer and then you attack the entire problem again and learn representation for the second layer. Instead, what we've explored in our recent work is what's--what is the benefit of jointly optimizing across the hierarchy? So that you--you don't in fact, rely on a fixed representation from the bottom layer when you're learning the top layer but you can optimize all aspects of the representation jointly. So, this is our goal and our result in our recent CDPR11 paper: A distributed coding of local features in a hierarchical model that allows full inference and full recursion. So, we derive our methods based on probabilistic topic models using Latent Dirichlet Allocation and we call it, alternately, recursive LDA or sometimes hierarchical LDA. So, I won't go into this in detail and many of you have seen these models before they've been used in vision in many places over the years. Most famously, they're used to describe topics which are composed of quantized SIFT-descriptors or visual words. That's not how we're using it here. We're actually build--we're actually building topics over the descriptor itself, not topics over quantized descriptors. So we have a representation for example, of SIFT and we formed topics over the individual cells in the--in the [INDISTINCT] histogram. So our topics can be--excuse me. Our topics can be thought of as finding structures that are additively combined or transparently combined. And we can find visual--so we can find these topics which is similar to what we've seen before in sparse coding but now they're evaluated over local SIFT descriptors or their forming the constituent basis of histogram-based descriptors. And here--and here's a visualization of average images that might give rise to each topic. So our base line is just the feed-forward method. We can express here in instance of the recursive model in its full glory and here's a visualization of how you might jointly be estimating topics over a patch and then topics on top--on top of topics or a patch. We've approached this with a Gibbs sampler for inference and do use a few schemes for efficient initialization. And so far in our test, we're able to show that this recursive model that jointly optimizes both layers. We'll go beyond two layers eventually, but so far that's what we've explored. It's better than just feed-forward initialization and it's better than just a single flat model even when you control for dimensionality. And it compares very favorably to all the other published hierarchical models. I think giving all that we've considered today and so that's very well compared even the best of the best, such as the work that Kyle and others have shown earlier today and the saliency methods from UCSD. So, these are the kinds of visualizations we got. So, the first part of the story is how we can learn good models from the bottom-up. We can learn them in a hierarchical fashion from data, and when we find that when we do so, we get better performance jointly optimizing the representation, rather than just doing a feed-forward. And there are many different feature directions we are considering, including pushing the hierarchy all the way up to the object level, looking at a spatiotemporal volume and training indiscriminately and considering non-parametric models. Can you still hear me if I walk away from the podium? >> [INDISTINCT] >> DARRELL: Oh, good. So, the other, it was the first theme today. The second theme is domain adaptation which is what happens when you train on one thing but you test on another. What you see is not always what you get in vision. And this is the work of Kate Seanko and Brian Kulis, who [INDISTINCT] in my lab, also with Mario Fritz. And Kate is jointly appointed in Harvard and in Berkeley, now that's a commute. So, people are very good at a wide range of visual category problems. How you can recognize these mice in many different domains? Even if I show you the examples of Mickey Mouse, you can then find the matching mouse in a real image. But machines often suffer even with very simple transformations. So, even if you just go from one type of image sensor to another image sensor, you can find yourself in a very different feature space. And you can also find very different domains just by looking at objects in different scales or in different poses, illuminations or backgrounds, or when you're moving from one domain which is more artistic to another domain which is more surveillance, for example, or seasons. So, we'd like to overcome this problem of training in one domain and testing another domain. So, in most object recognition paradigms, we assume that we have essentially the same distribution of training data and testing. So we assume that we have a lot of labeled data and we get, you know, nice image of some instance. And we just go back and compare what does this look like using some fancy machine learning algorithm, and that's fine. This is what most--all of existing computer vision research has been exploring recently. But in the real world you go out to take a picture of this cup which is taken in a completely different domain. The temperature just changed about 10 degrees. I apologize for the video tool [INDISTINCT]. So, we'd like to be able to recognize the cup on the bottom which is taken from a completely different domain. And the sad truth is, the existing methods don't work right out of the box. In fact, if you just try and train on one domain such as objects in an office, but train on images that were collected outside of that domain such as Amazon or the web. You know, if you train on Amazon and test on Amazon, life is pretty good. But if you then try and train on webcam images, all hell breaks lose and you lose anything close to good performance. So, what's the solution to this? Well, one solution is just to ask everyone to label everything in every environment. That's maybe the conventional approach. But it's awfully imposing and that's not what we'd like to do. It's too expensive. [INDISTINCT]. You could try and engineer your features, such that in that feature space everything is invariant to everything that could go wrong. And that's actually probably the conventional approach. You think of what the representations we use today in fact do. They don't even hand engineer to try an invariant elimination and the kinds of post variation that don't seem to matter. But they don't seem to solve the whole problem. So, that's obviously some part of the problem that we can't hand engineer and we should probably turn to machine learning to solve that part of the problem. So, that's our idea. Try and learn some notion of domain shift from the data, right? So, here we have an example where we have instances of different categories in different domains-- for example here, three different categories but they're in different domains. So, for example, I have circles, both taken in a green domain and in the blue domain, and squares in the blue domain and squares in the green domain. And stars--and here for the stars, maybe I actually have labeled data in both domains. Or rather for stars, I have no labeled data in both domains but for the others I do. So, I can--I can actually give--I can do something with the stars. I can take advantage of the fact that I know that these stars--that these circles and the squares had corresponding elements. So I know that, that there were squares over here in the blue domain and over here in the green domain and circles over here in the red and the green domain, and over here the blue domain. So, it looks like there's some sort of domain shift that's going on in the space. And much as I can optimize representation for specific categories, here I can optimize a representation for a specific domain shift. So, I know I don't really want to change things in this direction but it does seem like changing things in this direction is useful for this, for whatever transformation is happening between feature spaces between these two domains. So that's the cracks of our idea. We'd like to have some sort of feature space transformation that maps from a raw domain to a domain in which--from a raw space to a space in which domain shift is minimized. So we can frame this as a transformation learning problem, given pairs of similar examples, much like in traditional metric learning, we can say let's find the transformation of a space such that the distance between examples that are in fact the same but in different domains is small using our learned representation. But with a raw representation, it would be large. So, how do we learn W? Well, this is where we put our machine learning apps on and take advantage of a variety of interesting approaches that have been applied for metric learning and other forms of transformation learning, the work of NGO. Here at Google, has inspired us in the past. But to our knowledge, no one had really applied this for domain transformation where you try and solve an optimization problem that balances the regularization against constraints formed from those similarities. So, there's two ways you can think of forming these constraints. One is based on category where you say these are two cups. I'm going to form these links between these cups. These are not cups. This is a cleaner. So, the important thing is that you don't--if you're going to learn a domain transformation, you actually don't want to do the traditional metric learning thing which is to have interclass similarity constraints and dissimilarity constraints across different classes. Here, we only want to learn the domain shift yet you don't want to learn anything related to a specific class. So that's--much of the art of this idea is in how you form those constraints. You can also form them across instances if you happen to know that this is exactly the same corresponding object. That's the best--the best case for this domain shift idea. Okay. So, and there are ways to learn this transformations efficiently even with kernelized or kernelized version of this transformations and we both looked at a symmetric transformations and most--that was an easy CV paper and our CVPR paper is of a symmetric case. So, there is a stream of related work in domain transformation, a modeling domain shift especially in the natural language and text [INDISTINCT] in communities. Some of which had been applied to vision. There--and there's a sort classic techniques that look at preprocessing the data, how you can create domain and source and domain specific versions. There are ways that adapt SPM parameters for video transformation, and there are also ways to model a match--readdress the training data to match the test data distribution. None of these can handle the specific case that we were looking at, which is when you have training data for some number of categories but a category for--in both domains but you have a new category where you have no training data at all. So I come into this room, I want to recognize these chairs. I have no label data for this chair, but I do have label data for a few other objects that may both be on the web and be in this domain, and I use that to learn a transformation that can then help me learn these chairs. So, we've done some work, evaluating this on a data set that we collected for domain transformation. So the--here for example are--is a category of keyboard that we collected from Amazon, high resolution camera and a low resolution camera. We've shown how the methods we've developed improve over various baselines. We've also shown that the asymmetric method improves over the symmetric method. And we have a few really egregious examples which are, you know, somewhat unfair, which is that if you take completely different feature spaces--baseline techniques will give you garbage. So, if you simply change the level of quantization in a normal visual word model and then try to your nearest neighbors, of course you find garbage. You find that this pad matches these, all these other different types of objects. But our method can easily learn that transformation and I mean, you could--you could engineer that transformation as well but you can also learn it directly. Okay, so we're excited about this idea of domain transformation. We think it's an important one in computer vision and we've shown a variety of cool things. Okay. Any questions about this so far? So, last but not least, I want to go and, you know, throw my hat back in with the 3D folks and say that I think there is--that the time is right again to really take a serious look at 3D for many of the vision problems that recently had been tackled mostly with 2D. And looking back at my original slides, the reason for the first slides that I presented at the beginning of the talk, the reason I think vision research has looked at 2D features is they wanted category level variation, and wanted to look at things that could be found from the web. And so then, if you're going to match images from the web, you really have to rely on some of the techniques that were developed for wide baseline matching and not really 3D object descriptors. But to--in every several years there's a new revolutionary 3D sensor, right? I think we--those of us who've been in the community for quite a while can remember these. There are--real time stereo hardware is still exciting. My friend at the [INDISTINCT] company are--continue to pioneer in that realm. The time of flight sensors that came out, three or four or five years ago, LIDAR, you know, sort of the success of robotic vision made everyone think that the LIDAR sensor was going to be the impressive thing until recently. But now, the KINECT has come on the scene and provides images that really do have a quality and I guess ubiquity that seems quite exciting. And so, we think the time is right now to attack category level vision. The kind of vision challenges that we've seen in Pascal and Caltech101 and so on and so forth. In contrast, the instance base challenge is using this type of sensor. We looked in the literature and we couldn't find many existing--many existing 3D category level data sets. They all had limited scene, limited scene complexity, limited number of categories, limited post variation, and many had just explicitly instance focus. So, we've begun a collection effort that will be open-sourced to the community in the spirit of the level mid-data set that has complicated real world scenes where we've collected/registered depth and intensity data. Here's an example of the size distribution of the categories that we have. Here's an example from the chair category. So you can see there's just a lot of--that KINECT does give you a lot of signal there. And I think the vision community had avoided tackling the problem of category level representations with exclusive 3D descriptors. And now is the time to come back and get to it. One of the most obvious things when you look at 3D data is the distribution of size priors. So, if you compare the distribution of categories in 2D in a scene and of course, you get a huge range of variation. If you look at the 3D variation it's--in many cases, much tighter. So that's the first thing that one can exploit with 3D. And the other obvious thing one can do that several groups have proposed recently is to directly use an orientation histogram descriptor on a depth signal. So we considered both of these as baseline methods on this data set, and it's a mixed bag. I think the simple things are not necessarily going to immediately be what wins. Surprisingly, even with this really beautiful clean data, quote and quote, you have problems. There are some monitors, glasses; they still don't necessarily show up in KINECT. Flat objects, still you may not find a KINECT signature. So if you just use depth alone, you're not going to have necessarily a signature from those objects, but you can often improve the search complexity by pruning. So this is a project we've just started and are interested in collaborating with other groups who are KINECT hackers. And so please contact us if you like to join forces. Okay, so those are the three themes that I want to talk about today. How to take a hierarchical approach to learning low level and mid-level image features, how to learn transformations between domains so you can train them when environment and test in another. And our first look at 3D category level recognition. I think we're going to see an evolution of visual object for recognition research move towards 3D and robotics domains where you have the whole problem being considered at the same time including interaction with people. I think having PR2 that can recognize everything in the environment and have a conversation with a person about the objects is really an exciting topic. I didn't have time to mention any of the ideas we have and how language and vision might work together. But that is one for the future. And so I would like to end with just of these themes that I think are important for the coming year and coming years. How we can effectively bridge category level in instance level learning; fully integrate the richness of scene and task content--context that the--these robotic domains are going to provide us an explicitly model and user interaction and leverage data from multiple sources. Let me put in the plug for two things that really do work right now. These are two local companies that are looking at optimal fusion of crowd sourcing and computer vision in the case of IQ engines. And--an old interest of mine that I think is coming back now with KINECT, how we use vision for interface in gestures that can control lightweight user interfaces and I do advice both of these companies so its a bit of a plug. So I'd like to thank everyone who helped with these projects including my students and post-docs who are listed here and they deserve all the credit and I take all of the criticisms. Thanks very much. >> You're fast. So you have lots of time for questions still. Before the refreshments arrive. >> [INDISTINCT] I could put back in. They all have equation annexed, so. You might prefer to just ask me a few questions. >> Well most of the [INDISTINCT] >> DARREL: Well I think we use transparency for our early work and added other probabilistic models because we thought it was a fun and entertaining example. But it's certainly not true that all objects were transparent but it is true that these additive models are useful for general objects and that's the result that we showed--that I showed earlier confirmed for Caltech 101 or any kind of object that you can add--and also it's similar to what Andrew and others and [INDISTINCT] sparse coding. Even for non-transparent objects, objects that you think of as being traditionally a single process often aren't perhaps because of background effects or occlusion effects or because of un-modeled elimination effects even for non-transparent objects. So I think the set of objects that are truly modeled by sort of constant albedo patch that's, you know, homogenous with other rectangle, its pretty small. I should have repeated the question. Yes, the question was--you could figure out the answer, the question from the answer, right? This is Jeopardy for you all. One of you guys--I don't know which is first. >> So your first patented work, so you picked up this hierarchical model and the--since you're already working on learning the hierarchical [INDISTINCT] simultaneous [INDISTINCT], so why don't you directly working for other pixels? So I say learn everything from the pixels, that will be cool right? >> Yes, that would be cool and I think that's--although that is what many people already had done. So we wanted to, sort of, attack part of this base that had not yet really been addressed which is applying these probabilistic topic models as far as coding models directly to SIFT and seeing what we can get out of that. But, time permitting, I think we would like to go back and have a model that was both and which different layers were tuned into different underlying noise models of the raw pixels. And you should be able to get the whole thing out of the three layer model. >> When you did learn your domain transformation matrix W, do you actually need to put in feature correspondences or--and if you don't, why not? >> DARREL: We don't put in the feature correspondences; we do put in instant correspondences. So we say, here, given a particular feature or presentation, lets say, we're just doing a traditional bag of words model at this point, here's what a cup looks like in my particular presentation and here's what the same cup looks like in a different domain. Or even just, "here are cups" in one domain and "here are cups" in another domain. So, we're learning a transformation on the entire features base not per individual features you were thinking of. Any other questions? Sure. >> So at the very beginning you mentioned in the training time, you probably can use [INDISTINCT] modalities are missing, right? But as a whole in your following work you didn't in particular mentioned or [INDISTINCT]. >> DARREL: But that was one of the--the topics that I didn't cover today in the interest of time, where we do--this was the CPBR10 paper we had where we do hallucinate modalities in essence. If you have a certain set of modalities that are present in your test data, but not present in your training data, you can essentially apply semi-supervised learning. Learn the model that maps from one modality to another from this pile of unlabeled data that you have a test time and it--you know, it's counterintuitive but it actually helps if you go back and hallucinate more training data, at least it helped in our experiments and use that to build a model there. I'm not sure we've--I think there are variety ways one can approach that problem and we approached it from a supervised learning aggression problem--paradigm using [INDISTINCT] processes. But there are--there are other approaches. >> I was interested [INDISTINCT] your application and the graphics models. I know some people working with the text where they claim that methods based on [INDISTINCT] principle... >> I haven't and that's a topic that many folks have investigated-- envisioned in the last few years and I think often it's frustrating that's hard to build good co-occurrences models at the level of visual words. Certainly, there had been recent efforts that have started to show success and I think [INDISTINCT] if she's still in the room, has a model that does touch on that. So, I might defer to her or anyone else but I haven't done it and they don't have a good summary of the recent literature. Does it work? Didn't you do--feature co-occurrences in a topic model? >> [INDISTINCT] ... >> There are many ways to do that. Any last questions? With that-- maybe we should return sometime to the--to those of us who have to drive back to Berkeley. >> All right, let's thank the speaker again.