Detecting And Recognizing Objects in Natural Images

MALE SPEAKER: --from Everest. He got his PhD In Cambridge with Stephen Hawking and then went to MIT and to Harvard. After, he actually went west and studied a few years at the Smith-Kettlewell Eye Research Institute in San Francisco. And recently, he's been at UCLA for a number of years. He's a professor in two departments, psychology and statistics. And soon, also in CS. He wants to be a professor in as many departments as possible. He gets bored, otherwise. And also, he once pointed out that he'd never been a professor in a department where he had a degree. [INAUDIBLE] And today, he's going to talk about some of the work he's doing in object recognition, detection, text detection, in particular. And I think we've resolved the issues. So Alan, [UNINTELLIGIBLE]. ALAN YUILLE: OK, well, thank you very much for having me here. So I'm here to talk about work that's being done at the Center for Image and Vision Sciences at UCLA, which I direct together with professor Song-Chun Zhu. And some of the work I describe will be mine. Some will be joined, and some will be stuff that Song-Chu is doing himself or in some attraction. So yeah, just for people who may not work in vision, usually you put up a default slide saying vision is hard. Because people often think it's really easy. Why it's easy is our brains were evolved to do it. Our intelligence is basically in our cortex and at least 50% of our cortex seems to be doing vision in one way or another. And so, if you think about it in intelligence terms, vision is just looking at a room, and interpreting it is arguably a far harder task than solving the most difficult mathematics problem or building the most complex software system. I was once hissed at a Harvard mathematics department for saying this, but I think it's still objectively true. If you take away the vision part of your brain and the part that does motor control and a little bit that does language, there's not really very much left. So intelligence is often what's going on at the level of vision and perception. OK, so vision. Why is it hard? Well, vision requires decoding the image and passing it into components such as objects. And the difficulties are due to the fact that images are very complex and also extremely ambiguous. One way of pointing this out was the observation that if you look at all the possible images which are just 10 by 10-- 10 pixels in the x direction, 10 pixels in the y direction and count out how many images there are, you'll see there are far more of those than all the images that have been seen by all humans over the whole period of evolutionary history. Allowing for how many billion humans have ever existed and you see 30 images a second and so on. And you live, on average, 60, 70, 80 years' life. No one has seen all the possible 10 by 10 images. It's a very high dimensional space. It's a very complex place. Here is a quick example to test your visual ability. So these squares, this one and this one, which one is brighter. Has anyone seen this example before? AUDIENCE: They're the same, obviously. ALAN YUILLE: Very sophisticated people here, right. AUDIENCE: It's on Wikipedia. Everyone's seen it. ALAN YUILLE: Everyone's seen it by now. I still find some people who don't, but anyway, they're the same. But they look very different. And the reason it appears to be that when you look at an image like this, your brain is not just registering the intensity that you're actually getting here directly. It's doing some complicated inference. It's figuring out that this thing seems to be in shadow of here, and so its intensity is actually different from what you actually perceive it to be. So it's doing an inverted process. OK, so here is a typical image, illustrating, I think, Everest from the Tibetan side, illustrating some of the degrees of complexity that you get in the visual scene. And the work that we're doing, one part of it is related to what we called image parsing in a paper that we published a couple of years back, which would give you the basic flavor of what we want to do. The idea being that's an image, take this one, can be thought of as being composed of a number of different patterns. Patterns for the person, patterns for their-- a hierarchical representation of patterns. He is a person, here is a face, here is a body. He is a sports field. The sports field is made up of certain components. The spectators are also made up of certain components, which can be texture of different types and sometimes larger elementary objects. So think of the visual world as being made up of a large number of these patterns organized in some sort of hierarchical method like this. And then the idea of interpreting an image, in our terminology it would be parsing it, taking image and decomposing it into patterns. So you can take this picture two ways. One way is you go up here, which means you take these patterns and you stick them together to make the image. The other way is you start out with the image, you decode it by taking it apart and getting this representation for it. And once you've done that, you have solved the problem of detecting objects an image, recognizing them, and understanding the entire thing. And so our approach was to formulate this in terms of generating the image by probabilistic grammar, and a grammar allowing you to have a fairly abstract level of knowledge representation, probabilistic so that things are not purely deterministic. Here is a very simple conceptual picture of this. The image would be a scene. It would contain a face. It would contain some text. And it would contain a certain amount of background. And so you could generate an image by coming up with a scene node, a certain probability of there being a face somewhere in the scene, a probability of text, et cetera. The model would enable you to generate these, and this would be synthetic samples of text, synthetic samples of faces. So that would be the generation process starting from this abstract probabilistic model for this scene, and then generating it. Now, the talk of interpreting it would go backwards. It would have certain dynamics. You'd have a grammar. You'd have a grammar representation of the scene in terms of these nodes and the elements in it. You could do various operations. Like here, you'd start out by interpreting the image in terms of text and background without realizing there's anyone in the scene. And then over here, you'd do a move, a translation on the graph, which involves creating a node structure for the face and then explaining that part of the image in terms of the face model. And other types of transmissions can be done on this also. I was not planning to go into details of this type of thing. If people want to give details, give me feedback, and the more questions people ask, the more I can adapt to the audience and make sure I'm on the right page. Here is a rather more complicated diagram of how a system of that sort would work. And basically, it would work in two sets of stages. There's sort of a bottom-up component and a top-down one. The bottom-up component would be a system which would look for and would be particularly targeted to try and detect or make proposals about the presence of certain important things to the image, such as text or faces or any other types of objects. And so, for people involved in machine learning, you would have a series of tests, discriminative probability tests. For example, AdaBoost, or methods of that sort, are often very good procedures for that, which you would train on data, and which would look at a certain part of the image and say with certain probability we think there probably is a face there or there probably is text or whatever other types of objects you want to have. These would make proposals into this sort of hierarchical passing representation, which would allow you to create models or faces, destroy them, move them around, and so on, up to a high level where there would be more generative models of faces or text, which would sort of interpret what the low level is telling them. Also, the high-level models could start making predictions about the implications of certain things in the object or making predictions that, OK, you found a face here. There's certain probability that there ought to be a face over there, based on high-level representation. AUDIENCE: Is there a degree of pixel ownership, probabilistically between the low-level components? ALAN YUILLE: The low-level things wouldn't necessarily own pixels, The high-level-- the generative models-- should impose that. The low-level, you could have competition. Something jumps up saying, I think this is a face. And something else is saying, I think you're wrong. I think it's a tree. But the high-level stuff, generative, is the thing that really controls it and makes everything fit together and so on. Here are some examples of whether-- oh, yeah? AUDIENCE: [INAUDIBLE]? ALAN YUILLE: In this version, it only has one, but that's not-- yeah, this is a version from two or three years back that was implemented by Zhuowen Tu and Alex Chen and so on. So there's nothing in principle to stop it from having more. And some of the that professor Zhu is implementing would have far more nodes of that sort. Here is an aspect which would illustrate the issue of the [UNINTELLIGIBLE] of the bottom-up and top-down. So here, based on the filters we had several years ago, would be our estimates of what a face is and what a text in the image is. So it's getting these faces OK, but it also thinks this thing is a face. And if you look closely in it-- take out the context-- it actually doesn't look that different. There's a bit here that looks like an eye, an eye. This could be in a nose. That could be a mouth. So it's a plausible error that the system could make. And given the complexity of images, you're often going to find things of this sort. If you had a model of a tree or a particular tree thing, that would compete. That would say, hey, this is far more likely to be a tree. You shouldn't have a face there. But at the bottom level of these cues, which are working semi-independently, it's a legitimate thing for a face. And over here, there's another face here, et cetera. Over here, text here, text here. It thinks this is text. Well, it looks consistent. This could be a bunch of ones, and so on. So these sort of discriminative, bottom-up approaches are not always consistent, and they may, in certain cases, be wrong. So here, when you have the high levels of generative model,s you start resolving the ambiguities. These things are now detected as faces. But this area here, this is a part of the general region for the tree. It's no longer described as a face itself. And many things are-- you remove this area here, which was considered to be possible text, is now described by the high-level generative model as being a type of texture, because that fits it better. Over here, the nine, you may not have noticed, this nine was not picked up by the lower-level text detection, but it has been found here and is now interpreted as a nine. So this is giving you this basic strategy of the approach. Images would be composed of patterns. You would have various bottom-up cues, which you would typically learn by machine learning approaches. They would activate hypotheses. These hypotheses would be tested by generative models. And the generative models, in turn, wou;d try and impose a uniform interpretation of the whole image, which makes everything consistent and happy. Now I was going to show you this, which I think I should show on the-- how do I get out of the slide presentation? AUDIENCE: Probably Escape. ALAN YUILLE: Escape? OK, so I may need help with-- AUDIENCE: [INAUDIBLE]? ALAN YUILLE: --guy here. And now, let's see if I can put up a demonstration here. So I'm seeing this here. AUDIENCE: [INAUDIBLE]? ALAN YUILLE: Drag it over to the left? AUDIENCE: [INAUDIBLE]. ALAN YUILLE: OK, let me just go back here. Go back to the beginning and make it full screen. [SIDE CONVERSATION] ALAN YUILLE: OK, so this is great low-tech talk. Here is an example of what one can do now, though this is some time further on. And so, if you don't know vision, this may not be very impressive. If you do know vision, you should be pretty impressed by this. So on the left-hand side, there's video taken just by [UNINTELLIGIBLE] simple video camera. You go out into the street, you show these, run it around. And on the right-hand side, it's in practically real time. It's detecting text. It is binarizing it. AUDIENCE: [INAUDIBLE]? ALAN YUILLE: Well, in some, cases it's-- it's not keeping any memory of what is there from frame to frame . So there were certain aspects-- as it moves around, it's coming into view and into focus and that is coming out of out of focus moving around so some of the text will appear in one image and then will disappear again in the other image. Nevertheless, I think that this is something that is not-- I don't know anything else that can do it than a method of this type. AUDIENCE: What kind of processing power does it take to run this sort of thing? ALAN YUILLE: It's not a lot, actually. It's very-- Daniel, you do know the-- MALE SPEAKER: Yeah, it was run on, I think, a Pentium 3 or 4. But it's only processing a few frames-- maybe 10 frames per second or something. AUDIENCE: 10? MALE SPEAKER: Yeah, maybe. It's a relatively low-resolution frame. But that's on a single [INAUDIBLE]. ALAN YUILLE: And so, what this is relying on is having data sets of images, learning from that methods for what is distinctive about text, implementing them, and then proceeding on to do binarization. And from here on, you would go on to do recognition. So the systems we have, this would be one of the more practical things so far. A bit of history or memory that these things existed over time would remove the flicker. But it will make certain mistakes in places which, again, you'll see they'll flicker on and then they will disappear later on. AUDIENCE: So the algorithm that's doing this, really, is just single-frame analysis. There's no temporal correlation. ALAN YUILLE: There's no temporal correlation. So yeah, that's the flickering. AUDIENCE: So there could actually be improved accuracy if you took into account temporal correlation to get rid of some of the other-- ALAN YUILLE: Oh yeah, definitely. The temporal correlation would make things better, definitely. Everything here is based on cues. And each picture is sort of producing its own cues and the information from both of them. AUDIENCE: So some of this text, it seems to change forms as it flickers around. Is it doing different things in that case? ALAN YUILLE: No. AUDIENCE: We got off the example. Sometimes it comes out very readable and sometimes it's very confused. I'm just trying to understand exactly what we're looking at. ALAN YUILLE: it's the nature of the text. Certain of the text is different than others. And some of that also relates to the binarization. There are two stages in here. One stage is actually doing the detection. Then the binarization is making a certain amount of errors in itself. The detection itself, I think, is working extremely well. The binarization, well, we're working on improving the binarization. And there are certain rules we have which are good, and then there are certain cases where there were failure modes which we are then isolating and working on it. AUDIENCE: I think there's also some motion blur, interlace artifact, that our eyes are taking out of the original image, but which, because of the way it's composed in the single-frame, no-memory, causes the-- MALE SPEAKER: Right. [SIDE CONVERSATION] ALAN YUILLE: So that was work on, basically, we've been concentrating on finding faces and, well, particularly on finding text. Now the question is really, how do you go beyond this? Because really, the idea is you don't just want to find text and faces and things in images. You want to find everything. And everything we had before relied on having a certain amount of training data. To get the text thing to work, we had to have large numbers of examples of text in real images so that we could train, so we could get out with them so we could distinguish between the text that actually was text and the parts of images that could be anything else that looked like text. And the same thing for faces. And so that involves, really, having enormous numbers of images because you really have to see what there is in the environment that actually could correspond to text. And if you think about it, vision is a fairly strange subject in the fact that no one has ever really characterized what happens in all possible images. In speech, there were a certain number of words that people utter. There were phonemes and so on. There's a fairly basic understanding of what the basic vocabulary is, what the basic inputs are. If you're an astronomer, you study stars. And you know there are stars. You know there are galaxies. And you know there are dust clouds and a few other things. You know what the basic element of your domain are. But for vision, there's not really a large amount of knowledge of that. On one level, you know you're working with images, with pixels, with intensity values. But there's very little understanding of the whole complexities of what actually goes on inside them. And so my colleague, Song-Chun Zhu being done at Lotus Hill and a certain amount at UCLA is developing this really fairly ambitious project, which is more or less to take a very large number of images and try to map them and to understand, well actually, what really goes on with them. Essentially, while the work I described so far was based on the idea of parsing images into certain components, he's taking it further, on one hand, by trying to parse all enormous numbers of scenes into visual components. On the other hand, at the moment, it's being done more or less interactively. So it first started out a year or so ago with, I think, 20 Chinese art students sitting in front of images and hand parsing them. And I'll show you some examples later on. And then it's moved more into an interactive approach so that you can spare the art students some of their time by putting in vision algorithms which can find certain structures and then all the art students have to do is to validate that they're correct or not or to make changes. And so, certain of the work we did on text was made possible by having some of the amounts of data that we got out from this process. And so, once you have these representations, you learn some very big things by having them. One issue, you get this idea of the structure that images form in the forms of the graphical structures you have. You get the idea of what can happen in all possible images. And then, after you've done that, you can use the representations for learning. And then you can also use them for benchmarking. You can see how good the algorithms actually perform and how well they don't. If you're in the computer vision community, you know one problem with computer vision is that, while it's easy to come up with an algorithm that works nicely on a few images and you can publish a paper on it, actually getting that algorithm to generalize and to work in very large data sets is a completely different business. So here is an example of a typical Chinese image. I guess he gets some of his funding from China, obviously. And so here is an image. This is obviously outside the Forbidden City, I think, looking from Tiananmen Square. And this would be trying to take that scene, segment it. At this level, sky, building, flag, et cetera. Streetlights, portraits, and so on. Then mapping these down with line drawings, line illustrations, of all these types of structures. Here is the, yeah, putting layers. What structures are behind which other structures, et cetera. And whereas the text-- this was stuff we were using, for example. Labeling these as being text, these areas as being Chinese characters, and there should be some here on the top of the bus. AUDIENCE: Is that part of the automated-- ALAN YUILLE: Yeah, that's in here. It's the user interface annotation tool. So this would be the Chinese art students. The algorithms, this would be the Chinese graduate students. And then tying it in with the knowledge database and so on. Because the representations of the image would be based on a representation that Zhu has defined based on a sort of and/or, graph type version of the grammar. And now he's initially guessing what that sort of grammar representation ought to be. So they have one version of the grammar. And then, based on seeing how well they can describe certain structures, they have to modify it, and so on. So it's, in a sense, learning a good knowledge representation for the data. And the automation, well, more comes in. Of course, in a sense, if you can automate the whole process, then you've solved the whole vision problem and there's nothing else to do. But they're some way away from that yet. But still, it's interesting to find out which computer vision algorithms are actually useful for something like this-- which is hard-- and where they have to work reliably, and which ones are not. So here is just an example of a sort of representation. So here is a small element of a scene. So a boy with a backpack. You would draw it at one level. Then you would represent at this hierarchical manner, in terms of these parts that contain the hand or the zipper and so on. The boy would be represented in terms of the face, to hair, the ears, the parts, and so on. And all these would be encoded within this hierarchical data representation structure that Song-Chun Zhu's people are doing. Here is, I think, another illustration of what one does with, I think, a chair scene. So you're labeling the parts of the chair. You're putting the-- here is the seat of the chair, the cushion, the back, the light, and so on. I think the window and the so on is in the background. So from these you can both represent and you could learn methods for detecting chairs. You could also find the probabilistic relationships between certain structures happening in the image, like the chair and the table and so on. This has being talked of in computer vision. It's something that sort of comes and goes. I remember even when I started vision 20 years ago, there were schema being developed by [? Alan Wiseman and Hanson, ?] which talked about these issues, but it was practically impossible to do anything like that then. There was no data. There was no real-world images that you could really work on. And there was no possibility that you could actually even think of designing algorithms that would work on those types of processes. Street scene segmentation. Google Earth images, I guess he's got from him here, I trust, with your permission. We'll take these images, label all the cars and so on. Label all the buildings and get these representations and then proceed to use these domains to build image parsing systems based on the principles that I've been describing. Database. I think the size of this is big, certainly by the size used in computer vision. I remember five years ago, databases of 100 hundred images were quite large. After that, the Berkeley had a database of segmented images which was about 2,000, which was considered big. Well here, the total number of images is something like 500,000 so far, which Song-Chun says are hand-annotated. I haven't checked them all myself, so I'm not quite sure of the performance criteria. But still, it is an amazing amount. [UNINTELLIGIBLE] up by looking at the outdoor scenes, indoor scenes, images done by activities, aerial images and also coding of various animals, objects and so on. And this is by no means the end point. The institute has been running now for about a year and a half, I think, and 500,000, I think, is a big number. But Song-Chung is shooting for a lot more-- orders of magnitude, videos being included. AUDIENCE: [INAUDIBLE] a big proportion of that is videos, so I guess part of what makes this [INAUDIBLE] is that you can use [INAUDIBLE]. ALAN YUILLE: Yeah, I'm not sure exactly on that, how much of that is done by exploiting that property. But apparently, there was a number of videos of Chairman Mao's speeches which he had. AUDIENCE: Do you know how much time it takes to load one image, on average? ALAN YUILLE: I'm not sure, exactly. I spent time out there. There are highly motivated people working eight, nine hours a day, just doing these things. I'm glad I'm not doing it myself, but Song-Chun, I'm sure, could give you the figures of doing that. It's also going to change as soon as he [UNINTELLIGIBLE] putting the annotation in there, and you start just having to prune things out which are bad. [INAUDIBLE], have you got any idea of time? MALE SPEAKER: No. ALAN YUILLE: [INAUDIBLE]. And I think, also, with this process there's time issues like, how long do you learn to train the students to do this type of thing, et cetera, and the efficiencies which come with time. So that's part of how one could take the image parsing further. One needs to have this data set. You need to have these types of representations. You need to have this for training. You need to do it for validation. You need to have it for learning all your types of models. Here are just a variety of cats in here. Because, partly, I like cats, so I picked one example slide there. And then, certainly, the representations also go into 3-D scene labeling, so I just put in a slide of that form. Any questions on the data sets? Because I'm now moving on to, I guess, the final part of the talk. AUDIENCE: From the groups and hierarchy that you have in the various domains, are there generalized catch-alls? For instance, under land mammal, is there the notion that you can actually build a model that represents a probabilistic match to a land mammal, as opposed to only having the specific models for big cat, dog, and you'll always map one of those? ALAN YUILLE: I'm not sure of exact status. You would like to be able to have as much a genetic model as you can, obviously, for grounds of interpretation. And to the extent that you can find similarities and you'd have a grammar that could generate a kangaroo with one input and generate something else with another input. that's what you'd like to do. And frankly, I think that's quite practical, given the fact that all mammals have very similar structures. There's been work in the past. There's work I did with Song-Chun Zhu a long time ago when he was a graduate student in a form representation system, which involved representing things by skeletons and so on. And a lot of other people have worked on that sort of skeleton-type representation structure. And so I think it's not impossible to do that. I should also say that, in terms of detection as well-- so, for the earlier system that we had published, we built something which would detect faces bottom up. We build something that would detect text bottom up. But you don't really want to build something that detects the 8,000, 10,000 possible objects in the scene individually. You want something that brings out, depends on the similarities between them and will function and will output a number of these possible structures. There's been a bit of work done by Bill Freeman at MIT related to that [UNINTELLIGIBLE]. [? Wiseman ?] has also considered that issue, but you want to find commonality and build up with ideas of compositionality. Putting certain parts together, parts that hopefully reoccur, and can be detected individually. So any more? and Yeah? AUDIENCE: Sort of a similar question. Some of the objects were referred to in aggregate, like in the Maps image, a set of cars was all one [INAUDIBLE]. How do you discriminate between a car and what constitutes an aggregate? ALAN YUILLE: Actually, there were some slides which I took out of varieties of cars. For that, there was at least a generic car, and then a series of small specific type cars. And quite how that functioned-- what was your question, more specifically? I mean, it's both suspects being addressed-- AUDIENCE: So in the Google Maps figure, there were single cars labeled on the road. There were also groups of aligned cars. Is that generally a big problem? Each of those, obviously, is made up of single cars. ALAN YUILLE: Oh, right. Well, then you'd have the hierarchical representation where at the bottom-level nodes, you'd have the individual cars, and then you would have those grouped into regularities such as rows of cars. And then there would be a node higher up, which would be parking-- well, not a complete parking lot, but some public structure, and so on, and then the whole node higher up, which would be the whole parking lot. So the hierarchical modeling is intended to take care of those issues. AUDIENCE: So are you using the same recognizer for the map shots as you would for a ground-level scene shot? Or are they separate? ALAN YUILLE: They would be separate. Yeah. The cues [UNINTELLIGIBLE], I think, the viewpoints suggest two different-- for that to be done. So now, for the last part of the talk, I want to get in some other talk with another person called Zhu, but this is a graduate student, Zhu Long. So not to be confused with Song-Chun. I'm told the Chinese character is actually different, even though the Anglicized version, Zhu, is the same. So here is an attempt to learn probability grammars in an unstructured way, just from input images. So here, we're not relying on the types of detailed segmentations and annotations that Song-Chun's group is producing because, well, they weren't around when we were starting this work. And in any case, one would like to see how far one can go without having to have hordes of Chinese graduate students do that type of work for you. And so we are addressing work in the Caltech 101 dataset that, I guess, a number of people may know. So here, this was set up at Caltech by Fei-Fei Li working with Pietro Perona. And she got 101 different categories. I think, apparently, Pietro told her to get more than 100, so she went up to 101 and stopped there. And so this data set though, it's simple compared with what Song-Chun is producing, and there are certain criticisms of it, but still, it's considered a good, state of the art-- it's something that people are starting to use and compare results on. And so here is certain of the objects you have in it. And this is chair, cougar, et cetera, and below, certain of our recognition results performed on it. The important thing, perhaps, for us isn't so much the quality of the performance, which is fairly good, but it's more the concepts behind it, at least for this part of the talk. So you think of having a probabilistic model or grammar that could generate these objects. And so, the way these things are done is you're given a series of images. And in each image, there is an object. And it's a face, but you do not know where it is in the image. And you don't know what the background is, and the background is fairly complicated. So I think Pietro Perona calls is unsupervised learning. So it's not quite fair. It's more semi-supervised. But still, you don't have the detailed knowledge of where the boundary of the faces or the particular position or the size or the scale. So the idea is that you try and learn the grammar incrementally from these sets of images. And so the first thing you do is you assume that the image is just purely background. It's the default model. It's like it's random. That corresponds, schematically, to this graph structure here. This is just a background generating image, and that's it. I should say this work here, initially, is done just on feature points extracted from image, and later on, it's generalized. So you can think of these as featured points in the image to start out with. You take the image, you run an interest point detector on it, you end up with 40 feature points, something like that, and you want to explain them. And they have attributes like the appearance properties and so on, and you're using methods like the Kadir-Brady SIFT descriptors and so on. So you start out by this powerful model. So that's a rather boring model, but it's where you need to start out with. So then you start seeing, can you find more structure, more regularities in the image, which are more likely than the data just being generated by this independent model. So here, you start doing it in terms of combinations of triplets of features. Triplets being quite useful because, from triplets of features, you can get properties of the invariant to the orientation, and to the scale, so you don't need to have those things fixed. Then you see whether this model will explain it better than this. Then you grow this model to a more complex one, see, can you explain it better, and so on. And you keep on adding new features, adding more elements of this graph, to this grammar. While your ability to explain the data goes up, you'll carry on adding extra features in here. So you'll grow your grammar over time with data. People are probably, maybe, familiar with AdaBoost, so AdaBoost algorithm, you do feature selection, you use a certain number of weak classifiers to do a task, and then there's a procedure which decides what's the best new weak classifier to add on into the system. Well, this is a little like this, except it's rather more complicated. Because you're here, you're learning how to increase the structure of a very general probabilistic model, rather than just a classifier like AdaBoost. So [UNINTELLIGIBLE] with those things here, and the details of this are-- well, I can describe them later to people who are interested, but this gives you, I hope, the basic idea. From that, you could learn a grammar of the object. And this grammar is a fairly simple one at the moment because it's based on feature points. It would be comparable to the types of models, somewhat, that are done by Perona and Fergus on the constellation models, if people are, perhaps, familiar with that literature. But there's a difference here. Everything here is learned completely unsupervised. And the grammar's we're getting here-- the models that Pietro and the other people get have fixed numbers of points. Here, with the grammar, you're able to have different numbers of points depending on different aspects of the object. So the motorbike can be decomposed into this type of aspect appearance for it. Here is another form of appearance. Here as an another. And depending on the amount of viewpoint variation, you could have other aspects being developed as well. Another advantage of this process is also, once you've learned it, how can you do the inference? And the form of the grammar is set up so that doing the inference is very fast. So for example, once you've learned the structure, performing the inference to find out if there is a motorbike in the image or not is done in about one second, which by the speeds of what alternative methods here can do, is very fast, moving towards the practicality of real-time efforts. Three tasks. This is just to bring out the main different task. Once you've learned this probabilistic model, how will you do inference? Which would involve detecting the object in the image, parsing it, finding its boundary. The other would be learning the parameters of the model when the structure of the model is fixed. And the other is structure learning, where you allow the grammar to grow based on the data. So there are three different tasks related here. The inference is the part that's really fast. The parameter learning part is reasonably quick. Structured learning is a bit slower, on the order of several hours, but that's something that you do offline, in any case. And here, just to say some of the algorithms involved, you're using EM, you using dynamic programming, saddle point. There's a certain range of techniques put in there to do that task. So here, coming up with, again, more examples. Roosters, pianos et cetera. Faces airplanes, and so on. Now, these methods are effective-- here's invariance to rotation and scale. I'll skim over that because we're getting short, but the form of representation enables it to do that thing quite well. Here is, I think, conceptually interesting, is that if you don't say what is in the image, it will start learning the models individually as different branches of the grammar. So before, you would give an image, which there would be a face in there somewhere. You'd have that information there was a face in there. Now we're saying, and we're making it weaker, there's an image somewhere that is a face, or it could be a plane, or it could be a motorbike in the image. And then it will learn the grammar, and one part of the grammar will correspond to the plane, one to the face, one to the motorbike. So it will start learning that there are different types of objects, automatically, without you being taught it. Now these results here, are I think they're good. And that's comparable or maybe better to the current state of the art of the representations, which are based on representing objects by interest points. But interest points, they're a sparse representation of the object. And you are only getting a limited amount of information from them. You're getting ability to do certain tasks, like certain classification tasks seem to be possible based on interest points, because a grand piano and a face, they may look very different just based on the interest points. But if you want to get more fine-scale resolution or if you want to actually find the boundaries of the objects, you have to go to a richer set of vocabularies. So the next stage is to say, OK, we don't just represent the image in terms of the feature points. You represent them in terms of interest points, then you add extra features like, it could be a mask for the whole shape of the object, it could be edgelets. And as we can move forward with this, we can start putting in extra features. So eventually, one would hope to get enough features so that you could really represent the object completely. The interest points are very good places to start because they sort of have invariant properties, there are small numbers of them, they can get you started. They can tell you roughly how big the object is. They can tell you the orientation, et cetera. Once you have those, these other features can be applied. So over here, we start out with the basic objects, just using interest points. We find the interest points, and that interest point starts telling you roughly where the image should be. So here is a box based around the interest points of this star, of this thing here is one based on the interest point you found here. So locate a certain area from that. Then from this, automatically, you can start by hypothesizing a model for the shape of the object given, again, 100 examples. And you can learn that by doing inference, where that would be using certain information from the inside of the object and from the outside the object. Again, purely unsupervised. And so from that, you end up being able to learn masks of the object, which are reasonably good. Though [UNINTELLIGIBLE] screw up some places, like with the chair. They're not quite getting the legs of the chair very accurately. But nevertheless, for most of this objects, they're getting a pretty good model of the shape of it. And so you're going here from the basic, very sparse interest points to a probabilistic model of the shape. And then you're adding in edges and so on. So I'm hoping that with this method, by adding more features, you can sort of bootstrap your way up until you have a model that generates not just the interest points and the mask and the features but could generate, maybe, the appearance of the objects as well. And so, with this, you can not only do things like classification, which is what people do on the the Caltech database and get numbers for this, which are good or better or something than the other things but also, you can get methods for the detection for the positions of the boundaries and so on. And I think this would address one of the concerns with the Caltech thing is that, though there were large numbers of objects in the dataset, they're not really representative of the domain. And so, if you compare it, say, to the work on text detection, which we've done a lot on, we had to get examples of text and then we had to get lots and lots of examples of things that were not text in the image-- thousands, tens of thousands, even more non-text things to find out anything that you could confuse with text. And so, if you take the Caltech 101 dataset, and you take the interest point models we have or the models that other people have, sure, they'll work quite well in the Caltech 101 dataset, but there's going to be a lot of other things out there in the world that they're going to mix up with these objects because they have too limited, too restricted representation of the data. So hopefully, with what we're doing here though, as you add on more features, as you use the interest points just to get started, you add on more and more features, and that allows you to discriminate better and better between these objects and everything out there. And then also, it enables you to find the boundaries and do every other thing that you want with it-- finding the boundaries, doing the parsing, saying, hey, that's not just a grand piano, but there are the lets, here are the keys, et cetera. So that would be one goal. So I guess I'm finishing more or less on time. So here would be the conclusion slide. So generally, the group-- our Harvard group, the Center for Image and Vision Science, myself, and Song-Chun, we're interested in formulating problem tasks like image parsing, using probabilistic grammars for objects [UNINTELLIGIBLE] patterns, with high-level models of structure, low-level models acting in a bottom-up, top-down method to use low-level cues to activate the high-level models, which is going to be checked, confirmed, and make everything consistent. And I think where we've gone furthest on this so far with something that's pretty good is the automatic detection and binarization of text. We're then moving on to these other things. So Song-Chun Zhu's genome project of images with all these art students and extracting these data sets and getting these hierarchical representations by hand and using them for training and for testing, et cetera. And then this other work I'm describing at the end, the issues of how far you can go with learning these grammars in an unsupervised way without having to get Chinese art students to give you clues part of the way. So I think, hopefully, these aspects come together and one can get very powerful methods. The talk has been mainly about computer vision. I should say I'm also very interested in how the brain does these types of things. And earlier this week, I was at a meeting at NSF in Washington where issues of how you would model the brain from all different perspectives-- from physics, computer science, statistics, and so on-- were coming out. And so I guess I was also arguing there for this as a type of model that you could have. We're trying to build here a system that is as sophisticated, more or less, as a human visual system. And while you can try and see whether certain of the aspects of this relate to the properties that we know about the human brain and how it's organized and whether something like this could be both tested, validated, confirmed or I could find out that the brain is doing something different, better, or possibly something worse. Who knows? So I think it ties in here and it could also start using as a model for a theoretical neuroscience model capable of some tests. Anyway, thank you very much for your attention. AUDIENCE: Are your probabilistic grammars and parsing techniques all built on specific points or specific features or something like that, or do they use [INAUDIBLE] patches, or-- ALAN YUILLE: They could use most things. It's only the one at the end that we're starting off with the interest points. The things at the beginning for the faces, that could be patches, that could be any type of description that is-- yeah, anything that one can put in there. So there's no restriction to interest points. You really want to be able to explain the whole image. AUDIENCE: What's the nature of the parsing algorithm that works with these? [INAUDIBLE] patch as a thing to do distance matching on a parser something like that? ALAN YUILLE: No, well, if you were trying to recognize a face, you'd have a generative model for what a face would look like, which you would learn by training data by having-- for the face, it would be a bit like the active appearance models, so you'd have some sort of spatial warp, some models of the intensity and so on. The patches there could come in as possible cues, if you could statistically say that certain patches were highly likely to be present where there was a face, then that would be a useful feature that you could put into a bottom-up AdaBoost detector. But that would be there merely to drive the activation of the top-level generative model to validate and to confirm it. AUDIENCE: So I had a questions around, I'll call it viewpoint sensitivity. Although you had an objective rotation and scale and variance, that was for a given viewpoint on the original image. I was looking through the Caltech 256 database just now, browsing the images, and it seemed as if, very often, the image of the object always respected what might be called the principal axis of the object. But you never saw a book face-on like this and then edge-on like this and expected it to both recognize as the book. So is there a precision recall sensitivity issue, in terms of the perturbation from this trained viewpoint? ALAN YUILLE: Well, there isn't, but of the system I described, it should be able to deal with that by having a grammar, one aspect being for what the book would look like front-on, another for what it would look like from side-on. So at the top of the graph-- AUDIENCE: So given the right training data-- ALAN YUILLE: Given the right training data, yeah. It's not clearly shown here-- no, there's nothing here which really quite demonstrates it. but here, this would be like an Or graph. Or it could look like this or this or this. So if you had enough training data, it could happen automatically. There is an issue, though, weather that is the ideal thing to do or whether there are similarities between the representation here and the representation of the angles, that by going to a full 3-D representation, you would go away with. So at the moment, given enough training data, this sort of thing should work, should deal with that. But there could be a smarter, better representation-- AUDIENCE: Does it represent the underlying physics in terms of the light field? ALAN YUILLE: Yeah, it keeps on coming up. There's an issue about-- certain of the machine learning techniques are very data intensive. You rely a lot on them. There's knowledge of how humans learn. Humans seem to learn very quickly, from one or two examples, and then they generalize. And babies do it enormously well. There's interesting work being done by Josh Tenenbaum and people at MIT on that, not directly on vision, but on the issue of how humans learn concepts. And it does seem there that, in his models, you've got a more structured representation, which means that it gives you more of the structure and allows you to generalize more easily from a small number of examples. And so I think that's an area that I'm partly trying to push themselves in at the moment, since having enough data works and we know how to do it on. But still, it would be certainly more elegant and more effective in the long term to get at those invariant issues. And it's an area of importance at the moment. AUDIENCE: So I guess I'm just curious how close this is to being production quality. If I wanted to, say, find text in Google Images, what are the issues standing in the way of doing that in a useful way? ALAN YUILLE: I think finding text would work to be pretty well, I would say, since we've tested on large numbers of things. There's the finding text, there the binarization, and then there's reading the text. The finding the text, I think, goes pretty well. Bank on that. Binarization, that's working pretty well, but there are certain failure modes that one detects. And again, with more datasets and more labeled images, you find ways around those. The reading of text-- we've taken the outputs of these things, you put them into OCR systems like ABBYY or things like that, and it's surprising how often those systems fail, even when we give it binarization that looks very good. So at some level, I think the limitation of those things is the text reading systems, which, after all, they're designed to be done on documents like these. They're not designed to be done on real-world images. And I think the limitations are those. If you want to find the text, if you want to binarize it and put it into a system, then I think these things went pretty well. For the other aspects, other things I think we have not spent so much time on or developed so far. And partly, it has been an aspect of having the data. Text detection stuff, we started working on five, six, seven, eight, years back. We started getting images. We started with small numbers. We tested, we expanded, et cetera, et cetera. With these other things, it's only in the last year and a half that Song-Chun has got these people to get these types of images and to go forward. Also, it gets into this issue of variability. Text, the amount of variability is not as much as there would be for a deformable object like a panda or something like that. And the more deformable it is, unless you have a good knowledge representation system that is able to deal with that, the more deformable it is, the more data you're going to need to do it. AUDIENCE: So how hard would it be to generalize, to stylize things like logos and stuff, where they take normal text and they intentionally make it unique? ALAN YUILLE: Those can be frustrating sometimes. On those, we do surprisingly well, except with certain particular cases where there are some slight problems. For example, the text stuff is based on black and white images. If you have color images where the text is green on red and the intensity is almost exactly the same, then the system is not able to do that. I think, again, with a bit of training or a bit of putting in certain cues for those things, I don't think there's going to be too much of a problem for those, as far of the detection station and maybe the binarization stage. But certainly, yeah, those have been frustrating in the past. AUDIENCE: So in your grammar-- so ideally, you have the whole branch. So it should be one type of, let's say face. So could it be possible for one branch, a human face mixed with a, maybe, cougar face-- ALAN YUILLE: Well, if you showed the person cougar images mixed together with face images and just learned it, it would be quite-- we haven't done that. Maybe we should try it and see what happens. The other issue is, at the level of interest points, cougar faces and human faces are a bit similar. They're different enough so that when you learn the model, they are able to distinguish between them. But in the stages where you're trying to learn it without telling the difference between the cougar and so on, it might not work just with the interest points. I think, when you start putting in the other stuff, it's going to go a lot better. But with interest points alone, you might get mixed up until you've gone far enough to learn the full model. OK, thank you.