11 3 Kd Trees 2907

Now we're going to look at KD trees, which is an extension of BSTs that allow us to do efficient processing of sets of points in space and it's very flexible and very useful in lots of applications. So now we're going to extend the API to talk about two dimensional keys. So that's just, you can think of two dimensional keys as, points in the two dimensional geometric space. So we're going to talk about insertion, and search. We won't talk about deletion and then range search and range count. So, we want to be able to insert and delete, points. You can think of a two dimensional key as a point in two dimensional space. And we want to be able to find all keys that lie within a two dimensional range. That's a rectangle. As I mentioned at the beginning or count the number of keys that lie in a two dimensional range. So again, the geometric interpretation is the keys or points in the plane. And we have a, a, a range, 2D range is a, a rectangle, or is oriented, to align with the horizontal, vertical axes. And we want to be able to find or count the points in a given rectangle. In this one has many, many applications and we'll talk about some of them later on. And even if it's not points in the plane, just databases you might ask for all the people with incomes between 1,000,000 and 10,000,000 who are between 40 and 50 years of age. And this kind of algorithm and data structure would be useful for that kind of situation too. So how are we going to solve this problem, implement this API? We build a data structure containing points that can efficiently support range searching and range counting. Well one easy way to do it is to just think about dividing space into a grid of squares. So we'll pick a parameter M and divide space into an M by M grid of squares and then process all the points and make a list of points that are contained in each square. We can use a two dimensional array to directly index, relevant squares. And for insert, you just, take X, Y. Figure out which square it belongs to. Simply divide by, both coor dinates by N, and look into the two dimensional array. And just add the point to the list for the corresponding square. And then range searches, only find the squares that intersect the query, and process the points, in that square. And depending on the value of the parameter M, you have a space/time tradeoff. The amount of space required is M^2 for the grid + N. You have to have a linked list element, or whatever for each point. And then the time, though, gets divided by M^2. Your number of points and were spread out over the N squared, different squares. And so on average you examine N over M^2 points per square. So you don't want to make M too big that would be too much space, you don't want to make M too small that would be too much time. So what we want to choose is the square size that would best balance these two needs and then it is easy to see that what you should choose is M to be about n square root of N. So then your space is within a constant factor of N and your time is constant. So if the points are randomly distributed, then this is ideal. It takes us, linear time to initialize the data structure. And to insert and search, it take constant time per point in the range. And this is absolutely a fine method that, is not that difficult to implement, in the case that the points are evenly distributed. Unfortunately, it's usually the case that, in geometric data, that the points are not evenly distributed. There's a well known phenomenon known as clustering that says that, the, the points aren't going to be evenly distributed all over the whole thing. In the case of the, the grid implementation, they might all fall on the same square. And so, the average list length is short. This is like what we encountered with hashing. If you take, all the points in one square, and zero and all the rest of them. Your average is still, N over M squared. But. They are all in that long list and you're going to have a slow algorithm if it's, if it's based on this. So we need a data structure that more gracefully adapts to the distribution of the data. And again it, it's well known that most geometric data has this kind of problem. So for example here's some data which cities in the US. It's got 13,000 points, but if you try to use the grid implementation for this, you'd find that half the squares would be empty. And half the points are in just ten percent of the squares. So, the clustering in the data is going to make the implementation inefficient. We need to adapt to the data. And this is very, very typical, in geometric data. Particularly, in higher dimensional data, as we'll see in a minute. So, people have developed all different kinds of, methods for, adapting in this way. And what we're going to look at is one of the most widely used. Which is basically to use a tree to represent a recursive subdivision of the plane, of two dimensional space. It's going to be recursive. It's going to be based on the points the way in which we divide into half planes. And its one of many different algorithms that, have been, studied for this, but again its a simple one and widely used. So, for example if you played the game Doom or use the flight simulator, that, these types of graphical simulations and animations are made possible only through the use of space partitioning trees, like 2d trees and quad trees and also in all different types of scientific data processing these things are extremely important whenever you're processing. Geometric data, doing some kind of geometric search. Where is the closest thing? How am I going to find the closest thing efficiently? What things are nearby, and so forth? So, rest assured, these types of algorithms, lie at the heart of, any program that you use that, is involving a lot of geometric data. So, those are just, two examples. So let's look at how it looks now. So, a 2D tree, is, again, it's going to be a data structure based on a bunch of points that's going to facilitate, efficient data processing of these points. So, just as we do for, symbol tables, where we take, keys. Now we're going to ta ke points, and we're going to build a data structure based on these points. And the idea is to, build a tree that corresponds to recursively partitioning the plane. So arbitrarily our first point we're going to divide the plane into two parts based on a vertical line through that point. So now, in the tree on the right there, all the points that fall to the left of the first point are going to be on the left, and all the points that fall to the right. That first point, you're going to be on the right. But then we get to the next point, we'll switch and we'll partition on a horizontal line. So now, our second point, in the tree, the left sub-tree corresponds to everybody below that horizontal line, and the right sub-tree corresponds to everybody above it. Similar if our third point comes on the left again we'll partition according to the horizontal line through that point on the left. So if we go left and then left that means all the points to the left of one and above three, so the square in the upper left is represented. By, that node in the tree. And, again. Now, when we go one level below, we switch again to vertical. Alternate between horizontal and vertical partitioning, of the plane. So it's a regular binary search tree. But it's got this interpretation based on the geometric data, where we switch which key we use for, the comparison, the X coordinate or the Y coordinate, at each level. And that corresponds to this partitioning of the plane. So now five comes in, that's to the left of four because it was partitioned at a vertical and five's going to partion on a horizontal. This is simple, and completely well defined partion of the plane corresponding to a binary tree. Now the ninth point well it's to the left of eight, above to and to the left of eight and then corresponds to horizontal partitioning, tenth point is to the right of one, it's below two and we go to the left and it's to the right of seven so we go to the right. So that's a way to build a Binary tree corresponding to a partitioning of the pla ne. And it's really the same as the binary search tree. It's just that we alternate which coordinate we use as the key. At the even levels, we think of a vertical line. And the left subtree is all the points to the left, and the right subtree is all the points to the right. On odd levels, we use a horizontal line, in the left subtrees all points below. In the right subtrees, all points above. And the, and the code for this is, you know, the same as for binary search trees. It's simply, which, coordinate we use for the comparison. That's the only difference. So that's 2D tree implementation. So now what about solving a problem like this rain search problem for a 2D tree. So now we have a query like this green rectangle and what we want to find is all the points in the data structure that fall within that rectangle. Well, we're going to use, the 2D tree represents our points and we are going to use the structure and definition of that tree, to go ahead and help us find the points that are in the rectangle. If, if the root node lies in the rectangle then we're done. We can return that, that point but we have to look on both sides to look for more, but if the rectangle lies to the left of the root node then we have to look at the left and so forth. So let's look at how this works in the demo. All right, so, we're going to try to find all the points that are contained in that green query rectangle. So first thing is, to check if our rectangle contains the node of the root. In this case it doesn't. So since, it's to the left of the splitting line of the root we only have to search in the left sub-tree. Now, we search the left sub-tree and we're going to check if it contains.3. It does not contain.3, but now which, sub-trees do we search? In this case, now the rectangle intersects a splitting line, so we have to search both subtrees, both above and below. So, first we search the left subtree that's the one below does it contain .4? No. It's to the left so we're going to have to search the left sub-tree of .4. And so we search the left sub-tree and we check if it contains point five and it does, that's the one that we return. It, it also intersects the splitting lines, we have to search both the sub-trees, in this case they're both empty. So we're done with that but now we have to go back and we have to search the other sub-tree of point three and that's the above, so now we check this .6 in the rectangle. In this case, it's not. And it's still a left sway if it's to search the left sub tree a .6 and that one's empty and now we return and we're done. So we don't always go down just one branch if our splitting line hits a rectangle we have to go down both branches but still this is a very efficient algorithm, particularly think about the rectangle being small, it's going to be not that different than a regular search in a binary search tree. Alright. So what about the analysis of how long is this going to take? Well again, a typical case. A rectangle's small that we're only going to examine, really, a path of the three, maybe a couple of other nodes along the path. And the running time will be proportional to the number of points returned plus lg N. With geometric data the worst case can be bad. So, like all the points could be arranged in a circle. All, all different types of problems that might occur in, with some difficulties. It's possible to prove, that even if the tree is balanced, you can get a worst case proportional to square root of that. So analysis of 2D trees that the under scope. But, for many practical applications they are easily implement and worth using. Let's look at another using 2D trees to solve another problem, a so called nearest neighbor search. So now, instead of a rectangle, we have a query point. And our goal is to find the closest point to that point. So in this case our query point is, over here in green. And our algorithm's going to want to return to 0.5. That's the closest one to the query point. So let's see how that looks in a demo. So again, we start at the root. And wh at do we want to do? Well, we're going to check. And I, whenever we're at a node, it represents a point so we're going to check that point and we'll compute the distance from that point to our query point. And, if that distance is less than the best found so far, then we'll keep that as the champion. So the first point, that's the closest we've found so far to the query point. So we'll say, number one is the distance. And we'll only worry about the possibility of finding something closer, than that. And so just using that distance we recursively search, any part of the tree that could contain a closer point. And so that's well it continued to do so in this case the query point is to the left of the splitting line and will always go towards the query point first and so in this case we have to search both to see if there might possibly be a closer point than one over on the right if you come like straight across, there might be a closer point. We're going to have a look at both as far as we know now but we'll go towards. The query point and see if we can find something closer. So in that case now we go to .3. Compute the distance of that point to the query point. It's closer so we update three to be our new champion. So now we are going to look in parts of the tree that could give us a point that is closer to our query point then three. So already that would mean when we search the point one we wont search the right sub tree because there could be no point on the right sub-tree right of the splitting line. So lets closer to query point than three and so that idea getting closer and closer to the query point is going to cut out different parts of the tree as we process so, but anyway starting at point three as far as we know that we may have to look at both sub trees, so sometimes when we look at both sub-trees but as we get closer and closer we only look at one so lets look at point three now. So, again, go towards the query point. So we'll, go to the top first, and that takes us to six. Six is not any closer than three was. So that's not going to, update our champion. And so we'll search 6's left sub-tree which is empty which is right sub-tree and the nearest neighbor can't, we don't have to go down the right sub-tree of six because you can't have a point in that rectangle that's closer to the query point than three. So now we can return from that, And now we have to look at the bottom sub tree associated with three. And so that takes us to four, And that one is, not closer. So we still have three as our current champion. So now, we'll search the left subtree of four first because that query point is to the left of that splitting line. And that takes us to five and five is our new champion. So that's the closest point that we know about. Could there be a node that's closer to five, to our right query point than five in the right subtree of four? Oh. We have to go above. Sorry to look at the top sub-tree associated with five, and we find that it's empty. And now we're back at four. Do we have to search the right sub-tree of four? No, because there can't be a closer point, than five in the right sub-tree of four. So we're done with four, and we return to, come to three, and now we search the, suppose to search and return from there we are now at one, suppose to search the right subtree one next but we can term that nearest neighbor couldn't be in there. So, then we are done and we found that the nearest neighbor, is five. And this is going to be, very efficient because as we get closer and closer, the query point, we are cutting out all the subtrees that are away, and again in practice, the running time of this algorithm, is going to be close to logarithmic. So in, in typical cases that the running time of nearest neighbor search in a 2D tree is going to be proportion to logarithmic. It is possible to concoct cases where you are going to have to examine all the points for example if they're all arranged in a circle and your query points to the center or something of that sort. But for typical data it's very efficient. Now we're going to look at an application where we simulate a phenomenon in nature. And this is what kind of patterns do things like starlings and geese or cranes or, or fish or fireflies? How do they flock together. And we'll look at a simulation that corresponds to that. And then, when the moment is right, they behave in a way, that should be impossible. [music] And it happens everyday, right through the winter. Just a couple of miles from my doorstep. Help you desire. >> So to, there's a simple model developed by Craig Reynolds awhile ago for simulating this situation called the boid. And the idea is to use three simple rules to you get something very close to this complex flocking behavior. So, you have col, collision avoidance where you always try to point away from the K nearest boids. You have centering where you try to point near the center of mass of the K nearest boids, and velocity matching where you update your. Philosophy to the average of the k nearest boids. And that algorithm works like this. So as that example shows, 2D trees are extremely effective in quickly processing huge amounts of geometric data, and what's more, it expands to more dimensions. With a very simple modification we can take it to D tree and created data structure known as a KD tree, which even works for K dimensions and the idea is even if there is K dimension, what we will do is recursively partition one dimension at a time, so that's called a KD tree and we use the same ideas for two D trees, but instead of cycling through just horizontal vertical, we cycled through, however many dimensions there are, so its where in three space, we use a plane and do above and below and then simply cycle through the dimensions. At level I, we put on the left points whose I-th coordinates are less than P and on the right, we put the points to whose I-th coordinates are greater than P and at level in cycle three of the dimensions at the level I might k we just use that dimension of the point to do the comparison. The implementation is simple, ex cept for the comparison. And we get the same kind of partitioning for three dimensional data, so we could do boids in three dimensions or for databases with large number of dimensions. You could do even much higher dimensional data. And find nearest neighbors and do range searching extremely efficiently. It's a very efficient and simple data structure for processing K dimensional data that's very widely used and the whole idea is that data clusters, particularly, in high dimensions. And also one point to make for this class is that, this algorithm was discovered by an undergraduate in an algorithms class, so that's, John Bentley, who discovered this while an undergraduate at Stanford. And, so it's a simple idea that, but, experts scientists where struggling with, dealing with huge amounts of geometric data, and, Bentley found this way, to process it efficiently that's been widely used ever since. And in, in particular just as another example consider the idea of N body simulation which is a classic problem in physics. Where you've got N particles mutually affected by gravity and basically the computation is based on computing the interacting force for each pair of particles. And so then there'd be mutual gravitational pull. [inaudible] and this is what happens with a large number of particles in a certain simulation and people understand properties in the universe by coming up with, doing these kinds of calculations and comparing against what's observed in space. Now but the thing is for each pair of particles, so if you have M particles and you have to do it for each pair, that's M^2 so the progress of scientific investigation is going to be affected by how quickly, you can do this calculation for a large number of particles. There's a lot of particles out in the universe. And, you can't do a quadratic calculation for large N. So, another, undergraduate in an algorithms class discovered, this idea, for N body simulation. And that's, Andrew Appel. And his idea was that if some part. Particle is way away from som e cluster of particles, we can treat that cluster as a single aggregate particle. And not do the individual force calculation between our particle and every one of those in the aggregate. But use the center of mass. And you get a very accurate approximation to the N body doing that. And the algorithm that he used is based on 3D trees. With the N particles as nodes and storing the center of a mass of a sub-tree in each node. And then to compute the total force, traversing the tree of all the information that you need, to, complete the N body calculation. But in time, much closer to N lg N than to N^2. And that, idea that, you didn't need to take time proportional to N^2 but with a, a geometric algorithm, like a 3D tree, you could get the time to N lg N. That enabled, all sorts of new scientific investigation in this example of the use of algorithms to enable new