Lec - 22 search trees

We have been discussing data structuring in the previous class and we found out that a large number of problems are pure data structuring problems, the most problems are mixture of algorithm design and data structuring. And we will continue our study of data structuring and to find out what are the efficient ways of solving a large number of problems. Data structuring has got two parts, one is the organization of the data elements that is what ever is the data how you are going to store the data, in what form and exactly in what structure and format. And secondly these data will be operated by a number of procedures. So the procedures are the functions which operate on the data must be efficient. Therefore data structuring inherently consists of two parts, one is the organization of the data elements and the other is efficient procedures for data manipulation. We saw a number of examples which were pure data structuring problems. For example we discussed complex numbers with operations add, subtract, read and write. We discussed the problem of representing a set of numbers with the operations insert a number, delete a number, find out whether it is a member of a set, is a set of sets. So insert a number in a set x, delete a number from a set, find out whether a number is a element of a set, make a union of two sets, make an intersection of two sets. So this problem is a pure data structuring problem. Similarly on vectors you can have several operations, on matrices you can have operations. We also discussed some other problems like the problem of making an appointment list, a merit list of students with add a student, delete a student, select the first or the best, select the last or the worst then printing the merit list. And in general you can have a big problem like a student information system where subjects can be added, deleted or updated, student information can be added, deleted or updated, registration for students in particular subjects can be updated. The grades students obtained in the subjects is the register, sgpa cgpa calculation, grade card preparation, merit list preparation. So these problems inherently require storage of some data and some functions or procedures which manipulate this data. So these are pure data structuring problems which we are currently discussing. So let us come to one problem which we saw in the previous class. We have seen some of them, though we have not seen all of them, we have not seen a problem like this but we saw how to make a simple problem, solve a simple problem like this using a linked list were to make an appointment, you would put it at the end of the list and to call the next person you would remove it from the beginning of the list then you would store this information in a linked list. We also saw how you will store complex numbers, how we would solve the problem of vectors and even discussed briefly this. We went into details with this and we saw that the complexity of each operation, the order of complexity of each operation has to be checked and we were interested in minimizing the worst case complexity. So we will just quickly review this problem of sets of numbers before we proceed on. So in sets of numbers the operations were insert an element into a set, delete an element from a set, whether an element is a member of a set, union of two sets, to form a third set, intersection of two sets, to form a third set. And the options that we saw, we discussed that we will have to implement it as a linked list. An array will not do because this is a dynamic structure where these elements due to insertion and deletion may vary from one to anything and we do not know the total size, therefore having a static structure was not possible. So we require a dynamic structure with mallocing of it and therefore we implement it as a linked list which conceptually looked like this. At any point of time the node structure was a data field and a link field and the linked list at any point of time there is something like this say for example 5 3 9 1 and you had a pointer to this node structure which indicated what is the start of the list and may be we had a pointer to indicate what is the end of the list. So this was the conceptual linked list structure. I hope you all remember what we were doing here, the c declarations etc, how to malloc 1, all this we have already discussed. Now in this linked list data structure, we saw for this particular operation we had two options. Now we are giving them names, the options were actually on the basis of how we store the list. The first operation was insert in the beginning of the list. The insert operation was implemented by inserting in the beginning of the list; the delete operation was implemented by scanning the list, searching where the element was and then deleting that element. Member were also scanning the list and finding out whether it existed, union of two was done by starting with two lists and in one list you hold the pointer to the element and in the other list you find out whether it exists or not. If it exists then you delete both of them and store it in the third list, otherwise you delete it only from the one list. And you continue this till you finish off both the lists, that is once you finish both list the other list can be just looked at the end. Should I repeat the whole thing again or is it okay? Okay. Student: Sir why do we disturb the two lists. No, we don’t need to disturb the two lists or whether we need to, we will come to it. Let’s take an example, okay. So let us work out the simple unordered linked list. Is this visible, is this color visible? I will change the color. In the unordered linked list the insert operation was insert in front, that was simple. The member operation: scan the full list, the delete operation was actually first do member and if it is present then delete. This part is okay, I don’t need to go into details. Let’s come to union. What was the union operation? We are giving two lists, so in our format the linked list structure of the two lists, any list is like this say 5 7 1 and they are just put in front because 1 came first then 7 then 5 so they are put like this. Now suppose we have 5, in one list one list 5 7 1 and in another list suppose we have 3 5 2 and 7. Now we are doing the union of these two. So how do we do a union? We start with this list, so with this list the first element if you take and we have put, making the union in the third set. So we start here and put this element in the third set and we progress this pointer but before this progressing the pointer here, we must ensure we cannot just put this followed by this because they are not disjoint. Since they are not disjoint, we cannot take this one and just follow it up with this one. So now there are 2 or 3 things we can do. One is we can first put this whole list up here all right and then we start with this one whereas take this option, it will be easier. So we take this one and put all of them then we start with the second list, we see whether this is present here or not. So we scan this, it is not present so we put it here then we come here, we scan this it is present here, so we don’t put it here. We proceed, again we scan this, it is present here no, put it here. Again we scan change the pointer, we scan this it is present so we don’t put it. So this is how we end up in doing the union. Can anybody tell me how you will do the intersection? How will we do the intersection? We take this list and put it here, take this one see if it is not there fine. Take this one, if it is there delete it sorry sorry take this out, see if it is already there. Okay let’s do it in another way. We will start from this one and see if it is here, if it is not present here we don’t put it. Take this one see if it is here, if it is here we put it. Take this one see if it is here, if it is not here we don’t put it. Take this one, we put it. So now we have defined our algorithms where the storage of the data is how we are storing the data? As a node like this and conceptually we are storing it in as in a sequence as and how they come. Now we have to analyze how efficient it is. This one takes constant term, so we say it is order one. This one is proportional to the size of the current list, so it is order n. This is also of order n. This you would, if you are intelligent you would put the larger value first and then start with the smaller one but on an average the smaller one can be half in the worst case. In the worst case the smaller one can be half the size and we would possibly have to scan all of them, so it would be n by 2 square. For each one we have to scan all the elements of the other one. It is n by 2 into n by 2, so this will come to order n square. Similarly this will come to order n square. If all of them occur with equal probability which let us assume is the case then order n square is the worst case complexity that we have obtained. What do we do in an ordered linked list? In an ordered linked list, we make certain modifications, we insert we always maintain the list in an ordered fashion. That is suppose first element entered is 5, the second element entered is 3 then this 3 will be put before 5. If the next element entered is 4 then it will be put here. If the next element entered is 7 it will be made, if the next number is 6 it will become and this structure obviously as you can imagine is of this form. So insert now puts it in the proper position, all right. Insert puts it in a proper position, so insert in proper place. Delete, you scan and once you find it, you delete it, member is similar, union and intersection. Can you tell me how you would do union here? Let us take the same set of elements 5 7 1, so 5 7 1 would now be stored as 1 5 7 and 3 5 2 7 would now be stored as 2 3 5 7. To the user it is a set, to the user it is just a set with these operations. This is how we are implementing it. We are implementing it as an ordered list but to the user it is a set. So how would we do a union? If you recall the previous discussion, in the previous review do it like you did in merge routine in merge sort which start with these two because when we do union also, we would also have to implement it in an ordered fashion, the final result would be ordered. So we can modify our merge routine of merge sort and just implement it. The smaller of the two is first put in, 1 and this is pointer is shifted and these two, 2, this is shifted. These two, 3, this is shifted, both are equal put 1 and shift both right that’s the only difference with merge sort. Both are same, put it here and shift both. Now here we have reached the end of both the lists, we might not have reached the end of both the list, we have reached the end of one list and not the end of other, we will just part it up at the end just like we do in merge sort. So insertion takes how much time? Order n. Delete: order n. Member: order n. Union takes how much time? Order n1 plus n2, the total size is n, so this is also of order n. And how would you do intersection? In the same way, you would see 1 and 2 and smaller one without writing you would push. Here without writing you would push. So let’s just quickly revise how we will do this one. You start here and here, 1 and 2, say they are different so the smaller one is pushed out, nothing is done. They are unequal, the smaller one is pushed. They are unequal the smaller one is pushed, they are equal this is written and both are pushed. They are equal, this is written and both are pushed. And if you reach the end of one list, you are over; you don’t need to continue with the rest. So this would also be order n. Now, see what we have done. Order one has been made order n that these two were order n, these two were n square have been reduced to n and if they are of equal probability, this is a better data structure in the worst case because you know this way you have equal probability but if you have a variation in probability and relationship that this occurs much much less than this. Then may be something else would have come out, you would have to do a deeper analysis. So in this data structure, for this problem among the two options may be we would choose this option. The question is are there other option or is this the best option. So we will just see whether there are other options and we will simplify the problem and we will remove this union and intersection and try to solve the problem in a simpler fashion. Then we will reintroduce these two and see whether we can get a better data structure. So the problem that we have in hand now is this. Design a data structure to represent and store a set of numbers with the following operations. Insert x, now in one set not in a set of sets because union and intersection are not there, we have simplified it to one set of numbers, inserting the z, delete from the z, print the largest in the set and member, whether x is a member of the set. So these three are same, print the largest, it does not affect our routine because in an unwanted list finding out the largest would have taken order n times but in an ordered list, finding out the largest will take constant time because the end pointer is there and you can immediately print the largest element. So we have got now several options here, the first option is unordered list, the second is ordered list. In an unordered list insert is constant, order n, order n. So order one, order n order n and order n and in an ordered list, it is order n, order n, order one and order n. Competitive, both are competitive and you would have just argue between the probable times insert comes and print, max comes and they are both equivalent in nature. Now let us see whether we can do something more interesting and I will try to show you by example how something more interesting can be done. Suppose the first number that comes to us is 15. In an unordered list, in an unordered list we would store 15 like this both unordered and ordered like this. So when the first number comes that’s all we do. Now suppose another number comes 3, these are insert operations I am giving you sequence of insert operations. So in an unordered list, in an ordered list and in something else what we would do? In an ordered list, we would write 3 15, in an ordered list 15. This is one after another like this and in unordered, no, no, let us see we are putting the smaller earlier and in the unordered list we are just inserting in the front. So first 15 comes then 3 comes, so 15 is put here then 3 is put. And I am trying to show you how you could do something else and in this something else we would put 15 here and this has got two pointers. This points to all elements which are less than it; this points to all elements which are greater than it. In the ordered list there is only one pointer pointing to all elements which are greater than it. In an unordered list this pointer had just got no meaning, it was just one after another but in an ordered list, this pointer had the meaning that the elements which are this side are all greater than the given element. Now I put in two, elements suppose I have x here. All those which are less than it will go this side; all those which are greater than it will go this side. So I would put this one here, right. Now suppose another element comes up 20, then in an unordered list 20 would come here. In an ordered list 20 would come here and here 20 would come here that is in your implemented structure, it would be… is this visible, right. Now suppose the member operation comes, here for three elements in the worst case for any member operation you would have to search for all the three, here also you would have to search all the three. But on an average here we would possibly terminate possibly earlier because a smaller element can come with a higher probability and all. Here all elements can be at the middle on an average but here some elements will be earlier, some elements here but here you would not have to make more than two comparisons. Here you would make at least 3 in the worst case. For any, suppose 15 could have a 3 here, if 20 came here you would have to make 3 here but here you would never make more than 2. How? You take, you read the first element. If the given element is greater you would only look this side, if it is smaller you would look only this side. So let’s put in another element, let’s put in 1. In the unordered list, it would now look 1 20 3 15. In the ordered list it would look like this and in this structure it would look like this. So, here 4 comparisons for finding. See deletion also requires finding, so the complexity of member and deletion are identical. Here you would require 4 and 4 in the worst case but here 3 in the worst case, right. Let us put in another element, say 25, 25 would come here, 25 would come here and 25 would come here. How would you insert? You would first find where to insert. If you have to insert 25 in this, say 15, 25 is greater than 15 so you have to insert here. You come to 20, 25 is greater than 20 so you have to insert here. So there is no element here, this pointer is null, so you insert here we will come to details of this. But I am trying to just give you a conceptual structure. Algorithms for insertion, deletion will come but we are trying to see what the structure looks like. Now here we have to make 5 comparisons in the worst case but here no more than 3 states. Now suppose 19 comes then 19 is stored here, here 19 would be stored somewhere here, right. In the ordered list 1, 3, 15, 19, 20, 25 and here 19 would be stored here. Now 6, 6 still 3. Now suppose the next element is say 9, the 9 would be put somewhere here, so the ordered list now looks 1, 3, 9, 15, 19, 20, 25, right. This is the sequence here 9 would be… You can see I have picked up an example where this looks always better. I am trying to show a good case for this but we will come to bad cases also but let us see the good cases first. So this is how you would store it. So compared to an unordered list and an ordered list, this one has got two pointers. One, unordered list the pointer has no meaning it just says it is there as another element that’s all. In an ordered list all the elements, this points to all the elements which are greater and in this structure it points to… So it is a two linked list but it is not linking in a chain, these are les, so the one which has got say this pointer has got elements less, this full size will be less than 15. All elements here will be greater than 15, similarly all elements here will be less than 15, all elements here will be greater than 3 but less than 15. Since this looks something like this, this is called a tree and since it has got a key value with elements to the left and the right less than unequal and it helps to quickly search elements, it is called a search tree. And since it has got two pointers, it is called a binary search tree, all right. So now we will investigate how good this structure is whether there is any strength of using this structure in our implementation of insert, delete, find, print max etc etc and we will see whether this structure is really good or whether this structure is not as good as this, in what situation is it worse, better. So we have got what is called a binary search tree, for example which we worked out will look something like this say I have just may be this was 19. All elements which are this side is less than equal to, all elements this side are greater than equal to. This is true for any node and the node structure diagrammatically is data and two link fields one is called the left and one is called the right, this points to the same node structure, this points to the same node structure. Now to find out an element, what do you do on this tree? What are the, how to find out an element? Given a tree, you start at the root; this is called the root of the tree. If you look it backwards, some people look at it like a tree but I don’t know why it is called a tree I mean just because it has some branches possible. This is called the root of the tree and to work out the simplest member relation, my routine is like this. I check whether the element at the root is equal to the element I want to find out, all right. First I check whether the pointer is null or not, if the pointer is null not available. So first thing you do is if pointer to that node is null, return no. Next is if p pointer data is equal to x, return yes. Can you tell me the rest of the two steps else if p pointer dot data is less than x then what do you do. If p pointer dot data is less than x, go to the right. That is the element which is less, which is more than this. So go to the right means what? What would you do, go to the right? You would call this on this node now, you can just recursively call it on this node, you can recursively call it on this p pointer dot right. You would move here, let us say will come to functions later on. You would move here and do the same thing. If it was greater that is p pointer dot data is greater than x then you would work out on p pointer dot left and continue and do the same thing. Now you can see one thing that here you would at least leave out some elements and in the best case you would leave out half the elements. That is half the elements this side would not be seen and only the other half you would work on, that is if you checked on this one, you would now be checking only on this side. Does this remind you of anything else? Suppose these were put on in an array 1 3 15 18 20 and 25, you would check in the middle, 1 3 sorry 9, you would check in the middle and if it is less you would look only this side otherwise you would look only this side. So it is, you are looking at it only less than side it is just like binary search. How would you insert an element? You would do a find to see if it exists, while doing the find you would reach a point where it either exists then you would not do anything, you would not insert it because it’s a set you wouldn’t duplicate. And if you say suppose you have given 8 then you would come here, come here, come here and see it is not there because this pointer is null, you just simply insert it here because this is where you are supposed to find it and since it is not there, this is where you insert it. Are you getting the point? Delete is slightly more involved, we will come to delete later on. Maximum, which is the maximum element? Right, right, right, go on moving to the right till you reach the end that is automatically the maximum value. Now what is the complexity of finding out, what is the worst case complexity of say find because member is the most crucial operation? What is the worst case complexity of the member? Now here what is the in a binary search tree, in terms of the tree can you tell me what is the worst case of the member operation? n by 2, log n, tell me in terms of the tree n, tell me in terms of the tree, give me a more accurate calculation. It is the length of the longest path from the root to the node. The worst case complexity of find is the length of the longest path from root to leaf. Now this length of the longest path will vary depending on the way data is entered. Suppose I give you 15 and then 12 and then 11 and then 10 and then 3 then it will just be like your ordered linked list and the length of the longest path will be n. So in the worst case it may be this, so all our operations insert is the length, proportional to the length of the longest path. We will see how delete will occur and it will be also be proportional to the length of the longest path. Find will be proportional to the length of the longest path etc etc etc. So, all of them will be proportional to the length of the longest path. On the other hand in the worst case it is order n. So now the next is how could we minimize the length of the longest path that would be, that is you would like to define a search tree in which you will be able to minimize the length of the longest path, so that if you could do that efficiently then we would get an algorithm which is better than ordered or unordered list. So we will see that in the next lecture.