1. Other Collections

So far the only collection type we've dealt with was the list. In this session we are going to make a tour of other kinds of collections which differ from lists, both in the functionality and in their performance profile. One thing will stay the same, however, all the collections that we are going to study in depth are going to be immutable. We're going to start in this session by looking at different kinds of sequences. We've seen that lists are linear. Access to the first element is much faster than access to the middle or end of a list. Scala library also defines an alternative sequence implementation called a vector. This one has a much more evenly balanced access pad on the list. Vectors are essentially represented as very, very shallow trees. To see how that works in detail, let's make a little drawing. So, a vector of up to 32 elements is just an array, where the elements are stored in sequence. Here, only draw four for simplicity, but in practice, it would go up to 32. Now, if a vector becomes larger then 32 elements, its representation changes. What you do then is you would have a vector is 32. Two arrays of 32 elements. So again, I only always abbreviate to four. And once that is exhausted. So once we have 32 times 32. So that would be two to the tenth, 1024 elements full, then the representation changes again, and it would then become a vector of Pointers to pointers to arrays of 32 elements, so everything would become one level deeper. You see what the principle is. So a vector with three levels then would, would go up to two to the fifteenth elements. A vector of four level two to the twentieth. Five level would give you two to the twenty fifth and six levels would give you two to the thirtieth. That is a billion elements and that's about a far as it can go. So let's analyze how much time would it take to retrieve an element at some index in that vector. We've seen for lists, it real, very much depends on what the index is. Fast for zero, slow, linearly slow for indices towards the end of the list. For vector, vectors are much better behaved here, because, to get an index of vector of length 32, it's a single index access. If the vector has size up to about 1000, then it's just two accesses, so generally, the number of accesses. Is our, the depth of the vector. And we'll see that, that depth. Goes very slowly. A depth of six gives you a billion elements. So, generally, the formula would be that the depth of the vector is locked to the bases of 32. Of N, where N is the size of the vector. So we've seen that log to the base of 32 is a function that grows very, very slowly. That's why vectors have a pretty decent random access. Performance profile, much, much better than lists. Another advantage of lectures is that they are fairly good for bulk operations that traverse a sequence. So such bulk operations could be for instance a map that applies a function to every element or a fold that reduces aviation elements with an operator. For a vector then you can do that in chunks of 32 and that happens to be coincide fairly closely to a size of a cash line in modern processes. So it means that all the 32 aviation elements. Will be in a single cache line and that accesses will be fairly fast. For lists, on the other hand, you have this recursive structure where essentially every list element is in a console. There's just one element in the pointer to the rest and you have no guarantee that these con cells are anywhere near to each other. They might be in different cache lines, in different pages. So the locality for list accesses could be much worse than locality for vector accesses. So you could ask, if vectors are so much better, why, keep lists at all? But it turns out that if your operations fit nicely into, the model that you take the head of the recursive data structure. That's a constant time operation for lists. Whereas, for vectors you have to go down potentially several lev-, layers. And then to take the tail to, process the rest. Again a constant type operation for lists. Whereas, for vectors, it would be much more complicated. In that case, definitely if your access patterns have these recursive structures, lists are better. If your access patterns are. Typically bulk operations, such as Map, or Forte, or Filter, then a vector would be preferable. Fortunately it's easy to change between vectors and lists in your program because the two are quite analogous. So we create vectors just like we create lists only we would be write vector where we had written lists. And we can apply all the same operations of lists also to vectors, map, fault, head, tail and so on except for. Cons, because cons in a list, that's the primitive thing that builds a list and let, lets us parametric against the list. Instead of a cons, vectors have operations Plus column, which adds a new element to the left of the list, and column plus, which adds an element to the right of the list. So you see these here. X plus colon XS creates a new vector with leading element X, followed by all elements of XS. And XS colon plus X squared is a new vector with trailing element X, proceeded by all elements of XS. So note that the colon always points to where the collection is, where the sequence is. So let's see what it would take to append an element to a vector. Again, vectors like lists, are immutable. I can't touch the existing vector. I have to create a new data structure. So what I would do is, I would take the last array here and create a new one. Which contains, the given element. If the vector is completley full I had, I would have to create a new level, but I would assume there's still space left, in, in the arrays. So that gives me a new array of 32 elements, which I then have to combine, somehow, with the original vector. I can't change the pointer from the original, root to this array, because that of course would change, the old vector. So what I do instead is I cree, cre, create another copy here. That points to this new element and it also points to the other elements that my previous copy pointed to. And that then would replace this one here. And finally I need to create another root which points to my new copy, and to the other immediate descendants of the root, and. That, finally would complete the construction. So the new vector now is in red. Where as the blue one wasn't touched at all. So if you analyze the complexity of that, then we see that we have to create a new object, 32 bit or, 32 element array, for every level of the vector, where we did the change here. So in our case here, three of these arrays would be created. Not as efficient as changing a thing in place, but we get something in return. We get really two copies of the vector, that are both completely functional, and they don't, they are not in each other's ways. So the complexity again is here if you analyse it, again log 32N. But now, it's object creation. So we create as many objects of, with 32 as we have levels in the tree. So vectors and lists are two implementations of a concept of sequence. Which is represented, in fact, as a base class of list and vector.So if you do a diagram of the collection classes. Then what we would have here, here we have class list. Here we have class vector. And the two are sub-classes of a class called sequence. Now other collections in the Scala library as well. So besides sequences we would also have sets which very much resemble the sets that you had looked at in your various homeworks with a bit more functionality. And we'll also see, a structure called a map so these will be covered in future sessions. Sequences sets and maps are all instances of a common base class called Iterable. So that's the start of the hierarchy of Scalar collections. In fact there are also some other things in the Scalar library that look like a sequence. Arrays and strings both support the same operations as sequences, and they can be implicitly converted to sequences where needed. So to demonstrate that, let's go in the worksheet. I've defined an array. Again, analogous. I have just written array instead of list. And if I do that, then the worksheet would respond that I have defined an array of int. And, here are the elements of the array. What I can then do. I can, for instance, apply a map function just like I did apply a map for lists. And I would get the array with every, element doubled. Another sequence like structure is a string. So let's define one. And we can then apply for instance, an operation like filter, which gives us All the uppercase characters in the string. So this would give us just HW. So you see that the usual operations on sequences like map, or filter, or fold, or head, or tail, or take while, and so on. They all also work for a raise and strings, which is quite handy sometimes. So going back to our diagram, we would then have string. And array as further sequence-like structures. I draw a dotted line because they're not really subclasses of sequence. They cannot be that because both string and the ray come from the Java universe. And, of course, a Java class that doesn't know that at some future time somebody would define a class called Scalar of sequence. Another simple and useful kind of sequence is the range. A range simply represents a sequence of evenly spaced integers. The three common operators to construct ranges. I can write one, two, five and that would give the range of elements one two, three, four, five. I can also use one until five until operator is exclusive in the upper bound so the sequence will only go from one to four. And I can also vary the step value by the buy operator. So, I could write one through ten by three, and that would give me the range of one, four, seven, ten Or the step could also be negative. So six to one pi minus two would give me the sequence six, four, two. Of course, ranges are not represented like a rays or vectors as sequences of elements. There's a much more compact representation. All we need to store for a range is the lower bound, the upper bound, and the step value, and these three values are just stored as fields in a single range object. So coming back to my diagram then, I would have one more implementation of sequence called a range. So now that we have sequences, its time to look at some more operations that exist uniformly for all sequences including lists and vectors and ranges. The first operation is, exists so, excess exists with a private KP gives us true if there is an element in the sequence excess, such that the private KP of its host. Otherwise, it would give us false. The dual exists for all, so that one would return true, if P holds for all elements in the sequence excess. So if you look at the worksheet for instance S exists, C is upper. Would return true because in fact there are true upper case characters in the string. Whereas if we ask whether for all elements, character is an upper case character we would expect to see, false. Because there're also lower case characters, in the string. Another useful operation on sequences is one that takes two sequences and returns a single sequence of pairs. Pairs of corresponding elements of the two sequences. That operation is called Zip. Like the zipper that takes two single strands and combines them into a strand of pairs. So to try that out, lets create one sequence. Val pairs, equals. Let's create, lets say, a list of 123 zip well let's take our string S. What would we get here? How we would get a list of integers and characters that contains the three elements 1H, 2E, 3L. So we have taken corresponding elements from the two sequences and put them in to pairs of result list. The dual of zip is unzip so, if we do a pair, a pairs that unzip. But what we will see now is a pair of two lists. The first list contains the first half of the pairs that we've seen, one, two, three and the second list contains the characters from the second half of the pairs. Good. So we've seen exist to unzip The next useful function is called flat map, it takes a collection excess, and a function F that maps each element of excess to a collection by itself. And we would then concatenate all the results, collections into one large collection. So let's see that an action on the worksheet we could apply foreign flat map over a string, so it takes a character in the string, and it would give us back lets say a period, followed by the character. So each character in the string gets mapped to a list. The flat map would then concatenate all of the list in the result collection. Let's see how that would work. So what we end up with is, in fact, another string that now has a period in front of every character of the original string. The last group of operations I want to cover are some utilities for ordered or numeric collections. So if you have a collection of numbers, we can take the sum of the product of that collection. And if you have a collection of ordered elements, we can take the maximum or the minimum. A quick test here. So excess.sum. Would give us 50, and XS.max. Would give us 44 as expected. So let's apply these new operations to some examples. The first thing I want to do is I want to list all combinations of numbers x and y where x is draw from one interval let's say from one to n and y is drawn from another. So what I would do is I would cycle through the list one to n and I would do a flat map with what would I have to do here. Well it would be, I have to go now through the second list one to n. And I need a map, so for every element in one to N, I return a pair that consisting of X and Y, where X is drawn from the first range and Y is drawn on the second range. So the next example to look at is scalar product. So, the scalar product of two vectors is sum of the product of corresponding elements xi and yi of the two vectors. And we can take the mathematical definition and map it directly to code, so what we do is we first zip up the excess into y s, that brings corresponding elements together into pairs. And then we have a map that performs the multiplication on each pair. So we have a pair, xy, we pull out the first path, we pull out the second path, and we multiply it. And finally, we need to take the sum of the results of the multiplications. There's actually another way we can write that using a pattern matching function value. So instead of pulling out the elements of the pair with the selectors one and two. I can also do, use pattern matching on a pair. So here's how that would be written. I have, again the excess zip ys. But then I map with a function that reads in braces, simply case xy. So it's a pattern match against the pair. Which will always succeed by the way, because I know that I get a pair. And then simply return X times Y. So that shorthand is. So that generally is a short hand for a match expression. So the function value that consist of one or more cases in braces is actually exactly the same as the function value that takes the parameter. And then matches on the parameter with the cases as they are given. Of course the first version is shorter and also clearer than the second one. So here's an exercise for you. You know that a number is prime if the only divisors of the number are one and the number itself. What's a good high level way to write a test whether number is a prime number. For once I want you to value conciseness over efficiency, so I want to express the test in the most abstract and mathematical manner possible. Don't worry about its efficiency. So you would have a test is prime it takes an int and returns a boolean and how would we define it. So to answer that quiz, simply look at the mathematical definition and translate it directly in to code. So if a number is prime if the only divisors of the number are one and the number itself. That means for all other numbers between one and N that number is not a divisor of N. So lets just express that here so all other numbers between one and N that would be the range two until N. And then there's the for all. That should hold for all of these numbers, aah, give it a name what's call it D. So what should hold is that end modular D leaves a rest which is different from zero. So that's a very high level and short definition of what it means for a number to be a prime number.