Tip:
Highlight text to annotate it
X
So far the only collection type we've dealt with was the list.
In this session we are going to make a tour of other kinds of collections which
differ from lists, both in the functionality and in their performance
profile. One thing will stay the same, however, all
the collections that we are going to study in depth are going to be immutable.
We're going to start in this session by looking at different kinds of sequences.
We've seen that lists are linear. Access to the first element is much faster
than access to the middle or end of a list.
Scala library also defines an alternative sequence implementation called a vector.
This one has a much more evenly balanced access pad on the list.
Vectors are essentially represented as very, very shallow trees.
To see how that works in detail, let's make a little drawing.
So, a vector of up to 32 elements is just an array, where the elements are stored in
sequence. Here, only draw four for simplicity, but
in practice, it would go up to 32. Now, if a vector becomes larger then 32
elements, its representation changes. What you do then is you would have a
vector is 32. Two arrays of 32 elements.
So again, I only always abbreviate to four.
And once that is exhausted. So once we have 32 times 32.
So that would be two to the tenth, 1024 elements full, then the representation
changes again, and it would then become a vector of Pointers to pointers to arrays
of 32 elements, so everything would become one level deeper.
You see what the principle is. So a vector with three levels then would,
would go up to two to the fifteenth elements.
A vector of four level two to the twentieth.
Five level would give you two to the twenty fifth and six levels would give you
two to the thirtieth. That is a billion elements and that's
about a far as it can go. So let's analyze how much time would it
take to retrieve an element at some index in that vector.
We've seen for lists, it real, very much depends on what the index is.
Fast for zero, slow, linearly slow for indices towards the end of the list.
For vector, vectors are much better behaved here, because, to get an index of
vector of length 32, it's a single index access.
If the vector has size up to about 1000, then it's just two accesses, so generally,
the number of accesses. Is our, the depth of the vector.
And we'll see that, that depth. Goes very slowly.
A depth of six gives you a billion elements.
So, generally, the formula would be that the depth of the vector is locked to the
bases of 32. Of N, where N is the size of the vector.
So we've seen that log to the base of 32 is a function that grows very, very
slowly. That's why vectors have a pretty decent
random access. Performance profile, much, much better
than lists. Another advantage of lectures is that they
are fairly good for bulk operations that traverse a sequence.
So such bulk operations could be for instance a map that applies a function to
every element or a fold that reduces aviation elements with an operator.
For a vector then you can do that in chunks of 32 and that happens to be
coincide fairly closely to a size of a cash line in modern processes.
So it means that all the 32 aviation elements.
Will be in a single cache line and that accesses will be fairly fast.
For lists, on the other hand, you have this recursive structure where essentially
every list element is in a console. There's just one element in the pointer to
the rest and you have no guarantee that these con cells are anywhere near to each
other. They might be in different cache lines, in
different pages. So the locality for list accesses could be
much worse than locality for vector accesses.
So you could ask, if vectors are so much better, why, keep lists at all?
But it turns out that if your operations fit nicely into, the model that you take
the head of the recursive data structure. That's a constant time operation for
lists. Whereas, for vectors you have to go down
potentially several lev-, layers. And then to take the tail to, process the
rest. Again a constant type operation for lists.
Whereas, for vectors, it would be much more complicated.
In that case, definitely if your access patterns have these recursive structures,
lists are better. If your access patterns are.
Typically bulk operations, such as Map, or Forte, or Filter, then a vector would be
preferable. Fortunately it's easy to change between
vectors and lists in your program because the two are quite analogous.
So we create vectors just like we create lists only we would be write vector where
we had written lists. And we can apply all the same operations
of lists also to vectors, map, fault, head, tail and so on except for.
Cons, because cons in a list, that's the primitive thing that builds a list and
let, lets us parametric against the list. Instead of a cons, vectors have operations
Plus column, which adds a new element to the left of the list, and column plus,
which adds an element to the right of the list.
So you see these here. X plus colon XS creates a new vector with
leading element X, followed by all elements of XS.
And XS colon plus X squared is a new vector with trailing element X, proceeded
by all elements of XS. So note that the colon always points to
where the collection is, where the sequence is.
So let's see what it would take to append an element to a vector.
Again, vectors like lists, are immutable. I can't touch the existing vector.
I have to create a new data structure. So what I would do is, I would take the
last array here and create a new one. Which contains, the given element.
If the vector is completley full I had, I would have to create a new level, but I
would assume there's still space left, in, in the arrays.
So that gives me a new array of 32 elements, which I then have to combine,
somehow, with the original vector. I can't change the pointer from the
original, root to this array, because that of course would change, the old vector.
So what I do instead is I cree, cre, create another copy here.
That points to this new element and it also points to the other elements that my
previous copy pointed to. And that then would replace this one here.
And finally I need to create another root which points to my new copy, and to the
other immediate descendants of the root, and.
That, finally would complete the construction.
So the new vector now is in red. Where as the blue one wasn't touched at
all. So if you analyze the complexity of that,
then we see that we have to create a new object, 32 bit or, 32 element array, for
every level of the vector, where we did the change here.
So in our case here, three of these arrays would be created.
Not as efficient as changing a thing in place, but we get something in return.
We get really two copies of the vector, that are both completely functional, and
they don't, they are not in each other's ways.
So the complexity again is here if you analyse it, again log 32N.
But now, it's object creation. So we create as many objects of, with 32
as we have levels in the tree. So vectors and lists are two
implementations of a concept of sequence. Which is represented, in fact, as a base
class of list and vector.So if you do a diagram of the collection classes.
Then what we would have here, here we have class list.
Here we have class vector. And the two are sub-classes of a class
called sequence. Now other collections in the Scala library
as well. So besides sequences we would also have
sets which very much resemble the sets that you had looked at in your various
homeworks with a bit more functionality. And we'll also see, a structure called a
map so these will be covered in future sessions.
Sequences sets and maps are all instances of a common base class called Iterable.
So that's the start of the hierarchy of Scalar collections.
In fact there are also some other things in the Scalar library that look like a
sequence. Arrays and strings both support the same
operations as sequences, and they can be implicitly converted to sequences where
needed. So to demonstrate that, let's go in the
worksheet. I've defined an array.
Again, analogous. I have just written array instead of list.
And if I do that, then the worksheet would respond that I have defined an array of
int. And, here are the elements of the array.
What I can then do. I can, for instance, apply a map function
just like I did apply a map for lists. And I would get the array with every,
element doubled. Another sequence like structure is a
string. So let's define one.
And we can then apply for instance, an operation like filter, which gives us All
the uppercase characters in the string. So this would give us just HW.
So you see that the usual operations on sequences like map, or filter, or fold, or
head, or tail, or take while, and so on. They all also work for a raise and
strings, which is quite handy sometimes. So going back to our diagram, we would
then have string. And array as further sequence-like
structures. I draw a dotted line because they're not
really subclasses of sequence. They cannot be that because both string
and the ray come from the Java universe. And, of course, a Java class that doesn't
know that at some future time somebody would define a class called Scalar of
sequence. Another simple and useful kind of sequence
is the range. A range simply represents a sequence of
evenly spaced integers. The three common operators to construct
ranges. I can write one, two, five and that would
give the range of elements one two, three, four, five.
I can also use one until five until operator is exclusive in the upper bound
so the sequence will only go from one to four.
And I can also vary the step value by the buy operator.
So, I could write one through ten by three, and that would give me the range of
one, four, seven, ten Or the step could also be negative.
So six to one pi minus two would give me the sequence six, four, two.
Of course, ranges are not represented like a rays or vectors as sequences of
elements. There's a much more compact
representation. All we need to store for a range is the
lower bound, the upper bound, and the step value, and these three values are just
stored as fields in a single range object. So coming back to my diagram then, I would
have one more implementation of sequence called a range.
So now that we have sequences, its time to look at some more operations that exist
uniformly for all sequences including lists and vectors and ranges.
The first operation is, exists so, excess exists with a private KP gives us true if
there is an element in the sequence excess, such that the private KP of its
host. Otherwise, it would give us false.
The dual exists for all, so that one would return true, if P holds for all elements
in the sequence excess. So if you look at the worksheet for
instance S exists, C is upper.
Would return true because in fact there are true upper case characters in the
string. Whereas if we ask whether for all
elements, character is an upper case character we would expect to see, false.
Because there're also lower case characters, in the string.
Another useful operation on sequences is one that takes two sequences and returns a
single sequence of pairs. Pairs of corresponding elements of the two
sequences. That operation is called Zip.
Like the zipper that takes two single strands and combines them into a strand of
pairs. So to try that out, lets create one
sequence. Val pairs, equals.
Let's create, lets say, a list of 123 zip well let's take our string S.
What would we get here? How we would get a list of integers and
characters that contains the three elements 1H, 2E, 3L.
So we have taken corresponding elements from the two sequences and put them in to
pairs of result list. The dual of zip is unzip so, if we do a
pair, a pairs that unzip. But what we will see now is a pair of two
lists. The first list contains the first half of
the pairs that we've seen, one, two, three and the second list contains the
characters from the second half of the pairs.
Good. So we've seen exist to unzip The next
useful function is called flat map, it takes a collection excess, and a function
F that maps each element of excess to a collection by itself.
And we would then concatenate all the results, collections into one large
collection. So let's see that an action on the
worksheet we could apply foreign flat map over a string, so it takes a character in
the string, and it would give us back lets say a period, followed by the character.
So each character in the string gets mapped to a list.
The flat map would then concatenate all of the list in the result collection.
Let's see how that would work. So what we end up with is, in fact,
another string that now has a period in front of every character of the original
string. The last group of operations I want to
cover are some utilities for ordered or numeric collections.
So if you have a collection of numbers, we can take the sum of the product of that
collection. And if you have a collection of ordered
elements, we can take the maximum or the minimum.
A quick test here. So excess.sum.
Would give us 50, and XS.max. Would give us 44 as expected.
So let's apply these new operations to some examples.
The first thing I want to do is I want to list all combinations of numbers x and y
where x is draw from one interval let's say from one to n and y is drawn from
another. So what I would do is I would cycle
through the list one to n and I would do a flat map with what would I have to do
here. Well it would be, I have to go now through
the second list one to n. And I need a map, so for every element in
one to N, I return a pair that consisting of X and Y, where X is drawn from the
first range and Y is drawn on the second range.
So the next example to look at is scalar product.
So, the scalar product of two vectors is sum of the product of corresponding
elements xi and yi of the two vectors. And we can take the mathematical
definition and map it directly to code, so what we do is we first zip up the excess
into y s, that brings corresponding elements together into pairs.
And then we have a map that performs the multiplication on each pair.
So we have a pair, xy, we pull out the first path, we pull out the second path,
and we multiply it. And finally, we need to take the sum of
the results of the multiplications. There's actually another way we can write
that using a pattern matching function value.
So instead of pulling out the elements of the pair with the selectors one and two.
I can also do, use pattern matching on a pair.
So here's how that would be written. I have, again the excess zip ys.
But then I map with a function that reads in braces, simply case xy.
So it's a pattern match against the pair. Which will always succeed by the way,
because I know that I get a pair. And then simply return X times Y.
So that shorthand is. So that generally is a short hand for a
match expression. So the function value that consist of one
or more cases in braces is actually exactly the same as the function value
that takes the parameter. And then matches on the parameter with the
cases as they are given. Of course the first version is shorter and
also clearer than the second one. So here's an exercise for you.
You know that a number is prime if the only divisors of the number are one and
the number itself. What's a good high level way to write a
test whether number is a prime number. For once I want you to value conciseness
over efficiency, so I want to express the test in the most abstract and mathematical
manner possible. Don't worry about its efficiency.
So you would have a test is prime it takes an int and returns a boolean and how would
we define it. So to answer that quiz, simply look at the
mathematical definition and translate it directly in to code.
So if a number is prime if the only divisors of the number are one and the
number itself. That means for all other numbers between
one and N that number is not a divisor of N.
So lets just express that here so all other numbers between one and N that would
be the range two until N. And then there's the for all.
That should hold for all of these numbers, aah, give it a name what's call it D.
So what should hold is that end modular D leaves a rest which is different from
zero. So that's a very high level and short
definition of what it means for a number to be a prime number.