Tip:
Highlight text to annotate it
X
For expressions look somewhat like an embedded query language.
This session is going to explore the details of the relationship between for
expressions and database queries. The for notation that you've seen in the
last session is essentially equivalent to common operations on query languages for
databases. Languages such as SQL or XQuery, say.
To see this, suppose that we have a database of books, which is represented as
a list of books. Each book we model as a case class with a
title, which is a string, and the authors, which would be a list of string.
Here's a mini database that we would, will fit comfortably in memory.
So it's just a list of five books and each book has title and authors.
What I've done here is actually a, I've used name parameters, which you can do in
Scala. So, because book had title and authors, I
could have written just book and then a string and a list.
But you can also write book where title equals the string, and authors equals the
list, and often that's clearer. So I've, I've picked that here, so to make
it clear what was the title and what were the authors.
So now that we have this database, let's run some queries over it.
First query would be, find the titles of books whose author's name is Bird.
So what, a query like that could be written like you see here.
We let b range over books, we let a range over the authors of the book b, and we
demand that a starts with Bird and a comma.
So that would be, the author's last name is Bird.
So that would be the author's last name is Bird.
And, for each of these books we want to produce the title of book.
Or, to find all books which have the word program in the title one way we could
write that is, again, we let b range over the books, and we'd ask whether b.title
has the word Program in it a way to achieve that would be to use Java's
indexOf function, which produces the index of the substring if it appears and -one if
it doesn't. For all the books where this condition is
true, we yield again the title. So let's do one of these queries on the
work sheet. I have here the first kind of query, where
I demand now all the authors whose name starts with Bloch. And if I run that,
then, what I would get back here is the two books in my list that Joshua Bloch has
written, Effective Java and Java Puzzlers. A slightly more involved query is this one
here. We want to find the names of all authors
who have written at least two books that are present in the database.
So way to do this would be to, actually we have two iterators ranging over the books
by database. So we let b1 range over books, and b2
range over books, and we demand that b1 and b2 are different.
Now we have pairs of different books, and we let a1 and a2 range over the authors of
these pairs. And if we find a match, so if we find an
author that appears in the authors lists of both b1 and b2, then we find, have
found an author that has published at least two books, so we give it back as a
result. If we do that then I've, I've put the
query that you've seen on the slide up here.
Let's see what the result is. Well, we get Joshua Bloch, that's fine,
but we actually get him twice. So why do solutions show up twice?
The reason is that we have two generators that both go over books, so each pair of
book will show up twice once with the argument swapped.
So, for instance here with the books, we would have one pair that would read
Effective Java, That was one of the books Josh wrote,
And the other which read Java Puzzlers. And then we would have another pair where
the two were swapped. So that's why we get the, the same couple
of books in two pairs and why the solutions show up twice.
How can we avoid this? Well, one easy way to avoid that would be
to say, instead of just demanding that the two books are different, we demand that
the title of the first book must be lexicographically smaller than the title
of the second book. So that would mean that in our previous
one we would get Effective Java and Java Puzzlers as before, but we wouldn't get
the pair in reversed orders, because in lexicographical order, Effective Java
comes before Java Puzzlers. Let's see what happens if we do that
change. So we say b1.title < b2.title.
And what we get ready is indeed just a single author,
But, are we done yet? A question for you. What happens if an
author has published three books? Is the author still printed once, as we
desire, or is it printed twice, or maybe three times, or maybe the author is not
printed at all? Make your choice.
To find out what the solution is, let's add a third book that's also published by
the same author, Effective Java two and see what would
happen. Well, we see that the same author is
printed three times. Why is that?
And obviously the problem now is that even with this added condition, we have three
possible pairs of books. So if we have a book B1, B2, B3 all
published by the same author, then you have three possible pairs of two books out
of these three, and for each of the three possibility the same author will be
printed, So the author is printed three times.
What can we do about this? How can we avoid printing the author
several times? Well, one solution would be to remove
duplicate authors who are in the result list twice or several times.
There's a function for this, it's called distinct,
It works on all sequences, and we will simply remove duplicate elements from the
sequence. Keep the first one, remove the other ones.
So, one thing we could do is we could take the query that we've seen here, put it in
braces or parentheses, and call distinct on the result set and that would do the
trick. On the other hand,
Maybe this these problems are a sign that we have started off with the wrong data
structure. Remember,
That we have written a database as a list of books.
In actual databases, actually the order in which the rows the books appear shouldn't
matter. So databases are much more sets of rows
than lists of rows and sets have the advantage that duplicates are eliminated
by design. So, let's try this.
Let's make books a set of rows and then, yes, indeed you will see that the result
set consists again of a set of just the single author like what we wanted.
Okay. Good.