Tip:
Highlight text to annotate it
X
ALFRED FULLER: So, my name is Alfred Fuller, and I'm going to
talk about Next Gen Queries.
So who am I?
I'm the software engineer on the Datastore App Engine Team.
I work mostly with the Query Planner.
I helped remove the thousand-entity limit
for queries, I added --
ALFRED FULLER: Thank you.
I added cursors and prefetching, and I've done lot
of work in the Python and Java SDK around the datastore, so
I'm familiar with both.
So if you want to follow along with the wave notes,
you all know the drill.
The bitly link is NextGen/Queries/Wave.
I would also highly recommend looking at some past talks.
I'm gonna briefly go over some concepts that were gone over in
detail, especially in Building Scalable, Complex Apps on App
Engine by Brett Slatkin last year, and also there's Under
the Covers of the Google App Engine Datastore by
Ryan Barrett, which is excellent as well.
So the first thing I'm going to do is overview some of
the features we're adding.
These are the largest set of features that we're adding to
the datastore query engine since its launch, so these
are fairly substantial.
I'm going to go over some background of how we run these
queries and then talk about the technology we use to improve
upon what we already have, and mainly this is Zigzag Merge
Join and what we call MultiQuery.
And then I wanna talk about some -- I'll give you a few
examples, which are probably not very practical, but they're
very interesting, so I'll let you guys put them to practical
use, and then talk about some corollaries that came out of
some of the same technologies that went into adding
these features.
So the current system is incredibly scalable.
It's also schema-less so it's incredibly flexible.
It does the scaling automatically, and with that,
we are able to perform many query operations such as
equality filters, inequality filters, sorts, and through the
magic of composite indexes, we're able to do many of
these on the same query.
These usually require a composite index whenever you're
trying to do a combination of these filters and you have to
plan for those ahead of time, but you can do a lot of
different things with these.
They're very flexible.
And everything's combined with a logical AND operator, except
in the case when there are no sorts and they're all equality
filters, then its free rein.
You can do whatever you want, although you might want to add
composite indexes if you want to make it faster and
give some read-and-write performance tradeoffs.
So how are we improving this?
First thing we're going to do is we're going to add support
for arbitrary query logic.
So AND, OR, NOT subexpressions, so you can use the whole
gambit, everything that you can do, you know, with your
sequel queries and stuff.
I know that we announced that today.
There's definitely -- or yesterday.
This is definitely different from what you
expect in a sequel.
There are still limitations, especially the
single-inequality filter limitation.
But we're not leaving the datastore behind, especially
since it's -- what can automatically scale, you
know, with no supervision, which is why we like It.
We're also adding first-class support for domain-specific
queries, such as Geo-Queries, Date Range queries, and this
can actually be extended to do multiple-inequality filters on
numeric properties, or anything that can be normalized
to zero and one.
The main thing, the thing that makes this all possible, is
improvements to the composite indexes.
So we're gonna reduce the requirements on
composite indexes.
We're going to eliminate the exploding index problem.
I don't know, how many of you run into the
exploding index problem?
ALFRED FULLER: And we're gonna have fewer indexes serve a lot
[APPLAUSE]
more queries so you can really balance this, and this makes
composite index selection kind of a dark art.
Not that it wasn't already, but even more so.
And we really want to provide tools to help you do this, but
right now, I think you're gonna have to rely on this talk, so
you can refer to it later on YouTube.
And the composite index selection really provides the
wrench to decide between space and speed, between
write-and-read latency in costs, where you want spend
this based on what your application does.
And first and foremost, although last in this overview,
none of these result stats were pulled into memory during any
time during the making of these features.
So these are scalable, they do use low resources, and we're
very happy about that.
Quick disclaimer: So, unfortunately, this is Next Gen
Queries, so it's not in the current release or
the next release.
We're rolling out these features in pieces.
The infrastructure's there, but we have -- it takes time
to expose them to users.
That being said, you can feel free to use all your Bluetooth
devices and cell phones because this is only a presentation.
Although some of the back-end changes are available today,
and I'll go over exactly what those are, and those are the
ones that are likely to come out first, so the exploding
index problem should be solved as soon as I can get to it.
So throughout this presentation, I'm gonna be
talking -- or using the model Photo, which has a owner or
person who posted the picture; a tag, which is a multivalued
property or a list of user-assigned tags; people,
which can be either user assigned or if they're facial
recognition; location, which is where the photo was
taken, and the date.
And this is what I normally look like over there when I'm
not trying to be presentable.
I was told I was the only with a tie today so
maybe I went overboard.
Not the pug, not the pug.
That's my wife's cruel joke again, our dog.
So for some background, what we use to solve
every query are indexes.
These are the contained index data, which is a set of ordered
values that are then, when they are placed in the index, they
are sorted lexigraphically.
So it's a sorted collection of the index data, and so this is
exactly -- or an example of what the -- silent
mode, please.
That was my wife.
I think she heard the comment about the dog.
So this is the built-in property index.
So it has the kind, the name of the property, the value of the
property, and the key of the entity that it's referring to.
This is what a composite index looks like.
It has the index ID of the composite index that it's
using, a composite value, and then the key.
So queries are converted into scans over these indexes
through the Query Planner, and this is done by splitting the
index data into two different pieces: a prefix that's held
constant and a postfix that's different for each result.
So all the equality filters that go into the prefix, the
sort orders that go into the postfix, and inequality filters
are just range restrictions on that postfix, which is
why we only allow one.
And so for easy example, select "FROM Photo WHERE tag
equals family." We decide to split the index data
between the value and key.
We apply a post -- or prefix constraints of kind equals
photo, name equals tag, value equals "family," and
this assures that we only see index rows that refer to photos
that have these properties.
Then we let the key vary.
We use this key to look up the actual entity, and we return
the photo to you as a result set.
So, as you can see, since this is lexigraphically sorted,
these photos are actually coming out in postfix ordering
in this case, which is key ascending.
So if we wanted to change that ordering, we wanted to SELECT
FROM Photo ORDER BY date DESCENDING, we would choose a
slightly different separation of these index data.
We subdivide it between the name and the value.
We apply the prefix constraints of kind equals photo, name
equals date, which this is an implied existence filter, so if
it doesn't have a date, it won't show up in this
index, and then we let the postfix vary.
And you can see that if we do the encoding correctly, we
encode this so that date descending becomes
lexigraphically date descending in order.
When we pull off the key, we can look at the photo, and we
get everything with the postfix ordering of data descending,
key ascending, which is how we serve this query.
We get a little bit more complicated.
SELECT FROM Photo WHERE tag equals "family" AND
tag equals "outside, ORDER BY date.
Now this requires a composite index so the first thing we
have to do is we specify a composite index on
tag, tag, and date.
We choose to split this index data in the middle of the
composite value between the last tag and date.
We set our prefix constraints tag equals "family"
and tag equals "outside," and we
let the postfix vary.
So we have a postfix ordering of date descending, key
ascending, and we get the photos in this order.
Are you still with me?
Good.
Awesome.
So we expand on this in two ways, one of which is
with Zigzag Merge Join.
So currently, this is how it operates.
Zigzag Merge Join can efficiently find postfix
common to multiple scans.
It produces intermediate false results because it does two
scans or multiple scans to find the postfixes that are in
common, but it knows how to skip past sections of large,
not matching results so it tends to scale in the size of
the result set instead of the size of each
individual subscan.
So say we have photos that are tagged with dogs, photos that
are tagged with cats, and we would want to find photos that
show dogs and cats living together.
Well, Zigzag Merge Join can efficiently find when that
correlates with mass hysteria.
So here's an example of Zigzag Merge Join.
SELECT FROM Photos WHERE tag equals "family" AND
tag equals "outside." Notice there are all only
equality filters here and there's no sort order so we can
service this query with Zigzag Merge Join with
only the built-in indexes.
So we're gonna run two queries, or two index scans, I mean, on
the property index, one -- OK, we'll split these index data in
the same place for each query between value and key.
We'll apply a prefix constraint of kind equals photo, name
equals tag, value equals "family" for the
first scan, and we'll apply a prefix constraint of
"photo tag" and "outside."
So the first scan will give us all photos that are
tagged with family.
The second scan will give us all photos that are
tagged with outside.
Then we apply a postfix constraint using Zigzag Merge
Join to say that the first scan's key must equal the
second scan's key for there to be a match.
So they become one.
We run the scan through the magic of Zigzag Merge Join,
which is gone over in detail in Brett's Slatkin's doc, and we
have our photos that match this query in key ascending order.
MultiQuery is the second technology we use to
expand upon this.
MultiQuery combines multiple-query result sets
using a prority queue based on the orders of the query itself,
and I've nearly optimized it to avoid the need for this
priority queue when possible, and it currently just
supports IN and NOT EQUAL.
This is how we implemented these two.
So if you look at a query where tag IN "family" or
"friends," ORDER BY date DESCENDING, this query is
supposed to return any photos that are either tagged
family or friends.
We split into two different queries where one has all the
family and the other one has all the ones that are tagged
with friends, and each result -- a result from either one of
these queries is a correct result and is returned to the
user, and we maintain the ordering because both of these
subqueries have the same ordering that was on the
original query, and so we only need to see the first result,
the top result on each query to merge these result sets
together appropriately.
The NOT EQUAL, on the other hand -- I've also recently
expanded this, at least in Java, sorry, to support NOT
EQUAL on multiple NOT EQUAL expressions as long as they're
on the same property.
But this is actually expanded into inequality filters.
And so WHERE tag is less than "beach," WHERE tag is
greater than "beach" AND less than
"coworkers," and WHERE tag is greater than
"coworkers." Now, hopefully, you see a
problem with this.
This actually does not do what you think it does.
It doesn't return photos that are not tagged with beach and
are not tagged with coworkers.
It returns -- these are multivalued properties, and if
you look at this closely, it returns photos that have a tag
other than coworkers or beach.
So any other tag, and that result will be
in your result set.
This does the intuitive thing when it's the single-valued
property, but when it's a multivalued property, it
has an unintuitive result.
So the Next Gen features.
Was that clear?
Is everybody caught up?
OK, good.
So the first thing I did, or we did, was add sort
to Zigzag Merge Join.
So right now, Zigzag Merge Join doesn't work when
you have a sort order.
Well, it's pretty simple to expand Merge Join from looking
at just keys to looking at the entire postfix.
And so we've done that.
And so here's an example of the same query I showed you earlier
that required the index on tag, tag and date.
But now we can use Zigzag Merge Join.
We do two scans or -- OK, we first -- or we created an
index on tag and date.
We do two scans instead of one.
We divide it between the tag and date, as I previously
showed you, and then we apply our prefix constraints.
The first scan is for "family." The
second scan is for "outside." We apply
Zigzag Merge Join, which now can merge both the prefix or
both the date and the key, so basically the entire postfix,
and then we let that vary.
We pull off the key.
We get the photos in date descending and key ascending.
Now this is an essential difference because tag, tag and
date is an exploding index.
Tag and date is not because tag is a multivalued property.
Another thing we've added, will add -- sorry, it's not
out yet -- is OR and NOT.
So this is how AND works.
The postfix must equal.
And OR, for this Venn diagram, we want everything that's in
both of these circles, so we just don't have a constraint.
Everything that's in either scan matches.
But Zigzag can also be very efficient here in which we
never look at the same key twice.
So it scales on the size of the blue region, not on the size
of the individual circles.
NOT, on the other hand, is a little more complicated.
We say one scan's postfix does not equal a postfix
of another scan's.
This is equivalent to set subtraction.
So if I want to do NOT just some subexpression, I would
take everything NOT in that subexpression.
Everything and NOT(a).
Note, though, that is not converted into an
inequality expression.
It is not a greater than or a less than.
This is still an inequality.
So it can be put in the prefix, thus you can have a sort order
while doing NOT in this case.
And it returns intuitive results for multivalued
properties, because this is now set subtraction so it will
do exactly what you think it will do.
If the photo is tagged with "coworker" and
"outside," it won't show up in the result set even
if it has another tag on it.
So here's an example of actually how we
do this efficiently.
So a and NOT(b), if a is a result set that contains 2, 3
and 5, and b is a result set that contains 1, 3 and 4, we
expect the result to be a set of 2 and 5.
So the algorithm pulls the first value off of a, which is
2, looks for a value in b that is greater than our equal to 2,
finds 3, so two matches because 3 is greater than 2, pulls the
next result off of a, finds 3.
We already pulled 3 off of b so we know that 3 doesn't match.
Then it pulls the next result off of a, which is 5.
We look for a result in b that is greater than or equal to 5.
We find none.
Five matches.
We pull the next result off of a, we find none.
That's the end of our result set.
So we've only looked at four out of the six keys to produce
two results, but each one of these keys that we skipped over
in b can be a large region of nonmatching results.
So this does tend to scale in a size of the result set, and
it becomes very efficient, especially when combined with
many more merge constraints.
So we can go crazy with this.
You can even go -- this is very actually not that crazy,
now that I look at it.
So (a and b) OR c) AND NOT(d).
This, when all combined together, runs the
same algorithm.
It's pushed through doing the same process at the same time
so it does scale in the size of the blue area, not the area of
any of the sub -- of the other combinations or of the result
sets from any of those.
So this does scale and you can get it as complicated as you
want, and it will still tend to scale in the size
of the blue area.
So another thing that we're doing is we're adding
OR to MulitQuery.
So WHERE tag IN "family" or
"friends" or "person" IN some list
of people, ORDER BY date DESCENDING, we'll split out,
and you can see it's actually almost a trivial change where
we just split or we start saying person equals.
As long as they're sorted by the same sort properties, the
postfix is the same, thus we can order them properly when
we merge them using the priority queue.
Wow, that was awesome.
So if you look at this, you notice that this actually has
no false positive results.
In fact, MultiQuery never has a false positive result.
Every result that MultiQuery returns is a result.
But it does have a lot of duplicates, or it could have a
lot of duplicates, depending on the data, because each one of
these subqueries could produce the exact same results, and we
will have to de-dupe it for you so that you never see
the same result twice.
So the performance of this OR versus Zigzag, or MultiQuery
versus Zigzag in general, depends a lot on the data
because it's a trade-off between the duplicate
values or the intermediate false positive results.
Another thing that we've added or are going to add -- sorry --
are going to add to MultiQuery is support for a
domain-specific query such as Geo-Queries and Date
Range queries.
We do this by adding a query splitter that produces multiple
parallel query components.
And the splitting logic can be anything, right?
So in the case of Geo and Date Range, this is done using
a space-filling curve.
So these queries are split out into regions to scan along this
space-filling curve, and you can choose the number of
regions you want to scan, which is the number of queries you
want to run in parallel versus the accuracy of those regions.
So there's a performance trade-off there, and when you
choose something besides the most accurate, you get false
positive results in this case, so you have to do some pruning.
So we have entity filters.
We always apply a de-duping filter because that's almost
always needed or always needed with multiqueries and so
there's a special filter for the Geo and Date Range Queries
that prunes these fuzzy results.
So you only see the ones that exactly match your query,
but it's still done efficiently through the
space-filling curve.
So here's some examples of how the Next Gen queries improve
your life as a developer.
So SELECT FROM Photo WHERE tag equals "family" and
tag equals "outside," date DESCENDING.
I use this a lot.
This just means find recent photos of my family
taken outside.
So for the current gen, it's an index on tag, tag and date,
which is exploding and doesn't scale when you have
a lot of tags.
There's a 5,000 index row limit on the datastore, and your
users might not put in, you know, 5,000 square-rooted, but,
you know, I wouldn't trust that.
You know, users do crazy things sometimes, and maybe there's
enough stuff to tag in their photos.
So this is dangerous.
This is a dangerous index to do.
You might have errors that don't show up in testing, that
don't show up when you're exploring the possibilities
of using the datastore.
But on the Next Gen, you only require the index on tag and
date so it does not explode anymore.
This query SELECT photos WHERE tag IN "family" or
"friends" and tag equals "outside,"
and tag does not equal "beach." "Find
all photos of my friends and family taken outside but not on
the beach." I didn't specify a sort order because I
wanted to compare this to the current generation, and in the
current generation, you couldn't do this query if you
had a sort order because the inequality expression of the
NOT EQUAL turns into inequality filters, which require that
it's sorted on tag, plus you get unintuitive results because
tag is a multivalued property.
But in the Next Gen, it requires no additional indexes.
You can run this all only on the built-in indexes, and you
get the intuitive results.
If I add the filter and convert this to use OR instead of IN
and subexpressions, so NOT instead of NOT EQUALS,
NOT and then the term.
So, as I said, this is not possible in the current
generation, but if you were to do this in Next Gen, it only
requires that index on tag and date still.
And this is -- one of the many features that I think is
incredibly useful because if you have an index on tag and
date, you can have any number of tag equality filters in your
query in any combination of OR and AND, AND NOT, and it
will still be able to satisfy that query.
So if you have a search box and someone's typing in what tags
they want to find, I wanna see these tags, and not those tags,
a user is, before you'd be restricted on what queries
If you had a query on tag, tag, tag and date, you could do
you have access to.
three, but you couldn't do two.
You'd need an index on tag and date.
And then if you had that one, it wouldn't support four.
So the only way to dynamically allow a search box where a user
could enter this and it would be converted efficiently into a
query is to use the Zigzag Merge Join, which currently
doesn't allow you to sort.
And now that it does, we can solve it efficiently.
So we're gonna go really crazy.
Tag IN "family" and "friends," or people
IN some list of people that are your family and friends AND tag
equals "outside" AND NOT tag equals
"beach" OR location IN the coastal region, ORDER BY
date DESCENDING.
I do not recommend you run this query, but you could do it.
This is the exact same meaning of the previous query, but it
does not rely on proper tagging because you have facial
recognition for people who are in your the friends and family
contact lists, and you have a region that you think is near
the beach, thus I assume those tags are equal, although I
wouldn't trust that either.
But notice, though, that the NOT is applied to an entire
subexpression, so this is literally how you can work with
the new Zigzag framework, is it can be applied to any
subexpression, and it will do it efficiently.
This does require clever use of Geo encoding that I plan to
support, or we plan to support.
The team's not very big.
So normally, a location would be a location query, or a
geospatial query would be an inequality filter on the
space-filling curve, but you can, if you encode it correctly
and are clever about it, make it an inequality expression so
you can actually have that sort, which allows you to
prioritize, like if you have a map, you could prioritize what
you see in that map using this technique as well.
So in the current generation, it's definitely not possible.
In the next generation, you can deal with an index on tag and
date, tag and people, geolocation and date, and
that's all you need.
And those, if you look at them, can service as a whole slew of
queries, whereas before, when you were specifying a composite
index, there are very few that you could actually service
with these queries, or with these composite indexes.
So when do we use what?
When do we decide to use the MultiQuery OR, or the Zigzag
OR, or these all seem to support similar features?
Well it, all depends on what composite indexes
are given to us, but we do have a preference.
So Zigzag pretends to produce results that scale on the size
of the result set, but can in pathological cases, scale only
in a size of the largest subscan, while MultiQuery
produces duplicate correct results but is guaranteed to
scale on the side of the result set because there are no
false positive results.
Every result you pop off is a result that is
in the result set.
Although there are duplicates, but we do have a limit, which
makes that a constant value, the number of duplicates you
can see, which is kind of cheating and bad, but it
doesn't scale very well.
So in those cases, we're going to use Zigzag anyway.
So the actual performance does depend on the shape of your
data, but when possible, we will always prefer to run a
full-fledged query, or a MultiQuery, when we have the
composite indexes there, but there are many cases in which
this just does not work.
So if we look at the capabilities of these two
methods, Zigzag, if you ever talk to me afterwards, is the
answer to everything.
It does a lot of stuff, and it is very, very flexible,
especially in this case, whereas MultiQuery has
that unintuitive NOT.
It can only do that on a single property.
It has a limited number of filters when you do AND because
you have that composite index for each number of
those AND filters.
It restricts the number of queries you can run in parallel
so that OR is a little restrictive, but it's the only
thing that you can really do to the main specific queries.
So we'll continue -- so we pick what to use on what is given to
us and what are the constraints for each of these methods.
So specifically here, almost every query can be converted
into a MultiQuery that will potentially run faster
than the Zigzag except when we have the NOT.
It's a dividing line where if you want the intuitive NOT or
you want the NOT on multiple properties, it'll have to at
some point do the Zigzag between multiple indexes.
So let's look at how an actual query will be planned.
So if we have only the built-in indexes, if we're going to run
this query, it will have to be done using Zigzag Merge Join.
So we convert this into an AND operation, we split out the IN
to an OR, we pull out the NOT, pull out that OR, combine them
all together, and this is actually the class hierarchy
that we use, and this runs as a single Zigzag run.
But if you look at this, we're actually running
a lot of subscans.
So in this case, we're running five subscans, and each one of
these are very broad, right?
So there's a lot of potential for false-positive
results in these.
So this may not be efficient if you have a pathological case, a
specific query, a specific user-data combination.
So you might want to optimize this, depending on what kind
of restrictions you want to place on your users.
So if we add the index, which is exploding, but if we wanted
to restrict the users to be 5,000 square-rooted, a number
of tags, we can do this with Multiquery, and in this case,
we'll split out that IN into two different parallel query
components, tag equals "family" or tag
equals "friends." Those are run in
separate queries.
We split out the NOT into their own separate query components,
and then we run this.
We'll run these queries, and you can see that we're running
more queries in this case.
There's six of them, but they're very specific.
They only hit the positive results so they will
tend to be a lot faster.
Happily, there's middle ground now.
Before there was no middle ground, and we're gonna
provide a middle ground.
So in this case, you can just add an index on tag and tag,
and if we see this, Zigzag will actually combine the OR and
that one tag equals "outside" into a
single OR constraint.
So we're only running four queries now and two of them
are much more specific.
They scan a much smaller region, have a lot
smaller potential for false-positive results.
So this allows you to control the space versus time or
write versus read latency.
By the selection of these indexes, you can choose where
to optimize your queries.
So say we have these two queries, and if we had no
composite indexes, they would be running on three different
scans for each of these and doing Zigzag to merge them, and
your only choice today is to add these indexes
to speed them up.
Every time you add an index, you increase the space that
you're taking up with these indexes, you're reducing the
time it takes to run these queries, but you're also
increasing the cost of the write because each index
requires you to put in an index row, and when you do that it,
takes up you CPU quota for your API stuff, and that's very
precious in App Engine, but you are reducing your read time.
So already available, you can add this add
this index instead.
So this is the parts that were common to both of those queries
before so they will run on these indexes instead.
So today, if you added the common index, Zigzag
is smart enough to use that composite index.
And so you really have the tools to tweak your indexes and
really make them work for you which, if you can do and
successfully, will probably make you very valuable
to your company.
So some corollaries to this work, just some side notes.
So I don't know if any of you have seen the SearchableModel
class that Ryan wrote, but it was pretty useful unless you
wanted to sort, and datastore is meant to scale.
It's meant to have really large data sets, and those aren't
actually very useful unless you can sort, unless you can
pull out the most important results from all the noise.
So now SearchableModel becomes useful because you can actually
sort, and you can have a user enter a string that has any
length, and you can service that query, and it might take
longer to enter like 25 words, but it will still complete.
You won't get an index-not-found error or
a need-an-index error, and it will run.
Another thing you can notice is that multivalued properties
don't really sort well, like it's not very
intuitive to sort.
There's almost never a reason to put them in a sort order.
So given that and given the fact that you can Zigzag
everything, that you only require one property at most in
any prefix, although it is recommended that you put more
than those to optimize for specific queries, that
means there are no more exploding indexes.
There's no need to have a multivalued property repeated
more than once in an exploding index or in an index so
thus it will not explode.
And so another thing, if you've been paying attention and I
didn't go too fast, and you like my pretty slides that are
animated, you'll notice that the only thing that's different
for each of these results that was pushed to the users as a
result set was the postfix.
The prefix stayed constant.
And right now for cursors, what we're actually returning in the
cursor is the index row so we know where to start the scan
the next time you resume.
So we can look at this, and we can say we only need to
store the postfix so we can make these much smaller.
If you think about it for MultiQuery, the postfixes all
have to be the same because that determines the ordering,
and all the subqueries have to use the same ordering to be
able to merge efficiently.
So these cursors can work for multiqueries as well, although
you will face de-duping issues across multiple requests
if you do this.
But another thing we can do is that the only difference
between a date descending index and a date ascending index is
how the index data is encoded.
So we can store raw properties in cursors instead of the
encoded values, and that means that you can run a cursor on an
index that has the ordering reversed, thus allowing you to
scroll backwards and forwards in a given result set.
However, this does have a little catch, got-ya.
Since the final ordering of every query is key ascending,
you have to explicitly put in a key descending ordering.
Otherwise, it won't work properly.
But this does allow you to page endlessly forwards
and backwards in the same result set.
So that's it.
Wow, that was way under time.
I was supposed to talks slow, and I hope it did, But you guys
seem pretty free up there.
I asked a couple of times.
Oh yeah, I'm unplugged.
That's right.
ALFRED FULLER: I think my wife picked it out.
[LAUGHTER]
She didn't really like the ones I had.
Apparently, she likes to dress up her family
members, including the pug.
So I have mainly been working on reliability issues right
now, and I really want to get these out for you.
All the infrastructure is there so if you were -- if I really
wanted to run these queries right now, I could, but
exposing them to you and building all the pipeline that
needs to take it from the APIs, all the way through to the
back-end datastore, it's actually a fair amount of work,
and I'll get to it as soon as I can.
I really want to do this because -- it's a small team.
But the Zigzag operating on every query so you can use
SearchableModel will be out really, really soon.
So that part will be there, but improving -- like there are
many ways to enter the query so GQL won't be that hard to
upgrade, but the programmatic interface for Python and stuff
will have to be completely changed to support other
things, you know, besides AND, so that'll take a little work.
But I really want to get it you guys.
So the query splitter and the entity filters, they're in the
Java SDK right now, and we're going to put them in the Python
one and we want to finalize the API there, and I want to expose
those to users so you can have your queries work as if
they were our queries.
So the prototype I have right now for the GeoQueries, let's
just say, point in region.
Like you literally can say that, and exposing these will
let you do the exact same thing for your to domain-specific
queries so that you can have like an SQL-like statement that
will just work, and they'll have them, so I definitely
do want to expose those.
They aren't right now because we haven't finalized the API.
Do you have a question?
AUDIENCE: Yeah, so I was just wondering, so like on the Dev
App server, it recommends indexes to create based
on your run time.
Is that now going to change because certain indexes aren't
going to be needed for--
ALFRED FULLER: Yes.
AUDIENCE: --some of the queries that the Dev App server's
going to generate?
ALFRED FULLER: Yes, it will change.
AUDIENCE: And also do we need to delete manually some of
our composite indexes that were previously created?
So it will require refactoring of your indexes and the Dev App
ALFRED FULLER: Yes.
server will recommend different indexes.
Notably what it will do is it will never recommended an index
with a repeated property in it.
So if you have the same property repeated twice in the
index, we just won't recommend that, but other than that, it
will always recommend the fastest index, which is the
very explicit one, right?
But it does -- we'll have to make it so that it sees the
indexes you already have and doesn't recommend it
when you don't need it.
AUDIENCE: Yes, now that SearchableModel class is more
usable, are we able to essentially do what
are LIKE queries?
ALFRED FULLER: There's a large domain of full-text search
queries that you can use with it, but it's not going
to be like queries.
And it's pretty limited.
We are definitely working on a better solution
for full-text search.
We really want to get that out to you guys, but for right now,
searchable models like the only -- I'm not saying you should
use it for sure, but you can customize and do a similar
model for your type of data that meets your specific needs.
But no, LIKE is not really working right now.
So there is a place for sequel for sure, and it's just to get
it to this scale and make it scale automatically, there are
still those restrictions.
AUDIENCE: Did you say in cursors that we
can now go backwards?
Is that what you were saying?
Like the current cursors?
The current cursors right now can only go forward, which
ALFRED FULLER: Yes.
means they're very useful when you're chaining task view tasks
and stuff like that, but if you're paging through results,
it becomes very much less useful, and I always wanted
them to be able page both forwards and backwards
in a result set.
And to do that, it does require that you have an index that is
basically the reverse of whatever index you're going
over, but if we encode the raw values, we can actually find a
spot for that spot in the other index in the reverse index, and
thus you can scroll forward in that index, which is
essentially backwards in the other one.
We can also do things like here's an entity, tell me where
it is in this index, or just start on this entity, and we
can actually convert that entity into a cursor, and
so you can do things like that as well.
AUDIENCE: That's for the future, right?
Is that for Python and Java?
ALFRED FULLER: Yeah.
So that's actually on a lower level than the APIs, so if I
write it for one, it's for both.
AUDIENCE: Yeah, you mentioned point-in-region queries?
Geospatial queries?
ALFRED FULLER: Yes.
AUDIENCE: That sounds cool.
Is the region like a bounding box, or is it like a polygon?
ALFRED FULLER: So in the prototype that I have running,
you can specify a circle, a bounding box, or a
polyregion, polylines.
AUDIENCE: Polylines.
ALFRED FULLER: And
then you can -- should I not say this stuff?
I don't know.
Probably not?
Oh, well, in the prototype I have running, which is not
necessarily the final prototype, you can also do --
if you have IN, you can plug multiple regions in there and
it'll do a union, and you can end that with another IN, and
it will do an intersection of the regions, so you could do
any sort of combination of these regions as well.
AUDIENCE: Could you explain what you meant when you said
the only difference between an ascending and descending
index is the encoding?
ALFRED FULLER: Yeah, sure.
So the indexes we have are sort of lexigraphically
by-the-byte encoding, right?
So an integer value, depending on whether it's a big Indian or
little Indian, you know, it kind of will mess with that.
So we have special encodings, right, so that if you have like
an integer, it will encode it such that one is first,
two is second in a byte.
That's why they're fixed length, right?
And so when you do a descending decoding, it sorts it in the
opposite direction so it's like 255 minus all of those bytes.
So they're encoded differently.
And right now, cursors store the encoded value and not the
raw value, which is why we can disencode them differently
on the back end.
And we do have a plan.
We do have plans to support full-text search.
The main problem is we have many plans.
Well, actually, the problem is that we have multiple solutions
for this internally, and none of them really fit App
Engine quite right.
And they don't mesh exactly, and getting them to mesh, and
making sure that that mesh is solid and that the user will
have a good experience, is actually very difficult.
So that's what we're working on right now is we're testing
these solutions and making sure that they will work correctly
for you guys before we release them.
So that's why we don't have it yet.
ALFRED FULLER: Oh, yeah?
The gear?
[INAUDIBLE]
Yeah, for some reason, the encoding of this stuff is kinda
weird on here like that one.
[INAUDIBLE]
ALFRED FULLER: Which one?
Where?
[INAUDIBLE]
ALFRED FULLER: Presentation view.
[INAUDIBLE]
ALFRED FULLER: As soon as possible for that one if I can
parse that one correctly.
[READING___QUEST ION___FROM___SCREEN]
ALFRED FULLER: Are there any short-term plans to allow
filtering by more than one inequality filter?
So theoretically, the space-flowing curve, like the
Hilbert curve, can be expanded into any dimensionality space,
although the higher the dimensionality, the less
efficient your query is going to be, and the harder the
encoding of these indexes.
I mean, you'll need more bytes to get more accuracy
for that encoding.
But theoretically, you can take the Hilbert curve and use that
to do any property or to apply multiple inequality filters to
any property that can be normalized to like a zero and
one range, which does not include strings, right, or --
theoretically it doesn't include real numbers, but, you
know, for floating point numbers it's definitely
possible because there is a max.
So for right now, I plan on adding support for two because
it actually maps very well to the size of an integer field,
64 bits, and in the future, I will try to add more as well.
So this actually is really useful for date range problems
where you have a start and end date, you know, and you want to
find everything that overlaps this other range for the start
and end date, which is actually a very common problem.
So those will be supported by two so we can do those.
ALFRED FULLER: Is it impossible to add MapReduce queries like
[READING___QUEST ION___FROM___SCREEN]
in -- so the MapReduce frameworks should use cursors
and cursors work with all these queries so you should be able
to use MapReduce over an query.
If you saw the talk earlier, if you didn't, it was very
good, by Mike [? Atasky. ?]
And I would recommend you watch it on YouTube if you haven't.
I think it's much more closer to the arrows.
You know, the NOT EQUALS when it's like less than,
greater than because that's pretty much what it is.
And before all this work, it wasn't really possible to do
anything but what it does.
And so it was kind of assumed that we should just put it in
and since that's the only thing that's possible, then it
wouldn't ever -- well, it would definitely cause confusion, but
people testing their work would see that, and it would
happen like that.
And it's going to be difficult pushing into the future
to decide when to switch them around.
I think that way, if you specify NOT, if you specify NOT
explicitly, we're going to use Zigzag every time, or if we
can't do your query, we're going to use Zigzag, and in
other cases, we're going to assume they're single-value
properties and do that because that would be the most
efficient way to do it.
Well, this is all going to have to be -- apparently, there's
differences between GQL and the programmatic query framework,
and personally, I recommend you use the programmatic query
framework because it's not string parsing and stuff.
Why do that because Python is all about the string
parsing and lookups.
But we're going to have to rework all this to support the
Next Gen so we definitely want parity between the two, future
parity between this and Java, and Java has JDO and JPA, which
have all their own query languages, which are much more
rich than GQL and we'll definitely try also to support
that natively, all these features natively
in there as well.
So none of this is really published.
In fact, I don't think our internal documentation for it
is very good either, as, Ryan probably hates me for, because
he's always encouraged me to do that, and especially if a bus
crashed and I was in it, I think you guys would
be all out of luck.
So I think this is the documentation, like
you should watch the YouTube video for now.
[INAUDIBLE]
ALFRED FULLER: Yeah, we are definitely looking for
people to add to the team.
I don't know what that is.
OLAP?
[INAUDIBLE]
ALFRED FULLER: Okay, cool.
Well, thank you all for coming.
If you have any more specific questions or anything like
that, please come and see me.
I am happy to answer any of your questions.
Thank you very much.