Google I/o 2010 - Next gen queries

ALFRED FULLER: So, my name is Alfred Fuller, and I'm going to talk about Next Gen Queries. So who am I? I'm the software engineer on the Datastore App Engine Team. I work mostly with the Query Planner. I helped remove the thousand-entity limit for queries, I added -- ALFRED FULLER: Thank you. I added cursors and prefetching, and I've done lot of work in the Python and Java SDK around the datastore, so I'm familiar with both. So if you want to follow along with the wave notes, you all know the drill. The bitly link is NextGen/Queries/Wave. I would also highly recommend looking at some past talks. I'm gonna briefly go over some concepts that were gone over in detail, especially in Building Scalable, Complex Apps on App Engine by Brett Slatkin last year, and also there's Under the Covers of the Google App Engine Datastore by Ryan Barrett, which is excellent as well. So the first thing I'm going to do is overview some of the features we're adding. These are the largest set of features that we're adding to the datastore query engine since its launch, so these are fairly substantial. I'm going to go over some background of how we run these queries and then talk about the technology we use to improve upon what we already have, and mainly this is Zigzag Merge Join and what we call MultiQuery. And then I wanna talk about some -- I'll give you a few examples, which are probably not very practical, but they're very interesting, so I'll let you guys put them to practical use, and then talk about some corollaries that came out of some of the same technologies that went into adding these features. So the current system is incredibly scalable. It's also schema-less so it's incredibly flexible. It does the scaling automatically, and with that, we are able to perform many query operations such as equality filters, inequality filters, sorts, and through the magic of composite indexes, we're able to do many of these on the same query. These usually require a composite index whenever you're trying to do a combination of these filters and you have to plan for those ahead of time, but you can do a lot of different things with these. They're very flexible. And everything's combined with a logical AND operator, except in the case when there are no sorts and they're all equality filters, then its free rein. You can do whatever you want, although you might want to add composite indexes if you want to make it faster and give some read-and-write performance tradeoffs. So how are we improving this? First thing we're going to do is we're going to add support for arbitrary query logic. So AND, OR, NOT subexpressions, so you can use the whole gambit, everything that you can do, you know, with your sequel queries and stuff. I know that we announced that today. There's definitely -- or yesterday. This is definitely different from what you expect in a sequel. There are still limitations, especially the single-inequality filter limitation. But we're not leaving the datastore behind, especially since it's -- what can automatically scale, you know, with no supervision, which is why we like It. We're also adding first-class support for domain-specific queries, such as Geo-Queries, Date Range queries, and this can actually be extended to do multiple-inequality filters on numeric properties, or anything that can be normalized to zero and one. The main thing, the thing that makes this all possible, is improvements to the composite indexes. So we're gonna reduce the requirements on composite indexes. We're going to eliminate the exploding index problem. I don't know, how many of you run into the exploding index problem? ALFRED FULLER: And we're gonna have fewer indexes serve a lot [APPLAUSE] more queries so you can really balance this, and this makes composite index selection kind of a dark art. Not that it wasn't already, but even more so. And we really want to provide tools to help you do this, but right now, I think you're gonna have to rely on this talk, so you can refer to it later on YouTube. And the composite index selection really provides the wrench to decide between space and speed, between write-and-read latency in costs, where you want spend this based on what your application does. And first and foremost, although last in this overview, none of these result stats were pulled into memory during any time during the making of these features. So these are scalable, they do use low resources, and we're very happy about that. Quick disclaimer: So, unfortunately, this is Next Gen Queries, so it's not in the current release or the next release. We're rolling out these features in pieces. The infrastructure's there, but we have -- it takes time to expose them to users. That being said, you can feel free to use all your Bluetooth devices and cell phones because this is only a presentation. Although some of the back-end changes are available today, and I'll go over exactly what those are, and those are the ones that are likely to come out first, so the exploding index problem should be solved as soon as I can get to it. So throughout this presentation, I'm gonna be talking -- or using the model Photo, which has a owner or person who posted the picture; a tag, which is a multivalued property or a list of user-assigned tags; people, which can be either user assigned or if they're facial recognition; location, which is where the photo was taken, and the date. And this is what I normally look like over there when I'm not trying to be presentable. I was told I was the only with a tie today so maybe I went overboard. Not the pug, not the pug. That's my wife's cruel joke again, our dog. So for some background, what we use to solve every query are indexes. These are the contained index data, which is a set of ordered values that are then, when they are placed in the index, they are sorted lexigraphically. So it's a sorted collection of the index data, and so this is exactly -- or an example of what the -- silent mode, please. That was my wife. I think she heard the comment about the dog. So this is the built-in property index. So it has the kind, the name of the property, the value of the property, and the key of the entity that it's referring to. This is what a composite index looks like. It has the index ID of the composite index that it's using, a composite value, and then the key. So queries are converted into scans over these indexes through the Query Planner, and this is done by splitting the index data into two different pieces: a prefix that's held constant and a postfix that's different for each result. So all the equality filters that go into the prefix, the sort orders that go into the postfix, and inequality filters are just range restrictions on that postfix, which is why we only allow one. And so for easy example, select "FROM Photo WHERE tag equals family." We decide to split the index data between the value and key. We apply a post -- or prefix constraints of kind equals photo, name equals tag, value equals "family," and this assures that we only see index rows that refer to photos that have these properties. Then we let the key vary. We use this key to look up the actual entity, and we return the photo to you as a result set. So, as you can see, since this is lexigraphically sorted, these photos are actually coming out in postfix ordering in this case, which is key ascending. So if we wanted to change that ordering, we wanted to SELECT FROM Photo ORDER BY date DESCENDING, we would choose a slightly different separation of these index data. We subdivide it between the name and the value. We apply the prefix constraints of kind equals photo, name equals date, which this is an implied existence filter, so if it doesn't have a date, it won't show up in this index, and then we let the postfix vary. And you can see that if we do the encoding correctly, we encode this so that date descending becomes lexigraphically date descending in order. When we pull off the key, we can look at the photo, and we get everything with the postfix ordering of data descending, key ascending, which is how we serve this query. We get a little bit more complicated. SELECT FROM Photo WHERE tag equals "family" AND tag equals "outside, ORDER BY date. Now this requires a composite index so the first thing we have to do is we specify a composite index on tag, tag, and date. We choose to split this index data in the middle of the composite value between the last tag and date. We set our prefix constraints tag equals "family" and tag equals "outside," and we let the postfix vary. So we have a postfix ordering of date descending, key ascending, and we get the photos in this order. Are you still with me? Good. Awesome. So we expand on this in two ways, one of which is with Zigzag Merge Join. So currently, this is how it operates. Zigzag Merge Join can efficiently find postfix common to multiple scans. It produces intermediate false results because it does two scans or multiple scans to find the postfixes that are in common, but it knows how to skip past sections of large, not matching results so it tends to scale in the size of the result set instead of the size of each individual subscan. So say we have photos that are tagged with dogs, photos that are tagged with cats, and we would want to find photos that show dogs and cats living together. Well, Zigzag Merge Join can efficiently find when that correlates with mass hysteria. So here's an example of Zigzag Merge Join. SELECT FROM Photos WHERE tag equals "family" AND tag equals "outside." Notice there are all only equality filters here and there's no sort order so we can service this query with Zigzag Merge Join with only the built-in indexes. So we're gonna run two queries, or two index scans, I mean, on the property index, one -- OK, we'll split these index data in the same place for each query between value and key. We'll apply a prefix constraint of kind equals photo, name equals tag, value equals "family" for the first scan, and we'll apply a prefix constraint of "photo tag" and "outside." So the first scan will give us all photos that are tagged with family. The second scan will give us all photos that are tagged with outside. Then we apply a postfix constraint using Zigzag Merge Join to say that the first scan's key must equal the second scan's key for there to be a match. So they become one. We run the scan through the magic of Zigzag Merge Join, which is gone over in detail in Brett's Slatkin's doc, and we have our photos that match this query in key ascending order. MultiQuery is the second technology we use to expand upon this. MultiQuery combines multiple-query result sets using a prority queue based on the orders of the query itself, and I've nearly optimized it to avoid the need for this priority queue when possible, and it currently just supports IN and NOT EQUAL. This is how we implemented these two. So if you look at a query where tag IN "family" or "friends," ORDER BY date DESCENDING, this query is supposed to return any photos that are either tagged family or friends. We split into two different queries where one has all the family and the other one has all the ones that are tagged with friends, and each result -- a result from either one of these queries is a correct result and is returned to the user, and we maintain the ordering because both of these subqueries have the same ordering that was on the original query, and so we only need to see the first result, the top result on each query to merge these result sets together appropriately. The NOT EQUAL, on the other hand -- I've also recently expanded this, at least in Java, sorry, to support NOT EQUAL on multiple NOT EQUAL expressions as long as they're on the same property. But this is actually expanded into inequality filters. And so WHERE tag is less than "beach," WHERE tag is greater than "beach" AND less than "coworkers," and WHERE tag is greater than "coworkers." Now, hopefully, you see a problem with this. This actually does not do what you think it does. It doesn't return photos that are not tagged with beach and are not tagged with coworkers. It returns -- these are multivalued properties, and if you look at this closely, it returns photos that have a tag other than coworkers or beach. So any other tag, and that result will be in your result set. This does the intuitive thing when it's the single-valued property, but when it's a multivalued property, it has an unintuitive result. So the Next Gen features. Was that clear? Is everybody caught up? OK, good. So the first thing I did, or we did, was add sort to Zigzag Merge Join. So right now, Zigzag Merge Join doesn't work when you have a sort order. Well, it's pretty simple to expand Merge Join from looking at just keys to looking at the entire postfix. And so we've done that. And so here's an example of the same query I showed you earlier that required the index on tag, tag and date. But now we can use Zigzag Merge Join. We do two scans or -- OK, we first -- or we created an index on tag and date. We do two scans instead of one. We divide it between the tag and date, as I previously showed you, and then we apply our prefix constraints. The first scan is for "family." The second scan is for "outside." We apply Zigzag Merge Join, which now can merge both the prefix or both the date and the key, so basically the entire postfix, and then we let that vary. We pull off the key. We get the photos in date descending and key ascending. Now this is an essential difference because tag, tag and date is an exploding index. Tag and date is not because tag is a multivalued property. Another thing we've added, will add -- sorry, it's not out yet -- is OR and NOT. So this is how AND works. The postfix must equal. And OR, for this Venn diagram, we want everything that's in both of these circles, so we just don't have a constraint. Everything that's in either scan matches. But Zigzag can also be very efficient here in which we never look at the same key twice. So it scales on the size of the blue region, not on the size of the individual circles. NOT, on the other hand, is a little more complicated. We say one scan's postfix does not equal a postfix of another scan's. This is equivalent to set subtraction. So if I want to do NOT just some subexpression, I would take everything NOT in that subexpression. Everything and NOT(a). Note, though, that is not converted into an inequality expression. It is not a greater than or a less than. This is still an inequality. So it can be put in the prefix, thus you can have a sort order while doing NOT in this case. And it returns intuitive results for multivalued properties, because this is now set subtraction so it will do exactly what you think it will do. If the photo is tagged with "coworker" and "outside," it won't show up in the result set even if it has another tag on it. So here's an example of actually how we do this efficiently. So a and NOT(b), if a is a result set that contains 2, 3 and 5, and b is a result set that contains 1, 3 and 4, we expect the result to be a set of 2 and 5. So the algorithm pulls the first value off of a, which is 2, looks for a value in b that is greater than our equal to 2, finds 3, so two matches because 3 is greater than 2, pulls the next result off of a, finds 3. We already pulled 3 off of b so we know that 3 doesn't match. Then it pulls the next result off of a, which is 5. We look for a result in b that is greater than or equal to 5. We find none. Five matches. We pull the next result off of a, we find none. That's the end of our result set. So we've only looked at four out of the six keys to produce two results, but each one of these keys that we skipped over in b can be a large region of nonmatching results. So this does tend to scale in a size of the result set, and it becomes very efficient, especially when combined with many more merge constraints. So we can go crazy with this. You can even go -- this is very actually not that crazy, now that I look at it. So (a and b) OR c) AND NOT(d). This, when all combined together, runs the same algorithm. It's pushed through doing the same process at the same time so it does scale in the size of the blue area, not the area of any of the sub -- of the other combinations or of the result sets from any of those. So this does scale and you can get it as complicated as you want, and it will still tend to scale in the size of the blue area. So another thing that we're doing is we're adding OR to MulitQuery. So WHERE tag IN "family" or "friends" or "person" IN some list of people, ORDER BY date DESCENDING, we'll split out, and you can see it's actually almost a trivial change where we just split or we start saying person equals. As long as they're sorted by the same sort properties, the postfix is the same, thus we can order them properly when we merge them using the priority queue. Wow, that was awesome. So if you look at this, you notice that this actually has no false positive results. In fact, MultiQuery never has a false positive result. Every result that MultiQuery returns is a result. But it does have a lot of duplicates, or it could have a lot of duplicates, depending on the data, because each one of these subqueries could produce the exact same results, and we will have to de-dupe it for you so that you never see the same result twice. So the performance of this OR versus Zigzag, or MultiQuery versus Zigzag in general, depends a lot on the data because it's a trade-off between the duplicate values or the intermediate false positive results. Another thing that we've added or are going to add -- sorry -- are going to add to MultiQuery is support for a domain-specific query such as Geo-Queries and Date Range queries. We do this by adding a query splitter that produces multiple parallel query components. And the splitting logic can be anything, right? So in the case of Geo and Date Range, this is done using a space-filling curve. So these queries are split out into regions to scan along this space-filling curve, and you can choose the number of regions you want to scan, which is the number of queries you want to run in parallel versus the accuracy of those regions. So there's a performance trade-off there, and when you choose something besides the most accurate, you get false positive results in this case, so you have to do some pruning. So we have entity filters. We always apply a de-duping filter because that's almost always needed or always needed with multiqueries and so there's a special filter for the Geo and Date Range Queries that prunes these fuzzy results. So you only see the ones that exactly match your query, but it's still done efficiently through the space-filling curve. So here's some examples of how the Next Gen queries improve your life as a developer. So SELECT FROM Photo WHERE tag equals "family" and tag equals "outside," date DESCENDING. I use this a lot. This just means find recent photos of my family taken outside. So for the current gen, it's an index on tag, tag and date, which is exploding and doesn't scale when you have a lot of tags. There's a 5,000 index row limit on the datastore, and your users might not put in, you know, 5,000 square-rooted, but, you know, I wouldn't trust that. You know, users do crazy things sometimes, and maybe there's enough stuff to tag in their photos. So this is dangerous. This is a dangerous index to do. You might have errors that don't show up in testing, that don't show up when you're exploring the possibilities of using the datastore. But on the Next Gen, you only require the index on tag and date so it does not explode anymore. This query SELECT photos WHERE tag IN "family" or "friends" and tag equals "outside," and tag does not equal "beach." "Find all photos of my friends and family taken outside but not on the beach." I didn't specify a sort order because I wanted to compare this to the current generation, and in the current generation, you couldn't do this query if you had a sort order because the inequality expression of the NOT EQUAL turns into inequality filters, which require that it's sorted on tag, plus you get unintuitive results because tag is a multivalued property. But in the Next Gen, it requires no additional indexes. You can run this all only on the built-in indexes, and you get the intuitive results. If I add the filter and convert this to use OR instead of IN and subexpressions, so NOT instead of NOT EQUALS, NOT and then the term. So, as I said, this is not possible in the current generation, but if you were to do this in Next Gen, it only requires that index on tag and date still. And this is -- one of the many features that I think is incredibly useful because if you have an index on tag and date, you can have any number of tag equality filters in your query in any combination of OR and AND, AND NOT, and it will still be able to satisfy that query. So if you have a search box and someone's typing in what tags they want to find, I wanna see these tags, and not those tags, a user is, before you'd be restricted on what queries If you had a query on tag, tag, tag and date, you could do you have access to. three, but you couldn't do two. You'd need an index on tag and date. And then if you had that one, it wouldn't support four. So the only way to dynamically allow a search box where a user could enter this and it would be converted efficiently into a query is to use the Zigzag Merge Join, which currently doesn't allow you to sort. And now that it does, we can solve it efficiently. So we're gonna go really crazy. Tag IN "family" and "friends," or people IN some list of people that are your family and friends AND tag equals "outside" AND NOT tag equals "beach" OR location IN the coastal region, ORDER BY date DESCENDING. I do not recommend you run this query, but you could do it. This is the exact same meaning of the previous query, but it does not rely on proper tagging because you have facial recognition for people who are in your the friends and family contact lists, and you have a region that you think is near the beach, thus I assume those tags are equal, although I wouldn't trust that either. But notice, though, that the NOT is applied to an entire subexpression, so this is literally how you can work with the new Zigzag framework, is it can be applied to any subexpression, and it will do it efficiently. This does require clever use of Geo encoding that I plan to support, or we plan to support. The team's not very big. So normally, a location would be a location query, or a geospatial query would be an inequality filter on the space-filling curve, but you can, if you encode it correctly and are clever about it, make it an inequality expression so you can actually have that sort, which allows you to prioritize, like if you have a map, you could prioritize what you see in that map using this technique as well. So in the current generation, it's definitely not possible. In the next generation, you can deal with an index on tag and date, tag and people, geolocation and date, and that's all you need. And those, if you look at them, can service as a whole slew of queries, whereas before, when you were specifying a composite index, there are very few that you could actually service with these queries, or with these composite indexes. So when do we use what? When do we decide to use the MultiQuery OR, or the Zigzag OR, or these all seem to support similar features? Well it, all depends on what composite indexes are given to us, but we do have a preference. So Zigzag pretends to produce results that scale on the size of the result set, but can in pathological cases, scale only in a size of the largest subscan, while MultiQuery produces duplicate correct results but is guaranteed to scale on the side of the result set because there are no false positive results. Every result you pop off is a result that is in the result set. Although there are duplicates, but we do have a limit, which makes that a constant value, the number of duplicates you can see, which is kind of cheating and bad, but it doesn't scale very well. So in those cases, we're going to use Zigzag anyway. So the actual performance does depend on the shape of your data, but when possible, we will always prefer to run a full-fledged query, or a MultiQuery, when we have the composite indexes there, but there are many cases in which this just does not work. So if we look at the capabilities of these two methods, Zigzag, if you ever talk to me afterwards, is the answer to everything. It does a lot of stuff, and it is very, very flexible, especially in this case, whereas MultiQuery has that unintuitive NOT. It can only do that on a single property. It has a limited number of filters when you do AND because you have that composite index for each number of those AND filters. It restricts the number of queries you can run in parallel so that OR is a little restrictive, but it's the only thing that you can really do to the main specific queries. So we'll continue -- so we pick what to use on what is given to us and what are the constraints for each of these methods. So specifically here, almost every query can be converted into a MultiQuery that will potentially run faster than the Zigzag except when we have the NOT. It's a dividing line where if you want the intuitive NOT or you want the NOT on multiple properties, it'll have to at some point do the Zigzag between multiple indexes. So let's look at how an actual query will be planned. So if we have only the built-in indexes, if we're going to run this query, it will have to be done using Zigzag Merge Join. So we convert this into an AND operation, we split out the IN to an OR, we pull out the NOT, pull out that OR, combine them all together, and this is actually the class hierarchy that we use, and this runs as a single Zigzag run. But if you look at this, we're actually running a lot of subscans. So in this case, we're running five subscans, and each one of these are very broad, right? So there's a lot of potential for false-positive results in these. So this may not be efficient if you have a pathological case, a specific query, a specific user-data combination. So you might want to optimize this, depending on what kind of restrictions you want to place on your users. So if we add the index, which is exploding, but if we wanted to restrict the users to be 5,000 square-rooted, a number of tags, we can do this with Multiquery, and in this case, we'll split out that IN into two different parallel query components, tag equals "family" or tag equals "friends." Those are run in separate queries. We split out the NOT into their own separate query components, and then we run this. We'll run these queries, and you can see that we're running more queries in this case. There's six of them, but they're very specific. They only hit the positive results so they will tend to be a lot faster. Happily, there's middle ground now. Before there was no middle ground, and we're gonna provide a middle ground. So in this case, you can just add an index on tag and tag, and if we see this, Zigzag will actually combine the OR and that one tag equals "outside" into a single OR constraint. So we're only running four queries now and two of them are much more specific. They scan a much smaller region, have a lot smaller potential for false-positive results. So this allows you to control the space versus time or write versus read latency. By the selection of these indexes, you can choose where to optimize your queries. So say we have these two queries, and if we had no composite indexes, they would be running on three different scans for each of these and doing Zigzag to merge them, and your only choice today is to add these indexes to speed them up. Every time you add an index, you increase the space that you're taking up with these indexes, you're reducing the time it takes to run these queries, but you're also increasing the cost of the write because each index requires you to put in an index row, and when you do that it, takes up you CPU quota for your API stuff, and that's very precious in App Engine, but you are reducing your read time. So already available, you can add this add this index instead. So this is the parts that were common to both of those queries before so they will run on these indexes instead. So today, if you added the common index, Zigzag is smart enough to use that composite index. And so you really have the tools to tweak your indexes and really make them work for you which, if you can do and successfully, will probably make you very valuable to your company. So some corollaries to this work, just some side notes. So I don't know if any of you have seen the SearchableModel class that Ryan wrote, but it was pretty useful unless you wanted to sort, and datastore is meant to scale. It's meant to have really large data sets, and those aren't actually very useful unless you can sort, unless you can pull out the most important results from all the noise. So now SearchableModel becomes useful because you can actually sort, and you can have a user enter a string that has any length, and you can service that query, and it might take longer to enter like 25 words, but it will still complete. You won't get an index-not-found error or a need-an-index error, and it will run. Another thing you can notice is that multivalued properties don't really sort well, like it's not very intuitive to sort. There's almost never a reason to put them in a sort order. So given that and given the fact that you can Zigzag everything, that you only require one property at most in any prefix, although it is recommended that you put more than those to optimize for specific queries, that means there are no more exploding indexes. There's no need to have a multivalued property repeated more than once in an exploding index or in an index so thus it will not explode. And so another thing, if you've been paying attention and I didn't go too fast, and you like my pretty slides that are animated, you'll notice that the only thing that's different for each of these results that was pushed to the users as a result set was the postfix. The prefix stayed constant. And right now for cursors, what we're actually returning in the cursor is the index row so we know where to start the scan the next time you resume. So we can look at this, and we can say we only need to store the postfix so we can make these much smaller. If you think about it for MultiQuery, the postfixes all have to be the same because that determines the ordering, and all the subqueries have to use the same ordering to be able to merge efficiently. So these cursors can work for multiqueries as well, although you will face de-duping issues across multiple requests if you do this. But another thing we can do is that the only difference between a date descending index and a date ascending index is how the index data is encoded. So we can store raw properties in cursors instead of the encoded values, and that means that you can run a cursor on an index that has the ordering reversed, thus allowing you to scroll backwards and forwards in a given result set. However, this does have a little catch, got-ya. Since the final ordering of every query is key ascending, you have to explicitly put in a key descending ordering. Otherwise, it won't work properly. But this does allow you to page endlessly forwards and backwards in the same result set. So that's it. Wow, that was way under time. I was supposed to talks slow, and I hope it did, But you guys seem pretty free up there. I asked a couple of times. Oh yeah, I'm unplugged. That's right. ALFRED FULLER: I think my wife picked it out. [LAUGHTER] She didn't really like the ones I had. Apparently, she likes to dress up her family members, including the pug. So I have mainly been working on reliability issues right now, and I really want to get these out for you. All the infrastructure is there so if you were -- if I really wanted to run these queries right now, I could, but exposing them to you and building all the pipeline that needs to take it from the APIs, all the way through to the back-end datastore, it's actually a fair amount of work, and I'll get to it as soon as I can. I really want to do this because -- it's a small team. But the Zigzag operating on every query so you can use SearchableModel will be out really, really soon. So that part will be there, but improving -- like there are many ways to enter the query so GQL won't be that hard to upgrade, but the programmatic interface for Python and stuff will have to be completely changed to support other things, you know, besides AND, so that'll take a little work. But I really want to get it you guys. So the query splitter and the entity filters, they're in the Java SDK right now, and we're going to put them in the Python one and we want to finalize the API there, and I want to expose those to users so you can have your queries work as if they were our queries. So the prototype I have right now for the GeoQueries, let's just say, point in region. Like you literally can say that, and exposing these will let you do the exact same thing for your to domain-specific queries so that you can have like an SQL-like statement that will just work, and they'll have them, so I definitely do want to expose those. They aren't right now because we haven't finalized the API. Do you have a question? AUDIENCE: Yeah, so I was just wondering, so like on the Dev App server, it recommends indexes to create based on your run time. Is that now going to change because certain indexes aren't going to be needed for-- ALFRED FULLER: Yes. AUDIENCE: --some of the queries that the Dev App server's going to generate? ALFRED FULLER: Yes, it will change. AUDIENCE: And also do we need to delete manually some of our composite indexes that were previously created? So it will require refactoring of your indexes and the Dev App ALFRED FULLER: Yes. server will recommend different indexes. Notably what it will do is it will never recommended an index with a repeated property in it. So if you have the same property repeated twice in the index, we just won't recommend that, but other than that, it will always recommend the fastest index, which is the very explicit one, right? But it does -- we'll have to make it so that it sees the indexes you already have and doesn't recommend it when you don't need it. AUDIENCE: Yes, now that SearchableModel class is more usable, are we able to essentially do what are LIKE queries? ALFRED FULLER: There's a large domain of full-text search queries that you can use with it, but it's not going to be like queries. And it's pretty limited. We are definitely working on a better solution for full-text search. We really want to get that out to you guys, but for right now, searchable models like the only -- I'm not saying you should use it for sure, but you can customize and do a similar model for your type of data that meets your specific needs. But no, LIKE is not really working right now. So there is a place for sequel for sure, and it's just to get it to this scale and make it scale automatically, there are still those restrictions. AUDIENCE: Did you say in cursors that we can now go backwards? Is that what you were saying? Like the current cursors? The current cursors right now can only go forward, which ALFRED FULLER: Yes. means they're very useful when you're chaining task view tasks and stuff like that, but if you're paging through results, it becomes very much less useful, and I always wanted them to be able page both forwards and backwards in a result set. And to do that, it does require that you have an index that is basically the reverse of whatever index you're going over, but if we encode the raw values, we can actually find a spot for that spot in the other index in the reverse index, and thus you can scroll forward in that index, which is essentially backwards in the other one. We can also do things like here's an entity, tell me where it is in this index, or just start on this entity, and we can actually convert that entity into a cursor, and so you can do things like that as well. AUDIENCE: That's for the future, right? Is that for Python and Java? ALFRED FULLER: Yeah. So that's actually on a lower level than the APIs, so if I write it for one, it's for both. AUDIENCE: Yeah, you mentioned point-in-region queries? Geospatial queries? ALFRED FULLER: Yes. AUDIENCE: That sounds cool. Is the region like a bounding box, or is it like a polygon? ALFRED FULLER: So in the prototype that I have running, you can specify a circle, a bounding box, or a polyregion, polylines. AUDIENCE: Polylines. ALFRED FULLER: And then you can -- should I not say this stuff? I don't know. Probably not? Oh, well, in the prototype I have running, which is not necessarily the final prototype, you can also do -- if you have IN, you can plug multiple regions in there and it'll do a union, and you can end that with another IN, and it will do an intersection of the regions, so you could do any sort of combination of these regions as well. AUDIENCE: Could you explain what you meant when you said the only difference between an ascending and descending index is the encoding? ALFRED FULLER: Yeah, sure. So the indexes we have are sort of lexigraphically by-the-byte encoding, right? So an integer value, depending on whether it's a big Indian or little Indian, you know, it kind of will mess with that. So we have special encodings, right, so that if you have like an integer, it will encode it such that one is first, two is second in a byte. That's why they're fixed length, right? And so when you do a descending decoding, it sorts it in the opposite direction so it's like 255 minus all of those bytes. So they're encoded differently. And right now, cursors store the encoded value and not the raw value, which is why we can disencode them differently on the back end. And we do have a plan. We do have plans to support full-text search. The main problem is we have many plans. Well, actually, the problem is that we have multiple solutions for this internally, and none of them really fit App Engine quite right. And they don't mesh exactly, and getting them to mesh, and making sure that that mesh is solid and that the user will have a good experience, is actually very difficult. So that's what we're working on right now is we're testing these solutions and making sure that they will work correctly for you guys before we release them. So that's why we don't have it yet. ALFRED FULLER: Oh, yeah? The gear? [INAUDIBLE] Yeah, for some reason, the encoding of this stuff is kinda weird on here like that one. [INAUDIBLE] ALFRED FULLER: Which one? Where? [INAUDIBLE] ALFRED FULLER: Presentation view. [INAUDIBLE] ALFRED FULLER: As soon as possible for that one if I can parse that one correctly. [READING___QUEST ION___FROM___SCREEN] ALFRED FULLER: Are there any short-term plans to allow filtering by more than one inequality filter? So theoretically, the space-flowing curve, like the Hilbert curve, can be expanded into any dimensionality space, although the higher the dimensionality, the less efficient your query is going to be, and the harder the encoding of these indexes. I mean, you'll need more bytes to get more accuracy for that encoding. But theoretically, you can take the Hilbert curve and use that to do any property or to apply multiple inequality filters to any property that can be normalized to like a zero and one range, which does not include strings, right, or -- theoretically it doesn't include real numbers, but, you know, for floating point numbers it's definitely possible because there is a max. So for right now, I plan on adding support for two because it actually maps very well to the size of an integer field, 64 bits, and in the future, I will try to add more as well. So this actually is really useful for date range problems where you have a start and end date, you know, and you want to find everything that overlaps this other range for the start and end date, which is actually a very common problem. So those will be supported by two so we can do those. ALFRED FULLER: Is it impossible to add MapReduce queries like [READING___QUEST ION___FROM___SCREEN] in -- so the MapReduce frameworks should use cursors and cursors work with all these queries so you should be able to use MapReduce over an query. If you saw the talk earlier, if you didn't, it was very good, by Mike [? Atasky. ?] And I would recommend you watch it on YouTube if you haven't. I think it's much more closer to the arrows. You know, the NOT EQUALS when it's like less than, greater than because that's pretty much what it is. And before all this work, it wasn't really possible to do anything but what it does. And so it was kind of assumed that we should just put it in and since that's the only thing that's possible, then it wouldn't ever -- well, it would definitely cause confusion, but people testing their work would see that, and it would happen like that. And it's going to be difficult pushing into the future to decide when to switch them around. I think that way, if you specify NOT, if you specify NOT explicitly, we're going to use Zigzag every time, or if we can't do your query, we're going to use Zigzag, and in other cases, we're going to assume they're single-value properties and do that because that would be the most efficient way to do it. Well, this is all going to have to be -- apparently, there's differences between GQL and the programmatic query framework, and personally, I recommend you use the programmatic query framework because it's not string parsing and stuff. Why do that because Python is all about the string parsing and lookups. But we're going to have to rework all this to support the Next Gen so we definitely want parity between the two, future parity between this and Java, and Java has JDO and JPA, which have all their own query languages, which are much more rich than GQL and we'll definitely try also to support that natively, all these features natively in there as well. So none of this is really published. In fact, I don't think our internal documentation for it is very good either, as, Ryan probably hates me for, because he's always encouraged me to do that, and especially if a bus crashed and I was in it, I think you guys would be all out of luck. So I think this is the documentation, like you should watch the YouTube video for now. [INAUDIBLE] ALFRED FULLER: Yeah, we are definitely looking for people to add to the team. I don't know what that is. OLAP? [INAUDIBLE] ALFRED FULLER: Okay, cool. Well, thank you all for coming. If you have any more specific questions or anything like that, please come and see me. I am happy to answer any of your questions. Thank you very much.