Tip:
Highlight text to annotate it
X
ROBERT PUFKY: I'm Robert.
And today I have with me Zach and James.
And today we're going to talk about cloud support and how we
moved our internal support stack into the cloud.
Now I actually want to talk to you guys first and say, how
many of you guys have run traditional IT stacks in the
past or are running them currently?
Just a show of hands.
Wow, OK, a lot of people.
So our experience in moving our stack to the cloud should
be pretty relative to you.
And for you guys that are already running on App Engine
or are already running on cloud, maybe you'll get some
good tips and tricks about developing on App Engine.
So before we talk about this, I want to talk a little bit
about the background of support at Google and kind of
talk about how we do things.
So we can talk about the problem analysis of where we
were before, the applications that we developed, and where
we ended up.
So the first thing I'm going to talk about is Techstops.
And Techstops are really a central point at Google in a
lot of the offices where anyone can go at any time for
help with any kind of problem.
It could be a technical issue.
It could be software.
It could be hardware issues.
It could really be anything.
And what you expect to find here is you come in, and you
have fun and friendly techs that can answer your questions
quickly and get you on your way and make you be more
productive.
These techs are actually called field techs.
And in smaller offices, they equate to something like an IT
site director or something like that because they also do
a lot of the site work at smaller offices.
We also have a program called the ITR program, or the
Internal Technology Residency program, which is a two year
program where we pull students from colleges that have just
recently graduated.
We put them through a two-year process where we show them how
Google does IT.
And we put them on an ops rotation.
An ops rotation is like a three-month rotation with an
operations team where they can go and learn what
operations teams do.
My team is the Support Engineering Development team.
And both Zach and James did ops rotations with me.
Additionally, they do office rotations, where they go and
work for three months in another office and get an idea
of what all our offices are like.
Now to get back to the support at Google, really, our core
goal in Google in support is to make sure that users'
problem are fixed the first time, they're fixed quickly,
and they're happy and productive.
And this is really key.
It leads to a lot of really high customer
satisfaction scores.
But that's not really what we're focusing on.
We're focusing on users being really productive. .
And that's where we get into this conversation we're going
to talk about today.
And it's our frustration with the traditional IT stacks that
we were running before.
And these are great, actually.
A lot of IT professionals come in.
They've written applications on these stacks before.
They write quick applications here and there, even build up
some large-scale applications.
And that's great.
But we've found some problems with it.
We actually experienced a lot of application development,
and then we had to scale it to meet our demands.
We have offices all over the world, on every continent
except for Antarctica.
So when we launched an application, we literally had
to make sure that everyone in the world could use it when we
launched it.
And with traditional IT stacks, we really got to focus
on not only the application development but how we can
scale that to the world.
And this was a really big problem.
Because our team is generally small, we didn't have DBAs and
sysops and SREs.
SREs are Site Reliability Engineers, essentially keep
stuff up and running.
And we came to an injunction where we wanted to say, should
we actually go and hire these people to help scale this
infrastructure for us?
Or should we actually take a look at the core problems in
the group and see if we can come up with a different way,
a better way to solve these issues?
So what we did is we went in, and we did a problem analysis
on our group.
And we tried to figure out what was wrong in our
development group as well as what was wrong in the support
organization and how we could fix this.
So we pondered on these issues a little bit.
And we found the first thing, and probably the biggest
thing, is that we found that maintenance versus innovation
was a huge issue.
We were spending a lot of our time scaling our
infrastructure, keeping it up and running.
Spending a lot of time just making sure it was running as
opposed to innovating in our space and really fixing the
core issues that our team was there to solve.
We also found that a lack of tech information, or
information presented to techs in the Techstops, led to a lot
more escalations in pages and bugs.
And what that really meant is that the end user was not
getting their problem fixed as fast as possible.
So we wanted to make sure that our techs were getting the
information that they needed so they could be empowered to
make the right decisions and get people
on their way faster.
We also looked at the approvals
and how we did approvals.
We found that if a tech is fixing a user's computer, and
their hardware is busted, why should they wait for a manager
two levels away to approve a hardware swap?
Why can't the tech just make the determination that the
hardware's busted and swap it out?
What if the tech is working at a Techstop, and the location
has changed, or the phone number has changed?
Why should they go through an approval mechanism to get
those changes made?
Why can't they just go and make these changes
immediately?
We also looked at the automation for our group.
And what we really wanted to get done here is to remove a
lot of the mundane task work from the Techstops so they can
focus on interesting problems.
We did a little bit of this with traditional IT stacks.
But we really wanted to do this automation work and then
not have to think about it and keep it up and
running later on.
We also looked in the self-service.
Now if a Googler takes an average of 5 to 10 minutes to
walk to a Techstops, that means that there's 10 to 20
minutes of that Googler's day just lost just walking to the
Techstop without actually even getting their problem fixed.
So what we wanted to take a look at is if we could provide
information to our end users so they can fix their own
problems, maybe in two or three minutes, as opposed to
spending all that time walking to and from the Techstop.
And again, this is something that we need to do at a global
scale so that when we launch an application or a
self-service application, everyone can use it.
The last thing was slow applications.
And again, we've, touched on this before, so I'm not going
to hammer away at this.
But this is really scaling applications and making sure
they're fast and responsive to people that are using them.
The last thing we looked at, and I'll give you a minute to
look at this graph, is our application history and what
it was like from inception 'til death.
And what you'll see here is a histogram.
And this is basically a histogram of the applications
that we developed over time.
And what we did was we put a trend line over this data.
And we found that there's actually a really nice normal
or bell curve about the average lifespan of our
applications that we were developing.
And this was a key insight to our problem because we were
fundamentally different designing our
applications wrong.
We were originally aiming to build them to be 10- to
15-year applications, to be self-healing, to be
self-replicating, to fail over to other DCs, et cetera, et
cetera, DCs being Data Centers.
And it really turned out that our application lifespan was
only two to four years.
So we only really wanted to focus on getting an
application out, making sure it scaled to the world.
It did what it needed to do.
We could add features when we needed to, and we could turn
it down without actually needing to work on all the
maintenance stuff.
We looked into why applications were lasting only
two to four years, and we came up with a couple interesting
points as well.
First was technology changes.
Some of our systems interacted with tech stacks, and these
tech stacks got deprecated.
So therefore there was no point in having those
applications.
Second, we looked at internal process changes.
We're constantly evolving, and our internal
processes change a lot.
So some applications were deprecated by processes.
And the third and probably the most important was security
and scaling issues that we were having these
applications.
Again, our team was just a small team of developers.
And we were having a lot of issues scaling our
applications because we needed to get dedicated
resources to do it.
We also looked in the security aspect and really found out
that in traditional stacks, you have to worry about OS
security, how many users have access to the system, the
security of your serving sack as well as the
security of your code.
And then what happens when a data center fails, and you
need to fail over to another data center?
We really wanted to remove all of that from the equation and
really focus on writing the application itself.
So with these problems identified, we set forth to
migrate a lot of our applications
to something better.
And I'm going to introduce Zach, who's going to talk a
little bit about Techstop information, which is one of
these applications that we developed using these problems
that we identified and helped solve.
Zach?
ZACH SZAFRAN: My name's Zach Szafran.
I'm an IT resident, and I work in those Techstops that Bob
was just telling you about.
So recently, I had the chance to work with Bob and some of
his support engineers on their cloud-based applications.
Techstop Info is one of those projects I got to work on.
In the next couple slides, I'm going to talk about it and how
we implemented some of its features.
So originally, we had this version-controlled YAML file.
This was the authoritative source for all of our Techstop
information.
Any application that needed this type of Techstop info
would actually have to sync down their own copy
of this YAML file.
So if you take a minute and just look over this example
that we have here, you might see some human errors that
we've included.
These are actual human errors that you could
find in our YAML file.
So maybe by show of hands, does anyone see any obvious
problems with this?
It's kind of hard to tell.
So the first, less obvious problem is the fact our
indentation is incorrect.
If we ran strict parsing across this YAML file, it
would fail.
The more obvious issue is the fact our
phone number was incorrect.
And things like this did make it into our YAML file.
So let's say a technician found this phone number in our
YAML file, and he wanted to actually fix it.
What kind of process would they have to go
through to get this done?
So in order to make this change, they'd have to go to
our repo, find the file, and sync it down
to the local machine.
Once they have the file, they would make their change, fix
the phone number, whatever they need to do.
And then they would save it.
At which point, they would submit it for review, where it
would be checked for spelling errors, syntax,
style guide, et cetera.
And then the reviewer would give the approval and say,
looks good to me.
After this, it's up to the technician to submit this file
back up to the repository, where any application that
depends on the YAML format would have to sync it down.
At which point, they could use it or display it
however they want.
So in order to crowdsource this a little easier and avoid
that tedious process for making small changes, we set
up this App Engine-based replacement.
We provided a UI for making all these changes.
And in addition to that, we include audit logs for all the
changes that are made.
So now I'm going to demo Techstop Info a little bit,
just so you have an idea of what it's about and what some
of its features are.
So here, a technician could come to our website.
And they're immediately presented a list of Techstops.
So let's say the technician coming here actually wanted to
fix a phone number.
And they know the Techstops managed by Bob.
Using a Gmail-like search query, we can kind of filter
this down a little bit.
But that wasn't really specific enough.
So let's say we also know the Techstop that Bob manages is
on floor two.
And since that's the only result of our query, we're
taken directly into the Techstop, where phone numbers
could be changed or updated.
Also, an important note would be the audit log that
we have down here.
This basically says James was the last person to edit this
Techstop info.
And if we wanted to dig down a little deeper into this, we
could get a full audit trail and actually see all the
changes that were made to this Techstop, who made them, and
what the changes were.
So now that I've shown you Techstop Info and the
Gmail-like search querying, I'm going to explain a little
bit about how we implemented that.
Really, all we ended up doing is filtering based off the
models, our datastore's model attributes.
So to do this, we set up this GetTechstopList method.
And really, all it does is it takes a query from the UI and
decides what to do with it.
So the first check we do is for a colon in the query.
And that says it is a Gmail-like query.
So we pass it along to the GetListByAttributeQuery
method, which I'll talk about in the next slide.
And then if it isn't a Gmail-like query, we just take
what was entered in the UI, and we use it as a standard
term for a Datastore search.
So the GetListByAttributeQuery method is what actually parses
a Gmail-like query.
Really, all it does is take this query, split it up into
attribute names and values, and then we start looping
through them all.
So we check our Techstop model to ensure
that it has this attribute.
And if it doesn't, we inform the user that
their query was invalid.
But if it does have it, what we'll do is we'll take our
current list of Techstops, and we'll start pruning out the
ones that don't have that attribute or value.
What we're left with at the end of this is a list of
Techstops that match the entire query, meaning every
attribute and value, and then we return them.
So the IntersectResultsWithAttribute
method does pretty much what it says.
It takes our current list of tech stops, and it prunes the
ones that don't have this attribute or value.
So originally, when we run this method, we don't have a
current results that no Techstops.
So what we do is just a standard Datastore search with
a little bit of filtering involved.
Once we get those current results up, it actually will
start passing it in to here.
And we'll start looping through each of the Techstops
in that result set.
What we're doing is checking to make sure that they have
that attribute or value.
And if they don't, we just don't include it
in our return results.
These three functions here are really what allow us to
implement the search querying.
And it allows our application to quickly provide relevant
information to the techs that are visiting our website.
I mentioned a little earlier in the demo that we have these
audit logs for keeping track of all the changes that happen
to this Techstop info.
We use the SaveTechstop method here for actually writing the
Techstops to the Datastore.
So inside, we have this private transaction method.
And you can see that it's actually putting the Techstop
in the Datastore.
In return, we get a Techstop key, which we can then use to
build this audit entry with.
The audit entry says who made the change, what attributes
were changed, and what their new values are.
And then we write to the Datastore too.
We take that private transaction method, and we run
it in the db.run_in_transaction.
And that ensures that both the Techstop entity and our audit
entity get written to the Datastore.
I also mentioned that we have a corporate database.
What we're doing is we're taking our Datastore with all
of its Techstop information, and we're syncing it out to
this corporate database.
Really, the database is meant for storing all this
org-related information, so applications that depend on
that can easily access it.
With all of our Techstop info in here, all the website has
to do is authenticate to the corporate database, and all
this information's made available to them.
We don't have to administrate an API access or give direct
App Engine access to our application.
A really good example of the types of services that consume
our Techstop info from the corporate database would be an
internal overlay we have for Google Maps.
This displays all of our corporate locations.
And that just happens to include all of our Techstops.
So at this point, a technician can come to our new
application we have set up, make some changes to Techstop,
like the name, phone number, or GPS coordinates.
And then upon saving that, it's automatically updated in
this Maps overlay.
And another pretty good example is the fact that all
of our end user documentation, as well as call centers, have
access to our new application.
So they really don't have to worry about a phone number
being wrong or a room number being old.
As long as it's updated in our application, it's made
immediately accessible to them.
So if you take a minute and just look
over this graph here.
What we're showing are all the edits to all Techstop data
that have ever happened.
In the blue, we have all the edits to our YAML file.
And in the red, we have all the audit logs currently in
our data store.
I'm going to switch this to a log scale, just to make it a
little easier to see.
So again in the blue, our YAML edits.
And what this is displaying is about 107 edits over the
period of three years.
In the red, again, are the audit logs.
And what those are displaying are about 1,072 audit logs in
the period of five months.
That's about a 72-fold increase in overall edits to
our Techstop data.
And this should paint a pretty clear picture about what can
happen when you give complete control over the data to the
people that use it and depend on it the most.
There's a couple spikes in this graph that I'd
like to point out.
This very last blue spike was actually caused by us.
Those human errors I talked about earlier needed corrected
in order to import all this data into our new application.
So we had to go through and manually correct all this, and
that's what caused that very last spike there.
But after doing that and importing all this into our
application, we were able to release it.
At which point, all of our technicians kind of flooded to
the site to check it out.
And they actually corrected any old, invalid, or missing
data for our Techstops.
And that's what caused this spike here.
And then this last spike was actually caused by us again.
But unlike the YAML spike that I talked about, we didn't
actually have to manually make all these edits.
We were able to send out an email to all of our
technicians saying, we're making a global change,
informing them of this.
And they were actually able to make this change for us.
We essentially were able to crowdsource a global
modification without us having to do very much.
So now that I've talked about Techstop Info and some of its
features, I'm going to pass this off to James, who's going
to talk about another support application called Unified
Travel Manager.
JAMES MEADOR: Thanks, Zach.
Hi, everyone.
My name is James Meador.
And I'm also an IT resident at Google in the Internal
Technology Residency program.
For my operations rotation in this program, I was fortunate
enough to get to spend some time with Bob and the Support
Engineering team that he leads.
The majority of my time spent on rotation was spent
designing and developing this application called Unified
Travel Manager.
Now the design requirements for this application called
for an application that had a modular framework and not only
eased future development, but encouraged it as well.
Before we get into some of the technical details about
Unified Travel Manager, or UTM, as we frequently call it,
I want to give you a little background information about
how we handle travel at Google.
Now in a typical IT support organization, if a technician
decides to take vacation or gets sick or takes leave, the
responsibilities will be offloaded to another member of
their team or another team entirely.
This can be a particularly large problem for a smaller
office that's run by one or two technicians.
At Google, we try to mitigate the effects of this by using
what we call travel requests.
Now, a travel request is something that a technician
can open if they have a planned absence.
And then another technician can fill that travel request.
The same concept kind of applies for events.
So like Google I/O, we have a need for
technical support here.
And what we can do is we can open a travel request, and
then multiple technicians can fill that travel request.
This can pose some concerns, though.
What if more than one Googler or technician wants to fill
one of these travel requests?
What if a manager thinks that a different technician is
better suited to handle a higher-profile event, like
Google I/O?
This is where we got the idea for the first iteration of
Unified Travel Manager.
The first version was focused on the application process to
a travel request.
So if Bob knew he was headed to the Bahamas for a week, he
could open a travel application that both Zach and
myself could apply to.
Bob or his manager would then be able to select which one of
us was going to be lucky enough to cover his position.
Now the first version of UTM was great.
But after the problem analysis that Bob talked about, we
wanted to move this application to a
no-maintenance cloud solution.
And we really feel that Google App Engine was the best fit
for this application.
The three things in UTM that we want to talk to you about
are the modular framework and design, a distinction between
application administrators and application developers, and
some of the caching strategies that we used, along with an
implementation of what we call NDB, or Next Database, which
is a new Datastore API.
Before we do that, however, I want to give you a quick demo
of UTM so you can kind of see what it looks like it and how
the modules work.
Now what you can see here is the admin panel for UTM.
On the left, we've got a manager module
and a travel module.
In the middle, we've got read access groups and
write access groups.
And you can see that technicians and support
engineering are two groups that have read access to the
travel module.
On the right, we've got two on/off switches here.
And these allow us to enable or disable these
modules on the fly.
I'm going to go ahead and open the travel travel module here.
So what you can see here is a list of
the open travel requests.
Now I'm going to go ahead and create a new one.
Now I'll go ahead and set the destination here to the
Moscone Center and set the start date and end date to the
duration of Google I/O. I'll set the priority too high
because this is kind of important.
I'll also change the needed bodies to three because we
need three technicians.
And I'll specify that this is for Google I/O. I'll go ahead
and save this travel request and head
back out to the overview.
Now if we refresh the page, you can kind of see that it's
not showing up right now.
But what we can do is we head back into the admin panel and
flip this module on and off.
Go ahead and turn the module off, refresh the page.
It disappears from the nav bar.
I'll go ahead and turn the module back on.
The other thing we were going to show you in this
application is the manager module that you saw in the
module list.
Now at Google, we handle travel differently, like we
talked about.
And part of this is we have what we call travel managers.
Now a travel manager is somebody that's responsible
for the travel coming in and going out of a
certain site or region.
And these travel managers have access to what we call the
manager module in our application.
The manager module allows these travel managers to
specify who they manage the travel for.
So in the UI, they can type in the name of a user and then
drag that user to one of the different groups on the site.
They can also specify a group or team, like technicians or
support engineering that we showed you.
And then that'll send off an Ajax request to the back end,
which will then return a list of all the
users in that group.
So now that we've shown you about half of our demo of the
application, we want to go through some of the modular
design of UTM.
Now this is our class structure for the modular
overview of the application.
On the left, we've got webappRequestHandler.
And in the middle, we've got ParentRequestHandler.
And on the right, you can see ModuleRequestHandler.
Now each of these are subclasses of each other.
So Parent Request Handler is a subclass of the
webapp Request Handler.
And the Module Request Handler is then, again, a subclass of
that Parent Request Handler.
We think of our relationship between the Parent Request
Handler and the Module Request Handler a lot like the
relationship between a motherboard and
an expansion card.
The motherboard, or our Parent Request Handler, is
responsible for most of the basic routing and other
remedial tasks and utilities that are associated with
running the application.
The Module Request Handler, or the expansion card, is focused
on adding features and functionality, the bells and
whistles kinds of things.
Now on top of the class structure, what you see is the
path that an HTTP request will take through the application.
We'll go into more detail about this in a little bit.
But the general gist is that an HTTP request flows all the
way from the webapp Request Handler down to the Module
Request Handler, and then back out to the user through the
webapp Request Handler.
Now to do that, however, we have to know which modules are
responsible for processing which requests.
Now this mapping of URLs to methods happens in the
constructors for the Parent Request Handler and the Module
Request Handler.
Let's take a look at what these look like.
Now I really like tacos.
So we're going to take and build an application here
called IOLunchHandler.
This is going to take care of the ordering of lunch for
Google I/O.
Now this is our Parent Request Handler.
And it's inheriting from webapp.RequestHandler.
This is the constructor for IO Lunch Handler.
Now we'll go ahead and call the constructor for the super.
And we'll initialize two dictionaries, method and
template mappings.
We'll then call InitSubHandler.
We don't want all of our modules to override the
constructor of the Parent Request Handler because in our
live application, we've got some additional functionality
and processing that happens here.
So this is a module called TacoHandler.
It's inheriting from IO Lunch Handler.
And what we'll do is we'll define this Init Sub Handler
method that we just called.
In this case, we'll call a method MapURLToMethod.
Now what this does is it tells the Parent Request Handler
that we're going to process all requests to /tacos with
the DisplayTacos method.
Now remember the Display Tacos name because we're going to
talk about that in a little bit.
Let's look at what the Map URL To Method function looks like.
This method lives on the Parent Request Handler.
And here's the definition of it.
We'll use the URL as the key in these two dictionaries that
we created in the constructor.
And we'll assign a definition of the method in the method
mappings dictionary.
And the template's file name is a string in the template
mappings dictionary.
If we have a lot of modules in our application, they can be a
little difficult to keep track.
So we put together a class called ModuleManager that's
going to help with this.
Module Manager functions a lot like a dictionary in Python,
but with some additional functionality.
We'll use this handler's dictionary here to keep track.
One of the utility methods on the Module Manager is
something called AddHandler.
Now what this is going to do is it's going to take the
definition of a handler.
And it's going to use the handler's name as the key in
that handler's dictionary.
We'll also create an instance of the handler here and store
that as the value in the dictionary.
The reason we're creating an instance is because we want to
make sure that the Init Sub Handler method is called so
that all of the URL to method mapping can take place.
Another utility method on the Module Manager is called
GetURLMappings.
Now what this does is it iterates through the
dictionary's keys.
And it creates a list of [? tuples ?]
based on the URL in the handler that's responsible for
processing requests destined for that URL.
Now that we have this class defined, this is how we go
about using it.
We'll go ahead and create an instance of the
Module Manager class.
And then we'll add the Taco Handler to the class.
Now that we've got some data in that dictionary, we can use
the Get URL Mappings method here and pass the results of
that to webapp.WSGIApplication.
This is what tells App Engine what URLs are responsible for
processing which request and what handlers those
will get routed to.
The next thing we want to do is talk about how an HTTP
request goes through our application.
Now that we know which handlers and which modules are
responsible for processing which
requests, we can do this.
Now whenever App Engine sees an HTTP Get request come into
your application, it's going to fire off the Get method on
the subclass of webapp Request Handler.
In our case, that's going to be the Get method on the
Parent Request Handler.
So let's take a look at what that looks like.
This is the I/O Lunch Handler again, our
Parent Request Handler.
And the definition of the Get method is right here.
Now we'll use those dictionaries
that we just populated.
And we'll get the path for the HTTP request
from the request object.
We'll then look up the Get method that's responsible for
processing this HTTP request and store that here.
We'll do the same for the template's file name.
Now we'll call that Get method.
We'll store the results of this Get method in
template_params.
So we're going to jump out of the Parent Request Handler and
into the Module Request Handler while we
run this Get method.
So we just left the Parent Request Handler's call module
method portion, and we're headed into the module method
portion of the Module Request Handler.
So here's the Taco Handler module we showed you a little
earlier and the Display Tacos method that I mentioned.
Now we'll create a dictionary here with some pretty basic
taco types in it, chicken and beef, for this example.
And we'll return that dictionary back to the Parent
Request Handler for processing.
We're then stepping out of the module and back into the
Parent Request Handler for some template rendering.
Let's take a look at what this looks like.
This is still happening in that Get method.
So what we'll do is we'll call App Engine's template.render.
We'll pass in the file name and the dictionary with the
parameters to render to that template.
And then we'll render that response to the output stream.
Now this design lets us do some interesting things.
The first that we're going to talk about is the ability to
administratively disable modules.
Now we showed you briefly in the demo how we had those
on/off switches that allow us to toggle on
or off these modules.
A lot of this processing and validation happens in the
constructor for the Parent Request Handler.
So here's the Taco Handler module again.
And we're going to assign a module ID of "tacos." we use
this ID whenever we store some basic data about our modules
in Datastore.
Next, what you see is a method called ValidateSubHandler.
Now this method is on the Parent Request Handler.
And it's responsible for making sure that all of the
modules are implementing the module design correctly.
The first thing we'll do is we'll make sure that the
module has defined an ID.
If it hasn't, then we'll raise an attribute error.
If we do have a module ID, then we'll fetch our module
entity from Datastore based on that module ID.
If Datastore doesn't return anything, or it does return a
module, but the module's disabled, then we'll raise a
NotConfiguredError.
This is just a custom error in our application that lets us
know whenever a module's accessed when it
shouldn't have been.
Another thing we can do with this design is implement
granular access control at the module level.
Now App Engine provides you with 10 application
administrators and application developers in the console that
you can configure.
But we wanted to expand on that.
We were already implementing our own user
authentication model.
So this was actually kind of simple.
Let's show you what we did.
Now what you can see here is a user in module model.
Now the user model has a list of groups because the user can
be a member of multiple groups.
For instance, Zach is a member of both technicians and
support engineering.
And a module has an ACL right here.
You can see read groups, write groups, and admi groups.
Now if a technician needed access to a specific module,
then the technicians group would be on one of these ACLs,
let's say, for instance, the admin group.
But how do we intersect the user's groups with the
module's ACL?
Well, we use a decorator.
So in our decorators.py file, we have a decorator called
RequiresModuleAdmin.
Now the method that we're decorating that's passed in is
the handler_method argument of this method.
The IsModuleAdmin is the closure
method inside this decorator.
Now what it's going to do is it's going to use App Engine's
users API to fetch the currently logged
in user's user name.
We'll then take that user name and fetch our
user entity from Datastore.
We'll also get a list of the admin groups
for the current module.
If we have a user entity in our Datastore for the
currently logged in user, then we'll intersect that user's
groups with the admin groups for the module.
If there is an intersection there, then we'll go ahead and
call the handler method that we're decorating with the
arguments that were passed in.
If there's no intersection, then we'll simply error out
with a 403.
Now putting this functionality in a decorator makes it really
simple to use an implement throughout all of our modules.
What we can do is any method that requires the currently
logged in user to have module admin rights, we can just put
the decorator on top of the method definition.
In this situation, it's just a ChangeLunchItems method.
So now that we've talked about the modular overview and
design of UTM and a distinction between
application administrators and developers, we want to talk a
little bit about our caching strategies that we used in
this application.
Now who here likes a slow website?
That's unfortunate.
So slow websites waste people's time.
If you load a page, and it takes more than five seconds
to load, you've probably already moved on
to something else.
Now this was one of the main reasons we moved our
application into the cloud.
We wanted to make sure that page load times were quick.
We wanted to make sure that users didn't have to wait
forever whenever they were trying to access the data that
they needed.
And one of the things that we did was we implemented a
memcache to store the results of all of our
expensive method calls.
We also did this through a decorator.
You'll see the definition of a cache decorator here with
function passed in as the argument.
Now CachedFunction is the method here that's going to do
most of our heavy lifting.
We'll build a key based on the function's name that we're
decorating and a list of the ordered arguments and keyword
arguments that were passed in to the method.
Once we have this key, we'll try to fetch
a value from memcache.
If memcache has anything stored, then we'll simply
return that back out to the caller.
But if we don't have anything in memcache, we'll need to
call the method that we're decorating, store those
results in memcache, and then return
that back to the caller.
Another thing that we did was we implemented something
called Next Database.
I mentioned earlier that this is a new API on top of App
Engine's Datastore.
Now one of the features that NDB provides is something
called structured properties.
Structured properties allow you to retain some of the
organizational benefits that a normalized database gives you,
but in a non-relational database like
App Engine's Datastore.
Let's take a look at what we did.
So what you can see here is a travel application model
that's based on ndb.Model instead of DB.
We'll create a user property here just for the
sake of this example.
We'll also define a travel request model here.
Now we talked about how UTM is focused on the application
process for these travel requests.
This is actually an abbreviated example of exactly
what we do in our application.
So we have an approved_applications property
on the travel request itself.
We'll define this is as an ndb.StructuredProperty that's
based on the travel application.
We'll tell NDB that this is a repeated argument by setting
this to true.
Now what this does is it allows us to store data about
the application on the request object.
This is how we'd go about putting one of these requests
and applications in Datastore.
We'll create a travel application with the user
property set to the currently logged in user.
We'll then create an empty travel request.
We'll set the approved applications property on this
travel request to a list of length one with the
application in that list.
We'll then put the travel request into Datastore.
You'll note, however, that we never actually store the
travel application itself.
It's an ndb.Model, but we never called the .put method.
The reason for this is the travel application is stored
on the travel request itself.
Whenever you put a travel request into Datastore, it's
going to store the travel application
information with it.
Whenever you fetch the travel request from Datastore, NDB's
going to reconstruct that travel application
model on the fly.
So if you were to have any utility methods defined on the
NDB model, in this case, travel application, all those
methods would just work.
There wouldn't be any problem with getting those methods to
map correctly.
Now this is a method called GetTravelRequests.
This is a method that's in our application that fetches all
of the travel requests from Datastore.
It performs some processing on those requests.
And then it returns the results of that processing
back to the caller.
This is one of the methods that does a lot of processing
and Datastore operations, so we wanted to use the caching
decorator on this method.
Every time we hit the method, we don't want it to run.
So we'll check memcache to see if we have
something stored for it.
If memcache has anything, we'll just return that back to
the caller.
But if it doesn't, we'll have to go through this processing
the next time we call the function.
The other thing I want to show you is a feature called hooks
that NDB provides.
These aren't available in Datastore, in
the original DB API.
But NDB provides these for you.
Now this is a travel request model.
And we're going to define what we call a pre_put_hook.
Now NDB's going to fire off this method before you perform
any Datastore Puts on a travel request in this situation.
For this example, we'll just call memcache.flush_all, but
we could really do anything we want here.
Zach showed you how we used audit trails in Techstop Info.
We could actually use that same logic here.
Now these hooks are available for all of your Datastore
operations, your Puts, your Gets, and your Deletes.
And you can specify the hook to run before or after those
operations.
Now implementing all of these caching strategies led to some
pretty astounding results.
Now what you can see here is worst case cold cache load
times for a reporting page in our application.
Now this reporting page fetched a lot of travel
requests from Datastore.
So it did some processing on this as well, which is why the
DB load time is at 21 seconds.
Now after implementing NDB structured properties, we got
this down to about 1 and 1/2 seconds.
The reason for this is all of the travel requests in NDB now
store all of the travel applications on the travel
request object itself.
So with DB implementation, the relationship between travel
requests and travel applications was that of a
parent/child relationship.
So any time we'd fetch a travel request from Datastore,
we'd have to go and fetch the travel applications as well.
With the NDB implementation, this was a lot quicker, thanks
to structured properties.
Now we will note that if you were to switch directly from
DB to using NDB without implementing things like
structured properties or any other caching, you might
notice a slight performance decrease simply because NDB
has a little bit more overhead for the new
features that it adds.
Now that we've talked about some of the caching strategies
for UTM, and we've talked about the modular design of
the application, I'm going to pass it back off to Bob for
some results and conclusions.
ROBERT PUFKY: Thanks, James.
You know, they say live demos never work, and I think we
just proved that.
So I want to just wrap this up and talk about the results we
got from moving our application stack to the cloud
and the benefits that our team got from it, as well as our
organization.
And the first thing that I want to talk about is Platform
as a Service, and how it actually really works for us.
It's really a paradigm shift in application development for
us in the sense that we're focused on actually developing
and solving fundamental issues and innovating in our space,
as opposed to developing and maintaining stuff at the same
time,
We're focusing on the people at hand and the problems and
not focusing on keeping stuff up and running.
It's also no longer part of your core tasks to keep these
services running.
You don't have to worry about stuff going down.
We also found out as we started moving more and more
to the cloud that we had a lot of really intrinsic benefits
to this as well.
There was a definite hump when we got about partway through
our cloud migration that we spent a lot less time late
nights and weekends at work.
And this was obviously maintenance, putting out fires
like we were doing in the past.
All that was gone.
So we actually had a really large morale
increase on our team.
As well as it was just a better work-life balance.
People would come in, and they'd know they would work
Monday through Friday, nine to five.
And they could go camping on the weekend and not worry
about carrying a pager or worry about something going
down, they needed a failover.
And this was pretty huge.
And really, at the end of the day, guys, it was about our
end users and making sure that they were up and running and
being productive as possible.
And the cloud platform, App Engine, allowed us to do this
via Developers by allowing us to focus on the problems at
hand, allowing us to instantly scale to the world, as well as
allowing our techs and enabling our techs to have the
information at hand to make empowered decisions to get
people out of the Techstop faster and being more
productive.
Additionally, our users, Googlers, are happier as well
because they can actually get their work done.
They don't need to wait for problems to be solved or
information to be had.
So that's it with the presentation.
We have about a minute 30 left, I think, before we're
switching sessions.
So thank you guys very much for attending.
I really appreciate it.
And--
[APPLAUSE]
ROBERT PUFKY: If you have any questions that we can't answer
during the Q and A, since it's going to be so short, just
come and find me, and we can talk about it.