Tip:
Highlight text to annotate it
X
CLANCY CHILDS: Hello.
Thank you for attending.
In this session, we're going to be speaking about two
Google products, Google AdSense and Google Analytics,
which are now integrating with Google BigQuery, Google's big
data analysis and query engine.
My name's Clancy Childs.
I'm a product manager for Google Analytics.
LOUIS COLLARD: I'm Louis Collard.
I'm a software engineer on AdSense.
DUNCAN MCKIE: And I'm Duncan Mckie.
I work for Gumtree.com.
CLANCY CHILDS: A key design principle for all Google
products is to make sure that they are easy to use by the
average user.
This means that for reporting products, like Google
Analytics and AdSense, we really focus on making sure
that users can get to the data very quickly, with very low
latency, and don't require a lot of deep domain knowledge
about the data that's contained
within those reports.
Really, the idea is that we want these users to quickly
get to the data and get some insights out of it and then go
and take action.
However, we also know that there's a lot of users at our
clients-- a lot of our clients have in their teams data
scientists, data-savvy analysts and developers, who
really want a much more granular access to the data
within these products.
So for instance, some of the use cases that we see here are
the developers that want a SQL-like query interface so
they can actually get into the granular data and really make
complex queries on the data within there to sort of solve
specific problems or optimize.
They also tend to do things like use some of the business
class, the enterprise class visualization tools, things
like Tableau, which some of you may be familiar with.
What they look for is getting this data into Tableau and
doing things like building enterprise-wide dashboards so
that a company can see at a glance how their business is
doing using data from different data sets.
And a lot of times, what we've heard is that these companies
want to get their AdSense and Analytics data into those
visualization tools as well.
These users also tend to really rely on their own data
warehouses, so in many cases, what they would like to do is
join up data from different data sources
into a single place.
So for instance, an online hotel booking site may have a
database that contains all of their profit margins for each
online booking.
And what they would like to do is join that with data from,
let's say, Google Analytics and be able to say not only
what are my most profitable transactions, but actually
what were the online user behaviors that led to those
transactions?
So they want to be able to join this data all together in
the data warehouses.
Likewise, they may want to use this granular data as an input
into other business applications
that they may have.
For instance, an online retailer may have a
personalization engine application that they've built
for their home page.
And they want to take data from Google Analytics and from
their CRM system, join that data together, and when a user
lands on their home page, have that application display to
that user a personalized view of particular products that
that user may be interested in purchasing.
So all of these different use cases really require a very
deep access to the data contained within these
products for these data scientist users.
LOUIS COLLARD: Thanks, Clancy.
So this is where BigQuery comes in.
It's Google's big data analysis engine.
And it's been designed from the beginning with these kinds
of use cases in mind.
So we decided the best thing to do would be to leave our
existing interfaces as they are for standard users and
make our data available in BigQuery so that data
scientists can get the access they need and take advantage
of BigQuery's existing functionality.
So there's a few features I'd like to call out which we
think make this a really good fit.
So firstly scale, you can query billions of rows of data
interactively.
So you can type an SQL-like statement into your browser
and watch the results come back almost straightaway.
You can also query from the command line or from the API.
And as Clancy mentioned, you're probably going to want
to join the Google data we're making available with some of
your own data.
And Big Query gives you a couple of options here.
You can either upload your data into BigQuery and do all
your analysis in the cloud, which can be pretty
cost-effective as it saves you having to own and maintain an
equivalent hardware and software stack.
Or if you are already using some third-party data
warehousing and analysis solution, then BigQuery
already has good integration with a number of them.
So if you're using one of those, you can just import the
data from BigQuery into your existing solution.
If you are going to do that, you can still take advantage
of BigQuery to reduce the volume of data that you have
to import, either just by being selective about the rows
that you import or by performing some
pre-aggregation in BigQuery.
So I'm going to talk now about the AdSense data
we're making available.
Hopefully, this looks familiar to some of you.
It's the same data which is available in
the UI and the API.
And we've got this split into five tables.
The basic DailyReport table exposes all of the common
dimensions.
And then the AdUnit, CustomChannel, and UrlChannel
dimensions are each available in their own respective
tables, which all expose all of the common
dimensions as well.
And lastly, we've got the Domain table, which exposes
Date, Ad Client, Product, and Domain.
So all of these tables expose the same five raw AdSense
metrics that you see here.
You'll notice that CPC, CTR, and RPM are
missing from that list.
That's because those are all derived from these metrics,
which is something you can do easily as part of your query.
So compared to the UI, this gives you quite a bit of extra
flexibility, in that you can request any combination of the
dimensions and metrics that are available in those tables.
And you can run much larger reports than
you can in the UI.
The data is still retrieved live, so if you're watching
for updates throughout the day, you'll see those
reflected in BigQuery.
And the data will match what you see in
the UI at all times.
And obviously, it's still free to query this data, so you
won't be charged by BigQuery for querying these tables.
I've got here a very simple query, just to retrieve your
top channels by earnings, just to give you an idea of what it
would look like, really.
You can see in blue where it's selecting and grouping by
custom_channel_id.
And then in green, we're retrieving and
ordering by the earnings.
You probably already know your top five custom channels, so
more interesting might be channels that have grown the
most month-on-month.
This is something that's not readily available in the UI,
but in BigQuery is quite easy to retrieve.
We've got here a query which is joining the results of two
subqueries.
In blue, we're retrieving a customer
channel report for January.
And then in green-- this is truncated so it fits--
the same thing, but for February.
Then the outer query in red is joining the results of those
two subqueries on custom_channel_name, then
calculating the delta on the number of clicks
and ordering by that.
So if I switch to the BigQuery browser tool, you can see I've
got the full query here.
If I just press Run on that, we should pretty quickly get
back an answer.
And here we've got some helpfully named custom
channels and the number of additional clicks they got in
February compared to January.
Lastly, we mentioned third party integration.
I'm going to show you a few of the kinds of things that you
can do using Tableau, which is one of the products that
already integrates with BigQuery.
On the left-hand side here, we've got the dimensions and
metrics that are available in the BigQuery table.
And we're looking here at a graph of earnings from
November to end of January.
So you'll notice there's a fairly strong weekly
seasonality to the data.
And then we've got a much bigger dip here, which turns
out to be Christmas.
Interesting, nothing particularly actionable at
this point.
So let's dig a bit deeper.
If we add, say, platform, then we can very quickly see an
interesting trend pop up after Christmas, which is that the
orange and red sections, which are mobile and tablet devices,
show quite a noticeable uplift in revenue after Christmas.
So that would seem to indicate people got some new devices
for Christmas, so that kind of makes sense.
So we could decide to stop here and go away and optimize
our site for mobile.
In this example, we actually have different sites for
different countries.
So it's probably worth checking if all the countries
are showing the same trends.
So if I just exclude desktop for now, seeing as we're not
interested in that at the moment, and then if I merge
these two together, seeing as they're behaving roughly the
same, change this to a line, and then add Country, then we
see another interesting trend.
So our two main earners here are Australia in brown and the
UK in gray.
And it looks like it's been a pretty good
Christmas for the UK.
The uplift in earnings there is a lot more noticeable than
for Australia.
So at this point, we can go away and optimize our UK site
for mobile first, knowing that that will
have the biggest impact.
We're pleased to announce that there's a beta version of this
integration available for all AdSense publishers from today.
All you need to do to get started is sign up for a
BigQuery project using the same Google login as you use
for your AdSense accounts.
And that's it.
Then you can go and start querying the tables.
We've got step-by-step instructions on all of this on
the AdSense Developer site.
As I said, it is a beta, so there probably
will be some bugs.
If you find them, please let us know so we can fix them.
And the best place to do that is on the forum.
Thanks very much.
I'm going to pass back to Clancy.
CLANCY CHILDS: Cool.
[APPLAUSE]
CLANCY CHILDS: Thanks a lot, Louis.
Similar to the AdSense integration with BigQuery,
we're here today to announce what we're doing with Google
Analytics and BigQuery.
So what are we announcing?
We are targeting in September of this year a public launch
where Google Analytics Premium clients will be able to have
their data from Google Analytics Premium imported
into BigQuery.
What we're doing today is we're also opening up a
developer group for developers that may be interested in
working with this BigQuery integration.
So I'll give you in a few minutes a link that will take
you to a form where you can register your interest in this
particular integration.
What are we actually exporting?
It's a daily export of all of the data within a Google
Analytics profile into BigQuery, so
it's a daily export.
And the schema, which we'll be sharing on the developers
group shortly, takes the shape of every row in BigQuery in
this integration corresponds to a session or visit in the
Google Analytics profile.
And so it will have information on there like any
sort of visitor or visit-level custom dimensions or
variables, things like traffic source.
All of that visit-level information is in each row.
But also within each of those rows is a nested sequence of
all the hits that happened in that visit.
So this would mean that, as it's in a sequence, in the
order that they occurred, every page view, event,
e-commerce transaction, and if you're using the mobile SDKs
for Google Analytics, things like
exceptions and screen views.
So you really actually get the hit-level data, as well as the
session-level data in each of these rows in
the BigQuery export.
Where possible, we are using the same dimension and metric
names that we use for the Core Reporting
API in Google Analytics.
We're using those as the column names in BigQuery.
So if you already use the Core Reporting API for Google
Analytics, you will have a certain amount of familiarity
with the dimension names and metric names in BigQuery.
One important note is that at launch, this is going to
contain first-party data.
That means any of the data that's actually collected by
either the Google Analytics measurement protocol or any of
the tracking code or the SDKs for mobile apps.
There's some data in Google Analytics, for instance, GOIP
lookup, so the geography data, that's actually post-processed
into Google Analytics.
And that data might not be available
at launch in September.
However, we're going to continually look to expand the
number of dimensions and metrics that we can pass
through in this export into BigQuery.
Starting today, if you're interested in learning more
about this integration, you can go to this short link or
scan the QR code.
There will be a form there.
And that form, if you just explain to us what you're
hoping to do, we're just trying to collect some data
about exactly how people are looking forward to using this
integration.
And we'll add you to the group, where we will be
sharing things like the schema when it's ready to be shared,
things like a sample data set, and all that sort of
information especially as we come up to launch on this.
Please join.
And we'll be posting everything there shortly.
But I can talk about all the things we're going to do.
What makes a lot more sense is for me to invite
Duncan McKie up.
Duncan is a senior--
[INAUDIBLE].
DUNCAN MCKIE: I like to make a ***.
CLANCY CHILDS: So Duncan is a senior business analyst for
eBay Classifieds Group.
EBay Classified--
are you done?
DUNCAN MCKIE: I think I'm done.
I think I'm done.
Sorry guys.
CLANCY CHILDS: EBay Classifieds Group is a Google
Analytics premium client.
They're also one of our larger AdSense publishers as well.
So we've given Duncan early access to both of these
integrations.
And he's been able to run a bunch of queries and do a
bunch of really interesting things with this.
And he's going to take us through now some of the demos
that he's prepared.
DUNCAN MCKIE: Thank you, Clancy.
Sorry, everybody, for the ***.
Before we get started, can I just get a
quick show of hands.
How many people managed to attend the Big Data Mashup
session earlier?
OK.
So we've got quite a few in the room.
How many people managed to attend the cross-platform GA
optimization earlier?
OK.
So it's good to see there's not too many people here.
We'll have lots of new content for you, then.
OK.
So I'll start off just talking about who I am.
I work for Gumtree.com, which is part of
eBay Classifieds Group.
And we had probably around 42 million
users in the last month.
That's across 27 countries across the globe.
And unlike other e-commerce websites, although we connect
our buyers and our sellers together online initially, a
lot of those actually go on to transact offline.
So for us, we like to think of ourselves as a local company
even though we are global.
Now, every time one of our users interacts with us, be
that on a website, on one of our apps, or even through one
of our sales teams, that generates data.
We have petabytes of the stuff that's in MySQL, Postgres,
Salesforce.
It all gets pulled together into a massive Teradata
warehouse, which I think at the last count was round about
10 petabytes.
And we also have an awful lot of external data as well,
things like Analytics and AdSense that we'll talk about
today, and a lot of other stuff, like DFP, Paypal, loads
of different sources of data.
Now we also have a lot of different brands.
We have a lot of different sites across the globe.
And that means that each of our sites and countries
operates independently.
And we have a lot of these different stores of data.
So all of our challenges are kind of
magnified by about 30 times.
We have all of our data in different locations in
different formats.
And the real challenge here is how do we pull that together
and make it useful?
How do we turn data into insight?
Now for some of our pro users, these guys who know how to
connect to the different systems, they know how to go
in and get this data themselves, it's fine.
It's a little bit of a challenge, but it's possible.
However, for our non-power users, people like our
marketeers, our project managers, the people who need
the information to be able to make data-driven decisions,
it's not so easy.
We really want to be able to pull all of this together and
put it in one place so it's accessible for them and as
useful as possible.
Now how does BigQuery fit into this?
Well, I'd like to talk about how we use Analytics and
AdSense and pull these into BigQuery, and we can use that
as a central location for some of our users to get the data.
I'd like to take you through a couple of different examples.
The first one is this.
We all hate it when this happens, right?
So picture the scene.
Imagine me as a hipster.
I've just seen this awesome new blue single-speed bike
that I really want to buy and make all the other hipster
kids jealous.
And I'm just about to reply to Clancy, who's selling it, when
I see one of these arrows.
And the worst case is I think, I'll reply tomorrow.
It'll be fine.
And then the next day, I see some other kid riding around
the neighborhood on my bike.
The worst case scenario is that makes me never come back
to Gumtree.
And these kind of issues actually cost us users.
It's a bad user experience.
And we want to stop it as much as possible.
Unfortunately, it's not always simple to do that in the
current Google Analytics user interface, although there are
some excellent navigational tools that help us page
backwards and forwards through pageviews.
Some of our errors are actually caused by things that
are not pageviews, things like AJAX requests, things like
events that we're tracking.
Now that we can actually get the hit-level data in
BigQuery, we can page backwards and forwards through
these user journeys and really start to examine them So let's
talk through how we might do that.
Here you can see a query that's just requesting some of
our Analytics data.
And in red, you'll see that we've identified our errors
using a custom variable.
And we're doing a REGEXP_MATCH on that at the bottom.
Also, you can see in green, we're pulling out the hit
number for where the error occurs.
And the information about the session is in black.
The next step then is to do a self-JOIN back on this data.
And I'd like to show you a few things about this.
First of all, in blue you can see that we're JOINing on the
session data.
Slightly on top of that, you can see that we're using the
word EACH here.
Now the reason why we're doing that is because this is a very
big data set.
We're talking millions of rows.
If it's just a smaller data set, you
don't need to use EACH.
The final thing I'd like to show you
is the use of FLATTEN.
Now, because this Analytics data is in a nested JSON
format, if we want to be able to JOIN on these repeated
fields, we need to flatten it out and turn it into a virtual
flat format.
The final step then is to do one more self-JOIN.
We can use the hit data that we've pulled out and the visit
data and then pull out all of the meta-information about
this, things like the hour when it occurred, the page
name of the hit that directly preceded our error, and the
number of issues that we're seeing.
And if we can just switch over into BigQuery, we can give
that a run.
Now this is scanning through approximately 14 million rows
just for one of these derived tables.
We've done that three times, so that's 42 million rows.
And that takes 10 seconds.
It's blindingly fast.
So having a table of results is useful.
But the great thing is being able to plug this into one of
our BI tools, something like Tableau, which we're seeing a
lot of love for at the moment, and being able to track this
over time and automate the reporting process.
From the table, we can just see that Search has given us
1,775 errors.
But if we actually map this over time, we get a much more
interesting picture, and we can see how these errors have
evolved and which of the different pages have caused
the most errors.
Now we're showing an example of using errors here.
But the great thing about having all of our hits in
BigQuery is that we can investigate any sequence.
So if you were wanting to look at a conversion funnel, for
example, started with a pageview, and then the user
goes on to fill in a form and then later makes a purchase,
we can do that.
It's not currently possible in the user
interface in Google Analytics.
We can only do a pageview funnel.
So being able to get that kind of visibility is really,
really useful.
And the even better thing about it is we can look at
that historically.
We don't have to apply this in advance.
The next use case I'd like to talk about is how we segment
our onsite activity at Gumtree.
Now whenever an ad is posted, it has a number of different
attributes.
Some of those are simple, such as the category and location.
Some of them are a little bit more complicated, such as
whether it was paid for.
Most of these attributes, they're fairly static.
They're directly related to the user experience, and they
don't change too much over time.
However, our business needs can be a lot more changing, a
lot more dynamic, be that for strategic purposes, for HR
reasons, many other reasons.
To illustrate this, let me give you a quick example.
Let me tell you about my colleague Ron.
Now Ron's a marketing analyst.
He's a great guy.
I have to say that in case he's listening.
He's a great guy.
And when we started a couple of months ago, he came up to
me and said, Duncan, why isn't all of your traffic organized
by ISBA regions?
Ron, what the heck is an ISBA region?
I've never heard of that before.
So he explains to me it's a non-overlapping television
broadcasting region.
And apparently, it's very important in allowing us to
assess the online impact of our offline marketing.
Now, if I wanted to query that in the user interface, what
I'd have to do is create an advanced segment and a custom
filter for every single region.
And then I'd have to query every one of those
individually every time Ron wanted this data.
So you may ask, why don't we just stick
another tag on the site?
Well, we're trying to put down the amount of data that we're
sending backwards and forwards to our mobile users.
And if we just aggregated it up to these regions that he
wants, that might change in time.
And we're going to lose some of the granularity that we
really want.
So now that we have BigQuery, how do we deal with this?
Well, I just say to Ron, make me a lookup table.
And he'll get Excel out, he'll type in a load of values, and
he'll make me a lookup table, which I can then upload into
BigQuery in CSV format.
And we tried to be a little forward-thinking here, so
instead of just uploading all of the lookup details for our
ISBA regions, we'll just upload all of our metadata,
because we've got plenty of space to play with, right?
So once we've done that, we need to query
our Analytics data.
So here we're using a custom variable.
And we're going to call that location_level_2, because
that's the level that we want to look at.
And we're just pulling out, again, the session data and a
little bit of aggregation just relating to our
pageviews and bounces.
These are the metrics that we can sum up with, so we can
aggregate them at this stage.
The next step, then, is to aggregate our lookup table so
that we're only pulling out the data that we need.
So again, we've got location_level_2.
But we're also taking location_level_1, which is our
TV regions that we're looking at.
We've also applied the filter, so we're only getting back the
data that we're really interested in.
Once we've got these two queries together, we just do a
simple JOIN.
And you can see in red at the bottom, we're JOINing on
location_level_2 from both tables and then pulling out
the TV region from our lookup table.
And in blue, you can see that it's only at this stage, the
final stage, that we're actually aggregating that.
So if we just switch over to BigQuery again,
and we'll run that.
Again, this is running through millions of sessions.
And it only takes 4 and 1/2 seconds.
So if we switch back to our presentation, you can see the
results that we've got back here.
We've managed to pull back all of our regions that we're
interested in in one query.
It was very quick, very easy to run.
And we've saved ourselves a lot of time here.
The great thing about this as well is we can connect this to
our BI tools.
And as we're only aggregating at the final point, our users
can then use an interactive dashboard and cut and slice
this data as they want.
And it will never be sampled.
The final use case I'd like to talk about actually brings the
AdSense data and the Analytics data together.
Now for those of you who aren't familiar with A/B
testing, it's just the practice of releasing two or
more variants of a particular development at the same time.
And then we evaluate their performance according to a
particular metric, a particular goal.
This helps us to actually improve our implementation
incrementally.
Now when we're doing that for AdSense, we're not just
interested in the metrics that relate to AdSense.
We're not just interested in clickthrough rates or revenue
or whatever else.
We're interested in how these ads actually impact on our
overall user experience, whether it impacts on a
downstream conversion, for example, or it affects user
engagement.
As a result, some of our optimization metrics can be a
little bit complicated.
And they have to be derived from various sources.
Usually this means that we have to go into the different
user interfaces, query things separately, and then put them
together in Excel.
It's a little bit long-winded.
And I don't like doing that.
I don't like doing anything that I can't automate.
Now that we have BigQuery, we can just automate
the heck out of it.
It's great.
So we'll just write a quick query on our
AdSense data here.
You can see that we're using custom_channel_name as our
AdSense group.
And we've just applied a little tag onto all of our
users in AdSense so that we can identify them from the
DailyCustomChannelReport.
We've also done that in Google Analytics.
And we've used a custom variable again this time.
The final step then, we just do a very simple query that
JOINs together on those values of our AdSense test groups.
So if we just switch over again, jump back into BigQuery
and give that a run, how long do you think
it's going to take?
5.7 seconds.
That's brilliant.
So if we load up the results from that, now here we have
our three test groups.
And if I was to just look at the AdSense metrics, I'd see
that the click-through rate for testGroup3 is slightly
higher than the others.
And I may stop there and just pick this group.
However, if we start to combine our AdSense metrics
with the Analytics metrics, we can see that actually,
testGroup2 has a higher number of matched ad
requests per visitor.
That means that these guys are actually being shown more ads,
and they're possibly more engaged.
But it could be that actually, these guys are just coming
back more often, and in a particular session, they're
seeing less.
So if we also look at our purely Google Analytics
metrics, our pageviews per visit, we see that testGroup3
also has a higher value of pageviews per visit, which
would suggest that these guys are more engaged.
And then, by being able to look at the bigger picture and
look at all of our metrics together, we'd actually picked
testGroup2, which would be different from the group that
we might have picked otherwise.
Now this makes it really easy to get all of the information
that we need in one place.
And the super-awesome thing about this is if we use the
same Custom Dimension for every single AdSense test that
we do, we only ever need to set it up once, and we can
just sit back and be lazy and be happy.
OK.
Thanks for listening, everyone.
I'll hand you back to Clancy.
CLANCY CHILDS: Great.
Thanks a lot, Duncan.
I really appreciate that.
[APPLAUSE]
CLANCY CHILDS: So that shows us the power of what this tool
can do when put in the capable hands of a data
scientist like Duncan.
And this was actually--
he didn't have much time to prepare for this.
So it's amazing that just with this short amount of time,
he's able to start running really interesting queries
that can actually really help optimize some of the things
that they're working on at Gumtree.
So thank you very much for showing us that.
The next steps for everyone are if you would like to get
started with BigQuery and you haven't already,
bigquery.cloud.google.com.
Set up an account.
There's a lot of a public data samples there already, so if
you just want to start playing with data in BigQuery, it's
very easy to do.
For Google Analytics and BigQuery, please register on
this form at this URL.
Or if you want to come by the Ad Sandbox on the second
floor, we have this information available as well,
the URL and the one-sheeter.
And again, we're launching this in September for everyone
in Google Analytics Premium.
For the AdSense and BigQuery integration that Louis
demonstrated, the beta is available now.
And there's instructions at this URL, where you can go and
actually start accessing AdSense data in BigQuery.
So we're really excited about this.
This is really a step forward for getting more granular data
from both of these products into the hands of data
scientists who are using the BigQuery tool.
We're looking forward to see all the really cool things
that everybody's going to come up with.
And we look forward to your feedback.
Thank you very much.
We really appreciate it.
We'll open it up for questions.
DUNCAN MCKIE: Awesome.
CLANCY CHILDS: It was that explanatory?
DUNCAN MCKIE: Cool.
CLANCY CHILDS: OK.
Or we could do the whole thing again.
OK.
Yeah.
Do you mind taking the mike so people--
AUDIENCE: I guess lots of us are thinking sort of what data
sources are next, then.
Is there a schedule for them or a priority for them?
CLANCY CHILDS: You mean other Google data sources?
AUDIENCE: Mm-hmm.
CLANCY CHILDS: We can only really comment on the ones
that we work on, on AdSense and Analytics.
But I think it's pretty clear to see that BigQuery is
Google's big data query and analysis tool, so I think it's
reasonable to assume that other products will be looking
forward to getting their data in there to get it into the
hands of data scientists.
AUDIENCE: Are there any limits to the size, the number of
rows in the data sets?
CLANCY CHILDS: In BigQuery?
AUDIENCE: Yeah.
CLANCY CHILDS: There are some limits.
They're very high.
I don't have offhand the numbers.
I know that you can get things like trillions of rows of data
into tables, and that in some cases, if you try to export a
lot of data at once, you might hit a limit.
So sometimes you have to just make sure that you're not
doing a select star across terabytes of data, which might
not be able to output.
But for specific questions on BigQuery and the other things,
the Cloud Sandbox on the second floor might be able to
give more detail.
It's also all posted online.
Sorry I don't have the exact answer for you.
Great.
So again, these are the places go to get more information.
And if you do have a second, there's a QR code in the back
for rating the session.
We'd always appreciate any feedback you have about the
session, so if you can fill out there.
And we'll be down in the Ad Sandbox as well if you have
any other questions you want to ask us in person.
Thank you very much for your time.
We really appreciate it.
[APPLAUSE]