Tip:
Highlight text to annotate it
X
>>
TANCRED: ...the kind of data experiment we did, looking at a lot of our data in relation
to our gambling customers right around the time of the World Cup. Basically to see what
we could see. I run Product Management at Quova. Tobias, I'll let you introduce yourself.
>> SPECKBACHER: All right. So, I'm Tobias Speckbacher. I'm the VP of Emerging Technologies
at Quova, which really means that I get to work with lots of different companies these
days that are pre-products or first products line, to see how they fit into our infrastructure
to make things work there. I've been with Quova for 10 years. I've had multiple roles
there so I can run it through all technical positions pretty much that we have there from
research to operations. And recently, I moved into this role. And that's about it.
>> TANCRED: Yeah. So, a little bit about Quova. Quova provides information about IP addresses
and we provide geographic and network information. And what our customers do with that information
is basically provide richer, more engaging, more relevant experiences for their users.
So, whether it's geo-targeting and other kind of targeting for search and other kind of
advertising or financial services in e-commerce companies helping to mitigate the risk of
fraud. You have video on-demand in sports companies who stream live video content and
other rich content. One of the reasons they're able to do that with copyrighted content is
because solutions like ours allow them to comply with the regulations and the other
contracts they have that restrict them from streaming content in other places. Major League
Baseball is an example in the U.S. where legislation actually prevents them from streaming live
games in markets where they've sold the rights to broadcasters. So the reason they can stream
live games is because they can tell where you are if you're in a home market and restrict
that content. And that's how game--gaming customers uses as well. Gambling obviously
has dif--online gambling has different restrictions in different places around the world. The
reason you can gamble online where it is legal is because online gambling companies can tell
whether you're not--whether you're in somewhere where it's legal or not. Yes, so, as I said
we took--we have a number of gambling customers mostly in the U.K., but all over the world.
And we took some of their data and looked at it in relation to right around the time
of the World Cup as I said. So, a little bit about--Tobias will talk a little bit about
the methodology in our data. >> SPECKBACHER: Okay. So, the way we get the
data is we have what we call a closed feedback system that we basically as customers use
our data, we get individual transaction data back from them, which we use for accounting
purposes, but also to focus our research efforts. So, if you have the IP--the Internet, the
IPv4 space, basically it's 4.2 billion addresses. Not all of those are assigned so there's about
by actual users a lot of that is infrastructure space. And the majority of that traffic comes
from a subset of that. So, we use that feedback data as a significant sample to target the
areas that are important for our customers to focus our research on. The other thing
that we do is as we get that data back, we release data--we release our IP data every
week. As we get that data back, we join individual IP addresses back onto all the dimensions
that we have available on that specific IP address or the network at large and we store
that. So, we can basically perform dimensional analysis across all the feedback data that
we receive. And that's about 30 billion queries per month. That again is a subset of the queries
that are actually performed against our data. The actual number is probably, you know, way
north of a 100 billion a month, because some customers have higher performance requirements.
And choose to implement it differently that it doesn't allow them to give feedback data
back to us. What else is there? >> TANCRED: So, some of the--some of the information
that we--that we assign to an IP address includes geographic information from continent down
to postal code. And then the network characteristics we assign are things like the carrier or ISP,
the organization that is responsible for the content of the network, the domain, the speed
of the connection, how the connection is routed through the Internet whether the--that IP
address is associated with an automizing activity and things like that. And we have this data
going back pretty much since we started, about 10 years worth of data. You can imagine there's
a lot of data there. Because we have so much data--well, one of the reasons we haven't
looked at it yet in a way that we've started to look at it is because dealing with all
these data is kind of onerous, there's a lot of data to deal with. And so, we'll talk a
little bit about the technologies that we used to actually aggrate some of the data,
so it's easier to record against and also just--just mine it. A little bit about gambling
though, so online gambling has a kind of a storied history. I mentioned the reason you
can get them online is because these companies can now tell whether you're somewhere where
it's legal. Back in 2006, you saw stories especially about European companies and executives
of European companies being indicted in the U.S. because they were breaking the U.S. laws
by allowing their customers to gamble by violating U.S. citizens from the U.S. to gamble. So,
being able to tell where users are coming from is critical to industries like gambling.
And gambling in general, online gambling is a growth market, so it's--it represents 8%
of the total market or did last year, which is--which is significant in terms of the market.
And it's also growing. So it's growing 13% per year is projected to about $36 billion
by 2012, which is a large market. And this is all according to H2, which is a sort of
industry--a gambling industry analyst. And because of the legality, it's mainly in Europe
and Asia that you see online gambling. That's not to say there isn't any gambling in North
America though and in the U.S. In the U.S., gambling is traditionally legislated by states,
different states have different laws. You can actually gamble online. You can do things
like you can bet on horse races in certain states. And you can play Poker online for
money in some cases. But what's happening in the U.S., there's legislation now being
passed to allow certain kinds of gambling across the U.S. It will still be regulated
by states. And one of the reasons that's happening is because there's other laws being passed
to allow that gambling to be taxed. And of course, once you've, you know, it is a significant
market. Once you start taxing it, of course, it represents a significant revenue stream
for the government. So it--that's one reason why you're going to see that. And what you're
seeing now is some of those companies, those same companies that were in trouble in 2006
are coming to the U.S. and they're either setting up shop or buying some of the existing
gambling organizations in the U.S. So--and gambling is interesting because it has--because
it's worldwide it has a lot of the aspects that make IP address geolocation interesting.
You need to localize the language to the--your customer. You need to know where your customers
are coming from so you can market to them. And you need to--you need to restrict the
access. And there's also a lot of fraud involved, especially during these big events, what you
see is online gambling houses, especially smaller ones will be blackmailed by fraudsters
who'll say, you know, "I've set up a system that can take down your gambling site and
I'm going to do that, you know, during the World Cup unless you, you know, unless you
pay me X amount of money." And so it's really important to them, you know, that you--some
of these sites have been destroyed because they've ignored these threats. Some of them
just pay out, but it's really important for them to be able to understand what threats
are real and also help prevent those. So, it has a nice broad application for IP Geo.
So, a little bit about how we went about this. We worked with a design company called Stamen
Design. And they do a lot with really interesting visualizations and they do a lot with geography.
They did the maps for the last Olympics I think they're doing the 2012 Olympics in London
as well. You can it see it--some of the projects they've done here, Crimespotting in Oakland
and San Francisco. It's a project where you can go and see real-time crime statistics
for those cities. They're responsible for [INDISTINCT] labs where you can see different
visualizations of big stories, log-on or logging-in and wireless visualization. But they're a
fantastic design company, they do great work. And we knew that working with them we would
see--we would see the data in ways that we hadn't imagined we could see the data and
see things that--that we wouldn't see otherwise. One of the things--one of the ways that they
were able to work with a large dataset is through the use of Solr, which you can talk
a little bit about. >> SPECKBACHER: So, Solr is a Apache project
and it's built on top of the Lucene Engine that was developed by CNET in 2004. When it
was developed by CNET, it was donated to the Apache Foundation in 2004. What makes Solr
interesting for a project like this is that it allows you to rapidly dive into the data.
It's very fast to ingest data, so it'll access over it and it provides facet search and date
faceting. So, faceting basically is--as it correlated to group by operation that you
can run some [INDISTINCT]. So, we've used that to explore the data with Stamen. And
we'll present some interesting visualizations that we both have at Solr and we used some
innovative newer graphing concepts for those visualizations.
>> TANCRED: So, there are two kinds of graphs that Stamen used with the data. The first
is a Horizon Graph which I'll talk about in a little more detail, and the second is a
stream graph, which may--you might be a little bit more familiar with. I'll talk about horizon
graphs first. Horizon graphs were introduced in 2008 in a paper by these folks. Stephen
Few is a design blogger and consultant. He is--his site is the perceptualedge. And he
wrote a paper talking specifically about Panopticons use, which is a commercial business intelligence
company of Horizon Graphs. A lot of the images you see are from Stephen's paper. But it's
a really interesting way to see data that you would normally look at--temporal data
you might normally look at in a line graph in a compressed form where you can start comparing
things and seeing things differently. So, you have a traditional line graph and this
is a very good way to look at data over time and you can see variations in data, peaks
and valleys. It's pretty intuitive what these means. But it's hard to compare one line graph
to another. You can start overlaying line graphs, you can start putting them beside
each other, but it gets very busy, very quickly. And you can see that this is--this is 50 stocks
over about a year in 2006 all with different line graphs. And it's impossible to really
see what's going on with these line graphs, to really compare what's going on with them.
So, Horizon Graphs alidade to see the same data but in a much compressed form. And the
way you do that is you draw a zero line in the graph; ideally, somewhere in the middle
of the graph depending on your graph and you color the space between the zero line and
the line. You color the space above the line in one color, the space below the line in
another. And what you have is anywhere if you look at the red spaces, anywhere above
the line, you have empty white space. And so, you could leverage that white space by
essentially flipping the graph up. So, now you've cut the graph in half. You can still
see the peaks and valleys through the color and you can compress it further. So, this
graph is--it has six bands of color and you can see the darker color on top. If you look
at those parts of darker color, those polygons fit in the polygon below in every case. And
what you can do is basically compress them down. So, what you wound up with is a graph
that takes up less than a fifth of space but it still gives you a very good sense of the
data. So, you can see by the intensity of the color and the color itself whether the
data is positive or negative and how--where the peaks and valleys are. So obviously, where
the colors' more intense the peaks and valleys are higher and lower. So, if you could look
at that same graph of 50 stocks with Horizon Graphs, you get a much richer picture of the
data. You can see individually how individual stocks have performed to which one have--which
ones have done well and which ones haven't and you can start to see trends temporarily.
So, you can see these stocks are all performing negatively in this timeframe and these are
performing positively. And that maybe gives you some indication of where you might want
to look deeper into the data. The other good thing about this is this--all these line graphs
are--it doesn't really matter--it's all relative. So, you're seeing relative peaks and valleys
instead of absolute numbers. So that you can see, you know, you might have one stock at--that
trades at a very low price and other stock that trades at a very high price, but you'll
see the same trends because the data is all relative. So that's Horizon Graphs. So, if
you'll look--so, you know, we're dealing with countries all around the world. These are
the line graphs of the countries. You can start to see--well, first of all, you can't
see many countries on one page. You can start to see maybe some trends in terms of where
the peaks and valleys are, but it's hard to kind of see them. So, this is actually a single
color Horizon Graph, but you--this is Internet traffic to gambling sites from different countries
around the world. And immediately you start to see--and this is--just in about a week
before the World Cup. Immediately, you start to see, like if you look at the right edge
of each of these columns, you see a lot of activity there, which correlates with, you
know, the day before the day of the World Cup. And you still see individually where
you have a lot of activity. Like in Germany, there's always a lot of activity versus Guinea,
where there's not a lot of activity until the World Cup. So--and you have many more
countries here on this graph than you did before. So, it's a really powerful way to
see data, temporal data, when you're looking at lots of elements. So, this was really neat.
And it does show some trends. It really gets interesting when we start looking at the stream
graphs though, so I'll let Tobias talk about the stream graphs.
>> SPECKBACHER: All right. So, stream graphs are a type of Stacked Graph, complex layer
graph. And it was developed by Lee Byron and he developed it out of a personal interest
to visualize his listening habits on lots of events--last of that [INDISTINCT], lots
of different data about which music you listen to, how often you do that. So, he tried to
do that with line graphs and different standard visualization techniques and none of these
really brought a clear picture to the table. So, he developed the stream graph concept,
which excels really when you're trying to present lots of data to a mass audience. It's
not--it's probably not--I mean, it's not a accurate--it's not a highly-statistical representation
of the data, but it gives you ideas of trends and how the different layers behave independently.
In 2008, the New York Times published a stream graph that showed block the movie ticket sales
performance of 7,500 movies over the past 21 years. And, so this was kind of the first
publication of stream graph that was very popular. And it evoked different kinds of
emotions. So, probably more technical people didn't feel that good about it because it
doesn't really give you a good quantitative image of what's going on. And less technical
people really like the representation because it is very aesthetic and it lets you visually
explore the data much, much better than a more accurate representation of the absolute
numbers. So, here's an example. >> TANCRED: We'll help you.
>> SPECKBACHER: We'll get them. >> TANCRED: Basically--and this is actually
what we'll walk through. What the stream graphs do is they let you start seeing trends and
then depending on your system, you can start drilling down into the data either with more
stream graphs, which is what we'll do or other data. So, this graph is worldwide Internet
traffic to some of our gambling customers from the fifth through the 13th. And, of course,
the World Cup started on the--of June of this year, started on the 12th. So, what you see
is a pretty regular pattern of Internet traffic. It's heavily dominated by European countries
and the U.K., mostly because a lot of our gambling customers are in the UK. But also
they have a pretty good gambling culture there, online gambling culture anyway. And you see
there's a lot of activity during the day. It drops off at night, comes back during the
day. You see activity on Saturday and then more activity than the other days of the week,
but it's pretty regular until the day before the World Cup where you see it spike and then
continued to stay high. So, this is interesting. It is dominated by the U.K. and Europe. So,
what we're going to do is drill down into different continents and different countries
and then eventually different network characteristics of the data to see other trends. And you can
see little examples of little anomalies in here, but once you start drilling down they
become a little bit more apparent. So, if we look at just Europe, it pretty much looks
the same. You start to see little weird things, like up here you see this little chokepoint
but it pretty much looks the same. So, let's take a look at everything but the U.K., since
it was so heavily weighted from--with the U.K. So, now, it starts to look a little bit
different. You start to see less of a--the rhythm is still there, but it's less extreme.
So, you see more activity throughout the day. You also, on the first graph, you could see
this little blip, but this becomes a lot more apparent here. Friday morning, there's something
going on. And you see that's this red band in the middle, which is associated with the
U.S. So, there's something going on there. But you also see different countries behaving
differently. So, the blue up here, right above, is the Netherlands and they have a very regular
rhythm of activity during the day and not much at night versus some place like Denmark,
which is down here, which has pretty regular activity throughout the day. And then you
also have like this green up here is Singapore, where there's not a lot of activity at all
in the week before the World Cup and then it really just blows up. So, if we look at
Asia--yeah? >> I'm sorry, but what's technically the buildup
with Vietnam? I don't understand [INDISTINCT] >> TANCRED: That's a good question, I'm glad
you asked. Because it's very important to understand it. This--so the size, it's like
a Stacked Graphs, so the size of the color is more traffic, more queries. And what this
data represents is IP address queries from these companies. It doesn't necessarily mean
that people are gambling, so someone could be coming from the U.S. and hit the site and
be denied. >> So my question basically is, what is zero
and why is it different from a graph that is less [INDISTINCT] stacked graph?
>> SPECKBACHER: So typically when you stack graphs, you have a couple of issues. So first
of all if you use lots of time series, a series that don't contribute that much data kind
of disappear in the graph visually. So, the other issue is, if you have two series of
equal vertical height but with different slopping, one of the two tends to disappear visually.
So, this methodology really is to visually pull those out and not make them disappear
and stand apart. So it's not so much like I need to know exactly the slope and I want
to know what the movement of the individual layers is.
>> [INDISTINCT] >> SPECKBACHER: Right.
>> Like how did you choose [INDISTINCT] >> SPECKBACHER: It's actually an algorithm
that you... >> Oh.
>> SPECKBACHER: Yes. So, yeah, so it's a detailed--there's detailed documentation in the paper that was
linked on the previous slide, so. >> TANCRED: Yeah. And you'll see--you'll see
kind of how it differs from a stacked area graph when we look at the U.K. specifically.
And it's a nice example of how a stream graph kind of changes, how it's different from a
stacked area graph in some ways. Does that help at all? I mean basically, what you're
seeing here--what you're looking for are trends and in some cases it gives you some answers,
but in more cases it just raises additional questions that you may or may or may not be
able to answer with a stream graph. So we're looking at Asia. So Asia looks a little bit
similar to Europe, except that you don't have that big spike on Saturday, because it's--because
for the customers that we're seeing in this traffic, Asia isn't as much of a gambling
culture traditionally, but you do see them coming to these gambling sites during the
World Cup, before and during the World Cup. So, and again, you get a much better view
here of the impact of Singapore and their big traffic, which is represented in the middle
here where it just kind of explodes. So this gives you an idea of gambling patterns in
Asia. If we look at the US, where you saw that kind of weird spike, well this is North
America, but this is instead of by country, we did it by organization because it actually
gets very interesting. So if you look at the--so the immediate thing that you might notice
here is that regular rhythm is gone. It's a pretty straight graph for the most part.
You have these blips which I'll talk about in a second, but even in the bands, there
isn't a regular pulse of activity. So when you look at the actual organizations, it's
hard for you to read that, but red is Google, so either your counterparts in Mountain View
are staying up all night gambling everyday or there's something else going on. You start
looking at the other organizations like Microsoft and Yahoo! and you realize what these are
[INDISTINCT], that are indexing the site. So that all of a sudden makes sense, where
before, you might have seen a lot of traffic from North America to the States and not being
able to explain it because really you go to a site once you get denied and that you don't
try again. This is much more understandable. These kind of anomalies are weird. This one
on this side was Comcast Cable in Centerville, California. And so there was just a bunch
of activity on Saturday. I don't know why. I don't know--I mean, we can look at it further
and we can say, "Okay, which sites were they going to? What IP addresses were they? Does--is
it many IP addresses or single IP addresses?" But it's something to look into. It can be
completely legitimate or it could be illegitimate. It could be someone probing the site before
an attack. It could be someone probing the site for legitimate reasons. It could the
site itself doing some--running some tests. You see the same thing here. This one's in
Phoenix from a publishing company. Again, very odd to see that level of traffic the
day before the World Cup, but it could be, again, legitimate or illegitimate. And certainly
it's strange. You also see that chokepoint that I mentioned earlier, much more pronounced
here. And you see that on other graphs that could be an attack, maybe the servers went
down because of an attack or maybe they went down because they crashed or maybe they've--maybe
some of these sites took their service down for maintenance. It happens to be during--I
mean, it's a bad maintenance window and that is in the middle of the World Cup but if something
bad was happening and they had to take the site down, then it makes sense probably to
do it when traffic was low anyway. So that's probably what it is. But it's interesting
looking at these graphs and kind of coming up with theories for this. And then as a customer
of the data, you would be looking at this. As an industry, it [INDISTINCT] about what's
happening on the industry. Yeah? >> [INDISTINCT]
>> TANCRED: If you can--it's not just relative, you can get information about how many total
queries is this and then you can start figuring out what the traffic numbers actually are.
What I would do if I actually wanted to know what those numbers are, I'd query the data
directly for that timeframe and find out what the group's in. I don't for this category
graph, but we could come up with them. Yeah? >> [INDISTINCT]
>> TANCRED: So that's--that's the way that the graph works. It tries to--and maybe you
can explain it better, Tobias, but it tries to kind of equalize the data. And you'll see
this in some other graphs where there's less data, that the graph shifts more. Where there's
more data, it's better at equalizing. >> [INDISTINCT]
>> TANCRED: Yeah. Right, right. And I don't know exactly what the graphing software's
doing there but it's basically an artifact to the graph.
>> SPECKBACHER: This one? >> TANCRED: Yeah. So this is everything but
Europe, Asia and North America. So, again, you see this kind of shift because there's
less data overall so the waiting is less. But you start to see interesting things again,
like which countries outside of those three main markets are good markets for gambling
and gaming. And so, here you have South America in green and in gray, we got two grays, oh,
Australia. And you see, again, South America has a good rhythm. Australia, they stay up
later or they're gambling at different times, but it's more of an equal band until you get
to Wednesday. Interestingly, Australia started betting really early. If you look at other
countries, I was looking at other countries like Malawi. And when I was looking at Malawi,
I was just looking between Friday and Friday, and it was just basically flat except for
a spike somewhere on Monday or Tuesday. And I thought, well, like, "I guess they didn't
have a team in the World Cup so they weren't interested in it," until I looked at--because
every other country started betting on Friday, and then I looked at Saturday and then there
was a huge spike. So it's just interesting to see the different mentality of different
countries. And I don't think Nigeria played until the 13th so it could be that they were
betting on African teams. I don't know. But it's interesting to come up with hypothesis
about this. So now, we'll look at three different countries in Europe, starting with the UK
because it represented so much data. This is just a very interesting stream graph because
it--you basically have a stream graph, and if you take away London, you have a stacked
area graph on top of it, because London basically creates the zero line. But this essentially
matches the European data in terms of its pulse and again everything I talked about
with betting on the weekend and the chokepoint and things like that. So if we take away London,
it'd be interesting to see if the U.K. is sort of heterogeneous in the way it gambles
and the graph essentially looks the same. You start to see a little bit more detail
in terms of what other cities in the U.K. are gambling online but it basically looks
the same. So, let's look at something that looks different. So here's Germany. This kind
of have this rhythm but it's also a little bit all over the place. You have, you know,
Monday morning people come into work and they stop betting, but then they sort of get over
their guilt and they go online and continue betting. Germany's first game is on the 12th,
and so, you see a big spike here. But it's pretty consistent; they're online all the
time betting, unlike the U.K. And you also have this huge area that kind of looks like
London did in the U.K. except this is Karlsruhe which is not any place I've heard of. So,
it's a little bit harder to explain until you start looking a little bit deeper into
the data. And this is actually 1&1 Internet AG. They're an Internet provider. They have
a big hosting facility in Karlsruhe. And so, you know, we're locating their traffic where
their datacenter is because that's the last point we see. And so, in our data, this would
be represented with the routing type of regional proxy so, you know, we know what country it's
in, but we can't necessarily tell you what city it's in. But at least we can tell you
it's Germany. And so, now that makes a little bit more sense. So that's Germany. We'll look
at Denmark next, which also looks really crazy. There's really no pattern here. You have this
huge red and this huge blue. Definitely, you see a lot of activity during the World Cup.
And so, that big red most likely represents consumer traffic. It's strange that this blue
is really active here and really active in the middle of the week before the World Cup.
And then, kind of dies out completely. When you look at the organization behind this,
that blue is basically a website that reports odds for games and it refers traffic to the
gambling houses. So, for whatever reason, there's a lot of people online checking the
odds of different matches, whether it's World Cup or not and going to betting sites and
placing bets. The red is similar to what you saw in Germany in Karlsruhe, it seems to be
a hosting provider, although, it also has--provides VPN services. I don't know why there's a big
spike there. Maybe there were some other major sporting event that people were betting on.
But certainly, if, you know, if I want to learn more about the Denmark marketing, how
it works, this is something that would, you know, I would start looking into, why there
might be a big spike and then a complete drop in activity and what's going on. So I mentioned
that we have this geographic data, we also looked at the data in terms of the never characteristics.
In the next few graphs, Tobias will cover and they show how people are connecting and
routing to get to these gambling sites. >> SPECKBACHER: Right. So what we see here
is a stream graph representing the connection types. Meaning, what we do is we categorize
network blogs by how they are connected to the Internet. So you have the DSL and cable
down here in red and yellow which are, you know, you would expect those to be dominating.
There's a pretty healthy amount of routing as betting going around here that's represented
as purple on this graph. And we have this green band that shows this uniform traffic
coming through here on fix connections so that again is probably most likely the U.S.
traffic that we saw earlier that originated from the large search providers and we can
see that as fix connections here. >> [INDISTINCT]
>> TANCRED: Yes, you want to... >> SPECKBACHER: Yes. All right. So, as I said
there was a pretty healthy amount of mobile betting going on. And that's--and now we're
segmenting the data by mobile providers. And since most of the traffic came from the U.K.,
we see T-Mobile U.K. and Hutchison 3G, I think the dominant providers here. But this is kind
of an interesting if you, you know, to slice data like that it's interesting to understand
which providers users are with, you can use that for marketing or target ads. But so just
the fact that it's a--that you actually are able to identify that's coming from a mobile
carrier helps you in a sense because you know the user's mobile, so whatever IP geo-location
tells you is probably something that you should not rely 100% on but you can use confidence
factors and other data points that we give our customers to understand these circumstances.
So, there was also a segment of dial-up users. And that was actually kind of surprising because
there was decent percentage of... >> TANCRED: Yeah.
>> SPECKBACHER: ...of the overall traffic. And again, the U.K. has dominated in the traffic
there. There was some of the U.S. traffic there.
>> TANCRED: Japan. >> SPECKBACHER: Yeah, Japan.
>> TANCRED: Tanzania. >> SPECKBACHER: And then, there's, you know,
lots of developing countries on there, which apparently still use modems. Anonymizers.
So when you're operating a gambling site, you want to make sure that your customers
are not circumventing your IP geo-location solution. And typically, they'll try to do
that by cracking through a proxy server that provide its--that provides a certain level
of anonymity. If you're trying to gamble with a U.K. provider, what better proxy to use
than the one in the U.K. and that's basically what we see here.
>> TANCRED: Maybe, I can say a word about... >> SPECKBACHER: Yeah.
>> TANCRED: ...anonymizer in the data. So the way that Quova identifies anonymizers,
they identify anonymizers by specific IP address and activity receipt. We also--because we
provide our data as network blocks, we also identify network blocks that have anonymizing
activity in them. So, a lot of this activity is probably not anonymizer activity but is
in a network block where we've seen anonymizer activity. Certainly, so I wouldn't expect
that every transaction that you see here is associated with someone using a proxy. But
you can see at, you know, the graph certainly gets wider as it moves to the right, which
is what you'd expect during a big event that you'd see more anonymizer activity at these
sites. And as Tobias said, more in the U.K. because they're trying to reach sites that
are in the U.K. >> SPECKBACHER: Right. So, basically what
we flag is bad neighborhoods so like for crimespotting data, if you look at it, this is the network
block that had some suspicious activity going on in the past or recently. So you should
be cautious in dealing with that type of traffic. And so now, we segmented the anonymizer populations
by carriers and it's not very surprising that most of these anonymizers are actually with
hosting providers. So, they're probably not systems that are actively being used by actual
users, unless this is having betting with some customers. Yes?
>> TANCRED: And this can be compromised machines or hosts that people have setup specifically
for this? >> SPECKBACHER: Yeah. So, someone might get
[INDISTINCT] set up with or the other possibility is just that boxes get routed and [INDISTINCT].
>> TANCRED: And the significance of this information is that when you're trying to prevent fraud,
when you're looking at traffic coming into your sites, the more things you can correlate
with, the better your prediction capabilities are. So if you can correlate--if you'd know
that certain carriers or certain organizations or certain countries for certain connection
types correlate better with known fraud, then knowing all that data when--if the traffic
is coming in, lets you treat those connections differently than you would otherwise. And
that's what the financial institutions do, that's what e-commerce sites do, that's what
gambling houses do. And that's why it's important to have this information. So, you know, it
was a pretty brief look at a very small part of our data. We're just starting looking at
this data. We're just starting at looking at different ways to visualize the data. What
we'd like to do is make a lot of this information public because the more people looking at
it the more interesting things we'll find in the data. As people start looking at the
data, I expect that, you know, we'll see more trends in the data and that we can start to
use a lot of these user's data to do things like predict events, predict and prevent fraud,
look at marketing trends. And they're certainly going to be a lot of assumptions that people
have about traffic to different markets from different places that can be either confirmed
or disproved with this data. So, we're excited about this. We're going to continue looking
at it, like I said, hopefully, we'll make this data public pretty soon. And that's it,
any questions? Thank you. We were so interesting that we distracted you.
>> Yeah I am. So this is about [INDISTINCT]. >> TANCRED: Yeah.
>> [INDISTINCT] >> TANCRED: Right.
>> [INDISTINCT] >> TANCRED: Well, I mean, what the laws typically
state that you're using, you know, industry best practices.
>> Oh [INDISTINCT]. >> TANCRED: And, yeah, and it's not--and there
are certainly ways to get location data that are not industry best practices. So if you're
trying to--if you're, you know, selling restricted goods to different countries around the world
where those goods aren't supposed to be sold... >> Right.
>> TANCRED: ...then, using things like user reported data wouldn't be sufficient. You
have to use some other kind of data or, you know, even GPS now you see spoofing there,
so, yeah. >> [INDISTINCT]
>> TANCRED: Yes, in our experience. >> [INDISTINCT]
>> TANCRED: Yeah, sure. >> [INDISTINCT]
>> TANCRED: Yes. So, let me ask--I'll repeat the question because I don't know if the questions
are coming through in the recording but the question is, "When we put this data on the
public, do we know what kind of visualizations and graphs will allow, whether that they'll
be static or dynamic and things like that?" You want to take that?
>> SPECKBACHER: Sure. So, certainly, our goal is to enable lots of people to explore the
data. So, static graphs are not going to be very suitable for that. Obviously, we'll have
to provide some level of pre-aggregation to protect the innocent customers. But, you know,
we can provide dimensionally aggregated data and let people slice and dice those datasets
however they want. So that's the plan. >> TANCRED: And I would expect that we're
going to probably provide some interesting visualizations like this and maybe some more
traditional ones that let people get a little bit more statistical and specific with the
data. >> [INDISTINCT]
>> TANCRED: Right. >> ...it's getting anonymized and what have
you going with it
all connected on. But you've obviously shown that these kind of meet the new graph types.
>> TANCRED: Right. >> Are you implying something in the space
where people will actually be able to navigate these graph types?
>> TANCRED: That's our plan. I would expect we're not going to--well, at least in the
first instance, the first exploration will be through different graph types rather than
just access the data directly. Although, we might, depending on how we can aggregate and
anonymizerd the data to make the data directly available.
>> And my second question is with the stream graphs, have you done any kind of cross-dimensional
analysis back where you all are actually using it to find support correlation and trends
into the dimensions with different methods? >> TANCRED: It's interesting, we've like--we've
done that with multiple stream graphs. Like I was talking about, looking at a specific
city that shows weird activity and then looking at different dimensions of that but that's
by running different--well, yeah, running different stream graphs. And it's actually
been very interesting for us to see certain things about our data that weren't completely
evident to us before. But I don't know what the stream graph's capabilities are to look
at multiple dimensions in the same graph, if that's what you're asking.
>> [INDISTINCT] >> TANCRED: Yeah. I mean, what we wound up
doing a lot, I mean, Tobias and I spent, we basically spent a long time just creating
interesting graphs. And you wind up creating graphs on specific metrics and excluding specific
things to get to the answer you're looking for. You know, so you look at interesting
things like routing types against cities, against carriers and organizations until some
things start to make sense. Like that chokepoint that we saw early Saturday morning, I think
it was. If it exists across every routing type and across every customer that we're
looking at and in every country, then it indicates something maybe industry-wide. If it only
exists for one of the customers then it's something specific to that customer. And so,
that's the kind of exploration you want to do.
>> Thanks very much. >> TANCRED: Sure.
>> SPECKBACHER: So there's actually a JavaScript library that you can use to create this. It's
called Protovis. >> TANCRED: You know, I know everyone's wondering
about the stream graph on my shirt, so I'll answer that question, yeah, yeah. I didn't
plan on wearing the shirt. I brought it and Tobias mentioned he pointed out that if there's
essential stream graph on it so I'd realized I had to wear the shirt. So it's not entirely
intentional. >> SPECKBACHER: It was a designer's idea data,
I think. >> TANCRED: Yeah. It's a--yeah, I'll let you
decide. Thank you.