Tip:
Highlight text to annotate it
X
FEMALE SPEAKER: So I'm happy to introduce Mayur Thakur.
He'll be here from the same undergrad college, IT Delhi.
He did his PhD at University of Rochester and is right now
an assistant professor in University of Missouri-Rolla.
His research interests are in network and graph algorithms,
[INAUDIBLE]
analysis, [INAUDIBLE] and modeling, and [INAUDIBLE]
which is dear to us so [INAUDIBLE].
He has also held positions at Northeastern University, Los
Alamos Labs, Equinox Corporation, MSR,
Phillips, as we..
[INAUDIBLE]
AUDIENCE: What was that list again?
What was the list again?
FEMALE SPEAKER: Northeastern University, Los Alamos Labs,
Equinox Corporation, MSR, Phillips--
MAYUR THAKUR: [INAUDIBLE].
FEMALE SPEAKER: All right, he'll talk about
"Net and the City".
MAYUR THAKUR: Thanks, [INAUDIBLE], for the
introduction.
It's nice to be here.
First time in California, actually, so this is very
strange because I've been in the US for eight years now.
The talk is going to be on "Net and the City".
This is ongoing work with collaborators at Virginia
Tech, Northeastern, and Los Alamos, actually.
I am going to first motivate why we want to look at
city-nets, what those are, and then basically described how
we build our model of city-nets.
Then I'll present some analysis and what the
structure of the city-net looks like, and then at the
end I'll present some analysis on attacks that we have.
So, up front, let me just describe what city-nets is, or
define what city-nets is.
[SIDE CONVERSATION]
MAYUR THAKUR: So, when I refer to a city-net, what I mean is
exactly this.
I mean a set of routers, the set of routers, end hosts, and
links, that are located in a particular metropolitan area.
So think of New York and think of its boundaries.
It's not defined well, but let's say I define a
boundary for you.
And then think of every router, end host, and link
that is located inside of that boundary.
That's what I mean by the city-net.
OK?
The connection.
That's what I mean by a city-net.
So what you want to do is to uncover the structure for
large metropolitan areas like New York City, like Chicago,
and we actually have done that.
The reason for doing that is the following.
A lot of people have studied the general structure, the
internet at large structure.
And also, there's been a lot of study in the development
sub-networks, for example ISP structure has been studied,
autonomous system structure has been studied.
What we are doing is looking at the local structure, the
geographically local structure of the internet.
Why do we want to do that?
Well, the internet as infrastructure has grown with
humans, and humans have grown the most in cities.
So what you want to do is look at how the infrastructure has
grown in dense areas.
Dense both with respect to human population and in terms
of importance, because cities are where the businesses are.
Cities are where the universities are.
Cities are where in most of the people are.
In terms of computer networks, what this will do is
facilitate network simulation studies.
You see some aspects of it where we study vulnerability
or the robustness of networks.
What we want to do is to develop a methodology that
anyone having access to a few nodes and stuff can create a
structure, can create a map of city-nets, and then use that
for simulations.
We plan to release models for the research community, so
what will come out of this research, we want to package
it and release it to the network community.
Because the structure, the knowledge of
city-nets is not known.
OK, one thing that I wanted-- yes.
AUDIENCE: [INAUDIBLE]
data?
Exactly what--
MAYUR THAKUR: We can't release all the data because some of
the data is sensitive.
So what we'll do is we'll anonymize and release it.
is that you're asking?
AUDIENCE: Is that a mathematical model or is it--
MAYUR THAKUR: It's actually data from real world.
But we also plan to do research and from that real
world data make mathematical models.
Both.
But we can't release all data that we have because some of
it is actually sensitive.
But we can anonymize and release though.
And really, we can anonymize and release, so.
For example, not have any IP addresses.
Just give them integers.
OK.
One important distinction between our model--
I'll describe our model before I start--
our model of city-nets, or networks in general for now,
but let's go with city-nets--
is a path view.
So we don't look at a network as a graph.
which is what typically you would think.
What we look at in a network, when I say city-nets, model of
city-nets, I mean a set of nodes and a set of paths on
those nodes And then you can think of attaching traffic
rates and stuff on that, but this is the crucial
difference.
It's a path view as opposed to a graph view.
Now why do we want to do that?
Basically two reasons.
Routing on the internet is policy based.
just because there is a path on the internet doesn't mean
any packet will travel that path.
So we believe that the path view is the more correct view.
Also it has been shown that the paths on the internet have
this dominant paths phenomena, where in a steady state, all
traffic between this point and some other point is dominated
by a single path, even though there are multiple paths that
it could go.
It's changing slightly because of new technology and stuff,
but still, internet shows dominant path.
So again, when I say a model of a city-net, what I mean is
a set of nodes and a set of paths over those nodes.
So I'll describe all that today in a later slide, but
roughly speaking, this is the boundary of the city, these
are nodes right outside the city, so that's what I call
the border nodes B. These are the set of routers also called
internal nodes.
So these route traffic, and this set of
destinations are end hosts.
OK?
As the basic model.
OK, what this shows here is that you can have a small cut,
but we'll come to that.
But the basic idea is path instead of graph.
AUDIENCE: You're thinking only of
physical nodes, not wireless?
MAYUR THAKUR: No, at this point it's just--
and also if you can think of what level--
is it level two, level three, we're looking at
the IP router level.
OK?
People would done work at the AS level, so what is the AS
level of connectivity?
Or people have done it at even lower levels, for example, as
the physical connectivity.
We are looking at the IP level.
AUDIENCE: Are you going to get to it later or is it a good
time to ask about the typical length of the paths?
MAYUR THAKUR: I'm going to get to that.
Good question, [INAUDIBLE].
Yes, so length is the tool, it's the unweighted length and
the weighted length.
Weight is hard, so we have not done that.
We have have some models, where weight is basically
traffic or capacity.
And it's hard, but we have some models for that, we'll
come to that.
So, the problem statement is we want to design a robust
methodology.
The idea, in addition to the data, we want to develop a
methodology for inferring realistic path network views
of city-nets.
What do I mean?
I've described what path view is.
What do I mean by realistic?
What I mean by realistic is that first of all, it should
be inferred from real data.
And secondly, even though it's impossible to have all
routers, there are just too many of them and all end hosts
in your model, it should be realistic from a point of view
of whatever metrics you're interested in.
Whatever metrics common simulation studies are
interested in.
The model should have those properties.
The model for those metrics should satisfy the property of
the real world.
So our view, as I just mentioned earlier, is the
router level view, each node, think of it as a router and an
end host. In a lot of different studies, each node
is an autonomous system.
Or you can think of each node as an ISP.
Or you can think of each node as--
so when I say router, I mean an IP router at the IP level,
not at the physical level.
Because logical connectivity could be different from
physical connectivity.
There has been work, in particular work by [INAUDIBLE]
and I think someone at Google, actually, add the physically
connectivity.
But we're not doing that.
Another thing that we are not interested in, and that is
from this, is we are not interested
in the transit traffic.
If you see all parts start from outside and end inside.
There could be a path that starts from outside, goes
inside, and comes out.
We are not interested in that.
We'll not model that.
Why?
Because we're interested in how the city is connected to
the outside world and vice versa.
How the outside world is connected to the city.
So we are ignoring transit paths.
OK?
Other thing is, we actually are interested in forward
connectivity and backward connectivity.
However, 50% of paths have been shown to be symmetric,
and because of the way we do [INAUDIBLE] traceroutes, the
backward connectivity's harder to do.
So even though I'm claiming I will be sort of implying that
it works for both forward and backward, actually I'm going
to caveat it by saying that it's only forward paths.
But if you ask me to hazard a guess, I would say that the
forward and backward would not be much
qualitatively different.
OK?
But our models are not based on backward paths at all.
This is forward paths.
Obviously the issue here is that a path from here to here,
path from A to B might be much different from a path from B
to A. But typically it's 50% of paths are, in fact,
actually symmetric.
And lots more have shared many other properties.
OK, some related work, there's a ton of activity, I'm missing
a lot of stuff here.
But I'm just able to give you flavor of the kind of work
that has been done.
Different views, different views of the internet have
been look at.
And this is for general internet.
Ours is only for the cities.
So there's the router level view, you have the autonomous
system view, or the ISP view.
OK?
Also, related to our work, in fact a very crucial part of
our work is knowing which IP address falls in which city.
So we are using work that other people have done on
geolocation.
Given an IP address, how do you locate
where, what is the geography?
What is the latitude and longitude of that?
People use different kinds of triangulation techniques, both
in open research community and there are whole companies, and
we are using some data from these companies, even though
we could, in principle, do all of the [INAUDIBLE] work and
then use that data.
But we are not.
We are using it as a black box.
Geographic properties of routing--
AUDIENCE: Whose data did you use?
MAYUR THAKUR: Yeah, I'll come to that--
Akamai, Digital [INAUDIBLE], So we use three sources, and
they are all into all that.
Geographic properties of routing.
So if a packet has to be sent from Mountain View to New
York, how does travel geographically?
Does it go to Chicago first?
That has been studied and has been shown that it's not
optimal, but a lot of times it's
optimal in terms of distance.
Again, this is ours.
We look upon our work at a part of the bigger study of
understanding the internet and how it has grown, which is--
there's a ton of work in that [INAUDIBLE]
this is the first work that hinted that the internet
graphs, our internet has this power law structure, which is
a reason for efficient routing, but also leads to
vulnerability.
So we look upon our work as a part of the bigger internet
visual tomography.
There's a typo there.
We are not doing any visualization, though.
And again, even beyond internet, there's work on
complex networks.
Social networks, internet, so on and so forth.
Power networks, electrical power networks, have been some
work that is weighted but it's not part of the stock.
OK.
So what is the basic methodology that we have?
So our goal again is to find a path model.
So I want to give a set of nodes and a set of paths over
those nodes for a given city.
So let's assume a city like Chicago, for example.
If I have a specific city in this stock, I'll show
Chicago's example.
OK?
So I want to give you that.
I want to give you a path model for Chicago.
So how do we do it?
The first thing we do is find the set of all IP address that
lie within Chicago.
This we do using datasets.
So we are actually not doing research in here, we're just
using other people's research.
Now after that what we do is, we sample, since we can't run
traceroutes.
Our basic methodology is traceroutes.
Since we can't run traceroutes to all of them, we sample
intelligently so that, for a traceroute, the number of
things, the number of nodes and edges are
discovered, is improved.
Then we actually traceroute from a lot of different
vantage points, close to 44 vantage points, to every IP in
that set that he have, that sample set that we have. We
clean data, then we build a practice model.
We call it the BID model, and you sort of saw the glimpse of
it, and I'll come back to it.
But essentially that's what we do.
You have a sample set inside a city, you have all these
sources, you run traceroute from all these sources to the
sample set, and you get your path-based model, and we do
analysis after that.
OK.
So the first step was basically the IP
to geographic mapping.
As I mentioned in an earlier slide, it's an
active area of research.
We are not doing research in this.
We just take data from--
this is actually proprietary data, but if we really wanted,
we could spend effort and build our own system.
People have done that in the research community as well.
These are all well known leaders in their field.
What we do is take these three data.
The data doesn't match because of obvious reasons.
It's not exact.
We don't know where the city ends and the suburbs starts.
One thing we want to note is that we want to consider the
metropolitan area.
Because we don't want to lose out on home
customers, for example.
Blocks related to home customers.
That's just one.
So we want to look at not just downtown, but the metropolitan
area as well.
Again, that definition's also not clear.
But basically, what these data give us is for each IP, it
gives us a name, which is basically
the name of the city.
Actually, they do it by blocks.
But then you have to merge these data.
How do you merge these data to get one clean data?
OK.
The basic method that we have, there is a tool that we have,
as have other studies done, is traceroutes.
What traceroute is basically you start from a point, it
gives you the route from that point to any other point in
the internet.
So this is an example of an actual
traceroute from our dataset.
So you wanted to find out the forward path to this IP
address, and this gives you that.
You are not using the time information, but you could use
time information for weighting for example.
But you're not doing that as of now.
This is basically an internal bookkeeping
that we have. OK?
So the traceroute is a utility that's been used a lot in
network studies.
It uses the "time-to-live" attribute if IP packets.
What you do is you say, OK, that IP packet should expire
in, say, five steps.
So at the fifth hop, it expires, and then, just by the
way the protocols work, hopefully the router will send
you a message saying, oh, your packet expired.
So you know what the fifth router is, essentially.
They're not really quite [INAUDIBLE], but that's why
traceroutes a lot of times are not successful.
In fact, the success rate is only about 10% or so.
OK?
But that's what we have to live with.
So what we have is-- yeah?
AUDIENCE: In the traceroutes I seem to remember there's two
different ways of doing it with ICMP and--
MAYUR THAKUR: Yeah.
I don't mention that but we do all sorts of optimization.
It's the point of this slide, actually.
There's lots of different ways to do it.
We choose ICMP because UDP raises more flags.
So what you have to be careful about traceroutes is that,
first of all, they can take a long, long time.
So without that optimization, which I'm not mentioning here,
so 100,000 traceroute, which is a typical number for a
city, will take about two to six weeks.
Depends on how many successes you have. We do traceroutes.
We do, for example, we don't send three packets which is
normally what they do.
Remember, we are not interested in a very fine
structure because it's more important to get more
structure than to get everything single
small one of them.
So we do these optimizations, part of that
is using ICMP packets.
And then we'll use it to a reasonable number, one day for
100,000 traceroutes essentially.
So the utility that we have is basically a utility for
collecting traceroutes, and it has tunable options, including
[INAUDIBLE].
But the way we do it is always use ICMP.
it works across machines because sources in different
locations have different operating systems. It does
have error recovery, so if you stop at a certain point, you
can start it up without losing much data at all.
So I'm not going to go into optimization, but some effort
was spent there in trying to make sure that this was
reasonable.
Because this is just not reasonable.
OK.
So what we did was, we chose 12 of the 25 largest cities in
terms of population within the US.
And we chose cities such that they were
distributed across the US.
And it just so happened that there is no big city here.
Otherwise you see that it's sort of well distributed.
So you have stuff there, stuff here, stuff
there, and stuff here.
We have 12 cities in our study, so the reason for
looking at multiple studies is because we also want to look
at across-city similarities or differences.
OK?
So that's the destination cities.
And that's the cities is our study.
The data that we use is the following, the data
sets that we use.
We have Skitter, which is basically CAIDA's net
measurement tool.
And the data that we used, that's not all the data that
we have, but the data that we use had traceroutes from 20
vantage points.
What Skitter does is basically it has a set of IP addresses
that periodically it traceroutes to.
And some of that were useful for us, so we took that.
So not all the data is used here, but we took that.
They use 20 vantage points.
Planetlab is just a network of computers.
You can get slices on it.
We use 14 such vantage points across the world but most of
them in the US, and we traceroute to our data.
In addition, we used our vantage points, 10 in number,
basically we asked our friends, can you lend us a
computer for seven days, ten days, whatever?
And we learned traceroute from that.
OK?
So overall, it's 44 vantage points.
One crucial thing is that recent studies have shown that
if you don't run these traceroute-based studies from
multiple vantage points, lots of [INAUDIBLE]
vantage points, you could get misleading results.
So that's why a lot of effort was spent in that.
One obvious thing to consider was, and we did, was
Rocketfuel, which basically maps autonomous systems, or
actually ISPs.
So they have mapped, for example, Verizon.
However, when you look at that data, the amount of it that we
could use wasn't enough, because they're using very few
traceroutes.
So we decided not to use them.
Just to give you an idea of the amount of data that we
have for a city, and Chicago in particular--
so, bigger cities have more data, that's why Chicago's
almost twice the average--
is about 850,000 traceroutes for an average city, of which,
I said, that not all traceroutes complete.
Only about 100,000 traceroutes did complete.
However, we don't just throw away all non-completed
traceroutes.
The number of traceroutes we use for an average city is
over 200,000.
That's a lot of data.
For Chicago, it's more than half a million.
OK.
The first step, I said, was to find a set of IP addresses, to
reach to, so that we could traceroute to them.
Now, one thing we found was if you just randomly sample-- a
lot of IP address space is just wasteful it's just not
used at all.
If you randomly sample, is that good or do you need some
other sampling?
What we found out is that we actually
need some other sampling.
So what we did was the following.
Let me describe the result and then I'll say how we did it.
What we did was we took the IP address space, the IP
addresses within a city, and then we
broke it up into blocks--
and I'll describe how the blocks were made--
and then what this shows is the number of IP address
versus population, and it shows the number of blocks
versus population.
And you can see a much better correlation of the number of
blocks than the number of IP addresses.
What does it tell us?
Basically what it tells us is that the number of blocks is a
closer description of the complexity of a city-net than
a number of IP addresses.
So what we did was we did block sampling.
Instead of sampling randomly from IP addresses, we had
these blocs and we sampled--
uniformly, randomly-- from those blocks.
And again, you might ask, well, is this correlated with
the population?
How do you know that it gives you more complexity?
To tell you the truth, we don't.
However, if you look at the performance in terms of
discovering elements of city-net, nodes and edges--
forget what this is written here--
what we see is that if you do this uniformly, randomly
sample from IP addresses versus our block sampling, we
get 120% improvement.
OK?
For the same for the same number of traceroutes.
So if you do 10,000 traceroutes with uniform
sampling versus block sampling, ours is 120% better
across all types or nodes and edges we looked at.
So this is interior nodes, so these are connections between
routers inside Chicago.
These are total number of edges.
These are I-D edges means these are connections between
routers and end hosts.
So across every metric that we looked at we got huge
improvements.
Tells us that it is at least much better than the most
trivial way There could be better
approaches for sampling.
What we did for the sampling part was we took the data that
we have, from the geosources, the idea there
was to get to clusters.
This is basically a clustering problem.
We want to cluster IP addresses such that all IP
addresses that have similar routing behavior are
clustered in one.
Because if you do traceroute to one of them, you know the
result for the others as well.
OK?
The geolocation data that we had already had some block
information.
In fact, the [INAUDIBLE]
for example, and when you do this routing, it already has
this block information.
But you still have the problem of not aggregating too much.
Because if you aggregate too much, then you lost the
information.
Because remember what we would be doing is doing block
sampling here.
So we had to find a nice middle ground.
So that's what we did for the block sampling part.
I've sort of described the BID model, but just to formally
state what the BID model is, we have a set of nodes, and
they are divided into three sets.
Partitioned into three sets.
The border nodes are nodes that are just
outside of the city.
The interior nodes represent the routers inside the city.
The destination nodes represent the end
hosts within the city.
OK?
Every part that we have starts at a border node, ends at a
destination node, and must have at least one router here.
Only some parts we found didn't have this structure.
But most paths do conform to this structure.
Note that we are ignoring what happens beyond this, because
we are interested in comparing and interested in looking at
the connectivity of the city to the outside
world, or vice versa.
That's what--
AUDIENCE: --how much of the data is obfuscated from you by
[INAUDIBLE]?
There are lots of little subnets in my house for
example, where you can't tell what's inside because--
MAYUR THAKUR: Yeah.
All those are difficulties now.
AUDIENCE: And so you sort of expect that lots of end points
would be like that with small numbers, but I'm curious if
there are any big numbers that are hidden from view.
Like major subnets and corporations or ISPs.
MAYUR THAKUR: If you had to ask me I would say yes, it's a
possibility, but if you ask me for an estimate, I
would not know that.
We can only know--
one thing we did do was not throw
away incompleted results.
Even if you have those, you would probably get incomplete
paths that just, after that point, they just give up.
They can't go beyond.
So what we did, we didn't throw them away, we did the
best we could do, attach a dummy node there.
We don't know which nodes correspond to, if Microsoft
has a [INAUDIBLE]
blocking everything We don't know which destination node
corresponds to that, but there's something that
possibly corresponds to that.
And if it's a big chunk, it will have higher rate.
We'll see that later on.
But yeah.
That's the obvious problem.
There's so many things that are hidden, and any traceroute
study will have this problem, essentially.
You could do some--
OK, I should mention that there's some other
optimizations that we didn't do that we could do.
For example, people do source routing, where you ask to
forward a packet from here to here via
this node, for example.
We didn't do that.
That could increase the amount of nodes we see.
However, we are tracerouting in so many places, if you ask
me to guess, I don't think it matters in our case.
I would also mention that typical traceroute studies are
not done from these many sources except Rocketfuel.
But Rocketfuel was tracerouting to a small set.
Again, the border nodes connect city-nets to the
outside world.
Interior nodes are routers inside the city, and
destination nodes are end hosts.
And we are only interested in outside to inside traffic.
OK.
And again, what we did was we had the traceroutes, and we
want to go to the BID model, what we do is just cut it
right at the border, essentially.
Also, some optimization we only keep traceroutes that at
least go two hops into a city.
Otherwise you can get a path that has one "B" node for
[INAUDIBLE] a destination node.
We didn't want to do that.
We didn't lose too many nodes because of that, though.
Now we still have the problem of, if I give you a path I
haven't labeled it as an interior or
a destination node.
OK?
Because what can happen is this node is a destination in
one, and it is a non-destination node, it's an
interior node in the other.
But it's a fairly easy thing to, if you think about it.
Suppose x is a node, in all our traces that we have. A
border trace is simply a trace that is cut at the border.
So x is a node, and if we have ever seen it forwarding a
packet, we know that's it's a router, because end hosts will
not forward packets typically.
So we label it as an I node.
On the other hand, if it's a dead end, which means that
it's the last node on an unfinished traceroute.
What that means is that someone in the internet
thought that it could forward a packet.
Right?
Because assuming that someone was right,
it's probably a router.
So we label it as an I node.
Everything else, we label as a destination node.
OK?
So that's our simple heuristic.
Again, there is inaccuracies in all these steps.
But that's just the nature of the game.
However, because of all this inaccuracy, data validation
becomes an important step.
So the steps that we did take to validate our data, we ask
the following question.
Is our data enough?
Have our models converged?
So we did some statistical tests convergence, and I'll
describe two of them as follows.
So let's look at how many network unit work elements we
are discovering for a traceroute.
Actually, it's not for a traceroute but it's source.
So what we did was for one city, in this case, we took
all the traceroutes that we have, divide them randomly
into 10 buckets, and say, OK, what is the
return for this one?
And you can see that at that point it's very insignificant.
OK?
So that gives us that it's at least as far as traceroutes
are concerned, we are converging.
The other one that is non-standard-- this is
standard-- the other one that is sort of non-standard is the
factorial test, in which we basically are asking the
following question.
Here it was just a simply number of nodes, number of
elements, number of edges.
Here the question is not that, here the question is, look at
the indegree distribution let's say in this block.
Look at the indegree
distribution, and is that curve?
Is is that metric conversion?
So what this plot shows is, this is the solid line, is the
indegree for four sources for a particular city, I think
it's Chicago.
For a particular source [INAUDIBLE]
four sources.
And these, the dots here, above and below, are two
standard deviations above and below for three sources.
So think of three sources and the plot that you get, and now
go to four sources and the plot that you get, and what
this shows is that within a band, and what this shows is
that the four sources is between two standard
deviations above and below the three sources.
So even with three sources, of four sources, we are sort of
converging.
OK?
The standard deviations here, for example, are very small
compared to the mean.
Each point here, by the way, represents multiple runs.
So the way you compute this point is--
that's the reason it's called a factorial test--
you take your set of sources, you take [INAUDIBLE]
destinations, divide them into sets.
Divide them into sets here, and then take subsets.
You need lots of subsets so that you get statistically
significant results.
Similarly for the outdegree, again, in fact in this case
only one standard deviation above and below, it's still
converging.
This gives us confidence that, at least as far as we could do
in using this methodology, results can be believed.
Any question so far before I go on to the results section?
OK.
So now that we have constructed these models for
city-nets, so again our model is set of nodes and paths that
have been derived from actual traceroute experiments from 44
different sources, what can I say about the structure of
city-nets, how are they different from the internet at
large, and one question that we are asking here is about
the robustness.
And I'm going to describe a little bit of all of them.
As expected, the number of nodes increases linearly with
the number of blocks in the city, sort of linearly.
It's well correlated.
I'm sorry, these are very small, but, this is the number
of border nodes, this is the number of interior nodes, this
is B, I, and the number of destination nodes.
Again, the correlation with the number of IP addresses is
not that good at all.
This is much better than the correlation of IP addresses.
One interesting thing is, I think you asked about the
length of these paths.
The lengths of these paths are pretty, they line a pretty
small range, so most of the paths, there are no
paths less than two.
Because we are imposing the structure that you must have
at least one interior node.
Actually three.
Because I'm counting the number of hops,
the number of nodes.
OK?
So between three and about six, that's the number of hops
for almost 99% of the paths.
And the different plots are for different cities.
There's one curve per city.
There are not 12 of them because we choose five or six
or seven of them for that.
But essentially, one way to think
about this is the following.
If there were long lots of long paths, what
does it tell us?
Maybe one way to think about this is that there's some ISP
that is not doing a good job, essentially, because if you're
a customer for that ISP, you have a long path.
And people will probably shift from that ISP to something
that performs better.
So the lengths are not that different.
OK.
So now let's look at one difference between the graph
view and the path view.
This is the total degree of edges here, and this is the
total degree ranks.
What I mean by the total degree or
edges is the following.
You take an edge, and you take the sum of degrees off one
side, and add it to the sum of degrees of other nodes.
That's what I mean by total degree.
Outside.
This is just for the nodes, this is a total degree for the
nodes which means a total indegree, the sum of the
indegree and the outdegree.
That plot is coming later on.
So now you see that as observed earlier as well.
This is not just the first time this has been observed,
and this is a log-log plot, so this is roughly a state line,
again, for different cities, so it's a power law.
However, if you go to the path degree, it's the number of
paths going through a particular node.
OK?
And this goes for nodes and edges as well.
It's definitely qualitatively different because you
see this hump here.
AUDIENCE: What is pathdegree?
MAYUR THAKUR: Oh.
The pathdegree is the number of paths going through a node.
This is a number of edges.
So the pathdegree is way different because--
and in particular, what is happening is, that there are
fewer larger nodes.
Because the mass is distributed towards the
smaller nodes.
And we'll see a consequence of that.
In a sense, what you have is that you have smaller cuts in
city-nets, even fractionally.
Because you have smaller number of large nodes.
Because the mass in the tail is small as compared to, for
example, the total [INAUDIBLE].
So the indegree, outdegree and the total degree distribution
of power laws, I only showed you the total degree
distribution.
In case I didn't explain this, what this is is basically you
arrange the nodes in decreasing order of their
pathdegree.
And you plot that, or the log of that, versus the
pathdegree.
And it is well known that for the internet it's a power law.
So that's why you see an almost linear relation.
Yes.
AUDIENCE: People suggested a model for [INAUDIBLE]
for the power law this copy paste model.
I was wondering if there are any ideas of such models for
the distribution of the pathdegrees that you see.
MAYUR THAKUR: We don't have [INAUDIBLE]
come up with, for example, they have the
rich gets richer model.
To come up with something that has-- but this is definitely
less heavy tailed that power law.
Yeah.
Something [INAUDIBLE]
we have--
I don't have a result to show you.
AUDIENCE: [INAUDIBLE]
for the close cousins of power law like the [INAUDIBLE].
MAYUR THAKUR: Lognormal we have not, but that's, again,
this is stuff in the future.
However, at this point it is more qualitative than
quantitative for reasons that I'll also explain later on.
But yeah, that's, again, something that you would want
to do is to fit to some of these curves.
So what is clear that it's qualitatively very different.
And we'll see some consequence of that
shortly, next few slides.
And something you would have already guessed--
I know why this is taking so long.
Lots of points, this is a scatter plot.
So now, let's look at whether, if you look at an edge on a
node, and look at its degree, is that a good indication of
the number of paths going through it?
And the answer here is no, actually.
But if you look at the total edge degree, this is the edge
pathdegree.
What it means is, the total edge degree is the sum of the
degrees of--
so consider an edge u, v, so the total edge degree of e is
just the degree of u plus the degree of v. OK?
And the pathdegree of an edge is just the number of paths
going through it.
So what this means is just because an edge has lots of
connections doesn't mean a lot of paths are going through it.
It might be true for someone, but it's not necessarily true.
You could have an edge that has only a few neighbors, yet
you have a lot of paths going through it.
So that's what this means.
Because there's just no strong correlation at all.
And you can replace this with total node degree and still
it's qualitatively the same.
AUDIENCE: Do you have a final picture of that based on
whether the edge is B-I or I-I or any of those?
MAYUR THAKUR: I might, but not in this presentation.
I might.
There are lots of--
I want to finish on time, I'm already over time.
Just uncovering different structures.
And I know it's a bit unstructured for that, but
again, another property that we have is, let's do some
definitions, when I say last router, it means the i node,
the router, immediately preceding the destination.
OK?
and when we say a block, when you think of a block of IP
address, which we say it's homogeneous if all paths to
all the nodes in the block have the same last router.
Which basically means that a block is homogeneous as far as
routing is concerned, it's the same route from everywhere.
Finding this is not trivial, but you can randomly sample
and find it, and we did do that.
A cluster route is basically a last router that serves a
homogeneous block.
So what we have found is that cities have a
lot of cluster routers.
So 55% of Austin, for example, of Austin
routers are cluster routers.
And the number is higher for, I mean it's
roughly in that range.
So what that means is that the last hop router corresponds
that the [INAUDIBLE]
a number of routers.
The cluster routers are significant.
Also that pretty much means that the number of homogeneous
blocks is significant.
So lots and lots of blocks.
A lot of blocks are homogeneous, which also lends
credibility to first of all our block sampling, because we
basically need one path for each block, but also uncovers
a structure that we like to call as the apartment
hypothesis.
Maybe what is happening is you have these apartments that
lots of people live, again, this is an analogy I'm giving
you, and then there's a common router, and basically what the
ISP is doing is bringing a path to that.
And then from there you have connections to all of that.
As you see, almost 80% pretty much every city is
homogeneous.
City blocks is homogeneous.
Again, this was found experimentally.
Just a short note on the robustness analysis.
And this is a consequence of the path distribution being
less heavy-tailed.
What we did was we took the city-nets and the question we
asked was, how many nodes do I need to delete so that 80% of
the traffic roughly is deleted?
We did it for two things.
One is the number of paths, and the other is block uniform
weighting, where each block is assigned a traffic weight of
one, and then all energy is distributed equally to all the
paths coming into that block.
And the results are pretty similar, that's why I'm
showing only one.
So the top most one is when I do a target node.
So I know all the information about all the paths, and I'm
trying to delete the one that gives me the most, so it's
like a [INAUDIBLE]
heuristic, essentially.
And these are just heuristic because it's [INAUDIBLE]
to find out.
OK?
So if I do using the target nodes, so I'm using the number
of paths going through a node as the heuristic.
This is the number of paths going through an edge as the
heuristic, and this is the n degree of
a node is the heuristic.
So you see that if you use the n degree, since it's not that
well correlated with the number of paths, there's a
significant gap.
There's only 50% roughly.
So think of an attacker, if you're just looking at the
graph view.
You are probably not going to do as well for the attacker as
if you were looking at the path.
Again, it argues for the path view.
These are just random, just for comparison.
We just pick the nodes randomly, and the internet
because of the power law structure is known to be
resilient in that case.
AUDIENCE: [INAUDIBLE]
you mean adversarily chosen?
MAYUR THAKUR: Yes.
AUDIENCE: Factor of 10 to negative 3 there?
MAYUR THAKUR: Yes, oh yes.
This is a good point.
It's my punchline.
AUDIENCE: So it's a half percent [INAUDIBLE]?
MAYUR THAKUR: This is half a percent.
And at deletes 80% of traffic.
For the internet, if you look at the [INAUDIBLE]
graphs and stuff like that, this is order
of magnitude larger.
So you have to take the same curve, except
you might have to--
I forget the exact number but it's at least one order of
magnitude larger.
AUDIENCE: Do you know the geographic distribution of
nodes and how it would be impacted?
So I know there's a couple buildings in downtown San Jose
that if you parked a truck bomb outside them you could
[INAUDIBLE].
MAYUR THAKUR: Now.
See, that is very hard to do.
That's the first idea we had.
[INAUDIBLE]
plot it.
The problem is, the way geolocation works, the
granularity is not that much.
So it'll give you, OK, it's within 10 miles or so.
AUDIENCE: So you didn't have more accurate information for
any results?
MAYUR THAKUR: No.
And no one has that, to my knowledge.
But that's an interesting thing to do, because if you
can do that, you can do quite a few things.
Yeah, we didn't have that.
All we knew was that this was in the city and this was
outside the city.
AUDIENCE: You used to be able to separate the east coast
from the west coast for civilian voice traffic by only
cutting four places in the US.
Those were the early 90s.
They fixed it.
MAYUR THAKUR: OK, just some more definitions.
Again, looking at vulnerability structure.
We say the waist of a city.
So the structure that we have sort of goes in, is there are
few paths coming in, or few edges coming
in, it branches out.
And then think of it as you have [INAUDIBLE]
coming in, and then you go to your home, it's branches out
into small streets.
That's the mental model that they have. So the waist of a
city, this is what we call the waist, and then you
have the hip there.
And the waist of the city is just the smallest number of
routers that carry 80% of the traffic.
The hip of a city is the smallest number end hosts that
account for 80% of the traffic.
And what we have is basically a small
waist, large hip structure.
Again, this is the uniform block weighting that I
mentioned earlier as well.
It's not important.
The results are pretty consistent, actually, even
without that.
So just to give you an idea of the wasit-to-hip, these are
not fractions, these are percent.
This is half a percent, essentially.
Really, really small except for a few.
If you look at it, these are small cities.
So what is happening is, two things happening.
Fewer number of hosts here compared to
the number of routers.
And also, I'm figuring there's also some error that we have.
Related to that, we have what we call a DoSimeter index for
basically denial of service attack meter.
Which basically, say OK.
If adversary chosen, you delete, K nodes.
K is fixed at 10 nodes.
Where will it impact the most?
So think of it as how many people is it going to impact?
So it's the ratio of the population to the waist,
essentially.
Because remember, waist is 80% of number of routers that
carry 80% of the traffic.
And we see that there is a correlation between the size
and the population, and the size and the DoSimeter here.
So where larger cities have smaller DoSimeters.
Small is bad, because it means that it is--
wait, hold on.
So these are larger cities, and--
small is good, sorry.
Large is bad.
Large is bad, which means that with a few nodes you can
disrupt a lot of population.
But larger cities are actually worse.
There's a downtrend here.
This is an outlier.
I think it was Washington or so.
However, note that all of this is--
AUDIENCE: [INAUDIBLE].
MAYUR THAKUR: A related one that you might think of is the
ratio of the incoming to the outgoing in a certain sense.
So the ratio of I-D edges to D to I edges.
we are claiming that this is much larger than this.
These are sort of similar ones.
Again, this is representative of the degree of
disaggregation when the paths come into the city.
One interesting thing that we caught is that this is not--
I mean, this is from the data.
The plotting the actual data.
So that is very surprising, but what we found is that if
you look at the indegree/outdegree total
distribution, for these values, you can actually prove
that the edge skew, which is the ratio of
I-D to the B-I edges.
So ratio of your local streets to the highways.
OK?
So that ratio is actually for these values log N, which is
why this is a straight line, because this is a log scale.
OK?
So this you can prove.
It's not that non-trivial.
It's pretty trivial to prove, actually.
OK.
So, I was going to talk a little bit about the
theoretical part of our work, but I'll just end with the one
result that we have.
So we have shown that city-nets are vulnerable, in
certain restricted sense.
Not that we don't have traffic, so we are only
looking at what we have. So the obvious question is can we
detect large-scale attacks?
So just defining one reasonable large-scale attack
is the following.
We say that an attack is k, episilon attack on a path
network P, if the adversary can remove up to K nodes or
edges, K network elements, and he must cause a disruption to
at least an epsilon fraction.
So if it is 10 and 0.2, by removing 10 edges he must
disrupt at least 20% of the traffic, of the paths.
So this is the kind of attacks that we have been looking at.
It's inspired from work by Kleinberg, who looked at the
same thing for graphs, not for paths.
So one application might be that you can install monitors
on chosen end points, a small number of monitors, so that
any large-scale disruption shows up.
So what we can prove is that for every path network, and
for each k and epsilon, there exist detection sets of size
polynomial k and epsilon, as independent of the size of the
path networks.
Independent of the number of nodes and independent of the
numbers of edges.
And in fact you can have a simple randomized algorithm
that can do that.
It's based on VC-dimension and a very strong result by
Hausler and Welzl, which I'm not going to define, but that
was the proof there.
But basically what we get is that the VC-dimension of a set
system that we define, and again, I'm not defining that
what VC-dimension is for lack of time.
But we get that the VC-dimension of a set system,
a [INAUDIBLE]
set system that we define, is k log c.
And Hausler and Welzl gives us that it's polynomial in this
and epsilon prime.
So what you have is that it scales up very well with your
path network.
It doesn't depend on the size of the path at all, it only
depends on your k and epsilon.
One crucial thing that we need to know is
that, what is the c?
I didn't define what the c is.
It's a technical condition that we need to have in our
path network, which is the following.
Think of two paths.
They merge, and they deviate, and they merge again.
c bounds a number of times they can do that.
So it can't be arbitrary.
It can't depend on n, otherwise we don't our result.
So that's the confidence coefficient.
However, because internet paths are destination based,
we have found that you never see more than three
confluence.
So once you have that, you have small detection sets.
Just to conclude, we've studied city-nets as important
subnetworks of the internet.
Crucial difference between our and other work is that we have
a path view, and we have a methodology to make realistic
maps of city-nets.
We have taken to ensure the statistical validity,
as much as we can.
The cities show structure that is different from the
structure of the internet at large, In particular the small
waist-large hip structure.
And they seem more vulnerable in order of magnitude, more
vulnerable than internet at large.
So if you want to attack a smaller portion
you can do it easily.
In terms of fractional as well, right?
Do remember that the detection sets are fractions.
So that's where my talk ends.
Sorry for going overtime.
Any questions, I am ready to take.