Net And The City

FEMALE SPEAKER: So I'm happy to introduce Mayur Thakur. He'll be here from the same undergrad college, IT Delhi. He did his PhD at University of Rochester and is right now an assistant professor in University of Missouri-Rolla. His research interests are in network and graph algorithms, [INAUDIBLE] analysis, [INAUDIBLE] and modeling, and [INAUDIBLE] which is dear to us so [INAUDIBLE]. He has also held positions at Northeastern University, Los Alamos Labs, Equinox Corporation, MSR, Phillips, as we.. [INAUDIBLE] AUDIENCE: What was that list again? What was the list again? FEMALE SPEAKER: Northeastern University, Los Alamos Labs, Equinox Corporation, MSR, Phillips-- MAYUR THAKUR: [INAUDIBLE]. FEMALE SPEAKER: All right, he'll talk about "Net and the City". MAYUR THAKUR: Thanks, [INAUDIBLE], for the introduction. It's nice to be here. First time in California, actually, so this is very strange because I've been in the US for eight years now. The talk is going to be on "Net and the City". This is ongoing work with collaborators at Virginia Tech, Northeastern, and Los Alamos, actually. I am going to first motivate why we want to look at city-nets, what those are, and then basically described how we build our model of city-nets. Then I'll present some analysis and what the structure of the city-net looks like, and then at the end I'll present some analysis on attacks that we have. So, up front, let me just describe what city-nets is, or define what city-nets is. [SIDE CONVERSATION] MAYUR THAKUR: So, when I refer to a city-net, what I mean is exactly this. I mean a set of routers, the set of routers, end hosts, and links, that are located in a particular metropolitan area. So think of New York and think of its boundaries. It's not defined well, but let's say I define a boundary for you. And then think of every router, end host, and link that is located inside of that boundary. That's what I mean by the city-net. OK? The connection. That's what I mean by a city-net. So what you want to do is to uncover the structure for large metropolitan areas like New York City, like Chicago, and we actually have done that. The reason for doing that is the following. A lot of people have studied the general structure, the internet at large structure. And also, there's been a lot of study in the development sub-networks, for example ISP structure has been studied, autonomous system structure has been studied. What we are doing is looking at the local structure, the geographically local structure of the internet. Why do we want to do that? Well, the internet as infrastructure has grown with humans, and humans have grown the most in cities. So what you want to do is look at how the infrastructure has grown in dense areas. Dense both with respect to human population and in terms of importance, because cities are where the businesses are. Cities are where the universities are. Cities are where in most of the people are. In terms of computer networks, what this will do is facilitate network simulation studies. You see some aspects of it where we study vulnerability or the robustness of networks. What we want to do is to develop a methodology that anyone having access to a few nodes and stuff can create a structure, can create a map of city-nets, and then use that for simulations. We plan to release models for the research community, so what will come out of this research, we want to package it and release it to the network community. Because the structure, the knowledge of city-nets is not known. OK, one thing that I wanted-- yes. AUDIENCE: [INAUDIBLE] data? Exactly what-- MAYUR THAKUR: We can't release all the data because some of the data is sensitive. So what we'll do is we'll anonymize and release it. is that you're asking? AUDIENCE: Is that a mathematical model or is it-- MAYUR THAKUR: It's actually data from real world. But we also plan to do research and from that real world data make mathematical models. Both. But we can't release all data that we have because some of it is actually sensitive. But we can anonymize and release though. And really, we can anonymize and release, so. For example, not have any IP addresses. Just give them integers. OK. One important distinction between our model-- I'll describe our model before I start-- our model of city-nets, or networks in general for now, but let's go with city-nets-- is a path view. So we don't look at a network as a graph. which is what typically you would think. What we look at in a network, when I say city-nets, model of city-nets, I mean a set of nodes and a set of paths on those nodes And then you can think of attaching traffic rates and stuff on that, but this is the crucial difference. It's a path view as opposed to a graph view. Now why do we want to do that? Basically two reasons. Routing on the internet is policy based. just because there is a path on the internet doesn't mean any packet will travel that path. So we believe that the path view is the more correct view. Also it has been shown that the paths on the internet have this dominant paths phenomena, where in a steady state, all traffic between this point and some other point is dominated by a single path, even though there are multiple paths that it could go. It's changing slightly because of new technology and stuff, but still, internet shows dominant path. So again, when I say a model of a city-net, what I mean is a set of nodes and a set of paths over those nodes. So I'll describe all that today in a later slide, but roughly speaking, this is the boundary of the city, these are nodes right outside the city, so that's what I call the border nodes B. These are the set of routers also called internal nodes. So these route traffic, and this set of destinations are end hosts. OK? As the basic model. OK, what this shows here is that you can have a small cut, but we'll come to that. But the basic idea is path instead of graph. AUDIENCE: You're thinking only of physical nodes, not wireless? MAYUR THAKUR: No, at this point it's just-- and also if you can think of what level-- is it level two, level three, we're looking at the IP router level. OK? People would done work at the AS level, so what is the AS level of connectivity? Or people have done it at even lower levels, for example, as the physical connectivity. We are looking at the IP level. AUDIENCE: Are you going to get to it later or is it a good time to ask about the typical length of the paths? MAYUR THAKUR: I'm going to get to that. Good question, [INAUDIBLE]. Yes, so length is the tool, it's the unweighted length and the weighted length. Weight is hard, so we have not done that. We have have some models, where weight is basically traffic or capacity. And it's hard, but we have some models for that, we'll come to that. So, the problem statement is we want to design a robust methodology. The idea, in addition to the data, we want to develop a methodology for inferring realistic path network views of city-nets. What do I mean? I've described what path view is. What do I mean by realistic? What I mean by realistic is that first of all, it should be inferred from real data. And secondly, even though it's impossible to have all routers, there are just too many of them and all end hosts in your model, it should be realistic from a point of view of whatever metrics you're interested in. Whatever metrics common simulation studies are interested in. The model should have those properties. The model for those metrics should satisfy the property of the real world. So our view, as I just mentioned earlier, is the router level view, each node, think of it as a router and an end host. In a lot of different studies, each node is an autonomous system. Or you can think of each node as an ISP. Or you can think of each node as-- so when I say router, I mean an IP router at the IP level, not at the physical level. Because logical connectivity could be different from physical connectivity. There has been work, in particular work by [INAUDIBLE] and I think someone at Google, actually, add the physically connectivity. But we're not doing that. Another thing that we are not interested in, and that is from this, is we are not interested in the transit traffic. If you see all parts start from outside and end inside. There could be a path that starts from outside, goes inside, and comes out. We are not interested in that. We'll not model that. Why? Because we're interested in how the city is connected to the outside world and vice versa. How the outside world is connected to the city. So we are ignoring transit paths. OK? Other thing is, we actually are interested in forward connectivity and backward connectivity. However, 50% of paths have been shown to be symmetric, and because of the way we do [INAUDIBLE] traceroutes, the backward connectivity's harder to do. So even though I'm claiming I will be sort of implying that it works for both forward and backward, actually I'm going to caveat it by saying that it's only forward paths. But if you ask me to hazard a guess, I would say that the forward and backward would not be much qualitatively different. OK? But our models are not based on backward paths at all. This is forward paths. Obviously the issue here is that a path from here to here, path from A to B might be much different from a path from B to A. But typically it's 50% of paths are, in fact, actually symmetric. And lots more have shared many other properties. OK, some related work, there's a ton of activity, I'm missing a lot of stuff here. But I'm just able to give you flavor of the kind of work that has been done. Different views, different views of the internet have been look at. And this is for general internet. Ours is only for the cities. So there's the router level view, you have the autonomous system view, or the ISP view. OK? Also, related to our work, in fact a very crucial part of our work is knowing which IP address falls in which city. So we are using work that other people have done on geolocation. Given an IP address, how do you locate where, what is the geography? What is the latitude and longitude of that? People use different kinds of triangulation techniques, both in open research community and there are whole companies, and we are using some data from these companies, even though we could, in principle, do all of the [INAUDIBLE] work and then use that data. But we are not. We are using it as a black box. Geographic properties of routing-- AUDIENCE: Whose data did you use? MAYUR THAKUR: Yeah, I'll come to that-- Akamai, Digital [INAUDIBLE], So we use three sources, and they are all into all that. Geographic properties of routing. So if a packet has to be sent from Mountain View to New York, how does travel geographically? Does it go to Chicago first? That has been studied and has been shown that it's not optimal, but a lot of times it's optimal in terms of distance. Again, this is ours. We look upon our work at a part of the bigger study of understanding the internet and how it has grown, which is-- there's a ton of work in that [INAUDIBLE] this is the first work that hinted that the internet graphs, our internet has this power law structure, which is a reason for efficient routing, but also leads to vulnerability. So we look upon our work as a part of the bigger internet visual tomography. There's a typo there. We are not doing any visualization, though. And again, even beyond internet, there's work on complex networks. Social networks, internet, so on and so forth. Power networks, electrical power networks, have been some work that is weighted but it's not part of the stock. OK. So what is the basic methodology that we have? So our goal again is to find a path model. So I want to give a set of nodes and a set of paths over those nodes for a given city. So let's assume a city like Chicago, for example. If I have a specific city in this stock, I'll show Chicago's example. OK? So I want to give you that. I want to give you a path model for Chicago. So how do we do it? The first thing we do is find the set of all IP address that lie within Chicago. This we do using datasets. So we are actually not doing research in here, we're just using other people's research. Now after that what we do is, we sample, since we can't run traceroutes. Our basic methodology is traceroutes. Since we can't run traceroutes to all of them, we sample intelligently so that, for a traceroute, the number of things, the number of nodes and edges are discovered, is improved. Then we actually traceroute from a lot of different vantage points, close to 44 vantage points, to every IP in that set that he have, that sample set that we have. We clean data, then we build a practice model. We call it the BID model, and you sort of saw the glimpse of it, and I'll come back to it. But essentially that's what we do. You have a sample set inside a city, you have all these sources, you run traceroute from all these sources to the sample set, and you get your path-based model, and we do analysis after that. OK. So the first step was basically the IP to geographic mapping. As I mentioned in an earlier slide, it's an active area of research. We are not doing research in this. We just take data from-- this is actually proprietary data, but if we really wanted, we could spend effort and build our own system. People have done that in the research community as well. These are all well known leaders in their field. What we do is take these three data. The data doesn't match because of obvious reasons. It's not exact. We don't know where the city ends and the suburbs starts. One thing we want to note is that we want to consider the metropolitan area. Because we don't want to lose out on home customers, for example. Blocks related to home customers. That's just one. So we want to look at not just downtown, but the metropolitan area as well. Again, that definition's also not clear. But basically, what these data give us is for each IP, it gives us a name, which is basically the name of the city. Actually, they do it by blocks. But then you have to merge these data. How do you merge these data to get one clean data? OK. The basic method that we have, there is a tool that we have, as have other studies done, is traceroutes. What traceroute is basically you start from a point, it gives you the route from that point to any other point in the internet. So this is an example of an actual traceroute from our dataset. So you wanted to find out the forward path to this IP address, and this gives you that. You are not using the time information, but you could use time information for weighting for example. But you're not doing that as of now. This is basically an internal bookkeeping that we have. OK? So the traceroute is a utility that's been used a lot in network studies. It uses the "time-to-live" attribute if IP packets. What you do is you say, OK, that IP packet should expire in, say, five steps. So at the fifth hop, it expires, and then, just by the way the protocols work, hopefully the router will send you a message saying, oh, your packet expired. So you know what the fifth router is, essentially. They're not really quite [INAUDIBLE], but that's why traceroutes a lot of times are not successful. In fact, the success rate is only about 10% or so. OK? But that's what we have to live with. So what we have is-- yeah? AUDIENCE: In the traceroutes I seem to remember there's two different ways of doing it with ICMP and-- MAYUR THAKUR: Yeah. I don't mention that but we do all sorts of optimization. It's the point of this slide, actually. There's lots of different ways to do it. We choose ICMP because UDP raises more flags. So what you have to be careful about traceroutes is that, first of all, they can take a long, long time. So without that optimization, which I'm not mentioning here, so 100,000 traceroute, which is a typical number for a city, will take about two to six weeks. Depends on how many successes you have. We do traceroutes. We do, for example, we don't send three packets which is normally what they do. Remember, we are not interested in a very fine structure because it's more important to get more structure than to get everything single small one of them. So we do these optimizations, part of that is using ICMP packets. And then we'll use it to a reasonable number, one day for 100,000 traceroutes essentially. So the utility that we have is basically a utility for collecting traceroutes, and it has tunable options, including [INAUDIBLE]. But the way we do it is always use ICMP. it works across machines because sources in different locations have different operating systems. It does have error recovery, so if you stop at a certain point, you can start it up without losing much data at all. So I'm not going to go into optimization, but some effort was spent there in trying to make sure that this was reasonable. Because this is just not reasonable. OK. So what we did was, we chose 12 of the 25 largest cities in terms of population within the US. And we chose cities such that they were distributed across the US. And it just so happened that there is no big city here. Otherwise you see that it's sort of well distributed. So you have stuff there, stuff here, stuff there, and stuff here. We have 12 cities in our study, so the reason for looking at multiple studies is because we also want to look at across-city similarities or differences. OK? So that's the destination cities. And that's the cities is our study. The data that we use is the following, the data sets that we use. We have Skitter, which is basically CAIDA's net measurement tool. And the data that we used, that's not all the data that we have, but the data that we use had traceroutes from 20 vantage points. What Skitter does is basically it has a set of IP addresses that periodically it traceroutes to. And some of that were useful for us, so we took that. So not all the data is used here, but we took that. They use 20 vantage points. Planetlab is just a network of computers. You can get slices on it. We use 14 such vantage points across the world but most of them in the US, and we traceroute to our data. In addition, we used our vantage points, 10 in number, basically we asked our friends, can you lend us a computer for seven days, ten days, whatever? And we learned traceroute from that. OK? So overall, it's 44 vantage points. One crucial thing is that recent studies have shown that if you don't run these traceroute-based studies from multiple vantage points, lots of [INAUDIBLE] vantage points, you could get misleading results. So that's why a lot of effort was spent in that. One obvious thing to consider was, and we did, was Rocketfuel, which basically maps autonomous systems, or actually ISPs. So they have mapped, for example, Verizon. However, when you look at that data, the amount of it that we could use wasn't enough, because they're using very few traceroutes. So we decided not to use them. Just to give you an idea of the amount of data that we have for a city, and Chicago in particular-- so, bigger cities have more data, that's why Chicago's almost twice the average-- is about 850,000 traceroutes for an average city, of which, I said, that not all traceroutes complete. Only about 100,000 traceroutes did complete. However, we don't just throw away all non-completed traceroutes. The number of traceroutes we use for an average city is over 200,000. That's a lot of data. For Chicago, it's more than half a million. OK. The first step, I said, was to find a set of IP addresses, to reach to, so that we could traceroute to them. Now, one thing we found was if you just randomly sample-- a lot of IP address space is just wasteful it's just not used at all. If you randomly sample, is that good or do you need some other sampling? What we found out is that we actually need some other sampling. So what we did was the following. Let me describe the result and then I'll say how we did it. What we did was we took the IP address space, the IP addresses within a city, and then we broke it up into blocks-- and I'll describe how the blocks were made-- and then what this shows is the number of IP address versus population, and it shows the number of blocks versus population. And you can see a much better correlation of the number of blocks than the number of IP addresses. What does it tell us? Basically what it tells us is that the number of blocks is a closer description of the complexity of a city-net than a number of IP addresses. So what we did was we did block sampling. Instead of sampling randomly from IP addresses, we had these blocs and we sampled-- uniformly, randomly-- from those blocks. And again, you might ask, well, is this correlated with the population? How do you know that it gives you more complexity? To tell you the truth, we don't. However, if you look at the performance in terms of discovering elements of city-net, nodes and edges-- forget what this is written here-- what we see is that if you do this uniformly, randomly sample from IP addresses versus our block sampling, we get 120% improvement. OK? For the same for the same number of traceroutes. So if you do 10,000 traceroutes with uniform sampling versus block sampling, ours is 120% better across all types or nodes and edges we looked at. So this is interior nodes, so these are connections between routers inside Chicago. These are total number of edges. These are I-D edges means these are connections between routers and end hosts. So across every metric that we looked at we got huge improvements. Tells us that it is at least much better than the most trivial way There could be better approaches for sampling. What we did for the sampling part was we took the data that we have, from the geosources, the idea there was to get to clusters. This is basically a clustering problem. We want to cluster IP addresses such that all IP addresses that have similar routing behavior are clustered in one. Because if you do traceroute to one of them, you know the result for the others as well. OK? The geolocation data that we had already had some block information. In fact, the [INAUDIBLE] for example, and when you do this routing, it already has this block information. But you still have the problem of not aggregating too much. Because if you aggregate too much, then you lost the information. Because remember what we would be doing is doing block sampling here. So we had to find a nice middle ground. So that's what we did for the block sampling part. I've sort of described the BID model, but just to formally state what the BID model is, we have a set of nodes, and they are divided into three sets. Partitioned into three sets. The border nodes are nodes that are just outside of the city. The interior nodes represent the routers inside the city. The destination nodes represent the end hosts within the city. OK? Every part that we have starts at a border node, ends at a destination node, and must have at least one router here. Only some parts we found didn't have this structure. But most paths do conform to this structure. Note that we are ignoring what happens beyond this, because we are interested in comparing and interested in looking at the connectivity of the city to the outside world, or vice versa. That's what-- AUDIENCE: --how much of the data is obfuscated from you by [INAUDIBLE]? There are lots of little subnets in my house for example, where you can't tell what's inside because-- MAYUR THAKUR: Yeah. All those are difficulties now. AUDIENCE: And so you sort of expect that lots of end points would be like that with small numbers, but I'm curious if there are any big numbers that are hidden from view. Like major subnets and corporations or ISPs. MAYUR THAKUR: If you had to ask me I would say yes, it's a possibility, but if you ask me for an estimate, I would not know that. We can only know-- one thing we did do was not throw away incompleted results. Even if you have those, you would probably get incomplete paths that just, after that point, they just give up. They can't go beyond. So what we did, we didn't throw them away, we did the best we could do, attach a dummy node there. We don't know which nodes correspond to, if Microsoft has a [INAUDIBLE] blocking everything We don't know which destination node corresponds to that, but there's something that possibly corresponds to that. And if it's a big chunk, it will have higher rate. We'll see that later on. But yeah. That's the obvious problem. There's so many things that are hidden, and any traceroute study will have this problem, essentially. You could do some-- OK, I should mention that there's some other optimizations that we didn't do that we could do. For example, people do source routing, where you ask to forward a packet from here to here via this node, for example. We didn't do that. That could increase the amount of nodes we see. However, we are tracerouting in so many places, if you ask me to guess, I don't think it matters in our case. I would also mention that typical traceroute studies are not done from these many sources except Rocketfuel. But Rocketfuel was tracerouting to a small set. Again, the border nodes connect city-nets to the outside world. Interior nodes are routers inside the city, and destination nodes are end hosts. And we are only interested in outside to inside traffic. OK. And again, what we did was we had the traceroutes, and we want to go to the BID model, what we do is just cut it right at the border, essentially. Also, some optimization we only keep traceroutes that at least go two hops into a city. Otherwise you can get a path that has one "B" node for [INAUDIBLE] a destination node. We didn't want to do that. We didn't lose too many nodes because of that, though. Now we still have the problem of, if I give you a path I haven't labeled it as an interior or a destination node. OK? Because what can happen is this node is a destination in one, and it is a non-destination node, it's an interior node in the other. But it's a fairly easy thing to, if you think about it. Suppose x is a node, in all our traces that we have. A border trace is simply a trace that is cut at the border. So x is a node, and if we have ever seen it forwarding a packet, we know that's it's a router, because end hosts will not forward packets typically. So we label it as an I node. On the other hand, if it's a dead end, which means that it's the last node on an unfinished traceroute. What that means is that someone in the internet thought that it could forward a packet. Right? Because assuming that someone was right, it's probably a router. So we label it as an I node. Everything else, we label as a destination node. OK? So that's our simple heuristic. Again, there is inaccuracies in all these steps. But that's just the nature of the game. However, because of all this inaccuracy, data validation becomes an important step. So the steps that we did take to validate our data, we ask the following question. Is our data enough? Have our models converged? So we did some statistical tests convergence, and I'll describe two of them as follows. So let's look at how many network unit work elements we are discovering for a traceroute. Actually, it's not for a traceroute but it's source. So what we did was for one city, in this case, we took all the traceroutes that we have, divide them randomly into 10 buckets, and say, OK, what is the return for this one? And you can see that at that point it's very insignificant. OK? So that gives us that it's at least as far as traceroutes are concerned, we are converging. The other one that is non-standard-- this is standard-- the other one that is sort of non-standard is the factorial test, in which we basically are asking the following question. Here it was just a simply number of nodes, number of elements, number of edges. Here the question is not that, here the question is, look at the indegree distribution let's say in this block. Look at the indegree distribution, and is that curve? Is is that metric conversion? So what this plot shows is, this is the solid line, is the indegree for four sources for a particular city, I think it's Chicago. For a particular source [INAUDIBLE] four sources. And these, the dots here, above and below, are two standard deviations above and below for three sources. So think of three sources and the plot that you get, and now go to four sources and the plot that you get, and what this shows is that within a band, and what this shows is that the four sources is between two standard deviations above and below the three sources. So even with three sources, of four sources, we are sort of converging. OK? The standard deviations here, for example, are very small compared to the mean. Each point here, by the way, represents multiple runs. So the way you compute this point is-- that's the reason it's called a factorial test-- you take your set of sources, you take [INAUDIBLE] destinations, divide them into sets. Divide them into sets here, and then take subsets. You need lots of subsets so that you get statistically significant results. Similarly for the outdegree, again, in fact in this case only one standard deviation above and below, it's still converging. This gives us confidence that, at least as far as we could do in using this methodology, results can be believed. Any question so far before I go on to the results section? OK. So now that we have constructed these models for city-nets, so again our model is set of nodes and paths that have been derived from actual traceroute experiments from 44 different sources, what can I say about the structure of city-nets, how are they different from the internet at large, and one question that we are asking here is about the robustness. And I'm going to describe a little bit of all of them. As expected, the number of nodes increases linearly with the number of blocks in the city, sort of linearly. It's well correlated. I'm sorry, these are very small, but, this is the number of border nodes, this is the number of interior nodes, this is B, I, and the number of destination nodes. Again, the correlation with the number of IP addresses is not that good at all. This is much better than the correlation of IP addresses. One interesting thing is, I think you asked about the length of these paths. The lengths of these paths are pretty, they line a pretty small range, so most of the paths, there are no paths less than two. Because we are imposing the structure that you must have at least one interior node. Actually three. Because I'm counting the number of hops, the number of nodes. OK? So between three and about six, that's the number of hops for almost 99% of the paths. And the different plots are for different cities. There's one curve per city. There are not 12 of them because we choose five or six or seven of them for that. But essentially, one way to think about this is the following. If there were long lots of long paths, what does it tell us? Maybe one way to think about this is that there's some ISP that is not doing a good job, essentially, because if you're a customer for that ISP, you have a long path. And people will probably shift from that ISP to something that performs better. So the lengths are not that different. OK. So now let's look at one difference between the graph view and the path view. This is the total degree of edges here, and this is the total degree ranks. What I mean by the total degree or edges is the following. You take an edge, and you take the sum of degrees off one side, and add it to the sum of degrees of other nodes. That's what I mean by total degree. Outside. This is just for the nodes, this is a total degree for the nodes which means a total indegree, the sum of the indegree and the outdegree. That plot is coming later on. So now you see that as observed earlier as well. This is not just the first time this has been observed, and this is a log-log plot, so this is roughly a state line, again, for different cities, so it's a power law. However, if you go to the path degree, it's the number of paths going through a particular node. OK? And this goes for nodes and edges as well. It's definitely qualitatively different because you see this hump here. AUDIENCE: What is pathdegree? MAYUR THAKUR: Oh. The pathdegree is the number of paths going through a node. This is a number of edges. So the pathdegree is way different because-- and in particular, what is happening is, that there are fewer larger nodes. Because the mass is distributed towards the smaller nodes. And we'll see a consequence of that. In a sense, what you have is that you have smaller cuts in city-nets, even fractionally. Because you have smaller number of large nodes. Because the mass in the tail is small as compared to, for example, the total [INAUDIBLE]. So the indegree, outdegree and the total degree distribution of power laws, I only showed you the total degree distribution. In case I didn't explain this, what this is is basically you arrange the nodes in decreasing order of their pathdegree. And you plot that, or the log of that, versus the pathdegree. And it is well known that for the internet it's a power law. So that's why you see an almost linear relation. Yes. AUDIENCE: People suggested a model for [INAUDIBLE] for the power law this copy paste model. I was wondering if there are any ideas of such models for the distribution of the pathdegrees that you see. MAYUR THAKUR: We don't have [INAUDIBLE] come up with, for example, they have the rich gets richer model. To come up with something that has-- but this is definitely less heavy tailed that power law. Yeah. Something [INAUDIBLE] we have-- I don't have a result to show you. AUDIENCE: [INAUDIBLE] for the close cousins of power law like the [INAUDIBLE]. MAYUR THAKUR: Lognormal we have not, but that's, again, this is stuff in the future. However, at this point it is more qualitative than quantitative for reasons that I'll also explain later on. But yeah, that's, again, something that you would want to do is to fit to some of these curves. So what is clear that it's qualitatively very different. And we'll see some consequence of that shortly, next few slides. And something you would have already guessed-- I know why this is taking so long. Lots of points, this is a scatter plot. So now, let's look at whether, if you look at an edge on a node, and look at its degree, is that a good indication of the number of paths going through it? And the answer here is no, actually. But if you look at the total edge degree, this is the edge pathdegree. What it means is, the total edge degree is the sum of the degrees of-- so consider an edge u, v, so the total edge degree of e is just the degree of u plus the degree of v. OK? And the pathdegree of an edge is just the number of paths going through it. So what this means is just because an edge has lots of connections doesn't mean a lot of paths are going through it. It might be true for someone, but it's not necessarily true. You could have an edge that has only a few neighbors, yet you have a lot of paths going through it. So that's what this means. Because there's just no strong correlation at all. And you can replace this with total node degree and still it's qualitatively the same. AUDIENCE: Do you have a final picture of that based on whether the edge is B-I or I-I or any of those? MAYUR THAKUR: I might, but not in this presentation. I might. There are lots of-- I want to finish on time, I'm already over time. Just uncovering different structures. And I know it's a bit unstructured for that, but again, another property that we have is, let's do some definitions, when I say last router, it means the i node, the router, immediately preceding the destination. OK? and when we say a block, when you think of a block of IP address, which we say it's homogeneous if all paths to all the nodes in the block have the same last router. Which basically means that a block is homogeneous as far as routing is concerned, it's the same route from everywhere. Finding this is not trivial, but you can randomly sample and find it, and we did do that. A cluster route is basically a last router that serves a homogeneous block. So what we have found is that cities have a lot of cluster routers. So 55% of Austin, for example, of Austin routers are cluster routers. And the number is higher for, I mean it's roughly in that range. So what that means is that the last hop router corresponds that the [INAUDIBLE] a number of routers. The cluster routers are significant. Also that pretty much means that the number of homogeneous blocks is significant. So lots and lots of blocks. A lot of blocks are homogeneous, which also lends credibility to first of all our block sampling, because we basically need one path for each block, but also uncovers a structure that we like to call as the apartment hypothesis. Maybe what is happening is you have these apartments that lots of people live, again, this is an analogy I'm giving you, and then there's a common router, and basically what the ISP is doing is bringing a path to that. And then from there you have connections to all of that. As you see, almost 80% pretty much every city is homogeneous. City blocks is homogeneous. Again, this was found experimentally. Just a short note on the robustness analysis. And this is a consequence of the path distribution being less heavy-tailed. What we did was we took the city-nets and the question we asked was, how many nodes do I need to delete so that 80% of the traffic roughly is deleted? We did it for two things. One is the number of paths, and the other is block uniform weighting, where each block is assigned a traffic weight of one, and then all energy is distributed equally to all the paths coming into that block. And the results are pretty similar, that's why I'm showing only one. So the top most one is when I do a target node. So I know all the information about all the paths, and I'm trying to delete the one that gives me the most, so it's like a [INAUDIBLE] heuristic, essentially. And these are just heuristic because it's [INAUDIBLE] to find out. OK? So if I do using the target nodes, so I'm using the number of paths going through a node as the heuristic. This is the number of paths going through an edge as the heuristic, and this is the n degree of a node is the heuristic. So you see that if you use the n degree, since it's not that well correlated with the number of paths, there's a significant gap. There's only 50% roughly. So think of an attacker, if you're just looking at the graph view. You are probably not going to do as well for the attacker as if you were looking at the path. Again, it argues for the path view. These are just random, just for comparison. We just pick the nodes randomly, and the internet because of the power law structure is known to be resilient in that case. AUDIENCE: [INAUDIBLE] you mean adversarily chosen? MAYUR THAKUR: Yes. AUDIENCE: Factor of 10 to negative 3 there? MAYUR THAKUR: Yes, oh yes. This is a good point. It's my punchline. AUDIENCE: So it's a half percent [INAUDIBLE]? MAYUR THAKUR: This is half a percent. And at deletes 80% of traffic. For the internet, if you look at the [INAUDIBLE] graphs and stuff like that, this is order of magnitude larger. So you have to take the same curve, except you might have to-- I forget the exact number but it's at least one order of magnitude larger. AUDIENCE: Do you know the geographic distribution of nodes and how it would be impacted? So I know there's a couple buildings in downtown San Jose that if you parked a truck bomb outside them you could [INAUDIBLE]. MAYUR THAKUR: Now. See, that is very hard to do. That's the first idea we had. [INAUDIBLE] plot it. The problem is, the way geolocation works, the granularity is not that much. So it'll give you, OK, it's within 10 miles or so. AUDIENCE: So you didn't have more accurate information for any results? MAYUR THAKUR: No. And no one has that, to my knowledge. But that's an interesting thing to do, because if you can do that, you can do quite a few things. Yeah, we didn't have that. All we knew was that this was in the city and this was outside the city. AUDIENCE: You used to be able to separate the east coast from the west coast for civilian voice traffic by only cutting four places in the US. Those were the early 90s. They fixed it. MAYUR THAKUR: OK, just some more definitions. Again, looking at vulnerability structure. We say the waist of a city. So the structure that we have sort of goes in, is there are few paths coming in, or few edges coming in, it branches out. And then think of it as you have [INAUDIBLE] coming in, and then you go to your home, it's branches out into small streets. That's the mental model that they have. So the waist of a city, this is what we call the waist, and then you have the hip there. And the waist of the city is just the smallest number of routers that carry 80% of the traffic. The hip of a city is the smallest number end hosts that account for 80% of the traffic. And what we have is basically a small waist, large hip structure. Again, this is the uniform block weighting that I mentioned earlier as well. It's not important. The results are pretty consistent, actually, even without that. So just to give you an idea of the wasit-to-hip, these are not fractions, these are percent. This is half a percent, essentially. Really, really small except for a few. If you look at it, these are small cities. So what is happening is, two things happening. Fewer number of hosts here compared to the number of routers. And also, I'm figuring there's also some error that we have. Related to that, we have what we call a DoSimeter index for basically denial of service attack meter. Which basically, say OK. If adversary chosen, you delete, K nodes. K is fixed at 10 nodes. Where will it impact the most? So think of it as how many people is it going to impact? So it's the ratio of the population to the waist, essentially. Because remember, waist is 80% of number of routers that carry 80% of the traffic. And we see that there is a correlation between the size and the population, and the size and the DoSimeter here. So where larger cities have smaller DoSimeters. Small is bad, because it means that it is-- wait, hold on. So these are larger cities, and-- small is good, sorry. Large is bad. Large is bad, which means that with a few nodes you can disrupt a lot of population. But larger cities are actually worse. There's a downtrend here. This is an outlier. I think it was Washington or so. However, note that all of this is-- AUDIENCE: [INAUDIBLE]. MAYUR THAKUR: A related one that you might think of is the ratio of the incoming to the outgoing in a certain sense. So the ratio of I-D edges to D to I edges. we are claiming that this is much larger than this. These are sort of similar ones. Again, this is representative of the degree of disaggregation when the paths come into the city. One interesting thing that we caught is that this is not-- I mean, this is from the data. The plotting the actual data. So that is very surprising, but what we found is that if you look at the indegree/outdegree total distribution, for these values, you can actually prove that the edge skew, which is the ratio of I-D to the B-I edges. So ratio of your local streets to the highways. OK? So that ratio is actually for these values log N, which is why this is a straight line, because this is a log scale. OK? So this you can prove. It's not that non-trivial. It's pretty trivial to prove, actually. OK. So, I was going to talk a little bit about the theoretical part of our work, but I'll just end with the one result that we have. So we have shown that city-nets are vulnerable, in certain restricted sense. Not that we don't have traffic, so we are only looking at what we have. So the obvious question is can we detect large-scale attacks? So just defining one reasonable large-scale attack is the following. We say that an attack is k, episilon attack on a path network P, if the adversary can remove up to K nodes or edges, K network elements, and he must cause a disruption to at least an epsilon fraction. So if it is 10 and 0.2, by removing 10 edges he must disrupt at least 20% of the traffic, of the paths. So this is the kind of attacks that we have been looking at. It's inspired from work by Kleinberg, who looked at the same thing for graphs, not for paths. So one application might be that you can install monitors on chosen end points, a small number of monitors, so that any large-scale disruption shows up. So what we can prove is that for every path network, and for each k and epsilon, there exist detection sets of size polynomial k and epsilon, as independent of the size of the path networks. Independent of the number of nodes and independent of the numbers of edges. And in fact you can have a simple randomized algorithm that can do that. It's based on VC-dimension and a very strong result by Hausler and Welzl, which I'm not going to define, but that was the proof there. But basically what we get is that the VC-dimension of a set system that we define, and again, I'm not defining that what VC-dimension is for lack of time. But we get that the VC-dimension of a set system, a [INAUDIBLE] set system that we define, is k log c. And Hausler and Welzl gives us that it's polynomial in this and epsilon prime. So what you have is that it scales up very well with your path network. It doesn't depend on the size of the path at all, it only depends on your k and epsilon. One crucial thing that we need to know is that, what is the c? I didn't define what the c is. It's a technical condition that we need to have in our path network, which is the following. Think of two paths. They merge, and they deviate, and they merge again. c bounds a number of times they can do that. So it can't be arbitrary. It can't depend on n, otherwise we don't our result. So that's the confidence coefficient. However, because internet paths are destination based, we have found that you never see more than three confluence. So once you have that, you have small detection sets. Just to conclude, we've studied city-nets as important subnetworks of the internet. Crucial difference between our and other work is that we have a path view, and we have a methodology to make realistic maps of city-nets. We have taken to ensure the statistical validity, as much as we can. The cities show structure that is different from the structure of the internet at large, In particular the small waist-large hip structure. And they seem more vulnerable in order of magnitude, more vulnerable than internet at large. So if you want to attack a smaller portion you can do it easily. In terms of fractional as well, right? Do remember that the detection sets are fractions. So that's where my talk ends. Sorry for going overtime. Any questions, I am ready to take.