5 2 5b Clustering And Motifs 2045

Next, we're going to talk about local structure. Local structure though is tied to this small world phenomenon in this interesting way. Two people meet at a party. Within a minute, they're going through, you know, who they know and, sure enough, they find that they know someone in common. Now, what they are likely to conclude is oh, what a small world? This is a small world after all. But what they're doing is they are closing triads, that is they know, say, A and C know B in common and now, A and C know each other as well. So, This triadic closure is sometimes also termed transitivity or clustering, and what we want to know is how much of that is going on in the network. Because social networks, for example, are highly clustered because a friend of a friend is likely to also be a friend. But in general, the networks do not need to necessarily have this property and we still want to know, to what extent is clustering present. So, one way to measure it is with the global clustering coefficient which is going to hit the number of closed triangles against all connected triplets. So, if you have a close triangle, it actually generates three separate connected triplets. And hence, the factor of three in the enumerator as well. If the entire graph is a one connected clique with all edges present, the clustering coefficient is one and if there are no triangles in the entire graph, then this global clustering coefficient is zero and we'll see in a little bit how to evaluate values in between. However, when clustering was first calculated by Watts and Strogatz in their seminal paper which we'll be talking about in the model section, they looked at a local clustering coefficient. That is a calculated clustering coefficient where for each vertex and then they averaged overall vertices in the graph. When you take the local clustering coefficient, it's as if an individual looks at all of their friends, and says, out of all the possible friendships that could exist between my friends, how many are actually there? And this will, the denominator here is going to differ by whether this is a directed network, in which case if you have n friends, the number of potential connections between them is n times n minus one or if it's undirected, say, triple link friendship is treated as undirected, then the number of possible pairings of your friends who could be who could be friends with each other is n times n minus one divided by two. And as I mentioned, if you want to get the clustering coefficient for the entire graph, you just average over all the nodes. So, let's see how we would calculate for one vertex i. Here it is, and it has four neighbors or four friends. Three of the friends all know each other, but the fourth friend, they're keeping somehow hidden away. They haven't introduced him or her to the rest of their friends, and so all of those edges are absent, that, that fourth friend is not connected to any of the other three friends. So, to calculate the clustering coefficient. We looked at the number of neighbors, and we know that the max number of connection is n times n minus one divided by two, in this case, six. But only three of the connections are present. And so, the cluster coefficient is just three over six or a half. So, I'd like you to try this with this vertex i which has three friends. Can you calculate the clustering coefficient? Okay. Hopefully, you calculated the cluster coefficient as follows, you have three neighbors which also means three potential edges. Cuz three times two divided by two is three and two of these edges are actually there giving you a clustering coefficient of two-thirds. Let's look at this from the perspective of an individual tie. What does it mean for that tie to be part of one or more closed triads? Typically, when a tie is part of sets triads, it tends to also be a relationship between individuals who would have some sort of affinity and are likely in frequent contact. What you see actually is that if two individuals are closed, so A and B are closed and b, B and C are closed, You don't often see that A and C have no tie at all, right? So, there's something about closed ties and closing triads. And you can develop or you can just use this measure of edge embeddedness which is going to look at the number of nodes that A and C share in common divided by the total number of neighbors with either A or B. And the higher that quantity, the more embedded the tie is. The more, the more it's part of lots of closed triads. Now, you might say, well, you know, I looked at my Facebook graph and some of those ties seems pretty embedded but are necessarily closed. And I think again, it's really important to think about what your network is or such ties to be formed on Facebook, they may just have been, you know, everyone at the same high school adding each other on Facebook, right? So, it doesn't, it, it signifies a short context but not necessarily a closeness. On the other hand, if you were to filter the ties, to only be individuals who regularly interact on Facebook or regularly interact in real life and he still saw a lot of triads, you would you could conclude that, you know, this is really an embedded tied that's part of, of closely knit clique. Just another example of this. So, there was an older study that asks school children to name their first through eighth choice of friends at the school. We know how this turns out right form the dining table. Data dining table, dining partners, dining table partners data set but there is this interesting analysis that was then done that said, okay, let's go back to that thought experiment, right before we had an individual, and their 500 friends, and their you know, and 500 of each of those friends' friends, etc. Here, they just looked at expansion by two. And in one case although this, this graphic isn't going to match each of side of the Arall and van Alstyne paper but hopefully, illustrates the point. So, when they use the, the seventh and eighth choice, so these weren't their closest friends, the number of people end hubs out was far greater than when they use the first and second choice and the first and second choice looked something like this. It, my best friend's friend is also likely to be my friend. But if I look at my seventh or eighth friend, and their seventh or eighth choice, I'm likely to move pretty far away. So, there's this notion that, it's not just, You know, the, the distance, but that the strength of the tie and the embeddedness of the time matters as well. And the question is, you know, is it good to be embedded? Well, perhaps, when you're in the middle school or high school, being part of the clique is a good thing. It makes you feel secure, you have a gang that you can hang with, you know, it's good. But as we saw in homework three, with the Arall and van Alstyne paper, being a broker has some advantages as well. For example, the amount of novel information that you receive, or looking at Ron Byrd's papers, you're more likely to be generating good ideas than to be recognized for them, if you're a broker. So, it really depends, right? With, with all network analysis, it depends. Finally, there's a study that says, you know, strong ties, weak ties let's not treat them as binary, we just got this awesome data set. This is JP Onnela and his collaborators well, just got this data set that has, you know, a huge call graph who called whom when so you can get a frequency of interaction on this graph and they looked at how things might diffuse, depending on the strength of the time. So, first of all, they found a skew in the degree distribution, but they also found a skew in the strength of the tie. That is, how often two individuals communicate. Many individuals, you know, many pairs communicate infrequently, and very few communicate extremely frequently. They also looked at the embeddedness of the tie. And I'll just to a seam run on this graph. They found that, in general, there's a positive relationship between how embedded the tie is and how frequently the two individuals communicate. So, it's, it's just about a validation, right? Embedded ties tend to be closer ties at least when you're talking about phone communication. Next, they wanted to know, well. Anyway, so which, which is it? Is it strong ties that are crucial for information diffusion? Because there's a peakly talk with often, or is it weak ties, because they can go further in the network. It turned out it was neither, it was the intermediate ties, the people you talk to every so often, not so rarely but who weren't part of your very local clique who tended to be most instrumental for information diffusion. And they did another thing which was to randomize the weights of the edges. So, in the real world network, what you have is that this strong ties that tend to be embedded, but then, they just change this, not change but they reassigned the strengths of the tie at random to any edge in the network and what they found, So potentially now weak ties became strong ties, etc., without changing the overall configuration of who talks to whom. And that caused information to move much more rapidly. So they did confirm that this kind of localized clustered structure is impeding network flow in a sense, you know, not, no, no judgment about whether that's good or bad. You can do even better if you look at the directionality of the links within a closed triad or open triad, or even slightly larger structures, because that will capture more the dynamics of the interactions that are actually you know, kind of driving the network, or driving the edges within the network. So, you can take the network such as this one. And deconstruct it into different triads and then, count, for example, how many times did you see a feed forward loop. Well, it looks like there's one here and it looks like there is one here again. So, what is that mean? What type of a network is this, how did it compare to other networks or to random networks? So, in addition to, you know, where, where do we have the, here's a feed forward loop, you have a number of other three node motifs. You kind of enumerate all possible ones, and so the feed forward loop occurs often in neural networks, and it seems to be used to neutralize biological noise because x activates y, activates z, but x also directly activates z. You can you know, you can encounter other things. So, a single input module might be found in gene control networks, where x might be activating a number of other genes. You can up it to four nodes, why stop at three? Well, I'm going to tell you next what makes you want to do that but here is a four node motif that has parallel tops and this you might find in neural network and in food webs. And so, the lion who eats the zebra and the antelope, that's an antelope, who are in turn, each munching some greenery. And there are many more four node motifs than there are three node ones, and it's more computationally expensive to get them, but it may be worth your while, if you have not a huge network to deal with. The reason why it's computationally expensive is that not only do you want to get the, the counts of how many times did the feed forward loop occur, but, what you would really like to know is, is that unusual? I mean, so, so it occurred five times. Okay, is that high or is that low? And the way that you can figure that out is you generate the equivalent random graphs. Equivalent could mean that you just take the number of nodes and you generate a bunch of [UNKNOWN] random graphs or you may think that some of the motifs are result of the skew degree distribution. In which case, you want to keep the degree distribution the same, or you can do is break the edges, so each node has the same number of edge stubs, and then rewire them at random, and see, even controlling for the degree distribution, is the number of motifs that you counted, unusual or not. And the way that you're specifically going to do this, is you're going to calculate the Z score. So the, in your [UNKNOWN] random graph or equivalent random graph, you're going to get some counts and they are going to be normally distributed for the number of times that a certain motif occurs and then, you're going to see where in this distribution, your actual count lies. If it's up here, it means that it, it's overrepresented in your real network and especially once you get Z scores above two, this is what becomes the statistically significant at the, you know, alpha 0.05 level, meaning that there's only five percent chance but it occurred by tens that you saw such a motif. Similarly if you are all the way out here with a very negative Z scores, it means that somehow, this motif does not occur in your network as much as it would if everything was just wired randomly. You can download software, free software, that will run this analysis for you, and it has this nice output, or you can actually also use igraph. It has, I think there it's called Triadics Census but it's the, it's very much the same functionality. What this research group, [UNKNOWN] research group, did next was to show that a lot of different networks fall into these super families based on a motif profile and they, they designed this kind of measure that looked at the Z score for each of the individual thirteen three node motifs and then it just looked at, you know, whether the Z score follows one pattern or whether it follows another and, for example, language networks that have edges between words that occur adjacently in text. So, to be or not to be that's a little four cycle in there. Across different languages, the motif profile is very similar. So, your task is to find, you know, based on their triad census profiles, which two kinds of networks exhibit similar structure? Hopefully what you found, they actually are plotted together is that, the world wide web and social networks tend to have similar frequencies of different motifs, not to say that all web social but perhaps some of the dynamic driving social networks, for example, triadic closure, friends introducing one another may also be at work in the web where if one page who likes another and might start copying some of its links again closing triads. So, to understand this further, can you tell which of the following triads is underrepresented in social networks? Hopefully, you've seen that it's this one, triad number six, and this, if you remember from the beginning of the presentation, is the forbidden triad. That is A and B have a mutual tie, and B and C have a mutual tie, but somehow, A and C don't have a tie to each other. And you might imagine that poor B is, really stressed out, because B gets a long with A, and B gets along with C, but somehow A and C do not get along with each other in turn. So, just a recap on motifs and clustering. So, given a particular structure, you can search in network. For example, you can count the complete triad so you can summarize that as a cluster coefficient or you can, you know, create this motif profile. The advantage is that these motifs can represent certain functions or certain trends and, and just, you can validate by seeing whether those are present in a network, or you can discover them. You know, hey, this motif is very frequent in this network. Let me figure out, you know, what is causing it. And I can't understand some of the underlying dynamics, which is a real boost. So, for example, if you have, What you think is a good model of how your network is being generated, and then you run the motif profile on your model simulated data, and you run it on your actual network, and they don't agree, it means that something's missing in your model. And the motif profile is very rich, and can, discerning these differences. One disadvantage of using motifs is that again, it's, it's just like with the cake cliques and cliques and some cake course, you can look for these motifs, but you may not be able to tell whether they're actually part of some sort of larger interesting structure. So, because you're looking with a microscope, you may actually miss the big