4 2 4b Heuristics for Finding Communities 1351

Now that I've hopefully convinced you that finding communities in a network is a useful and noble endeavor, let's see if we can figure out what a community really is. So, what makes a community? Perhaps, it's mutuality of ties, so that everyone within a community has edges to each, to one another. Maybe we don't want to be so stringent. Not everyone has to have an edge, But everyone should know at least K other individuals within a community. That's what makes a community. Maybe we're more interested in information flow. So, for what we may want is that everyone is reachable from anyone else in a short number of hops, and we determine what that number is. And finally, maybe we don't want to set that it's, you know, K other individuals that you know, But just that you know a certain proportion of individuals within the community. Okay. So, before we get started, I think the very important point to note is that, part of the reason why we see this community structure is because it arises out of a bipartite structure. A bipartite network is one where you have two different kinds of nodes, say, these are all individuals and these are all events. And so, then, you link all individuals who went to the same event together. So, for example, here. The first four individuals went to this first event and that puts them in the same clique. For example, if the edge signifies, did the two individuals attend the same event? And this could then be embedded in you know, communication frequency, etc. So, yes. They probably all communicated because they went to the same event. So, when we go about this task of discovering communities. What we may actually be finding are the different contexts. Whether it's events or affiliations in clubs. For example, for boards of directors, do the directors sit on the same board? Well then, they're linked in the, in the director network. And so, it's something to keep in mind, right? That this one mode representation of who talks to whom, in this case, for social networks. Is it really an embedding of sorts of a more multidimensional network that exists out there. And so, we're recovering some of it by doing community finding. Let's look at the first definition, that everyone should know everyone else within the same community, This is the definition of a Clique. It's a fully completely connected sub graph. And when we look at the cliques, they can overlap. So here, we have two cliques, Each with three members that happened to overlap within the cleek. Everyone knows everyone else, It's just a closed triad. Here we have a clique of size four. Again, each of these four individuals knows the other three. Lets go back to that community structure we saw in the opinion formation example and see whether we can figure out how common cliques are within networks or community structure as opposed to the equivalent Erdos-Renyi random graph. So, we can find it. Okay, great. So, here we are. And all you're going to be doing, I guess not that. Okay, we're going to set up here and we're going to, you can layout the network. And then, we are going to highlight the maximal clique. So, here is the Erdos-Renyi version. So we can set up some more, Highlight the maximal clique, Or we can see the familiar community structure, And again look for a maximal clique. And you can click this multiple times if they're the same multiple cliques of the same size. Okay. So, that's your quiz question and moving on. So, how meaningful are these cliques? For one, they're not really robust. So, let's say, two people are not friends, or didn't talk at the party, or something like that, That invalidates that whole clique. It's no longer a clique and maybe you then miss the fact that it's, that it's really a community. The second part is that, maybe it's the clique itself is not that interesting because everyone's connected to everyone else. You don't have a densely connected core and then more peripheral individuals, All you have is that density connected core. And there's, there's no point in running any centrality measures on this because the, everyone has the same degreea any other centrality measure you would like to apply. And, finally, how cliques overlap may be more interesting than simply the fact that they exist. And in the last segment of this week's lectures, we'll talk about a clique percolation algorithm that takes advantage of exactly this fact. So, let's be a little bit less stringent. K cores, say that, unlike cliques, you don't need to know everyone in your K core, You just need to know K of them. So, looking at this network then, can you tell me what is the K for the core circled in red? Or depending on your quiz question, [LAUGH] what is the K for the core circled in blue? Indeed, you should've found that this is a three core. So, every node on this side is connected to at least three other individuals within that same three core. And the one on the left is a four core because every one's connected to at least four others. What makes K cores also not that robust is this example down here, where you have a node that really should rightly belong with all the other nodes in this four core, But it has only two edges. And so, it can't be part of a four core. It would need at least four edges to do that. And so, the next core that it can join is a two core that then envelopes both of these communities. And so we would, Incorrectly, you know, really leave this one out if we were looking for kind of meaningful communities. Of course, if you want to stick to the definition for core robust, then yes, that node should be left out. But is that really what you want to get out of your community finding exercise? As I mentioned before, sometimes you want to just look for potential for information to flow. In this case, we may just be interested in a set of nodes such that, for example, any node is reachable from any other node within two, two hops. And so, in this case, these are two, two cliques because everyone can reach everyone else within just two hops. And, there's some problems with this. The first is that, even though two nodes, For example, these two, may be reachable in two hops, That other node may be outside of this two clique, right? And so, the actual diameter of this two clique is greater. Actually, it's this one. It's the diameter three, Right? The longest path between any two nodes is called a diameter. And here, it's actually larger than the n of the clique. And, it's also not so great that we have left this node out. So, these n-cliques can also be somewhat funny animals. The final definition will examine our p-cliques. This is where, instead of saying you have to know K of the other individuals, you have to at least the proportion P of the individuals with, within the cluster. So, all of these methods that we talked about really focus on networks that don't have any sort of distinction between the edges. So, all edges are weighted equally. This is kind of the minimum information scenario and you're actually better off if you know how frequently individuals communicate. Because very likely, those within the same community are talking more often. And so, if you're lucky, you can actually just use this information directly. For example, You could eliminate highs that are not reciprocal. So, going back to the dining table partners' data set, you could just say, well, if two girls did not reciprocally name each other as someone they wanted to sit with at dinner, well, then let's toss out that edge. And that will reduce the network, and you'll be able to see more, more coherent communities. Another thing you can do is keep the edges above a certain threshold and throw out the rest. So, That's what I'm going to attempt to do now in [UNKNOWN]. these are political blogs from a long, long time ago I was studying these. And, let me just try and fit all of this on my screen. Great. So, I sort of artificially removed the weights and any sort of partition just so you can see that, You know, and, of course, if this wasn't laid out, you may have trouble seeing that there are two communities in here. In any case, this looks very densely connected and it's not that clear that you know, these kind of clique algorithms, they would probably lump everything together because it is just such a dense network. All of these political blogs were mentioning each other, left and right. But, let's see if we can tell left and right apart. So what I have done is, sort of artificially removed the weight. So, what I'ld like to do is move the mentions into weight. So, I want to take mentions, and copy data from mentions into weight. Okay. Hopefully, that worked. [laugh] So, if we go back to overview. Great. So now, we can see that, in fact, the, what, what we perceived to be the two communities. And actually, let me color them now so you have the conservatives and the liberals. And you, in fact, do see the, the conservatives tend to talk to each other more often and the liberals tend to talk to each other more often. And now, we are ready to apply filter so I'm going to do that, and we can increase the filter. So, for example, maybe they have to have mentioned each other at least four times. And this already thins out the network a lot, but still, you have most of the liberals connected, most of the conservatives connected. If we increase this a little bit more, a little more, a little more. [laugh] Come on. Okay. For the most part, with the exception of, let's see who these people are, Talking Points Memo and Wiz *** Blog who mention each other a lot. For the most part, you see two separate communities and I didn't apply any sort of community detection algorithm, No cliques, no K cores, no nothing. All I did was threshold the weights. But in the next segment, we'll be talking about what you can do when you don't know the number of communities there are, whether the, they will be cliquish or not. How can you automatically, in a, in a systematic principled way discover community structure, especially in large networks. So that's what we'll do next.