Tip:
Highlight text to annotate it
X
Now that I've hopefully convinced you that finding communities in a network is a
useful and noble endeavor, let's see if we can figure out what a community really is.
So, what makes a community? Perhaps, it's mutuality of ties, so that
everyone within a community has edges to each, to one another.
Maybe we don't want to be so stringent. Not everyone has to have an edge,
But everyone should know at least K other individuals within a community.
That's what makes a community. Maybe we're more interested in information
flow. So, for what we may want is that everyone
is reachable from anyone else in a short number of hops, and we determine what that
number is. And finally, maybe we don't want to set
that it's, you know, K other individuals that you know,
But just that you know a certain proportion of individuals within the
community. Okay. So, before we get started, I think
the very important point to note is that, part of the reason why we see this
community structure is because it arises out of a bipartite structure.
A bipartite network is one where you have two different kinds of nodes, say, these
are all individuals and these are all events.
And so, then, you link all individuals who went to the same event together.
So, for example, here. The first four individuals went to this
first event and that puts them in the same clique.
For example, if the edge signifies, did the two individuals attend the same event?
And this could then be embedded in you know, communication frequency, etc.
So, yes. They probably all communicated because they went to the same event.
So, when we go about this task of discovering communities.
What we may actually be finding are the different contexts.
Whether it's events or affiliations in clubs.
For example, for boards of directors, do the directors sit on the same board?
Well then, they're linked in the, in the director network.
And so, it's something to keep in mind, right?
That this one mode representation of who talks to whom, in this case, for social
networks. Is it really an embedding of sorts of a
more multidimensional network that exists out there.
And so, we're recovering some of it by doing community finding.
Let's look at the first definition, that everyone should know everyone else within
the same community, This is the definition of a Clique.
It's a fully completely connected sub graph.
And when we look at the cliques, they can overlap.
So here, we have two cliques, Each with three members that happened to
overlap within the cleek. Everyone knows everyone else,
It's just a closed triad. Here we have a clique of size four.
Again, each of these four individuals knows the other three.
Lets go back to that community structure we saw in the opinion formation example
and see whether we can figure out how common cliques are within networks or
community structure as opposed to the equivalent Erdos-Renyi random graph.
So, we can find it. Okay, great.
So, here we are. And all you're going to be doing,
I guess not that. Okay, we're going to set up here and we're
going to, you can layout the network. And then, we are going to highlight the
maximal clique. So, here is the Erdos-Renyi version.
So we can set up some more, Highlight the maximal clique,
Or we can see the familiar community structure,
And again look for a maximal clique. And you can click this multiple times if
they're the same multiple cliques of the same size.
Okay. So, that's your quiz question and moving
on. So, how meaningful are these cliques?
For one, they're not really robust. So, let's say, two people are not friends,
or didn't talk at the party, or something like that,
That invalidates that whole clique. It's no longer a clique and maybe you then
miss the fact that it's, that it's really a community.
The second part is that, maybe it's the clique itself is not that interesting
because everyone's connected to everyone else.
You don't have a densely connected core and then more peripheral individuals,
All you have is that density connected core.
And there's, there's no point in running any centrality measures on this because
the, everyone has the same degreea any other centrality measure you would like to
apply. And, finally, how cliques overlap may be
more interesting than simply the fact that they exist. And in the last segment of
this week's lectures, we'll talk about a clique percolation algorithm that takes
advantage of exactly this fact. So, let's be a little bit less stringent.
K cores, say that, unlike cliques, you don't need to know everyone in your K
core, You just need to know K of them.
So, looking at this network then, can you tell me what is the K for the core circled
in red? Or depending on your quiz question,
[LAUGH] what is the K for the core circled in blue?
Indeed, you should've found that this is a three core.
So, every node on this side is connected to at least three other individuals within
that same three core. And the one on the left is a four core
because every one's connected to at least four others.
What makes K cores also not that robust is this example down here, where you have a
node that really should rightly belong with all the other nodes in this four
core, But it has only two edges. And so, it
can't be part of a four core. It would need at least four edges to do
that. And so, the next core that it can join is
a two core that then envelopes both of these communities.
And so we would, Incorrectly, you know, really leave this
one out if we were looking for kind of meaningful communities.
Of course, if you want to stick to the definition for core robust, then yes, that
node should be left out. But is that really what you want to get
out of your community finding exercise? As I mentioned before, sometimes you want
to just look for potential for information to flow.
In this case, we may just be interested in a set of nodes such that, for example, any
node is reachable from any other node within two, two hops.
And so, in this case, these are two, two cliques because everyone can reach
everyone else within just two hops. And, there's some problems with this.
The first is that, even though two nodes, For example, these two, may be reachable
in two hops, That other node may be outside of this two
clique, right? And so, the actual diameter of this two
clique is greater. Actually, it's this one.
It's the diameter three, Right?
The longest path between any two nodes is called a diameter.
And here, it's actually larger than the n of the clique.
And, it's also not so great that we have left this node out.
So, these n-cliques can also be somewhat funny animals.
The final definition will examine our p-cliques. This is where, instead of
saying you have to know K of the other individuals, you have to at least the
proportion P of the individuals with, within the cluster.
So, all of these methods that we talked about really focus on networks that don't
have any sort of distinction between the edges.
So, all edges are weighted equally. This is kind of the minimum information
scenario and you're actually better off if you know how frequently individuals
communicate. Because very likely, those within the same community are talking more
often. And so, if you're lucky, you can actually
just use this information directly. For example,
You could eliminate highs that are not reciprocal.
So, going back to the dining table partners' data set, you could just say,
well, if two girls did not reciprocally name each other as someone they wanted to
sit with at dinner, well, then let's toss out that edge.
And that will reduce the network, and you'll be able to see more, more coherent
communities. Another thing you can do is keep the edges
above a certain threshold and throw out the rest.
So, That's what I'm going to attempt to do now
in [UNKNOWN]. these are political blogs from a long, long time ago I was studying
these. And, let me just try and fit all of this on my screen.
Great. So, I sort of artificially removed the
weights and any sort of partition just so you can see that,
You know, and, of course, if this wasn't laid out, you may have trouble seeing that
there are two communities in here. In any case, this looks very densely
connected and it's not that clear that you know, these kind of clique algorithms,
they would probably lump everything together because it is just such a dense
network. All of these political blogs were
mentioning each other, left and right. But, let's see if we can tell left and
right apart. So what I have done is, sort of
artificially removed the weight. So, what I'ld like to do is move the
mentions into weight. So, I want to take mentions, and copy data
from mentions into weight. Okay.
Hopefully, that worked. [laugh] So, if we go back to overview.
Great. So now, we can see that, in fact, the,
what, what we perceived to be the two communities.
And actually, let me color them now so you have the conservatives and the liberals.
And you, in fact, do see the, the conservatives tend to talk to each other
more often and the liberals tend to talk to each other more often.
And now, we are ready to apply filter so I'm going to do that, and we can increase
the filter. So, for example, maybe they have to have
mentioned each other at least four times. And this already thins out the network a
lot, but still, you have most of the liberals connected, most of the
conservatives connected. If we increase this a little bit more, a
little more, a little more. [laugh] Come on.
Okay. For the most part, with the exception of,
let's see who these people are, Talking Points Memo and Wiz *** Blog who mention
each other a lot. For the most part, you see two separate
communities and I didn't apply any sort of community detection algorithm,
No cliques, no K cores, no nothing. All I did was threshold the weights.
But in the next segment, we'll be talking about what you can do when you don't know
the number of communities there are, whether the, they will be cliquish or not.
How can you automatically, in a, in a systematic principled way discover
community structure, especially in large networks.
So that's what we'll do next.