Tip:
Highlight text to annotate it
X
Next, we're going to talk about local structure.
Local structure though is tied to this small world phenomenon in this interesting
way. Two people meet at a party.
Within a minute, they're going through, you know, who they know and, sure enough,
they find that they know someone in common.
Now, what they are likely to conclude is oh, what a small world?
This is a small world after all. But what they're doing is they are closing
triads, that is they know, say, A and C know B in common and now, A and C know
each other as well. So,
This triadic closure is sometimes also termed transitivity or clustering, and
what we want to know is how much of that is going on in the network.
Because social networks, for example, are highly clustered because a friend of a
friend is likely to also be a friend. But in general, the networks do not need
to necessarily have this property and we still want to know, to what extent is
clustering present. So, one way to measure it is with the
global clustering coefficient which is going to hit the number of closed
triangles against all connected triplets. So, if you have a close triangle, it
actually generates three separate connected triplets.
And hence, the factor of three in the enumerator as well.
If the entire graph is a one connected clique with all edges present, the
clustering coefficient is one and if there are no triangles in the entire graph, then
this global clustering coefficient is zero and we'll see in a little bit how to
evaluate values in between. However, when clustering was first
calculated by Watts and Strogatz in their seminal paper which we'll be talking about
in the model section, they looked at a local clustering coefficient.
That is a calculated clustering coefficient where for each vertex and then
they averaged overall vertices in the graph.
When you take the local clustering coefficient, it's as if an individual
looks at all of their friends, and says, out of all the possible friendships that
could exist between my friends, how many are actually there?
And this will, the denominator here is going to differ by whether this is a
directed network, in which case if you have n friends, the number of potential
connections between them is n times n minus one or if it's undirected, say,
triple link friendship is treated as undirected, then the number of possible
pairings of your friends who could be who could be friends with each other is n
times n minus one divided by two. And as I mentioned, if you want to get the
clustering coefficient for the entire graph, you just average over all the
nodes. So, let's see how we would calculate for
one vertex i. Here it is, and it has four neighbors or
four friends. Three of the friends all know each other,
but the fourth friend, they're keeping somehow hidden away.
They haven't introduced him or her to the rest of their friends, and so all of those
edges are absent, that, that fourth friend is not connected to any of the other three
friends. So, to calculate the clustering
coefficient. We looked at the number of neighbors, and
we know that the max number of connection is n times n minus one divided by two, in
this case, six. But only three of the connections are
present. And so, the cluster coefficient is just
three over six or a half. So, I'd like you to try this with this
vertex i which has three friends. Can you calculate the clustering
coefficient? Okay.
Hopefully, you calculated the cluster coefficient as follows, you have three
neighbors which also means three potential edges.
Cuz three times two divided by two is three and two of these edges are actually
there giving you a clustering coefficient of two-thirds.
Let's look at this from the perspective of an individual tie.
What does it mean for that tie to be part of one or more closed triads?
Typically, when a tie is part of sets triads, it tends to also be a relationship
between individuals who would have some sort of affinity and are likely in
frequent contact. What you see actually is that if two
individuals are closed, so A and B are closed and b, B and C are closed,
You don't often see that A and C have no tie at all, right?
So, there's something about closed ties and closing triads.
And you can develop or you can just use this measure of edge embeddedness which is
going to look at the number of nodes that A and C share in common divided by the
total number of neighbors with either A or B.
And the higher that quantity, the more embedded the tie is.
The more, the more it's part of lots of closed triads.
Now, you might say, well, you know, I looked at my Facebook graph and some of
those ties seems pretty embedded but are necessarily closed.
And I think again, it's really important to think about what your network is or
such ties to be formed on Facebook, they may just have been, you know, everyone at
the same high school adding each other on Facebook, right?
So, it doesn't, it, it signifies a short context but not necessarily a closeness.
On the other hand, if you were to filter the ties, to only be individuals who
regularly interact on Facebook or regularly interact in real life and he
still saw a lot of triads, you would you could conclude that, you know, this is
really an embedded tied that's part of, of closely knit clique.
Just another example of this. So, there was an older study that asks
school children to name their first through eighth choice of friends at the
school. We know how this turns out right form the
dining table. Data dining table, dining partners, dining
table partners data set but there is this interesting analysis that was then done
that said, okay, let's go back to that thought experiment, right before we had an
individual, and their 500 friends, and their you know, and 500 of each of those
friends' friends, etc. Here, they just looked at expansion by
two. And in one case although this, this
graphic isn't going to match each of side of the Arall and van Alstyne paper but
hopefully, illustrates the point. So, when they use the, the seventh and
eighth choice, so these weren't their closest friends, the number of people end
hubs out was far greater than when they use the first and second choice and the
first and second choice looked something like this.
It, my best friend's friend is also likely to be my friend.
But if I look at my seventh or eighth friend, and their seventh or eighth
choice, I'm likely to move pretty far away.
So, there's this notion that, it's not just,
You know, the, the distance, but that the strength of the tie and the embeddedness
of the time matters as well. And the question is, you know, is it good
to be embedded? Well, perhaps, when you're in the middle
school or high school, being part of the clique is a good thing.
It makes you feel secure, you have a gang that you can hang with, you know, it's
good. But as we saw in homework three, with the
Arall and van Alstyne paper, being a broker has some advantages as well.
For example, the amount of novel information that you receive, or looking
at Ron Byrd's papers, you're more likely to be generating good ideas than to be
recognized for them, if you're a broker. So, it really depends, right?
With, with all network analysis, it depends.
Finally, there's a study that says, you know, strong ties, weak ties let's not
treat them as binary, we just got this awesome data set.
This is JP Onnela and his collaborators well, just got this data set that has, you
know, a huge call graph who called whom when so you can get a frequency of
interaction on this graph and they looked at how things might diffuse, depending on
the strength of the time. So, first of all, they found a skew in the
degree distribution, but they also found a skew in the strength of the tie.
That is, how often two individuals communicate.
Many individuals, you know, many pairs communicate infrequently, and very few
communicate extremely frequently. They also looked at the embeddedness of
the tie. And I'll just to a seam run on this graph.
They found that, in general, there's a positive relationship between how embedded
the tie is and how frequently the two individuals communicate.
So, it's, it's just about a validation, right?
Embedded ties tend to be closer ties at least when you're talking about phone
communication. Next, they wanted to know, well.
Anyway, so which, which is it? Is it strong ties that are crucial for
information diffusion? Because there's a peakly talk with often,
or is it weak ties, because they can go further in the network.
It turned out it was neither, it was the intermediate ties, the people you talk to
every so often, not so rarely but who weren't part of your very local clique who
tended to be most instrumental for information diffusion.
And they did another thing which was to randomize the weights of the edges.
So, in the real world network, what you have is that this strong ties that tend to
be embedded, but then, they just change this, not change but they reassigned the
strengths of the tie at random to any edge in the network and what they found,
So potentially now weak ties became strong ties, etc., without changing the overall
configuration of who talks to whom. And that caused information to move much
more rapidly. So they did confirm that this kind of
localized clustered structure is impeding network flow in a sense, you know, not,
no, no judgment about whether that's good or bad.
You can do even better if you look at the directionality of the links within a
closed triad or open triad, or even slightly larger structures, because that
will capture more the dynamics of the interactions that are actually you know,
kind of driving the network, or driving the edges within the network.
So, you can take the network such as this one.
And deconstruct it into different triads and then, count, for example, how many
times did you see a feed forward loop. Well, it looks like there's one here and
it looks like there is one here again. So, what is that mean?
What type of a network is this, how did it compare to other networks or to random
networks? So, in addition to, you know, where, where
do we have the, here's a feed forward loop, you have a number of other three
node motifs. You kind of enumerate all possible ones,
and so the feed forward loop occurs often in neural networks, and it seems to be
used to neutralize biological noise because x activates y, activates z, but x
also directly activates z. You can you know, you can encounter other
things. So, a single input module might be found
in gene control networks, where x might be activating a number of other genes.
You can up it to four nodes, why stop at three?
Well, I'm going to tell you next what makes you want to do that but here is a
four node motif that has parallel tops and this you might find in neural network and
in food webs. And so, the lion who eats the zebra and
the antelope, that's an antelope, who are in turn, each munching some greenery.
And there are many more four node motifs than there are three node ones, and it's
more computationally expensive to get them, but it may be worth your while, if
you have not a huge network to deal with. The reason why it's computationally
expensive is that not only do you want to get the, the counts of how many times did
the feed forward loop occur, but, what you would really like to know is, is that
unusual? I mean, so, so it occurred five times.
Okay, is that high or is that low? And the way that you can figure that out
is you generate the equivalent random graphs.
Equivalent could mean that you just take the number of nodes and you generate a
bunch of [UNKNOWN] random graphs or you may think that some of the motifs are
result of the skew degree distribution. In which case, you want to keep the degree
distribution the same, or you can do is break the edges, so each node has the same
number of edge stubs, and then rewire them at random, and see, even controlling for
the degree distribution, is the number of motifs that you counted, unusual or not.
And the way that you're specifically going to do this, is you're going to calculate
the Z score. So the, in your [UNKNOWN] random graph or
equivalent random graph, you're going to get some counts and they are going to be
normally distributed for the number of times that a certain motif occurs and
then, you're going to see where in this distribution, your actual count lies.
If it's up here, it means that it, it's overrepresented in your real network and
especially once you get Z scores above two, this is what becomes the
statistically significant at the, you know, alpha 0.05 level, meaning that
there's only five percent chance but it occurred by tens that you saw such a
motif. Similarly if you are all the way out here
with a very negative Z scores, it means that somehow, this motif does not occur in
your network as much as it would if everything was just wired randomly.
You can download software, free software, that will run this analysis for you, and
it has this nice output, or you can actually also use igraph.
It has, I think there it's called Triadics Census but it's the, it's very much the
same functionality. What this research group, [UNKNOWN]
research group, did next was to show that a lot of different networks fall into
these super families based on a motif profile and they, they designed this kind
of measure that looked at the Z score for each of the individual thirteen three node
motifs and then it just looked at, you know, whether the Z score follows one
pattern or whether it follows another and, for example, language networks that have
edges between words that occur adjacently in text.
So, to be or not to be that's a little four cycle in there.
Across different languages, the motif profile is very similar.
So, your task is to find, you know, based on their triad census profiles, which two
kinds of networks exhibit similar structure?
Hopefully what you found, they actually are plotted together is that, the world
wide web and social networks tend to have similar frequencies of different motifs,
not to say that all web social but perhaps some of the dynamic driving social
networks, for example, triadic closure, friends introducing one another may also
be at work in the web where if one page who likes another and might start copying
some of its links again closing triads. So, to understand this further, can you
tell which of the following triads is underrepresented in social networks?
Hopefully, you've seen that it's this one, triad number six, and this, if you
remember from the beginning of the presentation, is the forbidden triad.
That is A and B have a mutual tie, and B and C have a mutual tie, but somehow, A
and C don't have a tie to each other. And you might imagine that poor B is,
really stressed out, because B gets a long with A, and B gets along with C, but
somehow A and C do not get along with each other in turn.
So, just a recap on motifs and clustering. So, given a particular structure, you can
search in network. For example, you can count the complete
triad so you can summarize that as a cluster coefficient or you can, you know,
create this motif profile. The advantage is that these motifs can
represent certain functions or certain trends and, and just, you can validate by
seeing whether those are present in a network, or you can discover them.
You know, hey, this motif is very frequent in this network.
Let me figure out, you know, what is causing it.
And I can't understand some of the underlying dynamics, which is a real
boost. So, for example, if you have,
What you think is a good model of how your network is being generated, and then you
run the motif profile on your model simulated data, and you run it on your
actual network, and they don't agree, it means that something's missing in your
model. And the motif profile is very rich, and
can, discerning these differences. One disadvantage of using motifs is that
again, it's, it's just like with the cake cliques and cliques and some cake course,
you can look for these motifs, but you may not be able to tell whether they're
actually part of some sort of larger interesting structure.
So, because you're looking with a microscope, you may actually miss the big