Semiparametric Two-Graph Testing

import numpy as np

from graspy.inference import SemiparametricTest
from graspy.embed import AdjacencySpectralEmbed
from graspy.simulations import sbm, rdpg
from graspy.utils import symmetrize
from graspy.plot import heatmap, pairplot

%matplotlib inline

Generate a stochastic block model graph to model as a random dot product graph

To start, we generate a binary stochastic block model graph (SBM). An SBM is composed of ‘communities’ or ‘blocks,’ where a node’s block membership in a graph determines its probability of connection to the other nodes in the graph.

n_components = 4 # the number of embedding dimensions for ASE
P = np.array([[0.9, 0.11, 0.13, 0.2],
              [0, 0.7, 0.1, 0.1],
              [0, 0, 0.8, 0.1],
              [0, 0, 0, 0.85]])

P = symmetrize(P)
csize = [50] * 4
A = sbm(csize, P)
X = AdjacencySpectralEmbed(n_components=n_components).fit_transform(A)
heatmap(A, title='4-block SBM adjacency matrix')
pairplot(X, title='4-block adjacency spectral embedding')
<seaborn.axisgrid.PairGrid at 0x254c28370b8>

In the adjacency matrix above, there is a clearly defined block structrure corresponding to the 4 communities in the graph that we established. On the right, we see the adjacency spectral embedding (ASE) of this graph. ASE(A) recovers an estimate of the latent positions of $A $. Latent positions refer to the idea of a random dot product graph (RDPG) which can be modeled as follows:

For an adjacency matrix \(A \in \mathbb{R}^{n x n}\), the probability of an edge existing between node \(i\) and node \(j\) (aka whether or not \(A_{ij}\) is a 1) is determined by the matrix \(P \in \mathbb{R}^{n x n}\)

\(P = XX^T\), where $X \in `:nbsphinx-math:mathbb{R}`^{n x d} $ and is referred to as the latent positions of the graph. \(X\) is referred to as the latent positions of the graph because each node \(n_i\) is modeled as having a hidden, usually unobserved location in \(\mathbb{R}^d\) (we’ll call it \(x_i\)). The probability of an edge existing between \(n_i\) and \(n_j\) is equal to the dot product \(x_i \cdot x_j\)

ASE is one way to obtain an estimate of the latent positions of a graph, \(\hat{X}\)

In the above embedding, we see 4 clusters of nodes corresponding to the 4 blocks that we prescribed. ASE recovers the fact that all of the nodes in a block have similar latent positions. So, RDPGs can also model an SBM graph.

Sample new RDPGs from this latent position

Given the estimate of X, we now sample two new RDPGs from the same latent position above

A1 = rdpg(X,
A2 = rdpg(X,

Xhat1 = AdjacencySpectralEmbed(n_components=n_components).fit_transform(A1)
Xhat2 = AdjacencySpectralEmbed(n_components=n_components).fit_transform(A2)

heatmap(A1, title='Sampled RDPG 1 adjacency matrix')
heatmap(A2, title='Sampled RDPG 2 adjacency matrix')
pairplot(Xhat1, title='Sampled RDPG 1 adjacency spectral embedding')
pairplot(Xhat2, title='Sampled RDPG 2 adjacency spectral embedding')
<seaborn.axisgrid.PairGrid at 0x254c828d860>

Qualitatively, both of the simulated RDPGs above match the behavior we would expect, with 4 clear blocks and the corresponding 4 clusters in the embedded space. But, can we say they were generated from the same latent positions?

Semiparametric test where null is true

Now, we want to know whether the above two graphs were generated from the same latent position. We know that they were, so the test should predict that the differences between Sampled RDPG 1 and 2 (up to a rotation, see below) are no greater than those differences observed by chance

In other words, we are testing

\[H_0: X_1 = X_2 R\]

\[H_a: X_1 \neq X_2 R\]

and want to see that the p-value for the semiparametric test is high (fail to reject the null)

Here, R is an orthogonal rotation matrix found from solving the orthogonal procrustes problem (Note: this constraint can be relaxed for other versions of semipar)

Note that the may take several minutes

spt = SemiparametricTest(n_bootstraps=200, n_components=n_components), A2)
print('p = {}'.format(spt.p_value_))
p = 0.8325

We see that the corresponding p-value is high, indicating that the observed differences between latent positions of Sampled RDPG 1 and 2 are likely due to chance

Semiparametric test where the null is false

Now, we distort the latent position of one of the sampled graphs by adding noise. The semiparametric test should have a low p-value, indicating that we should reject the null hypothesis

A3 = rdpg(X,
A4 = rdpg(X + np.random.normal(0.1, 0.1, size=(X.shape)),

Xhat3 = AdjacencySpectralEmbed(n_components=n_components).fit_transform(A3)
Xhat4 = AdjacencySpectralEmbed(n_components=n_components).fit_transform(A4)

heatmap(A3, title='Sampled RDPG 3 adjacency matrix')
heatmap(A4, title='Sampled RDPG 4 (distorted) adjacency matrix')
pairplot(Xhat3, title='Sampled RDPG 3 adjacency spectral embedding')
pairplot(Xhat4, title='Sampled RDPG 4 (distorted) adjacency spectral embedding')
<seaborn.axisgrid.PairGrid at 0x254dba1e5f8>
spt = SemiparametricTest(n_bootstraps=200, n_components=n_components), A4)
print('p = {}'.format(spt.p_value_))
p = 0.0025