Aligning

Sign flips

class graspologic.align.SignFlips(criterion='median')[source]

Flips the signs of all entries in one dataset, X along some of the dimensions. In particular, it does so in a way that brings this dataset to the same orthant as the second dataset, Y, according to some criterion, computed along each dimension. The two critera currently available are the median and the maximum (in magnitude) value along each dimension.

This module can also be used to bring the dataset to the first orthant (i.e. with all criteras being positive) by providing the identity matrix as the second dataset.

Parameters:

criterion : string, {'median' (default), 'max'}, optional

String describing the criterion used to choose whether to flip signs. Two options are currently supported:

  • 'median'
    Uses the median along each dimension
  • 'max'
    Uses the maximum (in magintude) along each dimension
Attributes:

Q_ : array, size (d, d)

Final orthogonal matrix, used to modify X.

fit(self, X, Y)[source]

Uses the two datasets to learn the matrix Q_ that aligns the first dataset with the second.

In sign flips, Q_ is an diagonal orthogonal matrices (i.e. a matrix with 1 or -1 in each entry on diagonal and 0 everywhere else) picked such that all dimensions of X @ Q_ and Y are in the same orthant using some critera (median or max magnitude).

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have same number of dimensions (axis 1) as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have same number of dimensions (axis 1) as X.

Returns:
self : returns an instance of self
fit_transform(self, X, Y)

Uses the two datasets to learn the matrix Q_ that aligns the first dataset with the second. Then, transforms the first dataset X using the learned matrix Q_.

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have same number of dimensions (axis 1) as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have same number of dimensions (axis 1) as X.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:

**params : dict

Estimator parameters.

Returns:

self : object

Estimator instance.

transform(self, X)

Transforms the dataset X using the learned matrix Q_. This may be the same as the first dataset as in fit(), or a new dataset. For example, additional samples from the same dataset.

Parameters:

X : np.ndarray, shape(m, d)

Dataset to be transformed, must have same number of dimensions (axis 1) as X and Y that were passed to fit.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.

Orthogonal Procrustes

class graspologic.align.OrthogonalProcrustes[source]

Computes the matrix solution of the classical orthogonal Procrustes [R70b5933f300f-1] problem, which is that given two matrices X and Y of equal shape (n, d), find an orthogonal matrix that most closely maps X to Y. Subsequently, uses that matrix to transform either the original X, or a different dataset in the same space.

Note that when used to match two datasets, this method unlike SeedlessProcrustes, not only requires that the datasets have the same number of entries, but also that there is some correspondence between the entries. In graph embeddings, this usually corresponds to the assumption that the vertex \(i\) in graph X has the same latent position as the vertex \(i\) in graph Y.

Attributes:

Q_ : array, size (d, d)

Final orthogonal matrix, used to modify X.

score_ : float

Final value of the objective function: \(|| X Q - Y ||_F\) Lower means the datasets have been matched together better.

Notes

Formally, minimizes \(|| X Q - Y ||_F\), which has a closed form solution, whenever \(Q\) is constrained to be an orthogonal matrix, that is a matrix that satisfies \(Q^T Q = Q Q^T = I\). For the more details, including the proof of the closed-form solution see [R70b5933f300f-1].

Implementation-wise, this class is a wrapper of the scipy.linalg.orthogonal_procrustes(), which itself uses an algorithm described in find the optimal solution algorithm [R70b5933f300f-2].

References

[R70b5933f300f-1](1, 2) https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
[R70b5933f300f-2]Peter H. Schonemann, "A generalized solution of the orthogonal Procrustes problem", Psychometrica -- Vol. 31, No. 1, March, 1996.
fit(self, X, Y)[source]

Uses the two datasets to learn the matrix Q_ that aligns the first dataset with the second.

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have the same shape as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have the same shape as X.

Returns:
self : returns an instance of self
fit_transform(self, X, Y)[source]

Uses the two datasets to learn the matrix Q_ that aligns the first dataset with the second. Then, transforms the first dataset X using the learned matrix Q_.

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have the same shape as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have the same shape as X.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:

**params : dict

Estimator parameters.

Returns:

self : object

Estimator instance.

transform(self, X)

Transforms the dataset X using the learned matrix Q_. This may be the same as the first dataset as in fit(), or a new dataset. For example, additional samples from the same dataset.

Parameters:

X : np.ndarray, shape(m, d)

Dataset to be transformed, must have same number of dimensions (axis 1) as X and Y that were passed to fit.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.

Seedless Procrustes

class graspologic.align.SeedlessProcrustes(optimal_transport_lambda=0.1, optimal_transport_eps=0.01, optimal_transport_num_reps=1000, iterative_num_reps=100, init='2d', initial_Q=None, initial_P=None)[source]

Matches two datasets using an orthogonal matrix. Unlike OrthogonalProcrustes, this does not require a matching between entries. It can even be used in the settings where the two datasets do not have the same number of entries.

In graph setting, it is used to align the embeddings of two different graphs, when it requires some simultaneous inference task and no 1-1 matching between the vertices of the two graphs can be established, for example, inside of the test for the equivalence of the latent distributions (see: LatentDistributionTest).

Parameters:

optimal_transport_lambda : float (default=0.1), optional

Regularization term of the Sinkhorn optimal transport algorithm.

optimal_transport_eps : float (default=0.01), optional

Tolerance parameter for the each Sinkhorn optimal transport algorithm. I.e. tolerance for each "E-step".

optimal_transport_num_reps : int (default=1000), optional

Number of repetitions in each iteration of the iterative optimal transport problem. I.e. maximum number of repetitions in each "E-step".

iterative_num_reps : int (default=100), optional

Number of reps in each iteration of the iterative optimal transport problem. I.e. maxumum number of total iterations the whole "EM" algorithm.

init : string, {'2d' (default), 'sign_flips', 'custom'}, optional

  • '2d'
    Uses \(2^d\) different restarts, where \(d\) is the dimension of the datasets. In particular, tries all matrices that are simultaneously diagonal and orthogonal. In other words, these are diagonal matrices with all entries on the diagonal being either +1 or -1. This is motivated by the fact that spectral graph embeddings have two types of orthogonal non-identifiability, one of which is captured by the orthogonal diagonal matrices. The final result is picked based on the final values of the objective function. For more on this, see [R60780f5e411a-2].
  • 'sign_flips'
    Initial alignment done by making the median value in each dimension have the same sign. The motivation is similar to that in '2d', except this is a heuristic that can save time, but can sometimes yield suboptimal results.
  • 'custom'
    Expects either an initial guess for Q_ or an initial guess for P_, but not both. See initial_Q and initial_P, respectively. If neither is provided, initializes initial_Q to an identity with an appropriate number of dimensions.

initial_Q : np.ndarray, shape (d, d) or None, optional (default=None)

An initial guess for the alignment matrix, Q_, if such exists. Only one of initial_Q, initial_P can be provided at the same time, and only if init argument is set to 'custom'. If None, and initial_P is also None - initializes initial_Q to identity matrix. Must be an orthogonal matrix, if provided.

initial_P : np.ndarray, shape (n, m) or None, optional (default=None)

Initial guess for the optimal transport matrix, P_, if such exists. Only one of initial_Q, initial_P can be provided at the same time, and only if init argument is set to 'custom'. If None, and initial_Q is also None - initializes initial_Q to identity matrix. Must be a soft assignment matrix if provided (rows sum up to 1/n, cols sum up to 1/m.)

Attributes:

Q_ : array, size (d, d)

Final orthogonal matrix, used to modify X.

P_ : array, size (n, m) where n and m are the sizes of two datasets

Final matrix of optimal transports, represent soft matching weights from points in one dataset to the other, normalized such that all rows sum to 1/n and all columns sum to 1/m.

score_ : float

Final value of the objective function: \(|| X Q - P Y ||_F\) Lower means the datasets have been matched together better.

selected_initial_Q_ : array, size (d, d)

Initial orthogonal matrix which was used as the initialization. If init was set to '2d' or 'sign_flips', then it is the adaptively selected matrix. If init was set to 'custom', and initial_Q was provided, then equal to that. If it was not provided, but initial_P was, then it is the matrix after the first procrustes performed. If neither was provided, then it is the identity matrix.

Notes

In essence, the goal of this procedure is to simultaneously obtain a, not necessarily 1-to-1, correspondence between the vertices of the two data sets, and an orthogonal alignment between two datasets. If the two datasets are represented with matrices \(X \in M_{n, d}\) and \(Y \in M_{m, d}\), then the correspondence is a matrix \(P \in M_{n, m}\) that is soft assignment matrix (that is, its rows sum to \(1/n\), and columns sum to \(1/m\)) and the orthogonal alignment is an orthogonal matrix \(Q \in M_{d, d}\) (an orthogonal matrix is any matrix that satisfies \(Q^T Q = Q Q^T = I\)). The global objective function is \(|| X Q - P Y ||_F\).

Note that both \(X\) and \(PY\) are matrices in \(M_{n, d}\). Thus, if one knew \(P\), it would be simple to obtain an estimate for \(Q\), using the regular orthogonal procrustes. On the other hand, if \(Q\) was known, then \(XQ\) and \(Y\) could be thought of distributions over a finite number of masses, each with weight \(1/n\) or \(1/m\), respectively. These distributions could be "matched" via solving an optimal transport problem.

However, both \(Q\) and \(P\) are simultaneously unknown here. So the algorithm performs a sequence of alternating steps, obtaining iteratively improving estimates of \(Q\) and \(P\), similarly to an expectation-maximization (EM) procedure. It is not known whether this procedure is formally an EM, but the analogy can be drawn as follows: after obtaining an initial guess of of \(\hat{Q}_{0}\), obtaining an assignment matrix \(\hat{P}_{i+1} | \hat{Q}_{i}\) ("E-step") is done by solving an optimal transport problem via Sinkhorn algorithm, whereas obtaining an orthogonal alignment matrix \(Q_{i+1} | P_{i}\) ("M-step") is done via regular orthogonal procurstes. These alternating steps are performed until iterative_num_reps is reached.

For more on how the initial guess can be performed, see init.

References

[R60780f5e411a-1]Agterberg, J. # TODO Cite the Seedless Procrustes preprint whenever available.
[R60780f5e411a-2]Agterberg, J., Tang, M., Priebe., C. E. (2020). "On Two Distinct Sources of Nonidentifiability in Latent Position Random Graph Models" arXiv:2003.14250
fit(self, X, Y)[source]

Uses the two datasets to learn the matrix self.Q_ that aligns the first dataset with the second.

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have same number of dimensions (axis 1) as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have same number of dimensions (axis 1) as X.

Returns:
self : returns an instance of self
fit_transform(self, X, Y)

Uses the two datasets to learn the matrix Q_ that aligns the first dataset with the second. Then, transforms the first dataset X using the learned matrix Q_.

Parameters:

X : np.ndarray, shape (n, d)

Dataset to be mapped to Y, must have same number of dimensions (axis 1) as Y.

Y : np.ndarray, shape (m, d)

Target dataset, must have same number of dimensions (axis 1) as X.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:

**params : dict

Estimator parameters.

Returns:

self : object

Estimator instance.

transform(self, X)

Transforms the dataset X using the learned matrix Q_. This may be the same as the first dataset as in fit(), or a new dataset. For example, additional samples from the same dataset.

Parameters:

X : np.ndarray, shape(m, d)

Dataset to be transformed, must have same number of dimensions (axis 1) as X and Y that were passed to fit.

Returns:

X_prime : np.ndarray, shape (n, d)

First dataset of vectors, aligned to second. Equal to X @ Q_.