Clustering

K-Means Clustering

class graspy.cluster.KMeansCluster(max_clusters=2, random_state=None)[source]

KMeans Cluster.

It computes all possible models from one component to max_clusters. The best model is given by the lowest silhouette score.

Parameters:
max_clusters : int, defaults to 1.

The maximum number of mixture components to consider.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
n_clusters_ : int

Optimal number of components. If y is given, it is based on largest ARI. Otherwise, it is based on smallest loss.

model_ : KMeans object

Fitted KMeans object fitted with optimal n_components.

silhouette_ : list

List of silhouette scores computed for all possible number of clusters given by range(2, max_clusters).

ari_ : list

Only computed when y is given. List of ARI values computed for all possible number of clusters given by range(2, max_clusters).

fit(self, X, y=None)[source]

Fits kmeans model to the data.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
self
fit_predict(self, X, y=None)

Fit the models and predict clusters based on best model.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labels : array, shape (n_samples,)

Component labels.

ari : float

Adjusted Rand index. Only returned if y is given.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

predict(self, X, y=None)

Predict clusters based on best model.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples, ), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labels : array, shape (n_samples,)

Component labels.

ari : float

Adjusted Rand index. Only returned if y is given.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns:
self

Gaussian Mixture Models Clustering

class graspy.cluster.GaussianCluster(min_components=2, max_components=None, covariance_type='full', random_state=None)[source]

Gaussian Mixture Model (GMM)

Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution. It computes all possible models from one component to max_components. The best model is given by the lowest BIC score.

Parameters:
min_components : int, default=2.

The minimum number of mixture components to consider (unless max_components=None, in which case this is the maximum number of components to consider). If max_componens is not None, min_components must be less than or equal to max_components.

max_components : int or None, default=None.

The maximum number of mixture components to consider. Must be greater than or equal to min_components.

covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}, optional

String or list/array describing the type of covariance parameters to use. If a string, it must be one of:

  • 'full'
    each component has its own general covariance matrix
  • 'tied'
    all components share the same general covariance matrix
  • 'diag'
    each component has its own diagonal covariance matrix
  • 'spherical'
    each component has its own single variance
  • 'all'
    considers all covariance structures in ['spherical', 'diag', 'tied', 'full']
If a list/array, it must be a list/array of strings containing only

'spherical', 'tied', 'diag', and/or 'spherical'.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
n_components_ : int

Optimal number of components based on BIC.

covariance_type_ : str

Optimal covariance type based on BIC.

model_ : GaussianMixture object

Fitted GaussianMixture object fitted with optimal number of components and optimal covariance structure.

bic_ : pandas.DataFrame

A pandas DataFrame of BIC values computed for all possible number of clusters given by range(min_components, max_components + 1) and all covariance structures given by covariance_type.

ari_ : pandas.DataFrame

Only computed when y is given. Pandas Dataframe containing ARI values computed for all possible number of clusters given by r``ange(min_components, max_components) and all covariance structures given by covariance_type.

fit(self, X, y=None)[source]

Fits gaussian mixure model to the data. Estimate model parameters with the EM algorithm.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
self
fit_predict(self, X, y=None)

Fit the models and predict clusters based on best model.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labels : array, shape (n_samples,)

Component labels.

ari : float

Adjusted Rand index. Only returned if y is given.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

predict(self, X, y=None)

Predict clusters based on best model.

Parameters:
X : array-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

y : array-like, shape (n_samples, ), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labels : array, shape (n_samples,)

Component labels.

ari : float

Adjusted Rand index. Only returned if y is given.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns:
self