coco_pipe.dim_reduction.reducers.linear#

Linear dimensionality reduction reducers.

This module provides linear projection wrappers built on top of scikit-learn and optional Dask backends. These reducers follow the shared ~coco_pipe.dim_reduction.reducers.base.BaseReducer contract so they can be used directly with ~coco_pipe.dim_reduction.DimReduction, reporting, and visualization utilities.

Classes#

PCAReducer

Principal Component Analysis wrapper based on sklearn.decomposition.PCA.

IncrementalPCAReducer

Incremental PCA wrapper for batch-wise fitting on larger datasets.

DaskPCAReducer

Optional Dask-ML PCA wrapper for lazy or distributed arrays.

DaskTruncatedSVDReducer

Optional Dask-ML Truncated SVD wrapper for lazy or distributed arrays.

References

[1] Pearson, K. (1901). “On Lines and Planes of Closest Fit to Systems of

Points in Space”. Philosophical Magazine, 2(11), 559-572.

[2] Hotelling, H. (1933). “Analysis of a complex of statistical variables

into principal components”. Journal of Educational Psychology, 24(6), 417-441.

[3] Scikit-learn PCA documentation:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Author: Hamza Abdelhedi (hamza.abdelhedi@umontreal.ca)

Classes#

PCAReducer

Principal Component Analysis reducer.

IncrementalPCAReducer

Incremental PCA reducer.

DaskPCAReducer

Dask-ML PCA reducer for lazy or distributed data.

DaskTruncatedSVDReducer

Dask-ML Truncated SVD reducer.

Module Contents#

class coco_pipe.dim_reduction.reducers.linear.PCAReducer(n_components=2, **kwargs)#

Bases: coco_pipe.dim_reduction.reducers.base.BaseReducer

Principal Component Analysis reducer.

This reducer wraps sklearn.decomposition.PCA and provides a linear low-dimensional embedding based on singular value decomposition.

Parameters:
  • n_components (int, default=2) – Number of principal components to keep.

  • **kwargs (dict) – Additional keyword arguments forwarded to sklearn.decomposition.PCA after signature filtering. Common options include whiten, svd_solver, and random_state.

Variables:

model (sklearn.decomposition.PCA or None) – Fitted PCA estimator after fit.

Notes

This is a deterministic linear reducer unless a randomized solver is used.

See also

IncrementalPCAReducer

Linear PCA variant for batch-wise fitting.

DaskPCAReducer

Linear PCA variant for lazy or distributed arrays.

DaskTruncatedSVDReducer

Linear factorization alternative for lazy arrays.

IsomapReducer

Nonlinear manifold learner based on geodesic distances.

TSNEReducer

Nonlinear neighborhood-preserving embedding.

UMAPReducer

Nonlinear graph-based embedding balancing local and global structure.

PHATEReducer

Nonlinear diffusion-based embedding for smooth trajectories.

Examples

>>> import numpy as np
>>> from coco_pipe.dim_reduction import PCAReducer
>>> X = np.random.rand(100, 10)
>>> reducer = PCAReducer(n_components=2, random_state=42)
>>> _ = reducer.fit(X)
>>> X_reduced = reducer.transform(X)
>>> X_reduced.shape
(100, 2)
>>> reducer.explained_variance_ratio_.shape
(2,)
>>> reducer.components_.shape
(2, 10)
>>> reducer = PCAReducer(n_components=3, whiten=True)
>>> reducer.fit_transform(X).shape
(100, 3)
property capabilities: dict#

Return capability metadata for PCA.

Returns:

Capability mapping describing PCA as a linear component-based reducer.

Return type:

dict

fit(X, y=None)#

Fit PCA on the input data.

Parameters:
Returns:

Fitted reducer instance.

Return type:

PCAReducer

Examples

>>> import numpy as np
>>> from coco_pipe.dim_reduction import PCAReducer
>>> X = np.random.rand(20, 5)
>>> reducer = PCAReducer(n_components=2)
>>> _ = reducer.fit(X)
>>> reducer.model is not None
True
transform(X)#

Project data onto the fitted principal component basis.

Parameters:

X (ArrayLike of shape (n_samples, n_features)) – Data to project.

Returns:

Projected coordinates in principal component space.

Return type:

np.ndarray of shape (n_samples, n_dims)

Raises:

RuntimeError – If the reducer has not been fitted.

property explained_variance_ratio_: numpy.ndarray#

Percentage of variance explained by each selected component.

Returns:

Explained variance ratio for each retained component.

Return type:

np.ndarray of shape (n_dims,)

Raises:

RuntimeError – If the reducer has not been fitted.

property participation_ratio_: float#

Effective dimensionality computed as the Participation Ratio.

Returns:

Participation ratio of the retained components.

Return type:

float

Raises:

RuntimeError – If the reducer has not been fitted.

property components_: numpy.ndarray#

Principal axes in feature space.

Returns:

Principal component loading matrix.

Return type:

np.ndarray of shape (n_dims, n_features)

Raises:

RuntimeError – If the reducer has not been fitted.

get_components()#

Return the principal component loading matrix.

Returns:

Principal component loading matrix.

Return type:

np.ndarray

Raises:

RuntimeError – If the reducer has not been fitted.

class coco_pipe.dim_reduction.reducers.linear.IncrementalPCAReducer(n_components=2, batch_size=None, **kwargs)#

Bases: coco_pipe.dim_reduction.reducers.base.BaseReducer

Incremental PCA reducer.

This reducer wraps sklearn.decomposition.IncrementalPCA for batch-wise fitting when the full dataset is too large to process in one pass.

Parameters:
  • n_components (int, default=2) – Number of principal components to keep.

  • batch_size (int, optional) – Number of samples processed per batch.

  • **kwargs (dict) – Additional keyword arguments forwarded to IncrementalPCA after signature filtering.

Variables:
  • batch_size (int or None) – Batch size used when fitting the incremental estimator.

  • model (sklearn.decomposition.IncrementalPCA or None) – Fitted IncrementalPCA estimator after fit or partial_fit.

See also

PCAReducer

Standard in-memory linear PCA reducer.

DaskPCAReducer

Linear PCA variant for lazy or distributed arrays.

DaskTruncatedSVDReducer

Linear factorization alternative for lazy arrays.

IsomapReducer

Nonlinear manifold learner based on geodesic distances.

TSNEReducer

Nonlinear neighborhood-preserving embedding.

UMAPReducer

Nonlinear graph-based embedding balancing local and global structure.

Examples

>>> import numpy as np
>>> from coco_pipe.dim_reduction import IncrementalPCAReducer
>>> X = np.random.rand(100, 12)
>>> reducer = IncrementalPCAReducer(n_components=3, batch_size=25)
>>> _ = reducer.fit(X)
>>> reducer.transform(X[:10]).shape
(10, 3)
>>> stream = IncrementalPCAReducer(n_components=2, batch_size=20)
>>> _ = stream.partial_fit(X[:50])
>>> _ = stream.partial_fit(X[50:])
>>> stream.transform(X).shape
(100, 2)
property capabilities: dict#

Return capability metadata for Incremental PCA.

Returns:

Capability mapping describing Incremental PCA as a linear component-based reducer.

Return type:

dict

batch_size = None#
fit(X, y=None)#

Fit Incremental PCA in batch mode.

Parameters:
Returns:

Fitted reducer instance.

Return type:

IncrementalPCAReducer

Examples

>>> import numpy as np
>>> from coco_pipe.dim_reduction import IncrementalPCAReducer
>>> X = np.random.rand(30, 6)
>>> reducer = IncrementalPCAReducer(n_components=2, batch_size=10)
>>> _ = reducer.fit(X)
>>> reducer.model is not None
True
partial_fit(X, y=None)#

Incrementally fit the estimator on a batch of samples.

Parameters:
Returns:

Reducer instance after updating the incremental estimator.

Return type:

IncrementalPCAReducer

Examples

>>> import numpy as np
>>> from coco_pipe.dim_reduction import IncrementalPCAReducer
>>> X = np.random.rand(40, 6)
>>> reducer = IncrementalPCAReducer(n_components=2, batch_size=20)
>>> _ = reducer.partial_fit(X[:20])
>>> _ = reducer.partial_fit(X[20:])
>>> reducer.model is not None
True
transform(X)#

Project data onto the fitted incremental PCA basis.

Parameters:

X (ArrayLike of shape (n_samples, n_features)) – Data to project.

Returns:

Projected coordinates in component space.

Return type:

np.ndarray of shape (n_samples, n_dims)

Raises:

RuntimeError – If the reducer has not been fitted.

property explained_variance_ratio_: numpy.ndarray#

Percentage of variance explained by each selected component.

Returns:

Explained variance ratio for each retained component.

Return type:

np.ndarray of shape (n_dims,)

Raises:

RuntimeError – If the reducer has not been fitted.

property participation_ratio_: float#

Effective dimensionality computed as the Participation Ratio.

Returns:

Participation ratio of the retained components.

Return type:

float

Raises:

RuntimeError – If the reducer has not been fitted.

property components_: numpy.ndarray#

Principal axes in feature space.

Returns:

Principal component loading matrix.

Return type:

np.ndarray of shape (n_dims, n_features)

Raises:

RuntimeError – If the reducer has not been fitted.

get_components()#

Return the incremental PCA component loading matrix.

Returns:

Principal component loading matrix.

Return type:

np.ndarray

Raises:

RuntimeError – If the reducer has not been fitted.

class coco_pipe.dim_reduction.reducers.linear.DaskPCAReducer(n_components=2, svd_solver='auto', **kwargs)#

Bases: coco_pipe.dim_reduction.reducers.base.BaseReducer

Dask-ML PCA reducer for lazy or distributed data.

This reducer wraps dask_ml.decomposition.PCA. The backend is imported lazily so the rest of the package remains importable without dask-ml.

Parameters:
  • n_components (int, default=2) – Number of principal components to keep.

  • svd_solver ({"auto", "full", "tsqr", "randomized"}, default="auto") – Solver used by the Dask PCA backend.

  • **kwargs (dict) – Additional keyword arguments forwarded to dask_ml.decomposition.PCA after signature filtering.

Variables:
  • svd_solver (str) – Solver used when instantiating the Dask PCA estimator.

  • model (dask_ml.decomposition.PCA or None) – Fitted Dask PCA estimator after fit.

Notes

This reducer requires the optional dask-ml backend.

See also

PCAReducer

Standard in-memory linear PCA reducer.

IncrementalPCAReducer

Linear PCA variant for batch-wise fitting.

DaskTruncatedSVDReducer

Linear SVD-based alternative for lazy arrays.

IsomapReducer

Nonlinear manifold learner based on geodesic distances.

TSNEReducer

Nonlinear neighborhood-preserving embedding.

UMAPReducer

Nonlinear graph-based embedding balancing local and global structure.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> from coco_pipe.dim_reduction import DaskPCAReducer
>>> X = da.from_array(np.random.rand(100, 10), chunks=(25, 10))
>>> reducer = DaskPCAReducer(n_components=2, svd_solver="tsqr")
>>> _ = reducer.fit(X)
>>> reducer.transform(X).shape
(100, 2)
property capabilities: dict#

Return capability metadata for Dask PCA.

Returns:

Capability mapping describing Dask PCA as a linear component-based reducer.

Return type:

dict

svd_solver = 'auto'#
fit(X, y=None)#

Fit Dask PCA on the input data.

Parameters:
  • X (ArrayLike) – Training data, typically a Dask array or a compatible array-like object accepted by the Dask backend.

  • y (ArrayLike, optional) – Ignored. Present for API compatibility.

Returns:

Fitted reducer instance.

Return type:

DaskPCAReducer

Raises:
  • ImportError – If dask-ml is not installed.

  • RuntimeError – If dask-ml is installed but fails during initialization.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> from coco_pipe.dim_reduction import DaskPCAReducer
>>> X = da.from_array(np.random.rand(40, 8), chunks=(20, 8))
>>> reducer = DaskPCAReducer(n_components=2)
>>> _ = reducer.fit(X)
>>> reducer.model is not None
True
transform(X)#

Project data using the fitted Dask PCA model.

Parameters:

X (ArrayLike) – Data to project.

Returns:

Backend-specific transformed output, typically a Dask array.

Return type:

Any

Raises:

RuntimeError – If the reducer has not been fitted.

property explained_variance_ratio_: numpy.ndarray#

Percentage of variance explained by each selected component.

Returns:

Explained variance ratio for each retained component.

Return type:

np.ndarray of shape (n_dims,)

Raises:

RuntimeError – If the reducer has not been fitted.

property participation_ratio_: float#

Effective dimensionality computed as the Participation Ratio.

Returns:

Participation ratio of the retained components.

Return type:

float

Raises:

RuntimeError – If the reducer has not been fitted.

property components_: numpy.ndarray#

Principal axes in feature space.

Returns:

Principal component loading matrix.

Return type:

np.ndarray of shape (n_dims, n_features)

Raises:

RuntimeError – If the reducer has not been fitted.

get_components()#

Return the Dask PCA component loading matrix.

Returns:

Principal component loading matrix or Dask-backed equivalent.

Return type:

np.ndarray

Raises:

RuntimeError – If the reducer has not been fitted.

class coco_pipe.dim_reduction.reducers.linear.DaskTruncatedSVDReducer(n_components=2, algorithm='tsqr', **kwargs)#

Bases: coco_pipe.dim_reduction.reducers.base.BaseReducer

Dask-ML Truncated SVD reducer.

This reducer wraps dask_ml.decomposition.TruncatedSVD and provides a linear projection for lazy or distributed arrays.

Parameters:
  • n_components (int, default=2) – Number of components to keep.

  • algorithm ({"tsqr", "randomized"}, default="tsqr") – SVD algorithm used by the Dask backend.

  • **kwargs (dict) – Additional keyword arguments forwarded to dask_ml.decomposition.TruncatedSVD after signature filtering.

Variables:
  • algorithm (str) – SVD algorithm used when instantiating the backend estimator.

  • model (dask_ml.decomposition.TruncatedSVD or None) – Fitted TruncatedSVD estimator after fit.

Notes

This reducer requires the optional dask-ml backend.

See also

PCAReducer

Standard in-memory linear PCA reducer.

IncrementalPCAReducer

Linear PCA variant for batch-wise fitting.

DaskPCAReducer

Linear PCA variant for lazy or distributed arrays.

IsomapReducer

Nonlinear manifold learner based on geodesic distances.

TSNEReducer

Nonlinear neighborhood-preserving embedding.

UMAPReducer

Nonlinear graph-based embedding balancing local and global structure.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> from coco_pipe.dim_reduction import DaskTruncatedSVDReducer
>>> X = da.from_array(np.random.rand(120, 15), chunks=(30, 15))
>>> reducer = DaskTruncatedSVDReducer(n_components=3, algorithm="randomized")
>>> _ = reducer.fit(X)
>>> reducer.transform(X).shape
(120, 3)
property capabilities: dict#

Return capability metadata for Dask Truncated SVD.

Returns:

Capability mapping describing Dask Truncated SVD as a linear component-based reducer.

Return type:

dict

algorithm = 'tsqr'#
fit(X, y=None)#

Fit Dask Truncated SVD on the input data.

Parameters:
  • X (ArrayLike) – Training data, typically a Dask array or compatible array-like object accepted by the backend.

  • y (ArrayLike, optional) – Ignored. Present for API compatibility.

Returns:

Fitted reducer instance.

Return type:

DaskTruncatedSVDReducer

Raises:
  • ImportError – If dask-ml is not installed.

  • RuntimeError – If dask-ml is installed but fails during initialization.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> from coco_pipe.dim_reduction import DaskTruncatedSVDReducer
>>> X = da.from_array(np.random.rand(40, 8), chunks=(20, 8))
>>> reducer = DaskTruncatedSVDReducer(n_components=2)
>>> _ = reducer.fit(X)
>>> reducer.model is not None
True
transform(X)#

Project data using the fitted Dask Truncated SVD model.

Parameters:

X (ArrayLike) – Data to project.

Returns:

Backend-specific transformed output, typically a Dask array.

Return type:

Any

Raises:

RuntimeError – If the reducer has not been fitted.

property explained_variance_ratio_: numpy.ndarray#

Percentage of variance explained by each selected component.

Returns:

Explained variance ratio for each retained component.

Return type:

np.ndarray of shape (n_dims,)

Raises:

RuntimeError – If the reducer has not been fitted.

property participation_ratio_: float#

Effective dimensionality computed as the Participation Ratio.

Returns:

Participation ratio of the retained components.

Return type:

float

Raises:

RuntimeError – If the reducer has not been fitted.

property components_: numpy.ndarray#

Principal axes in feature space.

Returns:

Principal component loading matrix.

Return type:

np.ndarray of shape (n_dims, n_features)

Raises:

RuntimeError – If the reducer has not been fitted.

get_components()#

Return the Truncated SVD component loading matrix.

Returns:

Component loading matrix or Dask-backed equivalent.

Return type:

np.ndarray

Raises:

RuntimeError – If the reducer has not been fitted.