.. _dim-reduction-concepts: ================================== Scientific Concepts and Principles ================================== This page explains the foundational decisions behind ``coco_pipe.dim_reduction``. Understanding them prevents the most common mistakes when reducing high-dimensional scientific data — pseudo-clustering, embedding-leakage in evaluation, and misinterpreting reducer outputs. --- 1. Reduction vs. Evaluation vs. Interpretation ================================================ Three concerns are kept deliberately separate: - **Reduction** produces an embedding from data. A :class:`~coco_pipe.dim_reduction.DimReduction` wraps one fitted reducer. - **Evaluation** measures whether the embedding preserves the original structure. Implemented by ``evaluation.core.evaluate_embedding``. - **Interpretation** measures which input features appear to drive the embedding axes. Implemented by :func:`~coco_pipe.dim_reduction.analysis.interpret_features`. Each step accepts an **explicit embedding array** rather than re-running the reducer. The manager never silently re-fits or re-embeds. .. admonition:: Why the explicit-embedding contract? Caching embeddings on the manager would make it easy to score an embedding the reducer no longer produces (e.g., after parameter changes), or to score an embedding that was computed on a different sample. Forcing the user to pass the embedding object explicitly makes lineage visible at the call site. --- 2. Sample-Layout Matters: 2D vs. 3D Embeddings ================================================ The evaluator routes by embedding shape, not by method name: .. list-table:: :header-rows: 1 :widths: 40 60 * - ``X_emb.shape`` - Path * - ``(n_samples, n_components)`` - Standard 2D metrics (trustworthiness, continuity, LCMC, MRRE, Shepard). * - ``(n_trajectories, n_times, n_dims)`` - Trajectory metrics (speed, curvature, dispersion, separation). For 2D paths, the **original** data ``X`` is required (the co-ranking matrix needs both spaces). For 3D paths, ``X`` is optional — most trajectory metrics operate purely on the embedded tensor. .. warning:: Trajectory metrics never reshape a flat 2D embedding into a 3D tensor. Any reshape has to happen upstream (e.g., via :meth:`coco_pipe.io.DataContainer.unstack`). Silent reshaping would invent trajectory structure that is not in the data. --- 3. Strict Configuration ======================== Every reducer has a pydantic :class:`~coco_pipe.dim_reduction.config.BaseReducerConfig` subclass: - **No unknown fields**: typos like ``n_neigbors`` fail at construction, not at fit time. - **Canonical method names**: exact strings (``"PCA"``, ``"UMAP"``, ``"TopologicalAE"``…). No aliasing. - **Typed constructors**: the manager can be built from ``DimReduction("UMAP", n_neighbors=15)`` *or* from ``DimReduction(UMAPConfig(n_neighbors=15))``. The :class:`~coco_pipe.dim_reduction.config.EvaluationConfig` follows the same contract: invalid metric names, duplicate entries, and selection metrics not present in ``metrics`` fail at parse time. This prevents the common pattern where the experiment runs but downstream ranking silently does nothing. --- 4. Embedding-Aware Metric Selection ===================================== The evaluator never silently skips a requested metric. If the embedding shape or required inputs are incompatible, the metric is reported as unavailable with a clear reason in the metric payload. Standard 2D metric families: - **Co-ranking based**: ``trustworthiness``, ``continuity``, ``lcmc``, ``mrre_intrusion``, ``mrre_extrusion``, ``mrre_total``. All require a ``(n_samples - 1) × (n_samples - 1)`` co-ranking matrix and a chosen neighborhood ``k``. - **Distance preservation**: ``shepard_correlation`` (rank-correlation of pairwise original vs. embedded distances). Trajectory metric families (operating on ``(n_trajectories, n_times, n_dims)``): - **Kinematics**: ``trajectory_speed``, ``trajectory_acceleration``, ``trajectory_curvature``, ``trajectory_turning_angle``. - **Geometry**: ``trajectory_path_length``, ``trajectory_displacement``, ``trajectory_tortuosity``, ``trajectory_dispersion``. - **Group structure**: ``trajectory_separation`` (requires per-trajectory ``labels``). See :ref:`dim-reduction-evaluation` and :ref:`dim-reduction-trajectories` for the full metric catalog and the math behind each one. --- 5. Tidy Records and Post-Hoc Comparison ========================================= Every scored reducer caches the evaluator's tidy long-form output on ``DimReduction.metric_records_``. Records have these columns: - ``method`` — reducer name. - ``metric`` — metric name. - ``value`` — numeric value. - ``scope`` — what the value is parameterized by (``"k"``, ``"time"``, ``"window"``, etc., or ``None`` for global scalars). - ``scope_value`` — value of ``scope`` for this record. Optional columns (``group``, ``condition``, ``pair``, ``subject``, ``session``, ``seed``, ``fold``) survive when present. This is the same shape consumed by: - :class:`coco_pipe.dim_reduction.evaluation.MethodSelector` for ranking, - :func:`coco_pipe.viz.plot_metrics` for visualization, - :meth:`coco_pipe.report.Report.add_comparison` for report sections. .. admonition:: Why post-hoc? Some users score and rank in the same script; others score on a cluster and compare later. ``MethodSelector`` accepts either a list of scored :class:`~coco_pipe.dim_reduction.DimReduction` objects or a frame of tidy records (``MethodSelector.from_records`` / ``from_frame``) so both flows share the same ranking semantics. --- 6. Interpretation Is Not Preservation Scoring ============================================== Preservation tells you *whether* an embedding faithfully represents the original data; interpretation tells you *which input features* appear to drive each embedding axis. These are independent questions: - A PCA embedding can have perfect ``trustworthiness`` *and* a misleading feature interpretation if multiple features are collinear. - A non-linear embedding can be highly informative even with weak Spearman correlations between input features and embedding axes. The three interpretation backends ("correlation", "perturbation", "gradient") target different reducer classes and computational budgets. See :ref:`dim-reduction-interpretation`. --- 7. Lazy Optional Dependencies ============================== Heavy libraries (``torch``, ``umap-learn``, ``dask``, ``pydmd``, ``ivis``…) are imported inside reducer methods, not at package import. ``import coco_pipe.dim_reduction`` is safe even if you only have the base scientific Python stack. See :ref:`dim-reduction-dependencies` for which extras unlock which methods. --- 8. Reducer Capability Contracts ================================= Every reducer exposes a ``capabilities`` dict that the manager and the evaluator inspect. Common flags include: - ``is_linear`` — whether components are linear projections of inputs. - ``has_components`` — whether component loadings can be extracted via ``DimReduction.get_components()``. - ``has_loss_history`` — whether training loss is available for plotting. - ``input_ndim`` / ``input_layout`` — expected input shape, used to validate inputs early. Custom reducers declare their own capabilities by overriding the property; see :ref:`dim-reduction-custom-reducers`.