.. _dim-reduction-evaluation:

=============================
Evaluation and Interpretation
=============================

Evaluation and Method Comparison
================================

The evaluation layer answers two questions:

1. **For one embedding**: how well does it preserve the structure of the
   original data?
2. **Across multiple embeddings**: which reducer should I prefer for this
   dataset?

Both flow through a single pure evaluator
(:func:`~coco_pipe.dim_reduction.evaluation.core.evaluate_embedding`) that
emits tidy long-form records, then consumed either through manager scoring or
through :class:`~coco_pipe.dim_reduction.evaluation.MethodSelector` for
ranking.

---

1. Standard 2D Metric Catalog
-----------------------------

All standard metrics operate on an embedding of shape
``(n_samples, n_components)`` and the corresponding original ``X`` with shape
``(n_samples, n_features)``. The first three are computed from a shared
co-ranking matrix; the last is distance-based.

================================  ==================================================
Metric                            What it measures
================================  ==================================================
``trustworthiness``               Penalizes *intrusions* — points that appear in the
                                  embedding's ``k``-nearest neighbors but were not
                                  in the original's ``k``-NN. ``[0, 1]``, higher is
                                  better.
``continuity``                    Penalizes *extrusions* — points that were in the
                                  original's ``k``-NN but were pushed out of the
                                  embedding's ``k``-NN. ``[0, 1]``, higher is better.
``lcmc``                          Local Continuity Meta-Criterion: overlap of the
                                  original and embedding ``k``-NN sets, normalized.
``mrre_intrusion`` /              Mean Relative Rank Error split into intrusion and
``mrre_extrusion`` /              extrusion components, and combined as
``mrre_total``                    ``mrre_total``. Lower is better.
``shepard_correlation``           Spearman correlation between original and embedded
                                  pairwise distances, computed on a subsample.
================================  ==================================================

The co-ranking-based metrics share a per-sample-size validity domain:
``2 * n_samples - 3 * k - 1 > 0``. The evaluator validates this before
computing and surfaces a clear error if it fails.

.. code-block:: python

   from coco_pipe.dim_reduction import DimReduction, trustworthiness, continuity, lcmc
   from coco_pipe.dim_reduction.evaluation.metrics import compute_coranking_matrix

   reducer = DimReduction("PCA", n_components=2)
   embedding = reducer.fit_transform(X)

   # Direct use of the primitives (rare — usually done via score()):
   Q = compute_coranking_matrix(X, embedding)
   print(trustworthiness(Q, k=10), continuity(Q, k=10), lcmc(Q, k=10))

In practice, prefer the manager:

.. code-block:: python

   reducer.score(embedding, X=X, k_values=[5, 10, 20])
   reducer.metrics_           # scalar summaries
   reducer.metric_records_    # tidy long-form, one row per (metric, k)

---

2. Trajectory Metrics (Native 3D Paths)
---------------------------------------

When ``X_emb.shape == (n_trajectories, n_times, n_dims)``, the evaluator
switches to trajectory metrics. They are covered in detail in
:ref:`dim-reduction-trajectories`.

---

3. Calling the Pure Evaluator Directly
--------------------------------------

Most workflows go through ``DimReduction.score``, but the pure evaluator is
public for advanced use:

.. code-block:: python

   from coco_pipe.dim_reduction.evaluation.core import evaluate_embedding

   payload = evaluate_embedding(
       X_emb=embedding,
       X=X,
       method_name="UMAP",
       metrics=["trustworthiness", "continuity"],
       k_values=[5, 10, 20],
       random_state=42,
   )
   payload["metrics"]       # scalar summaries
   payload["records"]       # tidy long-form, ready for plotting / reports
   payload["diagnostics"]   # arrays (e.g., coranking_matrix, shepard_distances)

Inputs:

================================  ==========================================
``X_emb``                         2D ``(n_samples, n_components)`` for
                                  standard metrics; 3D
                                  ``(n_trajectories, n_times, n_dims)`` for
                                  trajectory metrics.
``X``                             Required for 2D paths; optional for 3D.
``metrics``                       Optional metric subset; defaults to "all
                                  applicable for the shape".
``labels`` / ``groups``           Used by supervised separation metrics and
                                  ``trajectory_separation``.
``times``                         Optional time coords for trajectory AUC.
``random_state``                  Seed for sampled Shepard distances.
================================  ==========================================

Output: a dict with keys ``embedding``, ``metrics``, ``metadata``,
``diagnostics``, ``records``, ``artifacts``.

---

4. Tidy Records Schema
----------------------

Every record is a flat dictionary with at minimum:

================  =================================================
``method``        Reducer name (filled in by the manager / selector).
``metric``        Metric name (e.g., ``"trustworthiness"``).
``value``         Numeric value.
``scope``         Parameter dimension this row is parameterized by
                  (``"k"``, ``"time"``, ``"window"``, ``"pair"``, …) or
                  ``None`` for global scalars.
``scope_value``   Value of ``scope`` for this row.
================  =================================================

Optional columns survive when present: ``group``, ``condition``, ``pair``,
``subject``, ``session``, ``seed``, ``fold``. These are not required by the
selector but pass through to plots and reports unchanged.

This is the same shape consumed by:

- :func:`coco_pipe.viz.plot_metrics` for visualization,
- :class:`~coco_pipe.dim_reduction.evaluation.MethodSelector` for ranking,
- :meth:`coco_pipe.report.Report.add_comparison` for report sections.

---

5. ``MethodSelector``: Post-Hoc Comparison and Ranking
------------------------------------------------------

:class:`~coco_pipe.dim_reduction.evaluation.MethodSelector` is a thin
collector + ranker over already-scored reducers. It never fits or scores
anything — only what's already cached.

5.1 Construction
~~~~~~~~~~~~~~~~

.. code-block:: python

   from coco_pipe.dim_reduction.evaluation import MethodSelector

   reducers = [DimReduction(m, n_components=2) for m in ["PCA", "UMAP", "Isomap"]]
   for r in reducers:
       emb = r.fit_transform(X)
       r.score(emb, X=X, k_values=[5, 10, 20])

   selector = MethodSelector(reducers).collect()
   # Or: MethodSelector({"pca": pca_reducer, "umap": umap_reducer}).collect()

You can also build from existing records:

.. code-block:: python

   selector = MethodSelector.from_records(metric_records)
   selector = MethodSelector.from_frame(metric_frame)

5.2 Frame Export and Ranking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   frame = selector.to_frame()         # tidy DataFrame
   ranked = selector.rank_methods(
       selection_metric="trustworthiness",
       selection_k=10,
       tie_breakers=["continuity"],
   )
   best_name = ranked.iloc[0]["method"]

``rank_methods`` ranks by **mean** of the selected metric. For ``k``-scoped
metrics, ``selection_k`` narrows comparison to one neighborhood size; ties
are broken using each successive ``tie_breakers`` metric.

5.3 Failure modes the selector catches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Reducers without cached ``metric_records_`` (you forgot to call
  ``score()``).
- Asking to rank by a metric that no reducer ever computed.
- Asking for a ``selection_k`` that none of the records cover.

These all raise :class:`ValueError` at ranking time with a specific message.

---

6. Driving Evaluation From ``EvaluationConfig``
-----------------------------------------------

When the same metric stack is used across many experiments, drive everything
from one :class:`~coco_pipe.dim_reduction.config.EvaluationConfig`:

.. code-block:: python

   from coco_pipe.dim_reduction.config import EvaluationConfig

   eval_cfg = EvaluationConfig(
       metrics=["trustworthiness", "continuity", "lcmc"],
       k_range=[5, 10, 20],
       selection_metric="trustworthiness",
       selection_k=10,
       tie_breakers=["continuity"],
       separation_method="centroid",
   )

   for r in reducers:
       emb = r.fit_transform(X)
       r.score(emb, X=X,
               metrics=eval_cfg.metrics,
               k_values=eval_cfg.k_range,
               separation_method=eval_cfg.separation_method)

   ranked = MethodSelector(reducers).collect().rank_methods(
       selection_metric=eval_cfg.selection_metric,
       selection_k=eval_cfg.selection_k,
       tie_breakers=eval_cfg.tie_breakers,
   )

.. _dim-reduction-interpretation:

Feature Interpretation
======================

Interpretation answers: *which input features appear to drive each embedding
axis*? This is independent of preservation scoring (covered in
:ref:`dim-reduction-evaluation`).

Three backends with different cost / reducer-class tradeoffs are available
through :meth:`coco_pipe.dim_reduction.DimReduction.interpret` and the pure
backend :func:`coco_pipe.dim_reduction.analysis.interpret_features`.

---

1. Backends at a Glance
-----------------------

==================  ==========================================  ========================
Backend             What it measures                            Reducer requirements
==================  ==========================================  ========================
``correlation``     Spearman correlation between each input     Any reducer (just needs
                    feature and each embedding axis.            an embedding).
``perturbation``    Mean-squared shift in the embedding when    Any reducer with a
                    each feature is independently shuffled.     fitted ``transform``.
``gradient``        Encoder saliency: ``∂‖z‖ / ∂x`` averaged    Torch-based encoders
                    over samples.                               (``IVIS``,
                                                                ``ParametricUMAP``,
                                                                ``TopologicalAE``).
==================  ==========================================  ========================

All three return tidy long-form records suitable for the same plotting and
report paths.

---

2. Correlation (Default)
------------------------

Spearman correlation between every column of ``X`` and every column of
``X_emb``. Returns a nested mapping of dimension → feature → correlation,
sorted by absolute magnitude within each dimension.

.. code-block:: python

   from coco_pipe.dim_reduction.analysis import correlate_features

   per_dim = correlate_features(X, X_emb, feature_names=feature_names)
   # {"Dimension 1": {"feat_07": -0.81, "feat_12": 0.74, ...}, ...}

When the input is constant or the embedding axis is degenerate, the Spearman
coefficient is undefined; ``correlate_features`` reports those as ``0.0`` so
the output stays sortable.

Cost: ``O(n_features * n_components)`` Spearman calls — essentially free.

---

3. Perturbation Importance
--------------------------

Model-agnostic. For each feature, shuffle it ``n_repeats`` times, ask the
reducer to re-embed, and measure mean squared deviation from the reference
embedding. Aggregate across repeats and normalize so importances sum to 1.

.. code-block:: python

   from coco_pipe.dim_reduction.analysis import perturbation_importance

   scores = perturbation_importance(
       reducer.reducer,                # the underlying fitted reducer
       X,
       feature_names=feature_names,
       X_emb=X_emb,
       n_repeats=5,
       random_state=42,
   )
   # {"feat_07": 0.31, "feat_12": 0.18, ...}

Cost: ``n_features * n_repeats`` calls to ``transform``. For methods where
``transform`` is expensive (PHATE, TSNE — though TSNE doesn't even expose
``transform``), this is slow.

Caveats:

- **Requires ``transform``.** Non-parametric methods (``TSNE``, ``MDS``,
  ``PHATE``, ``Isomap``, ``LLE``, ``SpectralEmbedding``) do not implement it.
  Use ``correlation`` or fit a parametric proxy.
- **Correlated features dilute importance.** If two features are perfectly
  correlated, shuffling one barely changes the embedding — both will appear
  unimportant.
- **Stochastic.** Set ``random_state`` for reproducibility.

---

4. Gradient Saliency
--------------------

Encoder-based methods can compute ``∂‖z‖ / ∂x`` analytically. The backend
calls ``wrapper.get_pytorch_module()``, runs ``z = encoder(x)``, backpropagates
``z.sum()``, and averages absolute gradients across the sample axis.

.. code-block:: python

   from coco_pipe.dim_reduction.analysis import gradient_importance

   scores = gradient_importance(
       reducer.reducer,                # the underlying torch-backed reducer
       X,
       feature_names=feature_names,
   )
   # {"feat_07": 0.41, ...} for 1D inputs;
   # {"importance_matrix": ndarray} for higher-rank inputs.

Cost: one forward + one backward pass. The cheapest option *when applicable*.

Requirements:

- The reducer must expose ``get_pytorch_module()`` returning a module with an
  ``encoder`` submodule.
- Currently supported: ``IVIS``, ``ParametricUMAP``, ``TopologicalAE``.
- ``torch`` must be installed. Use the ``[topology]`` or ``[ivis]`` extras
  depending on the reducer.

---

5. Unified Backend: ``interpret_features``
------------------------------------------

For most workflows, use the high-level backend directly through the manager:

.. code-block:: python

   result = reducer.interpret(
       X,
       X_emb=embedding,
       analyses=["correlation", "perturbation"],
       feature_names=feature_names,
       n_repeats=5,
       random_state=42,
   )
   result["analysis"]   # dict keyed by analysis name
   result["records"]    # tidy long-form records

The same backend is also importable as a pure function:

.. code-block:: python

   from coco_pipe.dim_reduction.analysis import interpret_features

   payload = interpret_features(
       X,
       X_emb=embedding,
       model=reducer.reducer,
       analyses=["correlation", "perturbation"],
       feature_names=feature_names,
       method_name="UMAP",
       n_repeats=5,
       random_state=42,
   )

Outputs are cached on ``DimReduction.interpretation_`` and
``DimReduction.interpretation_records_`` so subsequent plotting and reporting
don't need to recompute.

---

6. Visualization
----------------

Tidy records flow straight into :func:`coco_pipe.viz.plot_feature_importance`
(in the dim-reduction module) and :func:`coco_pipe.viz.plot_feature_correlation_heatmap`.

.. code-block:: python

   from coco_pipe.viz import (
       plot_reduction_feature_importance,
       plot_feature_correlation_heatmap,
   )

   plot_reduction_feature_importance(
       reducer.interpretation_records_,
       analysis="perturbation",
       method=reducer.method,
       top_n=20,
   )
   plot_feature_correlation_heatmap(
       reducer.interpretation_["correlation"],
       method=reducer.method,
   )

---

7. Choosing a Backend
---------------------

- **First pass on any reducer**: ``correlation`` — cheap, always available.
- **Parametric reducer with a non-trivial cost per ``transform``**:
  ``perturbation`` — gives a true input → output sensitivity but is
  ``n_features``-times slower.
- **Encoder-based reducer**: ``gradient`` — by far the cheapest accurate
  measure when it applies.

Combine them: ``correlation`` for ranking, ``perturbation`` or ``gradient``
for the final published interpretation.