Evaluation and Interpretation#

Evaluation and Method Comparison#

The evaluation layer answers two questions:

  1. For one embedding: how well does it preserve the structure of the original data?

  2. Across multiple embeddings: which reducer should I prefer for this dataset?

Both flow through a single pure evaluator (evaluate_embedding()) that emits tidy long-form records, then consumed either through manager scoring or through MethodSelector for ranking.

1. Standard 2D Metric Catalog#

All standard metrics operate on an embedding of shape (n_samples, n_components) and the corresponding original X with shape (n_samples, n_features). The first three are computed from a shared co-ranking matrix; the last is distance-based.

Metric

What it measures

trustworthiness

Penalizes intrusions — points that appear in the embedding’s k-nearest neighbors but were not in the original’s k-NN. [0, 1], higher is better.

continuity

Penalizes extrusions — points that were in the original’s k-NN but were pushed out of the embedding’s k-NN. [0, 1], higher is better.

lcmc

Local Continuity Meta-Criterion: overlap of the original and embedding k-NN sets, normalized.

mrre_intrusion /

Mean Relative Rank Error split into intrusion and

mrre_extrusion /

extrusion components, and combined as

mrre_total

mrre_total. Lower is better.

shepard_correlation

Spearman correlation between original and embedded pairwise distances, computed on a subsample.

The co-ranking-based metrics share a per-sample-size validity domain: 2 * n_samples - 3 * k - 1 > 0. The evaluator validates this before computing and surfaces a clear error if it fails.

from coco_pipe.dim_reduction import DimReduction, trustworthiness, continuity, lcmc
from coco_pipe.dim_reduction.evaluation.metrics import compute_coranking_matrix

reducer = DimReduction("PCA", n_components=2)
embedding = reducer.fit_transform(X)

# Direct use of the primitives (rare — usually done via score()):
Q = compute_coranking_matrix(X, embedding)
print(trustworthiness(Q, k=10), continuity(Q, k=10), lcmc(Q, k=10))

In practice, prefer the manager:

reducer.score(embedding, X=X, k_values=[5, 10, 20])
reducer.metrics_           # scalar summaries
reducer.metric_records_    # tidy long-form, one row per (metric, k)

2. Trajectory Metrics (Native 3D Paths)#

When X_emb.shape == (n_trajectories, n_times, n_dims), the evaluator switches to trajectory metrics. They are covered in detail in Trajectory Analysis.

3. Calling the Pure Evaluator Directly#

Most workflows go through DimReduction.score, but the pure evaluator is public for advanced use:

from coco_pipe.dim_reduction.evaluation.core import evaluate_embedding

payload = evaluate_embedding(
    X_emb=embedding,
    X=X,
    method_name="UMAP",
    metrics=["trustworthiness", "continuity"],
    k_values=[5, 10, 20],
    random_state=42,
)
payload["metrics"]       # scalar summaries
payload["records"]       # tidy long-form, ready for plotting / reports
payload["diagnostics"]   # arrays (e.g., coranking_matrix, shepard_distances)

Inputs:

X_emb

2D (n_samples, n_components) for standard metrics; 3D (n_trajectories, n_times, n_dims) for trajectory metrics.

X

Required for 2D paths; optional for 3D.

metrics

Optional metric subset; defaults to “all applicable for the shape”.

labels / groups

Used by supervised separation metrics and trajectory_separation.

times

Optional time coords for trajectory AUC.

random_state

Seed for sampled Shepard distances.

Output: a dict with keys embedding, metrics, metadata, diagnostics, records, artifacts.

4. Tidy Records Schema#

Every record is a flat dictionary with at minimum:

method

Reducer name (filled in by the manager / selector).

metric

Metric name (e.g., "trustworthiness").

value

Numeric value.

scope

Parameter dimension this row is parameterized by ("k", "time", "window", "pair", …) or None for global scalars.

scope_value

Value of scope for this row.

Optional columns survive when present: group, condition, pair, subject, session, seed, fold. These are not required by the selector but pass through to plots and reports unchanged.

This is the same shape consumed by:

5. MethodSelector: Post-Hoc Comparison and Ranking#

MethodSelector is a thin collector + ranker over already-scored reducers. It never fits or scores anything — only what’s already cached.

5.1 Construction#

from coco_pipe.dim_reduction.evaluation import MethodSelector

reducers = [DimReduction(m, n_components=2) for m in ["PCA", "UMAP", "Isomap"]]
for r in reducers:
    emb = r.fit_transform(X)
    r.score(emb, X=X, k_values=[5, 10, 20])

selector = MethodSelector(reducers).collect()
# Or: MethodSelector({"pca": pca_reducer, "umap": umap_reducer}).collect()

You can also build from existing records:

selector = MethodSelector.from_records(metric_records)
selector = MethodSelector.from_frame(metric_frame)

5.2 Frame Export and Ranking#

frame = selector.to_frame()         # tidy DataFrame
ranked = selector.rank_methods(
    selection_metric="trustworthiness",
    selection_k=10,
    tie_breakers=["continuity"],
)
best_name = ranked.iloc[0]["method"]

rank_methods ranks by mean of the selected metric. For k-scoped metrics, selection_k narrows comparison to one neighborhood size; ties are broken using each successive tie_breakers metric.

5.3 Failure modes the selector catches#

  • Reducers without cached metric_records_ (you forgot to call score()).

  • Asking to rank by a metric that no reducer ever computed.

  • Asking for a selection_k that none of the records cover.

These all raise ValueError at ranking time with a specific message.

6. Driving Evaluation From EvaluationConfig#

When the same metric stack is used across many experiments, drive everything from one EvaluationConfig:

from coco_pipe.dim_reduction.config import EvaluationConfig

eval_cfg = EvaluationConfig(
    metrics=["trustworthiness", "continuity", "lcmc"],
    k_range=[5, 10, 20],
    selection_metric="trustworthiness",
    selection_k=10,
    tie_breakers=["continuity"],
    separation_method="centroid",
)

for r in reducers:
    emb = r.fit_transform(X)
    r.score(emb, X=X,
            metrics=eval_cfg.metrics,
            k_values=eval_cfg.k_range,
            separation_method=eval_cfg.separation_method)

ranked = MethodSelector(reducers).collect().rank_methods(
    selection_metric=eval_cfg.selection_metric,
    selection_k=eval_cfg.selection_k,
    tie_breakers=eval_cfg.tie_breakers,
)

Feature Interpretation#

Interpretation answers: which input features appear to drive each embedding axis? This is independent of preservation scoring (covered in Evaluation and Interpretation).

Three backends with different cost / reducer-class tradeoffs are available through coco_pipe.dim_reduction.DimReduction.interpret() and the pure backend coco_pipe.dim_reduction.analysis.interpret_features().

1. Backends at a Glance#

Backend

What it measures

Reducer requirements

correlation

Spearman correlation between each input feature and each embedding axis.

Any reducer (just needs an embedding).

perturbation

Mean-squared shift in the embedding when each feature is independently shuffled.

Any reducer with a fitted transform.

gradient

Encoder saliency: ∂‖z‖ / ∂x averaged over samples.

Torch-based encoders (IVIS, ParametricUMAP, TopologicalAE).

All three return tidy long-form records suitable for the same plotting and report paths.

2. Correlation (Default)#

Spearman correlation between every column of X and every column of X_emb. Returns a nested mapping of dimension → feature → correlation, sorted by absolute magnitude within each dimension.

from coco_pipe.dim_reduction.analysis import correlate_features

per_dim = correlate_features(X, X_emb, feature_names=feature_names)
# {"Dimension 1": {"feat_07": -0.81, "feat_12": 0.74, ...}, ...}

When the input is constant or the embedding axis is degenerate, the Spearman coefficient is undefined; correlate_features reports those as 0.0 so the output stays sortable.

Cost: O(n_features * n_components) Spearman calls — essentially free.

3. Perturbation Importance#

Model-agnostic. For each feature, shuffle it n_repeats times, ask the reducer to re-embed, and measure mean squared deviation from the reference embedding. Aggregate across repeats and normalize so importances sum to 1.

from coco_pipe.dim_reduction.analysis import perturbation_importance

scores = perturbation_importance(
    reducer.reducer,                # the underlying fitted reducer
    X,
    feature_names=feature_names,
    X_emb=X_emb,
    n_repeats=5,
    random_state=42,
)
# {"feat_07": 0.31, "feat_12": 0.18, ...}

Cost: n_features * n_repeats calls to transform. For methods where transform is expensive (PHATE, TSNE — though TSNE doesn’t even expose transform), this is slow.

Caveats:

  • Requires ``transform``. Non-parametric methods (TSNE, MDS, PHATE, Isomap, LLE, SpectralEmbedding) do not implement it. Use correlation or fit a parametric proxy.

  • Correlated features dilute importance. If two features are perfectly correlated, shuffling one barely changes the embedding — both will appear unimportant.

  • Stochastic. Set random_state for reproducibility.

4. Gradient Saliency#

Encoder-based methods can compute ∂‖z‖ / ∂x analytically. The backend calls wrapper.get_pytorch_module(), runs z = encoder(x), backpropagates z.sum(), and averages absolute gradients across the sample axis.

from coco_pipe.dim_reduction.analysis import gradient_importance

scores = gradient_importance(
    reducer.reducer,                # the underlying torch-backed reducer
    X,
    feature_names=feature_names,
)
# {"feat_07": 0.41, ...} for 1D inputs;
# {"importance_matrix": ndarray} for higher-rank inputs.

Cost: one forward + one backward pass. The cheapest option when applicable.

Requirements:

  • The reducer must expose get_pytorch_module() returning a module with an encoder submodule.

  • Currently supported: IVIS, ParametricUMAP, TopologicalAE.

  • torch must be installed. Use the [topology] or [ivis] extras depending on the reducer.

5. Unified Backend: interpret_features#

For most workflows, use the high-level backend directly through the manager:

result = reducer.interpret(
    X,
    X_emb=embedding,
    analyses=["correlation", "perturbation"],
    feature_names=feature_names,
    n_repeats=5,
    random_state=42,
)
result["analysis"]   # dict keyed by analysis name
result["records"]    # tidy long-form records

The same backend is also importable as a pure function:

from coco_pipe.dim_reduction.analysis import interpret_features

payload = interpret_features(
    X,
    X_emb=embedding,
    model=reducer.reducer,
    analyses=["correlation", "perturbation"],
    feature_names=feature_names,
    method_name="UMAP",
    n_repeats=5,
    random_state=42,
)

Outputs are cached on DimReduction.interpretation_ and DimReduction.interpretation_records_ so subsequent plotting and reporting don’t need to recompute.

6. Visualization#

Tidy records flow straight into coco_pipe.viz.plot_feature_importance() (in the dim-reduction module) and coco_pipe.viz.plot_feature_correlation_heatmap().

from coco_pipe.viz import (
    plot_reduction_feature_importance,
    plot_feature_correlation_heatmap,
)

plot_reduction_feature_importance(
    reducer.interpretation_records_,
    analysis="perturbation",
    method=reducer.method,
    top_n=20,
)
plot_feature_correlation_heatmap(
    reducer.interpretation_["correlation"],
    method=reducer.method,
)

7. Choosing a Backend#

  • First pass on any reducer: correlation — cheap, always available.

  • Parametric reducer with a non-trivial cost per ``transform``: perturbation — gives a true input → output sensitivity but is n_features-times slower.

  • Encoder-based reducer: gradient — by far the cheapest accurate measure when it applies.

Combine them: correlation for ranking, perturbation or gradient for the final published interpretation.