Evaluation and Interpretation#
Evaluation and Method Comparison#
The evaluation layer answers two questions:
For one embedding: how well does it preserve the structure of the original data?
Across multiple embeddings: which reducer should I prefer for this dataset?
Both flow through a single pure evaluator
(evaluate_embedding()) that
emits tidy long-form records, then consumed either through manager scoring or
through MethodSelector for
ranking.
—
1. Standard 2D Metric Catalog#
All standard metrics operate on an embedding of shape
(n_samples, n_components) and the corresponding original X with shape
(n_samples, n_features). The first three are computed from a shared
co-ranking matrix; the last is distance-based.
Metric |
What it measures |
|---|---|
|
Penalizes intrusions — points that appear in the
embedding’s |
|
Penalizes extrusions — points that were in the
original’s |
|
Local Continuity Meta-Criterion: overlap of the
original and embedding |
|
Mean Relative Rank Error split into intrusion and |
|
extrusion components, and combined as |
|
|
|
Spearman correlation between original and embedded pairwise distances, computed on a subsample. |
The co-ranking-based metrics share a per-sample-size validity domain:
2 * n_samples - 3 * k - 1 > 0. The evaluator validates this before
computing and surfaces a clear error if it fails.
from coco_pipe.dim_reduction import DimReduction, trustworthiness, continuity, lcmc
from coco_pipe.dim_reduction.evaluation.metrics import compute_coranking_matrix
reducer = DimReduction("PCA", n_components=2)
embedding = reducer.fit_transform(X)
# Direct use of the primitives (rare — usually done via score()):
Q = compute_coranking_matrix(X, embedding)
print(trustworthiness(Q, k=10), continuity(Q, k=10), lcmc(Q, k=10))
In practice, prefer the manager:
reducer.score(embedding, X=X, k_values=[5, 10, 20])
reducer.metrics_ # scalar summaries
reducer.metric_records_ # tidy long-form, one row per (metric, k)
—
2. Trajectory Metrics (Native 3D Paths)#
When X_emb.shape == (n_trajectories, n_times, n_dims), the evaluator
switches to trajectory metrics. They are covered in detail in
Trajectory Analysis.
—
3. Calling the Pure Evaluator Directly#
Most workflows go through DimReduction.score, but the pure evaluator is
public for advanced use:
from coco_pipe.dim_reduction.evaluation.core import evaluate_embedding
payload = evaluate_embedding(
X_emb=embedding,
X=X,
method_name="UMAP",
metrics=["trustworthiness", "continuity"],
k_values=[5, 10, 20],
random_state=42,
)
payload["metrics"] # scalar summaries
payload["records"] # tidy long-form, ready for plotting / reports
payload["diagnostics"] # arrays (e.g., coranking_matrix, shepard_distances)
Inputs:
|
2D |
|
Required for 2D paths; optional for 3D. |
|
Optional metric subset; defaults to “all applicable for the shape”. |
|
Used by supervised separation metrics and
|
|
Optional time coords for trajectory AUC. |
|
Seed for sampled Shepard distances. |
Output: a dict with keys embedding, metrics, metadata,
diagnostics, records, artifacts.
—
4. Tidy Records Schema#
Every record is a flat dictionary with at minimum:
|
Reducer name (filled in by the manager / selector). |
|
Metric name (e.g., |
|
Numeric value. |
|
Parameter dimension this row is parameterized by
( |
|
Value of |
Optional columns survive when present: group, condition, pair,
subject, session, seed, fold. These are not required by the
selector but pass through to plots and reports unchanged.
This is the same shape consumed by:
coco_pipe.viz.plot_metrics()for visualization,MethodSelectorfor ranking,coco_pipe.report.Report.add_comparison()for report sections.
—
5. MethodSelector: Post-Hoc Comparison and Ranking#
MethodSelector is a thin
collector + ranker over already-scored reducers. It never fits or scores
anything — only what’s already cached.
5.1 Construction#
from coco_pipe.dim_reduction.evaluation import MethodSelector
reducers = [DimReduction(m, n_components=2) for m in ["PCA", "UMAP", "Isomap"]]
for r in reducers:
emb = r.fit_transform(X)
r.score(emb, X=X, k_values=[5, 10, 20])
selector = MethodSelector(reducers).collect()
# Or: MethodSelector({"pca": pca_reducer, "umap": umap_reducer}).collect()
You can also build from existing records:
selector = MethodSelector.from_records(metric_records)
selector = MethodSelector.from_frame(metric_frame)
5.2 Frame Export and Ranking#
frame = selector.to_frame() # tidy DataFrame
ranked = selector.rank_methods(
selection_metric="trustworthiness",
selection_k=10,
tie_breakers=["continuity"],
)
best_name = ranked.iloc[0]["method"]
rank_methods ranks by mean of the selected metric. For k-scoped
metrics, selection_k narrows comparison to one neighborhood size; ties
are broken using each successive tie_breakers metric.
5.3 Failure modes the selector catches#
Reducers without cached
metric_records_(you forgot to callscore()).Asking to rank by a metric that no reducer ever computed.
Asking for a
selection_kthat none of the records cover.
These all raise ValueError at ranking time with a specific message.
—
6. Driving Evaluation From EvaluationConfig#
When the same metric stack is used across many experiments, drive everything
from one EvaluationConfig:
from coco_pipe.dim_reduction.config import EvaluationConfig
eval_cfg = EvaluationConfig(
metrics=["trustworthiness", "continuity", "lcmc"],
k_range=[5, 10, 20],
selection_metric="trustworthiness",
selection_k=10,
tie_breakers=["continuity"],
separation_method="centroid",
)
for r in reducers:
emb = r.fit_transform(X)
r.score(emb, X=X,
metrics=eval_cfg.metrics,
k_values=eval_cfg.k_range,
separation_method=eval_cfg.separation_method)
ranked = MethodSelector(reducers).collect().rank_methods(
selection_metric=eval_cfg.selection_metric,
selection_k=eval_cfg.selection_k,
tie_breakers=eval_cfg.tie_breakers,
)
Feature Interpretation#
Interpretation answers: which input features appear to drive each embedding axis? This is independent of preservation scoring (covered in Evaluation and Interpretation).
Three backends with different cost / reducer-class tradeoffs are available
through coco_pipe.dim_reduction.DimReduction.interpret() and the pure
backend coco_pipe.dim_reduction.analysis.interpret_features().
—
1. Backends at a Glance#
Backend |
What it measures |
Reducer requirements |
|---|---|---|
|
Spearman correlation between each input feature and each embedding axis. |
Any reducer (just needs an embedding). |
|
Mean-squared shift in the embedding when each feature is independently shuffled. |
Any reducer with a
fitted |
|
Encoder saliency: |
Torch-based encoders
( |
All three return tidy long-form records suitable for the same plotting and report paths.
—
2. Correlation (Default)#
Spearman correlation between every column of X and every column of
X_emb. Returns a nested mapping of dimension → feature → correlation,
sorted by absolute magnitude within each dimension.
from coco_pipe.dim_reduction.analysis import correlate_features
per_dim = correlate_features(X, X_emb, feature_names=feature_names)
# {"Dimension 1": {"feat_07": -0.81, "feat_12": 0.74, ...}, ...}
When the input is constant or the embedding axis is degenerate, the Spearman
coefficient is undefined; correlate_features reports those as 0.0 so
the output stays sortable.
Cost: O(n_features * n_components) Spearman calls — essentially free.
—
3. Perturbation Importance#
Model-agnostic. For each feature, shuffle it n_repeats times, ask the
reducer to re-embed, and measure mean squared deviation from the reference
embedding. Aggregate across repeats and normalize so importances sum to 1.
from coco_pipe.dim_reduction.analysis import perturbation_importance
scores = perturbation_importance(
reducer.reducer, # the underlying fitted reducer
X,
feature_names=feature_names,
X_emb=X_emb,
n_repeats=5,
random_state=42,
)
# {"feat_07": 0.31, "feat_12": 0.18, ...}
Cost: n_features * n_repeats calls to transform. For methods where
transform is expensive (PHATE, TSNE — though TSNE doesn’t even expose
transform), this is slow.
Caveats:
Requires ``transform``. Non-parametric methods (
TSNE,MDS,PHATE,Isomap,LLE,SpectralEmbedding) do not implement it. Usecorrelationor fit a parametric proxy.Correlated features dilute importance. If two features are perfectly correlated, shuffling one barely changes the embedding — both will appear unimportant.
Stochastic. Set
random_statefor reproducibility.
—
4. Gradient Saliency#
Encoder-based methods can compute ∂‖z‖ / ∂x analytically. The backend
calls wrapper.get_pytorch_module(), runs z = encoder(x), backpropagates
z.sum(), and averages absolute gradients across the sample axis.
from coco_pipe.dim_reduction.analysis import gradient_importance
scores = gradient_importance(
reducer.reducer, # the underlying torch-backed reducer
X,
feature_names=feature_names,
)
# {"feat_07": 0.41, ...} for 1D inputs;
# {"importance_matrix": ndarray} for higher-rank inputs.
Cost: one forward + one backward pass. The cheapest option when applicable.
Requirements:
The reducer must expose
get_pytorch_module()returning a module with anencodersubmodule.Currently supported:
IVIS,ParametricUMAP,TopologicalAE.torchmust be installed. Use the[topology]or[ivis]extras depending on the reducer.
—
5. Unified Backend: interpret_features#
For most workflows, use the high-level backend directly through the manager:
result = reducer.interpret(
X,
X_emb=embedding,
analyses=["correlation", "perturbation"],
feature_names=feature_names,
n_repeats=5,
random_state=42,
)
result["analysis"] # dict keyed by analysis name
result["records"] # tidy long-form records
The same backend is also importable as a pure function:
from coco_pipe.dim_reduction.analysis import interpret_features
payload = interpret_features(
X,
X_emb=embedding,
model=reducer.reducer,
analyses=["correlation", "perturbation"],
feature_names=feature_names,
method_name="UMAP",
n_repeats=5,
random_state=42,
)
Outputs are cached on DimReduction.interpretation_ and
DimReduction.interpretation_records_ so subsequent plotting and reporting
don’t need to recompute.
—
6. Visualization#
Tidy records flow straight into coco_pipe.viz.plot_feature_importance()
(in the dim-reduction module) and coco_pipe.viz.plot_feature_correlation_heatmap().
from coco_pipe.viz import (
plot_reduction_feature_importance,
plot_feature_correlation_heatmap,
)
plot_reduction_feature_importance(
reducer.interpretation_records_,
analysis="perturbation",
method=reducer.method,
top_n=20,
)
plot_feature_correlation_heatmap(
reducer.interpretation_["correlation"],
method=reducer.method,
)
—
7. Choosing a Backend#
First pass on any reducer:
correlation— cheap, always available.Parametric reducer with a non-trivial cost per ``transform``:
perturbation— gives a true input → output sensitivity but isn_features-times slower.Encoder-based reducer:
gradient— by far the cheapest accurate measure when it applies.
Combine them: correlation for ranking, perturbation or gradient
for the final published interpretation.