Building and Running Experiments#
The Experiment Orchestrator#
coco_pipe.decoding.Experiment is the main entry point for all decoding
experiments. It validates configuration, orchestrates the outer CV loop,
and returns a fully populated ExperimentResult.
—
1. Initialization#
from coco_pipe.decoding import Experiment, ExperimentConfig
from coco_pipe.decoding.configs import ClassicalModelConfig, CVConfig
config = ExperimentConfig(
task="classification",
models={"lr": ClassicalModelConfig(estimator="LogisticRegression")},
metrics=["accuracy"],
cv=CVConfig(strategy="stratified", n_splits=5),
)
exp = Experiment(config)
At construction time, Experiment.__init__ immediately:
Resolves all model specs from
ESTIMATOR_SPECS.Validates task/metric/model compatibility (raises
ValueErrorif any combination is invalid).Propagates the master
random_stateto all sub-configs.
—
2. Running an Experiment#
result = exp.run(
X,
y,
groups=None, # or np.ndarray of group labels
sample_ids=None, # or array of unique sample identifiers
sample_metadata=None, # or dict/DataFrame with Subject, Session, ...
feature_names=None, # or list of feature name strings
time_axis=None, # or np.ndarray of timepoints for 3D inputs
observation_level="epoch", # or "trial", "subject", etc.
inferential_unit=None, # auto-inferred from metadata
)
2.1 X and y#
X: 2D array(n_samples, n_features)for classical models, or 3D array(n_samples, n_channels, n_times)for temporal estimators.y: 1D array(n_samples,)of class labels (classification) or continuous values (regression).
2.2 sample_metadata#
A dict or DataFrame with columns for each metadata variable. Must include
``Subject`` and ``Session`` (capitalized) when the outer CV uses a group key.
Additional columns (e.g., Site, Age) are stored in predictions and splits
for downstream analysis.
sample_metadata = {
"Subject": subject_ids, # unique subject identifiers
"Session": session_ids, # recording session identifiers
"Site": site_ids, # optional acquisition site
}
2.3 observation_level#
A string label stored in result.meta["observation_level"]. It describes what
each row of X represents ("epoch", "trial", "subject", etc.).
This metadata does not affect fitting but documents the result for downstream
analysis and reporting.
—
3. Per-Fold Pipeline#
For each outer CV fold, Experiment executes the following sequence:
Split: divide
X,y, and metadata into training and test partitions.Validate fold integrity: check for degenerate folds (empty partitions, single-class training sets for classification).
Build pipeline: create a
sklearn.pipeline.Pipelinewith steps:scaler → feature_selector → model. Each step is instantiated fresh for this fold.Wrap with tuning: if
TuningConfig.enabled, wrap the pipeline inGridSearchCVorRandomizedSearchCV.Fit: call
pipeline.fit(X_train, y_train)(with groups if required).Calibrate: if
CalibrationConfig.enabled, wrap inCalibratedClassifierCVand refit with calibration folds.Score: compute all requested metrics on
X_test.Extract diagnostics: feature importances, predictions, timing, warnings.
—
4. Parallel Execution#
config = ExperimentConfig(
...,
n_jobs=4, # number of parallel outer CV jobs
)
result = Experiment(config).run(X, y)
n_jobs controls the number of parallel outer-fold evaluations via joblib.
For exact reproducibility, use n_jobs=1 (see Reproducibility Architecture).
—
5. Save and Load#
# Save result to JSON
path = result.save("results/my_experiment.json")
# Load from JSON
from coco_pipe.decoding.result import ExperimentResult
loaded = ExperimentResult.load(path)
The result is serialized as a self-contained JSON payload (schema version
decoding_result_v1), including the config, metadata, per-fold outputs,
and provenance information.
—
6. Configuration Reference#
See Configuration Reference for a full listing of all configuration classes.
The most important fields on ExperimentConfig:
Field |
Description |
|---|---|
|
|
|
Dict mapping model names to model configs. |
|
List of metric keys (validated against the task and model capabilities). |
|
|
|
|
|
|
|
|
|
|
|
Whether to prepend a |
|
Number of parallel outer CV jobs. |
|
Master seed for reproducibility. |
|
Descriptive label stored in the result metadata. |
Configuration Reference#
All experiment configuration is declarative and Pydantic-validated. Every
config class uses extra="forbid" so misspelled or unsupported field names
raise a ValidationError immediately — before any training starts.
—
1. ExperimentConfig#
Top-level configuration for a decoding experiment.
from coco_pipe.decoding.configs import ExperimentConfig
config = ExperimentConfig(
task="classification", # required: "classification" or "regression"
models={"lr": ...}, # required: dict of model configs
metrics=["accuracy"], # default: task-appropriate defaults
cv=CVConfig(...), # default: StratifiedKFold(5)
tuning=TuningConfig(...), # default: disabled
feature_selection=FeatureSelectionConfig(...), # default: disabled
reducer=ReducerConfig(...), # default: disabled (in-pipeline reduction)
calibration=CalibrationConfig(...), # default: disabled
statistical_assessment=StatisticalAssessmentConfig(...), # default: disabled
grids={"lr": {"C": [0.1, 1.0]}}, # hyperparameter grids for tuning
use_scaler=True, # prepend StandardScaler to pipeline
n_jobs=1, # outer CV parallelism
verbose=False,
tag="my_experiment", # descriptive label in result metadata
random_state=42,
)
—
2. CVConfig#
Controls the outer cross-validation loop.
from coco_pipe.decoding.configs import CVConfig
cv = CVConfig(
strategy="stratified_group_kfold",
n_splits=5, # also the number of groups left out for "leave_p_out"
group_key="Subject", # column name in sample_metadata
test_size=0.2, # for "split" / "group_shuffle_split" only
stratify=True, # for "split" + classification only
auto_reduce_n_splits=True, # shrink n_splits if too few groups
random_state=42,
)
See Cross-Validation Strategies Guide for a complete strategy guide.
—
3. ClassicalModelConfig#
Configures a classical scikit-learn estimator.
from coco_pipe.decoding.configs import ClassicalModelConfig
model = ClassicalModelConfig(
estimator="LogisticRegression", # key in ESTIMATOR_SPECS
params={"C": 1.0, "max_iter": 200},
)
Short-form aliases are also available for common estimators:
from coco_pipe.decoding.configs import LogisticRegressionConfig
model = LogisticRegressionConfig(C=1.0, max_iter=200)
—
4. TemporalDecoderConfig#
Wraps a classical base estimator for 3D temporal inputs.
from coco_pipe.decoding.configs import TemporalDecoderConfig, ClassicalModelConfig
model = TemporalDecoderConfig(
wrapper="sliding", # or "generalizing"
base=ClassicalModelConfig(estimator="LogisticRegression"),
scoring="accuracy",
n_jobs=-1,
)
Requires mne as an optional dependency.
—
5. TuningConfig#
Controls hyperparameter search.
from coco_pipe.decoding.configs import TuningConfig, CVConfig
tuning = TuningConfig(
enabled=True,
search_type="grid", # or "random"
scoring="accuracy",
n_iter=20, # for "random" search only
n_jobs=1,
refit=True,
cv=CVConfig(strategy="stratified", n_splits=3), # inner CV
allow_nongroup_inner_cv=False, # leakage guard
random_state=42,
)
—
6. FeatureSelectionConfig#
from coco_pipe.decoding.configs import FeatureSelectionConfig, CVConfig
fs = FeatureSelectionConfig(
enabled=True,
method="k_best", # or "sfs"
n_features=20,
scoring="accuracy", # scoring criterion for SFS inner CV
cv=CVConfig(strategy="stratified", n_splits=3), # SFS inner CV
direction="forward", # for SFS: "forward" or "backward"
allow_nongroup_inner_cv=False,
)
—
7. CalibrationConfig#
Enables probability calibration inside the training path.
from coco_pipe.decoding.configs import CalibrationConfig, CVConfig
calibration = CalibrationConfig(
enabled=True,
method="sigmoid", # or "isotonic"
cv=CVConfig(strategy="stratified", n_splits=3),
allow_nongroup_inner_cv=False,
)
—
8. StatisticalAssessmentConfig#
from coco_pipe.decoding.configs import (
StatisticalAssessmentConfig, ChanceAssessmentConfig, ConfidenceIntervalConfig
)
assessment = StatisticalAssessmentConfig( # pass as statistical_assessment=assessment
enabled=True,
random_state=42,
unit_of_inference="group_mean", # "sample", "group_mean", "group_majority", "custom"
chance=ChanceAssessmentConfig(
method="permutation", # or "binomial", "auto"
n_permutations=1000,
p0=None, # required for "binomial"
temporal_correction="max_stat", # "max_stat", "fdr_bh", "none"
store_null_distribution=False,
),
confidence_intervals=ConfidenceIntervalConfig(
alpha=0.05,
method="clopper_pearson", # or "wilson"
),
)
—
9. Foundation Model Configs#
from coco_pipe.decoding.configs import (
FoundationEmbeddingModelConfig,
FrozenBackboneDecoderConfig,
NeuralFineTuneConfig,
LoRAConfig,
QuantizationConfig,
DeviceConfig,
CheckpointConfig,
)
# Frozen embedding extractor
embed_cfg = FoundationEmbeddingModelConfig(
backend="braindecode", # "auto" (default), "braindecode", "hugging_face"
model_key="labram", # a registered model — see list_foundation_models()
pooling="mean", # "mean" or "flatten"
cache_embeddings=True,
normalize_embeddings=True,
)
# Full / parameter-efficient neural fine-tuning
ft_cfg = NeuralFineTuneConfig(
backend="hugging_face",
model_key="reve",
input_kind="epoched", # "temporal", "epoched", "tokens"
train_mode="qlora", # "full", "frozen", "linear_probe", "lora", "qlora"
lora=LoRAConfig(r=16, alpha=32),
quantization=QuantizationConfig(enabled=True, load_in_4bit=True),
device=DeviceConfig(device="auto", precision="bf16"), # "fp32", "fp16", "bf16"
checkpoints=CheckpointConfig(save="best"), # "none", "best", "last", "all"
)
Discover available backbones and their capabilities with
list_foundation_models() and
get_foundation_model_spec().
ExperimentResult API#
ExperimentResult is the structured container returned by Experiment.run().
It provides 20+ accessor methods for tidy-data inspection, diagnostic reporting,
and statistical inference — all without rerunning the experiment.
—
1. Structure#
result.raw # per-model dict of fold outputs
result.meta # environment provenance, task, model names, capabilities
result.config # original ExperimentConfig
—
2. Prediction Accessors#
# All out-of-fold predictions in tidy long form
preds = result.get_predictions()
# columns: Model, Fold, SampleIndex, SampleID, Group, y_true, y_pred
# + y_proba_0, y_proba_1, ... (if probabilities available)
# + Subject, Session, Site (from sample_metadata)
# + Time (sliding) or TrainTime, TestTime (generalizing)
—
3. Score Accessors#
# Per-fold, per-metric scores
scores = result.get_detailed_scores()
# columns: Model, Fold, Metric, Value, Time (if temporal)
# Fold-level split information
splits = result.get_splits(with_metadata=True)
# Fit/predict/score timing and convergence warnings
fit_diag = result.get_fit_diagnostics()
—
4. Curve Diagnostics#
# ROC curves (binary or one-vs-rest multiclass)
roc = result.get_roc_curve()
# columns: Model, Fold, Class, FPR, TPR, Threshold, AUC
# Precision-recall curves
pr = result.get_pr_curve()
# columns: Model, Fold, Class, Precision, Recall, Threshold
# Calibration (reliability) curves
cal = result.get_calibration_curve()
# Probability quality summary (log-loss + Brier per fold)
prob_diag = result.get_probability_diagnostics()
# Summary statistics for ROC AUC
roc_summary = result.get_roc_auc_summary()
# Summary statistics for PR AUC
pr_summary = result.get_pr_auc_summary()
—
5. Confusion Matrices#
# Per-fold confusion matrices in long form
cm = result.get_confusion_matrices(normalize=True)
# columns: Model, Fold, TrueLabel, PredLabel, Count
# Pooled (over folds) confusion matrix
pooled_cm = result.get_pooled_confusion_matrix(normalize="true")
—
6. Temporal Accessors#
# Score summary per timepoint (sliding only)
temporal = result.get_temporal_score_summary()
# columns: Model, Metric, Time, MeanScore, StdScore
# Generalization matrix: shape (n_train_times, n_test_times)
matrix = result.get_generalization_matrix("accuracy")
# or long form:
matrix_long = result.get_generalization_matrix("accuracy", long=True)
—
7. Statistical Inference#
# Full-pipeline or lightweight permutation/binomial assessment
assessment = result.get_statistical_assessment()
# Lightweight (fixed-prediction, fast, biased)
assessment_fast = result.get_statistical_assessment(lightweight=True, metric="accuracy")
# Bootstrap CI over independent units
ci = result.get_bootstrap_confidence_intervals(
metric="accuracy",
unit="Subject",
n_bootstraps=2000,
ci=0.95,
)
# Null distribution (if stored via store_null_distribution=True)
nulls = result.get_statistical_nulls()
—
8. Model Comparison#
# Paired permutation test between two models (in-result)
paired = result.compare_models_paired("lr", "svm", metric="accuracy", unit="Subject")
# All pairwise comparisons with correction
all_pairs = result.compare_models(metric="accuracy", correction="fdr_bh")
—
9. Feature Importances#
# Mean ± std feature importance across folds
importances = result.get_feature_importances()
# columns: FeatureName, MeanImportance, StdImportance
# Per-fold importances
fold_imp = result.get_feature_importances(fold_level=True)
# Ranked importances (descending by mean)
ranked = result.get_feature_importances(rank=True)
—
10. Feature Selection Accessors#
# Selected features per fold
selected = result.get_selected_features(ordered=True)
# Feature stability: selection rate across folds
stability = result.get_feature_stability()
# Per-fold univariate feature scores (k_best only)
scores = result.get_feature_scores(with_pvalues=True)
—
11. Hyperparameter Tuning#
# Best hyperparameters per fold
best = result.get_best_params()
# Full grid search results
grid = result.get_search_results()
—
12. Model Artifact Metadata#
# Neural model training history, checkpoints, etc.
artifacts = result.get_model_artifacts()
—
13. Serialization#
# Serialize to JSON-compatible payload
payload = result.to_payload()
# Save to file
path = result.save("results/my_result.json")
# Load from file
from coco_pipe.decoding.result import ExperimentResult
loaded = ExperimentResult.load("results/my_result.json")
Metric Registry#
All metrics are registered in coco_pipe.decoding._metrics.METRIC_REGISTRY.
Metric/task compatibility is enforced at config validation time — before any
model is trained — preventing silent misuse of classification metrics for
regression tasks (or vice versa).
—
1. Registry API#
from coco_pipe.decoding._metrics import (
get_metric_spec,
get_metric_names,
get_metric_families,
get_scorer,
METRIC_REGISTRY,
)
# Inspect a single metric
spec = get_metric_spec("accuracy")
print(spec.name) # "accuracy"
print(spec.task) # "classification"
print(spec.family) # "label"
print(spec.response_method) # "predict"
print(spec.greater_is_better) # True
# List all classification metrics in the "threshold_sweep" family
names = get_metric_names(task="classification", family="threshold_sweep")
# Get a callable scorer
scorer = get_scorer("f1") # sklearn-compatible callable
Each MetricSpec contains:
Field |
Type |
Description |
|---|---|---|
|
|
Unique key in the registry. |
|
|
|
|
|
|
|
|
|
|
|
Grouping for reporting (see below). |
|
|
Directionality for permutation p-values and Max-Stat correction. |
—
2. Classification Metrics#
2.1 Label Metrics (family="label")#
Require only predict output. Work with any classifier.
Metric |
Description |
|---|---|
|
Fraction of correctly classified samples. Sensitive to class imbalance. |
|
Mean recall per class. Recommended over |
|
Fraction misclassified. |
|
Per-label Hamming loss (fraction of labels incorrectly predicted). |
2.2 Confusion-Derived Metrics (family="confusion")#
Derived from the confusion matrix. Require only predict.
Metric |
Description |
|---|---|
|
Binary F1 score (harmonic mean of precision and recall). |
|
Unweighted macro-average F1 across classes. |
|
Global precision/recall pooled across classes. |
|
Positive predictive value. TP / (TP + FP). |
|
Sensitivity / true positive rate. TP / (TP + FN). |
|
Synonym for recall. Binary only; raises |
|
True negative rate. TN / (TN + FP). Binary only. |
|
Intersection-over-union for binary labels. |
|
Matthews correlation coefficient. Balanced for all class distributions. |
|
Agreement corrected for chance. Range [-1, 1]. |
2.3 Threshold-Sweep Metrics (family="threshold_sweep")#
Require probability or decision scores. Use predict_proba when available,
decision_function as fallback for binary classifiers.
Metric |
Description |
|---|---|
|
Area under the ROC curve (binary OvR). Insensitive to class threshold. |
|
Macro-weighted one-vs-rest AUC for multiclass. |
|
Area under the PR curve using sklearn’s interpolated AP (binary). |
|
Trapezoidal AUC of the precision-recall curve. Preferred over AP when positive fraction is small. |
2.4 Probability-Score Metrics (family="score_probability")#
Require predict_proba. Enable calibration diagnostics.
Metric |
Description |
|---|---|
|
Cross-entropy loss. Lower is better ( |
|
Mean squared error of probability predictions. Lower is better. |
—
3. Regression Metrics (family="regression")#
Require only predict output.
Metric |
Description |
|---|---|
|
Coefficient of determination. 1.0 is perfect fit; can be negative. |
|
Negative MSE. Negated so higher = better for optimization consistency. |
|
Negative MAE. More robust than MSE to outliers. |
|
Negative RMSE. Same units as the target variable. |
|
Proportion of variance explained. Similar to R² but not penalized for bias. |
—
4. Compatibility Rules#
The registry enforces three compatibility checks at ExperimentConfig
validation time:
Task mismatch: A metric’s
taskmust matchExperimentConfig.task.Proba requirement: If
response_method == "proba", the model must declarepredict_probaor calibration must be enabled.Score requirement: If
response_method == "proba_or_score", the model must declarepredict_probaordecision_function.
These checks fire before any model is trained, producing a clear ValueError
with the specific metric and model name.
—
5. Custom Metrics#
You can extend the registry for project-specific metrics:
from coco_pipe.decoding._metrics import METRIC_REGISTRY, MetricSpec
from sklearn.metrics import top_k_accuracy_score
from functools import partial
top2 = partial(top_k_accuracy_score, k=2, labels=[0, 1, 2])
METRIC_REGISTRY["top2_accuracy"] = MetricSpec(
name="top2_accuracy",
task="classification",
scorer=top2,
response_method="proba",
family="label",
greater_is_better=True,
)
Warning
Custom metrics are added to the in-process registry only. They are not
persisted in saved ExperimentResult payloads and must be re-registered
in any new Python process that loads existing results.
Feature Selection#
coco_pipe.decoding supports two feature selection strategies that execute
inside each outer CV fold on the training partition only, guaranteeing
zero test-set leakage.
—
1. Filter Selection (k_best)#
SelectKBest selects the top-k features based on a univariate statistical
test. It has no inner CV loop. It is fast and well-suited for high-dimensional data
(e.g., many EEG channels/frequency bins) where a quick, interpretable feature
ranking is desired.
from coco_pipe.decoding.configs import (
ExperimentConfig, CVConfig, ClassicalModelConfig, FeatureSelectionConfig
)
config = ExperimentConfig(
task="classification",
models={"lr": ClassicalModelConfig(estimator="LogisticRegression")},
metrics=["accuracy"],
cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
feature_selection=FeatureSelectionConfig(
enabled=True,
method="k_best",
n_features=20,
scoring="accuracy", # optional; defaults to task-appropriate test
),
)
1.1 Score Function#
For classification, the default univariate test is f_classif (ANOVA F-value).
For regression, it is f_regression. Override via feature_selection.scoring.
1.2 Accessing Feature Scores#
After fitting, retrieve per-fold and per-feature scores:
feature_scores = result.get_feature_scores()
# columns: FeatureName, Fold, Score, PValue
# Mean score across folds
mean_scores = feature_scores.groupby("FeatureName")["Score"].mean().sort_values(ascending=False)
—
2. Sequential Feature Selection (sfs)#
SequentialFeatureSelector is a wrapper-based method. It iteratively adds
(forward SFS) or removes (backward SFS) features by evaluating the model’s
cross-validated performance on each candidate feature set. Because it uses the
model’s predictive performance as the selection criterion, it is more powerful
than filter methods but significantly more expensive.
config = ExperimentConfig(
task="classification",
models={"lr": ClassicalModelConfig(estimator="LogisticRegression")},
metrics=["balanced_accuracy"],
cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
feature_selection=FeatureSelectionConfig(
enabled=True,
method="sfs",
n_features=10,
scoring="balanced_accuracy", # criterion for SFS inner evaluation
cv=CVConfig(strategy="stratified_group_kfold", n_splits=3, group_key="Subject"),
direction="forward", # or "backward"
),
)
2.1 Inner CV for SFS#
SFS requires an inner CV loop to evaluate candidate feature sets. When omitted,
coco_pipe.decoding derives the inner SFS CV from:
tuning.cvif tuning is enabled.The outer CV family (group-based if outer is group-based).
When the outer CV is group-based, the SFS inner CV is automatically group-based.
Overriding requires allow_nongroup_inner_cv=True.
2.2 Group-Aware SFS#
coco_pipe.decoding uses scikit-learn metadata routing to pass the
outer-fold training groups into the SFS inner CV. This requires
scikit-learn >= 1.6.
2.3 SFS with Tuning#
SFS combined with hyperparameter tuning evaluates feature subsets inside the
tuning inner folds. coco_pipe.decoding uses a sklearn.pipeline.Pipeline
cache to avoid redundant refitting:
config = ExperimentConfig(
...,
feature_selection=FeatureSelectionConfig(enabled=True, method="sfs", n_features=10),
tuning=TuningConfig(enabled=True, scoring="accuracy"),
grids={"lr": {"C": [0.1, 1.0, 10.0]}},
)
Warning
SFS + tuning is computationally intensive. Reduce the outer n_splits or
the SFS inner n_splits for development runs.
—
3. Feature Stability Analysis#
For both k_best and sfs, coco_pipe.decoding tracks which features
were selected in each fold. The stability score is the proportion of folds in
which a feature was selected:
stability = result.get_feature_stability()
# columns: FeatureName, SelectionRate, MeanRank, StdRank
# Most stable features
top = stability.sort_values("SelectionRate", ascending=False).head(20)
Note
Feature stability across folds is a measure of generalizability, not importance. A feature selected in all folds is a robust signal across the sampled subjects, regardless of its average selection score.
—
4. Selected Features per Fold#
selected = result.get_selected_features()
# columns: FeatureName, Fold, Rank
# Features selected in every fold
universal = selected.groupby("FeatureName")["Fold"].count()
universal = universal[universal == config.cv.n_splits].index.tolist()
—
5. Compatibility Notes#
Feature selection is only valid for 2D tabular inputs (
input_kind in {"tabular_2d", "embedding_2d"}).Feature selection is incompatible with temporal estimators (
SlidingEstimator,GeneralizingEstimator). The registry blocks this at validation time.k_bestdoes not support ranked importances beyond fold scores/p-values. For importance-based selection, use tree ensemble importances viaresult.get_feature_importances().