Cross-Validation and Statistical Inference#
Cross-Validation Strategies Guide#
The cross-validation strategy is the most consequential choice in a decoding experiment. It determines whether the performance estimate is statistically valid, whether group independence is preserved, and whether the inner model-selection loops can correctly inherit the outer splitting logic.
—
1. Available Strategies#
All strategies are configured via CVConfig.strategy:
Strategy |
Group-aware |
Use case |
scikit-learn equivalent |
|---|---|---|---|
|
❌ |
Balanced class folds (classification) |
|
|
❌ |
Regression, or classification without imbalance |
|
|
✅ |
K folds, subjects exclusive to test |
|
|
✅ |
K folds, class-balanced, subjects exclusive |
|
|
✅ |
Leave-one-subject-out (LOSO) |
|
|
✅ |
Leave-P-subjects-out |
|
|
❌ |
Ordered splits for time-series data |
|
|
❌ |
Single train/test holdout |
Custom |
|
✅ |
Randomized group-aware train/test splits |
|
Set CVConfig.auto_reduce_n_splits=True to let the splitter shrink
n_splits automatically when a group-aware strategy cannot honor the
requested number of folds (e.g. too few groups).
—
2. Group-Based Strategies#
2.1 When Groups Are Required#
Use a group-based strategy whenever your data contains multiple observations per independent unit (e.g., multiple epochs per subject). Failure to do so means the model trains and tests on data from the same subjects, producing inflated accuracy estimates.
Provide groups via sample metadata:
from coco_pipe.decoding.configs import CVConfig
cv = CVConfig(
strategy="stratified_group_kfold",
n_splits=5,
group_key="Subject", # must match a column in sample_metadata
)
result = Experiment(config).run(
X, y,
sample_metadata={"Subject": subject_ids, "Session": session_ids}
)
2.2 LOSO (Leave-One-Subject-Out)#
LOSO leaves all epochs from one subject out of training per fold. It is the most conservative and the most clinically-relevant evaluation strategy, but it has as many folds as subjects, which can be computationally expensive.
cv = CVConfig(strategy="leave_one_group_out", group_key="Subject")
Note
leave_one_group_out does not accept an n_splits parameter. The number
of folds equals the number of unique subjects.
2.3 Leave-P-Subjects-Out#
Leave-P-groups-out leaves p subjects out per fold. More powerful than LOSO
when p > 1, but substantially increases the number of folds.
cv = CVConfig(strategy="leave_p_out", n_splits=2, group_key="Subject") # leaves 2 groups out per fold
—
3. Group Propagation to Inner CV#
When the outer CV is group-based, coco_pipe.decoding automatically propagates
group constraints to all inner CV loops:
Hyperparameter tuning (
TuningConfig): uses a group-based inner CV by default.Sequential Feature Selection (
FeatureSelectionConfig(method="sfs")): uses a group-based inner CV by default.Calibration (
CalibrationConfig): uses a group-based inner calibration split by default.
Overriding this requires explicitly setting allow_nongroup_inner_cv=True on
the relevant config object:
from coco_pipe.decoding.configs import TuningConfig, CVConfig
tuning = TuningConfig(
enabled=True,
cv=CVConfig(strategy="stratified", n_splits=3),
allow_nongroup_inner_cv=True, # explicit acknowledgement of leakage risk
)
—
4. Stratified Strategies#
Stratified strategies ensure that class proportions are approximately equal across folds. This is important for imbalanced datasets where some folds might contain no minority-class examples.
"stratified"and"stratified_group_kfold"are only valid for classification.For regression tasks, use
"kfold"or group-based strategies.
cv = CVConfig(
strategy="stratified_group_kfold",
n_splits=5,
group_key="Subject",
random_state=42, # reproducibility for stratification
)
—
5. Holdout Split#
For large datasets or as a quick sanity check, a single train/test holdout avoids the overhead of K outer folds:
cv = CVConfig(
strategy="split",
test_size=0.2,
stratify=True, # stratified split for classification
random_state=42,
)
The n_splits field is ignored for "split" — it always produces exactly
one fold.
—
6. Time Series Split#
For EEG/MEG data that is not epoched (e.g., continuous recordings), or for
temporal regression, use "timeseries":
cv = CVConfig(
strategy="timeseries",
n_splits=5,
test_size=0.2, # optional, overrides sklearn default
)
TimeSeriesSplit ensures that training data always comes before test data
in time, preventing future data from leaking into the model.
—
7. Random State and Reproducibility#
CVConfig.random_state seeds the splitter. For full reproducibility, also
set ExperimentConfig.random_state, which propagates derived seeds to the CV,
tuning, feature selection, and calibration configs via a SeedSequence.
config = ExperimentConfig(
task="classification",
models={"lr": LogisticRegressionConfig()},
metrics=["accuracy"],
cv=CVConfig(strategy="stratified", n_splits=5),
random_state=42, # propagated to all sub-components
)
See Reproducibility Architecture for the full seed propagation architecture.
Statistical Assessment Guide#
coco_pipe.decoding cleanly separates descriptive CV performance from
inferential claims. Descriptive metrics (fold scores, confusion matrices,
curves) are always computed. Inferential statistics are opt-in and require
explicit configuration of StatisticalAssessmentConfig.
—
1. Two Levels of Assessment#
1.1 Descriptive Performance#
Every ExperimentResult provides fold-level and summary scores without
any statistical testing:
scores = result.get_detailed_scores()
print(scores[["Model", "Fold", "Metric", "Value"]])
# Per-model summary: mean ± std across folds
summary = scores.groupby(["Model", "Metric"])["Value"].agg(["mean", "std"])
This is the correct starting point for all decoding reports. Always report fold-level variability alongside the mean.
1.2 Finite-Sample Inferential Assessment#
Statistical significance claims require a null distribution. coco_pipe.decoding
supports two null-generation strategies:
Method |
How the null is generated |
When to use |
|---|---|---|
|
Full outer CV rerun under label permutations. |
Gold standard. Correct for any preprocessing pipeline. |
|
Analytical Clopper-Pearson interval on hard accuracy. |
Only valid for scalar accuracy, one prediction per independent unit. |
—
2. Full-Pipeline Permutation Assessment#
from coco_pipe.decoding.configs import (
ExperimentConfig, CVConfig, ClassicalModelConfig,
StatisticalAssessmentConfig, ChanceAssessmentConfig,
)
config = ExperimentConfig(
task="classification",
models={"lr": ClassicalModelConfig(estimator="LogisticRegression")},
metrics=["accuracy"],
cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
statistical_assessment=StatisticalAssessmentConfig(
enabled=True,
unit_of_inference="group_mean",
chance=ChanceAssessmentConfig(
method="permutation",
n_permutations=1000,
),
),
)
result = Experiment(config).run(
X, y,
sample_metadata={"Subject": subject_ids, "Session": session_ids},
observation_level="epoch",
)
assessment = result.get_statistical_assessment()
The returned DataFrame contains, per model and metric:
Column |
Description |
|---|---|
|
Observed score on the true labels. |
|
Empirical p-value from permutation distribution. |
|
Multiple-comparison corrected p-value. |
|
Boolean: |
|
Bootstrap CI for the observed score. |
|
Null distribution percentiles. |
|
Number of permutations used. |
|
Effective sample size (number of independent units). |
|
Present only for temporal outputs. |
2.1 Label Permutation Inside Groups#
When unit_of_inference="group_mean" and cv.group_key is set, labels are
permuted across groups, not within them. This preserves within-subject epoch
correlations in the null distribution, yielding a correctly-calibrated p-value
for the group-level null hypothesis.
Note
Permuting only within subjects (swapping epochs within a subject) would be the wrong null for testing whether the model performs above chance at the population level.
—
3. Binomial Assessment#
Binomial testing uses the Clopper-Pearson exact interval. It is valid only when:
Task is classification.
Metric is plain
accuracy.Each independent unit contributes exactly one prediction (no aggregation needed).
An explicit chance level
p0is provided.
statistical_assessment=StatisticalAssessmentConfig(
enabled=True,
chance=ChanceAssessmentConfig(
method="binomial",
p0=0.5, # chance level for binary classification
),
confidence_intervals=ConfidenceIntervalConfig(
method="clopper_pearson", # or "wilson"
alpha=0.05,
),
)
The test statistic is:
where \(F\) is the binomial CDF, \(k\) is the number of correct predictions, and \(n\) is the number of independent observations.
—
4. Bootstrap Confidence Intervals#
Confidence intervals for any metric can be computed independently of the permutation test, using non-parametric bootstrap over independent units:
ci = result.get_bootstrap_confidence_intervals(
metric="accuracy",
unit="Subject", # or "Session", "sample", etc.
n_bootstraps=2000,
ci=0.95,
)
Bootstrap CI is also automatically included in the permutation assessment output
(CILower, CIUpper columns).
—
5. Temporal Correction Methods#
For sliding/generalizing decoders, one p-value per timepoint must be corrected
for multiple comparisons. Set temporal_correction in ChanceAssessmentConfig:
Method |
Description |
|---|---|
|
Permutation Max-Stat (default). FWER control. Uses the global maximum of the permutation null at each timepoint. Recommended for temporal data with moderate-to-high positive correlation between timepoints. |
|
Benjamini-Hochberg FDR. Controls the expected proportion of false discoveries. More powerful than Max-Stat but weaker guarantees. |
|
No correction. For exploratory analysis only. |
—
6. Lightweight Post-Hoc Diagnostics#
For quick exploratory inspection without rerunning training:
# Lightweight label permutation over fixed predictions (fast but biased)
null = result.get_statistical_assessment(lightweight=True, metric="accuracy")
# Direct post-hoc permutation (bypasses full retraining)
from coco_pipe.decoding.stats import assess_post_hoc_permutation
posthoc = assess_post_hoc_permutation(result.raw["lr"], metric="accuracy", n_permutations=500)
Warning
Post-hoc permutations that shuffle labels over fixed predictions do not
account for preprocessing, feature selection, or hyperparameter search. They
underestimate the null and can produce overly optimistic p-values if any of
these steps used the labels. Use method="permutation" (full-pipeline) for
any claim of statistical significance in publications.
—
7. Paired Model Comparison#
See Model Comparison for a full guide. Quick reference:
# Paired permutation test: does model A outperform model B?
paired = result.compare_models_paired("lr", "svm", metric="accuracy")
print(paired[["Difference", "PValue", "Significant"]])
# Full-pipeline paired assessment across two result objects
from coco_pipe.decoding.stats import run_paired_permutation_assessment
df = run_paired_permutation_assessment(
result_a, result_b, "lr", "accuracy", config=eval_config
)
—
8. Unit of Inference Options#
Value |
Aggregation behavior |
|---|---|
|
No aggregation. Each prediction row is treated as independent. |
|
Average probabilities per group, then classify. Recommended for epoch-level EEG. |
|
Majority vote of hard labels per group. |
|
Aggregate by a named column in |
Model Comparison#
After running a decoding experiment with multiple models, coco_pipe.decoding
provides rigorous paired statistical tests to determine whether observed
performance differences are beyond chance. All comparison methods use
within-subject label swaps to control for subject-specific baseline variance.
—
1. Why Paired Tests?#
Independent-sample tests compare two models assuming the samples are drawn independently. In a within-subject decoding design, the same subjects appear in both models’ test folds, making the samples positively correlated. A paired test exploits this correlation to achieve higher statistical power.
Paired permutation test: randomly swap model assignments within each independent unit (subject) and recompute the observed difference. The resulting null distribution represents the expected difference under no true effect.
—
2. Quick Paired Comparison#
For a fast paired comparison using existing outer-fold predictions:
from coco_pipe.decoding import Experiment, ExperimentConfig
from coco_pipe.decoding.configs import ClassicalModelConfig, CVConfig, SVMConfig
config = ExperimentConfig(
task="classification",
models={
"lr": ClassicalModelConfig(estimator="LogisticRegression"),
"svm": ClassicalModelConfig(estimator="SVC"),
},
metrics=["accuracy", "roc_auc"],
cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
)
result = Experiment(config).run(
X, y,
sample_metadata={"Subject": subject_ids, "Session": session_ids}
)
paired = result.compare_models_paired(
"lr", "svm",
metric="accuracy",
unit="Subject",
n_permutations=5000,
random_state=42,
)
print(paired[["Metric", "ScoreA", "ScoreB", "Difference", "PValue", "Significant"]])
The returned DataFrame has one row per (temporal coordinate or scalar) with:
Column |
Description |
|---|---|
|
Observed score for model A. |
|
Observed score for model B. |
|
|
|
Two-sided p-value from the sign-swap permutation distribution. |
|
Boolean: |
|
Number of independent units used for swapping. |
|
Number of permutations used. |
—
3. Multiple Model Comparison#
When comparing more than two models, use compare_models to compare all
pairs with optional multiple-comparison correction:
comparison = result.compare_models(
metric="accuracy",
unit="Subject",
correction="fdr_bh", # or "bonferroni", "none"
n_permutations=5000,
)
print(comparison[["ModelA", "ModelB", "Difference", "PValue", "CorrectedPValue"]])
—
4. Full-Pipeline Paired Permutation Test#
For rigorous inference where preprocessing, feature selection, and tuning must
be included in the null distribution, use run_paired_permutation_assessment:
from coco_pipe.decoding.stats import run_paired_permutation_assessment
from coco_pipe.decoding.configs import StatisticalAssessmentConfig, ChanceAssessmentConfig
# Run two separate experiments with the same CV folds
result_a = Experiment(config_a).run(X, y, sample_metadata=meta)
result_b = Experiment(config_b).run(X, y, sample_metadata=meta)
eval_config = StatisticalAssessmentConfig(
chance=ChanceAssessmentConfig(
n_permutations=1000,
temporal_correction="max_stat", # for temporal outputs
),
unit_of_inference="sample",
random_state=42,
)
paired_df = run_paired_permutation_assessment(
result_a, result_b, "model_name", "accuracy", eval_config
)
Note
The two experiments must have been run with the same outer CV configuration
and the same subjects. The function aligns predictions at the SampleID
level before computing the difference.
—
5. Interpreting Results#
Effect Size
The Difference column is the primary effect size. A small but significant
difference is not necessarily scientifically meaningful. Always report both the
magnitude and statistical significance.
Temporal Generalization Comparison
For generalizing decoders, one comparison row is produced per
(TrainTime, TestTime) cell. Apply temporal correction (max_stat or
fdr_bh) to control the family-wise error rate across the matrix.
Multiple Model Pitfall
If you run K pairwise comparisons without correction, the expected number of
false positives is 0.05 × K. Always apply correction when comparing more than
two models.
—
6. Post-Hoc vs Full-Pipeline Comparison#
Method |
Speed |
Validity |
|---|---|---|
|
Fast (uses existing predictions) |
Valid if preprocessing did not use the comparison metric during fitting. |
|
Slow (reruns full CV per permutation) |
Fully valid; recommended for publications. |