Cross-Validation and Statistical Inference#

Cross-Validation Strategies Guide#

The cross-validation strategy is the most consequential choice in a decoding experiment. It determines whether the performance estimate is statistically valid, whether group independence is preserved, and whether the inner model-selection loops can correctly inherit the outer splitting logic.

1. Available Strategies#

All strategies are configured via CVConfig.strategy:

Strategy

Group-aware

Use case

scikit-learn equivalent

"stratified"

Balanced class folds (classification)

StratifiedKFold

"kfold"

Regression, or classification without imbalance

KFold

"group_kfold"

K folds, subjects exclusive to test

GroupKFold

"stratified_group_kfold"

K folds, class-balanced, subjects exclusive

StratifiedGroupKFold

"leave_one_group_out"

Leave-one-subject-out (LOSO)

LeaveOneGroupOut

"leave_p_out"

Leave-P-subjects-out

LeavePGroupsOut

"timeseries"

Ordered splits for time-series data

TimeSeriesSplit

"split"

Single train/test holdout

Custom ShuffleSplit

"group_shuffle_split"

Randomized group-aware train/test splits

GroupShuffleSplit

Set CVConfig.auto_reduce_n_splits=True to let the splitter shrink n_splits automatically when a group-aware strategy cannot honor the requested number of folds (e.g. too few groups).

2. Group-Based Strategies#

2.1 When Groups Are Required#

Use a group-based strategy whenever your data contains multiple observations per independent unit (e.g., multiple epochs per subject). Failure to do so means the model trains and tests on data from the same subjects, producing inflated accuracy estimates.

Provide groups via sample metadata:

from coco_pipe.decoding.configs import CVConfig

cv = CVConfig(
    strategy="stratified_group_kfold",
    n_splits=5,
    group_key="Subject",    # must match a column in sample_metadata
)

result = Experiment(config).run(
    X, y,
    sample_metadata={"Subject": subject_ids, "Session": session_ids}
)

2.2 LOSO (Leave-One-Subject-Out)#

LOSO leaves all epochs from one subject out of training per fold. It is the most conservative and the most clinically-relevant evaluation strategy, but it has as many folds as subjects, which can be computationally expensive.

cv = CVConfig(strategy="leave_one_group_out", group_key="Subject")

Note

leave_one_group_out does not accept an n_splits parameter. The number of folds equals the number of unique subjects.

2.3 Leave-P-Subjects-Out#

Leave-P-groups-out leaves p subjects out per fold. More powerful than LOSO when p > 1, but substantially increases the number of folds.

cv = CVConfig(strategy="leave_p_out", n_splits=2, group_key="Subject")  # leaves 2 groups out per fold

3. Group Propagation to Inner CV#

When the outer CV is group-based, coco_pipe.decoding automatically propagates group constraints to all inner CV loops:

  • Hyperparameter tuning (TuningConfig): uses a group-based inner CV by default.

  • Sequential Feature Selection (FeatureSelectionConfig(method="sfs")): uses a group-based inner CV by default.

  • Calibration (CalibrationConfig): uses a group-based inner calibration split by default.

Overriding this requires explicitly setting allow_nongroup_inner_cv=True on the relevant config object:

from coco_pipe.decoding.configs import TuningConfig, CVConfig

tuning = TuningConfig(
    enabled=True,
    cv=CVConfig(strategy="stratified", n_splits=3),
    allow_nongroup_inner_cv=True,  # explicit acknowledgement of leakage risk
)

4. Stratified Strategies#

Stratified strategies ensure that class proportions are approximately equal across folds. This is important for imbalanced datasets where some folds might contain no minority-class examples.

  • "stratified" and "stratified_group_kfold" are only valid for classification.

  • For regression tasks, use "kfold" or group-based strategies.

cv = CVConfig(
    strategy="stratified_group_kfold",
    n_splits=5,
    group_key="Subject",
    random_state=42,    # reproducibility for stratification
)

5. Holdout Split#

For large datasets or as a quick sanity check, a single train/test holdout avoids the overhead of K outer folds:

cv = CVConfig(
    strategy="split",
    test_size=0.2,
    stratify=True,     # stratified split for classification
    random_state=42,
)

The n_splits field is ignored for "split" — it always produces exactly one fold.

6. Time Series Split#

For EEG/MEG data that is not epoched (e.g., continuous recordings), or for temporal regression, use "timeseries":

cv = CVConfig(
    strategy="timeseries",
    n_splits=5,
    test_size=0.2,     # optional, overrides sklearn default
)

TimeSeriesSplit ensures that training data always comes before test data in time, preventing future data from leaking into the model.

7. Random State and Reproducibility#

CVConfig.random_state seeds the splitter. For full reproducibility, also set ExperimentConfig.random_state, which propagates derived seeds to the CV, tuning, feature selection, and calibration configs via a SeedSequence.

config = ExperimentConfig(
    task="classification",
    models={"lr": LogisticRegressionConfig()},
    metrics=["accuracy"],
    cv=CVConfig(strategy="stratified", n_splits=5),
    random_state=42,    # propagated to all sub-components
)

See Reproducibility Architecture for the full seed propagation architecture.

Statistical Assessment Guide#

coco_pipe.decoding cleanly separates descriptive CV performance from inferential claims. Descriptive metrics (fold scores, confusion matrices, curves) are always computed. Inferential statistics are opt-in and require explicit configuration of StatisticalAssessmentConfig.

1. Two Levels of Assessment#

1.1 Descriptive Performance#

Every ExperimentResult provides fold-level and summary scores without any statistical testing:

scores = result.get_detailed_scores()
print(scores[["Model", "Fold", "Metric", "Value"]])

# Per-model summary: mean ± std across folds
summary = scores.groupby(["Model", "Metric"])["Value"].agg(["mean", "std"])

This is the correct starting point for all decoding reports. Always report fold-level variability alongside the mean.

1.2 Finite-Sample Inferential Assessment#

Statistical significance claims require a null distribution. coco_pipe.decoding supports two null-generation strategies:

Method

How the null is generated

When to use

"permutation"

Full outer CV rerun under label permutations.

Gold standard. Correct for any preprocessing pipeline.

"binomial"

Analytical Clopper-Pearson interval on hard accuracy.

Only valid for scalar accuracy, one prediction per independent unit.

2. Full-Pipeline Permutation Assessment#

from coco_pipe.decoding.configs import (
    ExperimentConfig, CVConfig, ClassicalModelConfig,
    StatisticalAssessmentConfig, ChanceAssessmentConfig,
)

config = ExperimentConfig(
    task="classification",
    models={"lr": ClassicalModelConfig(estimator="LogisticRegression")},
    metrics=["accuracy"],
    cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
    statistical_assessment=StatisticalAssessmentConfig(
        enabled=True,
        unit_of_inference="group_mean",
        chance=ChanceAssessmentConfig(
            method="permutation",
            n_permutations=1000,
        ),
    ),
)

result = Experiment(config).run(
    X, y,
    sample_metadata={"Subject": subject_ids, "Session": session_ids},
    observation_level="epoch",
)

assessment = result.get_statistical_assessment()

The returned DataFrame contains, per model and metric:

Column

Description

Observed

Observed score on the true labels.

PValue

Empirical p-value from permutation distribution.

CorrectedPValue

Multiple-comparison corrected p-value.

Significant

Boolean: CorrectedPValue <= alpha.

CILower, CIUpper

Bootstrap CI for the observed score.

NullMedian, NullLower, NullUpper

Null distribution percentiles.

NPermutations

Number of permutations used.

NEff

Effective sample size (number of independent units).

Time / TrainTime / TestTime

Present only for temporal outputs.

2.1 Label Permutation Inside Groups#

When unit_of_inference="group_mean" and cv.group_key is set, labels are permuted across groups, not within them. This preserves within-subject epoch correlations in the null distribution, yielding a correctly-calibrated p-value for the group-level null hypothesis.

Note

Permuting only within subjects (swapping epochs within a subject) would be the wrong null for testing whether the model performs above chance at the population level.

3. Binomial Assessment#

Binomial testing uses the Clopper-Pearson exact interval. It is valid only when:

  • Task is classification.

  • Metric is plain accuracy.

  • Each independent unit contributes exactly one prediction (no aggregation needed).

  • An explicit chance level p0 is provided.

statistical_assessment=StatisticalAssessmentConfig(
    enabled=True,
    chance=ChanceAssessmentConfig(
        method="binomial",
        p0=0.5,     # chance level for binary classification
    ),
    confidence_intervals=ConfidenceIntervalConfig(
        method="clopper_pearson",  # or "wilson"
        alpha=0.05,
    ),
)

The test statistic is:

\[p = 1 - F(k - 1; n, p_0)\]

where \(F\) is the binomial CDF, \(k\) is the number of correct predictions, and \(n\) is the number of independent observations.

4. Bootstrap Confidence Intervals#

Confidence intervals for any metric can be computed independently of the permutation test, using non-parametric bootstrap over independent units:

ci = result.get_bootstrap_confidence_intervals(
    metric="accuracy",
    unit="Subject",    # or "Session", "sample", etc.
    n_bootstraps=2000,
    ci=0.95,
)

Bootstrap CI is also automatically included in the permutation assessment output (CILower, CIUpper columns).

5. Temporal Correction Methods#

For sliding/generalizing decoders, one p-value per timepoint must be corrected for multiple comparisons. Set temporal_correction in ChanceAssessmentConfig:

Method

Description

"max_stat"

Permutation Max-Stat (default). FWER control. Uses the global maximum of the permutation null at each timepoint. Recommended for temporal data with moderate-to-high positive correlation between timepoints.

"fdr_bh"

Benjamini-Hochberg FDR. Controls the expected proportion of false discoveries. More powerful than Max-Stat but weaker guarantees.

"none"

No correction. For exploratory analysis only.

6. Lightweight Post-Hoc Diagnostics#

For quick exploratory inspection without rerunning training:

# Lightweight label permutation over fixed predictions (fast but biased)
null = result.get_statistical_assessment(lightweight=True, metric="accuracy")

# Direct post-hoc permutation (bypasses full retraining)
from coco_pipe.decoding.stats import assess_post_hoc_permutation
posthoc = assess_post_hoc_permutation(result.raw["lr"], metric="accuracy", n_permutations=500)

Warning

Post-hoc permutations that shuffle labels over fixed predictions do not account for preprocessing, feature selection, or hyperparameter search. They underestimate the null and can produce overly optimistic p-values if any of these steps used the labels. Use method="permutation" (full-pipeline) for any claim of statistical significance in publications.

7. Paired Model Comparison#

See Model Comparison for a full guide. Quick reference:

# Paired permutation test: does model A outperform model B?
paired = result.compare_models_paired("lr", "svm", metric="accuracy")
print(paired[["Difference", "PValue", "Significant"]])

# Full-pipeline paired assessment across two result objects
from coco_pipe.decoding.stats import run_paired_permutation_assessment
df = run_paired_permutation_assessment(
    result_a, result_b, "lr", "accuracy", config=eval_config
)

8. Unit of Inference Options#

Value

Aggregation behavior

"sample"

No aggregation. Each prediction row is treated as independent.

"group_mean"

Average probabilities per group, then classify. Recommended for epoch-level EEG.

"group_majority"

Majority vote of hard labels per group.

"custom"

Aggregate by a named column in sample_metadata.

Model Comparison#

After running a decoding experiment with multiple models, coco_pipe.decoding provides rigorous paired statistical tests to determine whether observed performance differences are beyond chance. All comparison methods use within-subject label swaps to control for subject-specific baseline variance.

1. Why Paired Tests?#

Independent-sample tests compare two models assuming the samples are drawn independently. In a within-subject decoding design, the same subjects appear in both models’ test folds, making the samples positively correlated. A paired test exploits this correlation to achieve higher statistical power.

Paired permutation test: randomly swap model assignments within each independent unit (subject) and recompute the observed difference. The resulting null distribution represents the expected difference under no true effect.

2. Quick Paired Comparison#

For a fast paired comparison using existing outer-fold predictions:

from coco_pipe.decoding import Experiment, ExperimentConfig
from coco_pipe.decoding.configs import ClassicalModelConfig, CVConfig, SVMConfig

config = ExperimentConfig(
    task="classification",
    models={
        "lr": ClassicalModelConfig(estimator="LogisticRegression"),
        "svm": ClassicalModelConfig(estimator="SVC"),
    },
    metrics=["accuracy", "roc_auc"],
    cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"),
)

result = Experiment(config).run(
    X, y,
    sample_metadata={"Subject": subject_ids, "Session": session_ids}
)

paired = result.compare_models_paired(
    "lr", "svm",
    metric="accuracy",
    unit="Subject",
    n_permutations=5000,
    random_state=42,
)

print(paired[["Metric", "ScoreA", "ScoreB", "Difference", "PValue", "Significant"]])

The returned DataFrame has one row per (temporal coordinate or scalar) with:

Column

Description

ScoreA

Observed score for model A.

ScoreB

Observed score for model B.

Difference

ScoreA - ScoreB.

PValue

Two-sided p-value from the sign-swap permutation distribution.

Significant

Boolean: PValue <= 0.05.

NUnits

Number of independent units used for swapping.

NPermutations

Number of permutations used.

3. Multiple Model Comparison#

When comparing more than two models, use compare_models to compare all pairs with optional multiple-comparison correction:

comparison = result.compare_models(
    metric="accuracy",
    unit="Subject",
    correction="fdr_bh",       # or "bonferroni", "none"
    n_permutations=5000,
)

print(comparison[["ModelA", "ModelB", "Difference", "PValue", "CorrectedPValue"]])

4. Full-Pipeline Paired Permutation Test#

For rigorous inference where preprocessing, feature selection, and tuning must be included in the null distribution, use run_paired_permutation_assessment:

from coco_pipe.decoding.stats import run_paired_permutation_assessment
from coco_pipe.decoding.configs import StatisticalAssessmentConfig, ChanceAssessmentConfig

# Run two separate experiments with the same CV folds
result_a = Experiment(config_a).run(X, y, sample_metadata=meta)
result_b = Experiment(config_b).run(X, y, sample_metadata=meta)

eval_config = StatisticalAssessmentConfig(
    chance=ChanceAssessmentConfig(
        n_permutations=1000,
        temporal_correction="max_stat",   # for temporal outputs
    ),
    unit_of_inference="sample",
    random_state=42,
)

paired_df = run_paired_permutation_assessment(
    result_a, result_b, "model_name", "accuracy", eval_config
)

Note

The two experiments must have been run with the same outer CV configuration and the same subjects. The function aligns predictions at the SampleID level before computing the difference.

5. Interpreting Results#

Effect Size

The Difference column is the primary effect size. A small but significant difference is not necessarily scientifically meaningful. Always report both the magnitude and statistical significance.

Temporal Generalization Comparison

For generalizing decoders, one comparison row is produced per (TrainTime, TestTime) cell. Apply temporal correction (max_stat or fdr_bh) to control the family-wise error rate across the matrix.

Multiple Model Pitfall

If you run K pairwise comparisons without correction, the expected number of false positives is 0.05 × K. Always apply correction when comparing more than two models.

6. Post-Hoc vs Full-Pipeline Comparison#

Method

Speed

Validity

compare_models_paired

Fast (uses existing predictions)

Valid if preprocessing did not use the comparison metric during fitting.

run_paired_permutation_assessment

Slow (reruns full CV per permutation)

Fully valid; recommended for publications.