.. _decoding-inference: ========================================== Cross-Validation and Statistical Inference ========================================== .. _decoding-cv: Cross-Validation Strategies Guide ================================= The cross-validation strategy is the most consequential choice in a decoding experiment. It determines whether the performance estimate is statistically valid, whether group independence is preserved, and whether the inner model-selection loops can correctly inherit the outer splitting logic. --- 1. Available Strategies ----------------------- All strategies are configured via ``CVConfig.strategy``: .. list-table:: :header-rows: 1 :widths: 30 25 20 25 * - Strategy - Group-aware - Use case - scikit-learn equivalent * - ``"stratified"`` - ❌ - Balanced class folds (classification) - ``StratifiedKFold`` * - ``"kfold"`` - ❌ - Regression, or classification without imbalance - ``KFold`` * - ``"group_kfold"`` - ✅ - K folds, subjects exclusive to test - ``GroupKFold`` * - ``"stratified_group_kfold"`` - ✅ - K folds, class-balanced, subjects exclusive - ``StratifiedGroupKFold`` * - ``"leave_one_group_out"`` - ✅ - Leave-one-subject-out (LOSO) - ``LeaveOneGroupOut`` * - ``"leave_p_out"`` - ✅ - Leave-P-subjects-out - ``LeavePGroupsOut`` * - ``"timeseries"`` - ❌ - Ordered splits for time-series data - ``TimeSeriesSplit`` * - ``"split"`` - ❌ - Single train/test holdout - Custom ``ShuffleSplit`` * - ``"group_shuffle_split"`` - ✅ - Randomized group-aware train/test splits - ``GroupShuffleSplit`` Set ``CVConfig.auto_reduce_n_splits=True`` to let the splitter shrink ``n_splits`` automatically when a group-aware strategy cannot honor the requested number of folds (e.g. too few groups). --- 2. Group-Based Strategies ------------------------- 2.1 When Groups Are Required ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use a group-based strategy whenever your data contains **multiple observations per independent unit** (e.g., multiple epochs per subject). Failure to do so means the model trains and tests on data from the **same subjects**, producing inflated accuracy estimates. Provide groups via sample metadata: .. code-block:: python from coco_pipe.decoding.configs import CVConfig cv = CVConfig( strategy="stratified_group_kfold", n_splits=5, group_key="Subject", # must match a column in sample_metadata ) result = Experiment(config).run( X, y, sample_metadata={"Subject": subject_ids, "Session": session_ids} ) 2.2 LOSO (Leave-One-Subject-Out) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LOSO leaves all epochs from one subject out of training per fold. It is the most conservative and the most clinically-relevant evaluation strategy, but it has as many folds as subjects, which can be computationally expensive. .. code-block:: python cv = CVConfig(strategy="leave_one_group_out", group_key="Subject") .. note:: ``leave_one_group_out`` does not accept an ``n_splits`` parameter. The number of folds equals the number of unique subjects. 2.3 Leave-P-Subjects-Out ~~~~~~~~~~~~~~~~~~~~~~~~ Leave-P-groups-out leaves ``p`` subjects out per fold. More powerful than LOSO when ``p > 1``, but substantially increases the number of folds. .. code-block:: python cv = CVConfig(strategy="leave_p_out", n_splits=2, group_key="Subject") # leaves 2 groups out per fold --- 3. Group Propagation to Inner CV -------------------------------- When the outer CV is group-based, ``coco_pipe.decoding`` automatically propagates group constraints to all inner CV loops: - **Hyperparameter tuning** (``TuningConfig``): uses a group-based inner CV by default. - **Sequential Feature Selection** (``FeatureSelectionConfig(method="sfs")``): uses a group-based inner CV by default. - **Calibration** (``CalibrationConfig``): uses a group-based inner calibration split by default. Overriding this requires explicitly setting ``allow_nongroup_inner_cv=True`` on the relevant config object: .. code-block:: python from coco_pipe.decoding.configs import TuningConfig, CVConfig tuning = TuningConfig( enabled=True, cv=CVConfig(strategy="stratified", n_splits=3), allow_nongroup_inner_cv=True, # explicit acknowledgement of leakage risk ) --- 4. Stratified Strategies ------------------------ Stratified strategies ensure that class proportions are approximately equal across folds. This is important for imbalanced datasets where some folds might contain no minority-class examples. - ``"stratified"`` and ``"stratified_group_kfold"`` are only valid for classification. - For regression tasks, use ``"kfold"`` or group-based strategies. .. code-block:: python cv = CVConfig( strategy="stratified_group_kfold", n_splits=5, group_key="Subject", random_state=42, # reproducibility for stratification ) --- 5. Holdout Split ---------------- For large datasets or as a quick sanity check, a single train/test holdout avoids the overhead of K outer folds: .. code-block:: python cv = CVConfig( strategy="split", test_size=0.2, stratify=True, # stratified split for classification random_state=42, ) The ``n_splits`` field is ignored for ``"split"`` — it always produces exactly one fold. --- 6. Time Series Split -------------------- For EEG/MEG data that is **not epoched** (e.g., continuous recordings), or for temporal regression, use ``"timeseries"``: .. code-block:: python cv = CVConfig( strategy="timeseries", n_splits=5, test_size=0.2, # optional, overrides sklearn default ) ``TimeSeriesSplit`` ensures that training data always comes **before** test data in time, preventing future data from leaking into the model. --- 7. Random State and Reproducibility ----------------------------------- ``CVConfig.random_state`` seeds the splitter. For full reproducibility, also set ``ExperimentConfig.random_state``, which propagates derived seeds to the CV, tuning, feature selection, and calibration configs via a ``SeedSequence``. .. code-block:: python config = ExperimentConfig( task="classification", models={"lr": LogisticRegressionConfig()}, metrics=["accuracy"], cv=CVConfig(strategy="stratified", n_splits=5), random_state=42, # propagated to all sub-components ) See :ref:`decoding-reproducibility` for the full seed propagation architecture. .. _decoding-stats: Statistical Assessment Guide ============================ ``coco_pipe.decoding`` cleanly separates **descriptive** CV performance from **inferential** claims. Descriptive metrics (fold scores, confusion matrices, curves) are always computed. Inferential statistics are opt-in and require explicit configuration of ``StatisticalAssessmentConfig``. --- 1. Two Levels of Assessment --------------------------- 1.1 Descriptive Performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Every ``ExperimentResult`` provides fold-level and summary scores without any statistical testing: .. code-block:: python scores = result.get_detailed_scores() print(scores[["Model", "Fold", "Metric", "Value"]]) # Per-model summary: mean ± std across folds summary = scores.groupby(["Model", "Metric"])["Value"].agg(["mean", "std"]) This is the correct starting point for all decoding reports. Always report fold-level variability alongside the mean. 1.2 Finite-Sample Inferential Assessment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Statistical significance claims require a null distribution. ``coco_pipe.decoding`` supports two null-generation strategies: .. list-table:: :header-rows: 1 :widths: 25 35 40 * - Method - How the null is generated - When to use * - ``"permutation"`` - Full outer CV rerun under label permutations. - Gold standard. Correct for any preprocessing pipeline. * - ``"binomial"`` - Analytical Clopper-Pearson interval on hard accuracy. - Only valid for scalar accuracy, one prediction per independent unit. --- 2. Full-Pipeline Permutation Assessment --------------------------------------- .. code-block:: python from coco_pipe.decoding.configs import ( ExperimentConfig, CVConfig, ClassicalModelConfig, StatisticalAssessmentConfig, ChanceAssessmentConfig, ) config = ExperimentConfig( task="classification", models={"lr": ClassicalModelConfig(estimator="LogisticRegression")}, metrics=["accuracy"], cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"), statistical_assessment=StatisticalAssessmentConfig( enabled=True, unit_of_inference="group_mean", chance=ChanceAssessmentConfig( method="permutation", n_permutations=1000, ), ), ) result = Experiment(config).run( X, y, sample_metadata={"Subject": subject_ids, "Session": session_ids}, observation_level="epoch", ) assessment = result.get_statistical_assessment() The returned DataFrame contains, per model and metric: .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``Observed`` - Observed score on the true labels. * - ``PValue`` - Empirical p-value from permutation distribution. * - ``CorrectedPValue`` - Multiple-comparison corrected p-value. * - ``Significant`` - Boolean: ``CorrectedPValue <= alpha``. * - ``CILower``, ``CIUpper`` - Bootstrap CI for the observed score. * - ``NullMedian``, ``NullLower``, ``NullUpper`` - Null distribution percentiles. * - ``NPermutations`` - Number of permutations used. * - ``NEff`` - Effective sample size (number of independent units). * - ``Time`` / ``TrainTime`` / ``TestTime`` - Present only for temporal outputs. 2.1 Label Permutation Inside Groups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``unit_of_inference="group_mean"`` and ``cv.group_key`` is set, labels are permuted **across groups**, not within them. This preserves within-subject epoch correlations in the null distribution, yielding a correctly-calibrated p-value for the group-level null hypothesis. .. note:: Permuting only within subjects (swapping epochs within a subject) would be the wrong null for testing whether the model performs above chance at the **population** level. --- 3. Binomial Assessment ---------------------- Binomial testing uses the Clopper-Pearson exact interval. It is valid only when: - Task is classification. - Metric is plain ``accuracy``. - Each independent unit contributes **exactly one** prediction (no aggregation needed). - An explicit chance level ``p0`` is provided. .. code-block:: python statistical_assessment=StatisticalAssessmentConfig( enabled=True, chance=ChanceAssessmentConfig( method="binomial", p0=0.5, # chance level for binary classification ), confidence_intervals=ConfidenceIntervalConfig( method="clopper_pearson", # or "wilson" alpha=0.05, ), ) The test statistic is: .. math:: p = 1 - F(k - 1; n, p_0) where :math:`F` is the binomial CDF, :math:`k` is the number of correct predictions, and :math:`n` is the number of independent observations. --- 4. Bootstrap Confidence Intervals --------------------------------- Confidence intervals for any metric can be computed independently of the permutation test, using non-parametric bootstrap over independent units: .. code-block:: python ci = result.get_bootstrap_confidence_intervals( metric="accuracy", unit="Subject", # or "Session", "sample", etc. n_bootstraps=2000, ci=0.95, ) Bootstrap CI is also automatically included in the permutation assessment output (``CILower``, ``CIUpper`` columns). --- 5. Temporal Correction Methods ------------------------------ For sliding/generalizing decoders, one p-value per timepoint must be corrected for multiple comparisons. Set ``temporal_correction`` in ``ChanceAssessmentConfig``: .. list-table:: :header-rows: 1 :widths: 20 80 * - Method - Description * - ``"max_stat"`` - Permutation Max-Stat (default). FWER control. Uses the global maximum of the permutation null at each timepoint. Recommended for temporal data with moderate-to-high positive correlation between timepoints. * - ``"fdr_bh"`` - Benjamini-Hochberg FDR. Controls the expected proportion of false discoveries. More powerful than Max-Stat but weaker guarantees. * - ``"none"`` - No correction. For exploratory analysis only. --- 6. Lightweight Post-Hoc Diagnostics ----------------------------------- For quick exploratory inspection without rerunning training: .. code-block:: python # Lightweight label permutation over fixed predictions (fast but biased) null = result.get_statistical_assessment(lightweight=True, metric="accuracy") # Direct post-hoc permutation (bypasses full retraining) from coco_pipe.decoding.stats import assess_post_hoc_permutation posthoc = assess_post_hoc_permutation(result.raw["lr"], metric="accuracy", n_permutations=500) .. warning:: Post-hoc permutations that shuffle labels over **fixed** predictions do not account for preprocessing, feature selection, or hyperparameter search. They underestimate the null and can produce overly optimistic p-values if any of these steps used the labels. Use ``method="permutation"`` (full-pipeline) for any claim of statistical significance in publications. --- 7. Paired Model Comparison -------------------------- See :ref:`decoding-model-comparison` for a full guide. Quick reference: .. code-block:: python # Paired permutation test: does model A outperform model B? paired = result.compare_models_paired("lr", "svm", metric="accuracy") print(paired[["Difference", "PValue", "Significant"]]) # Full-pipeline paired assessment across two result objects from coco_pipe.decoding.stats import run_paired_permutation_assessment df = run_paired_permutation_assessment( result_a, result_b, "lr", "accuracy", config=eval_config ) --- 8. Unit of Inference Options ---------------------------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Value - Aggregation behavior * - ``"sample"`` - No aggregation. Each prediction row is treated as independent. * - ``"group_mean"`` - Average probabilities per group, then classify. Recommended for epoch-level EEG. * - ``"group_majority"`` - Majority vote of hard labels per group. * - ``"custom"`` - Aggregate by a named column in ``sample_metadata``. .. _decoding-model-comparison: Model Comparison ================ After running a decoding experiment with multiple models, ``coco_pipe.decoding`` provides rigorous paired statistical tests to determine whether observed performance differences are beyond chance. All comparison methods use within-subject label swaps to control for subject-specific baseline variance. --- 1. Why Paired Tests? -------------------- Independent-sample tests compare two models assuming the samples are drawn independently. In a within-subject decoding design, the **same subjects** appear in both models' test folds, making the samples positively correlated. A paired test exploits this correlation to achieve higher statistical power. **Paired permutation test**: randomly swap model assignments within each independent unit (subject) and recompute the observed difference. The resulting null distribution represents the expected difference under no true effect. --- 2. Quick Paired Comparison -------------------------- For a fast paired comparison using existing outer-fold predictions: .. code-block:: python from coco_pipe.decoding import Experiment, ExperimentConfig from coco_pipe.decoding.configs import ClassicalModelConfig, CVConfig, SVMConfig config = ExperimentConfig( task="classification", models={ "lr": ClassicalModelConfig(estimator="LogisticRegression"), "svm": ClassicalModelConfig(estimator="SVC"), }, metrics=["accuracy", "roc_auc"], cv=CVConfig(strategy="stratified_group_kfold", n_splits=5, group_key="Subject"), ) result = Experiment(config).run( X, y, sample_metadata={"Subject": subject_ids, "Session": session_ids} ) paired = result.compare_models_paired( "lr", "svm", metric="accuracy", unit="Subject", n_permutations=5000, random_state=42, ) print(paired[["Metric", "ScoreA", "ScoreB", "Difference", "PValue", "Significant"]]) The returned DataFrame has one row per (temporal coordinate or scalar) with: .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``ScoreA`` - Observed score for model A. * - ``ScoreB`` - Observed score for model B. * - ``Difference`` - ``ScoreA - ScoreB``. * - ``PValue`` - Two-sided p-value from the sign-swap permutation distribution. * - ``Significant`` - Boolean: ``PValue <= 0.05``. * - ``NUnits`` - Number of independent units used for swapping. * - ``NPermutations`` - Number of permutations used. --- 3. Multiple Model Comparison ---------------------------- When comparing more than two models, use ``compare_models`` to compare all pairs with optional multiple-comparison correction: .. code-block:: python comparison = result.compare_models( metric="accuracy", unit="Subject", correction="fdr_bh", # or "bonferroni", "none" n_permutations=5000, ) print(comparison[["ModelA", "ModelB", "Difference", "PValue", "CorrectedPValue"]]) --- 4. Full-Pipeline Paired Permutation Test ---------------------------------------- For rigorous inference where preprocessing, feature selection, and tuning must be included in the null distribution, use ``run_paired_permutation_assessment``: .. code-block:: python from coco_pipe.decoding.stats import run_paired_permutation_assessment from coco_pipe.decoding.configs import StatisticalAssessmentConfig, ChanceAssessmentConfig # Run two separate experiments with the same CV folds result_a = Experiment(config_a).run(X, y, sample_metadata=meta) result_b = Experiment(config_b).run(X, y, sample_metadata=meta) eval_config = StatisticalAssessmentConfig( chance=ChanceAssessmentConfig( n_permutations=1000, temporal_correction="max_stat", # for temporal outputs ), unit_of_inference="sample", random_state=42, ) paired_df = run_paired_permutation_assessment( result_a, result_b, "model_name", "accuracy", eval_config ) .. note:: The two experiments must have been run with the **same outer CV configuration** and the **same subjects**. The function aligns predictions at the ``SampleID`` level before computing the difference. --- 5. Interpreting Results ----------------------- .. rubric:: Effect Size The ``Difference`` column is the primary effect size. A small but significant difference is not necessarily scientifically meaningful. Always report both the magnitude and statistical significance. .. rubric:: Temporal Generalization Comparison For generalizing decoders, one comparison row is produced per ``(TrainTime, TestTime)`` cell. Apply temporal correction (``max_stat`` or ``fdr_bh``) to control the family-wise error rate across the matrix. .. rubric:: Multiple Model Pitfall If you run ``K`` pairwise comparisons without correction, the expected number of false positives is ``0.05 × K``. Always apply correction when comparing more than two models. --- 6. Post-Hoc vs Full-Pipeline Comparison --------------------------------------- .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Method - Speed - Validity * - ``compare_models_paired`` - Fast (uses existing predictions) - Valid if preprocessing did not use the comparison metric during fitting. * - ``run_paired_permutation_assessment`` - Slow (reruns full CV per permutation) - Fully valid; recommended for publications.