.. _decoding-concepts: ================================== Scientific Concepts and Principles ================================== This page explains the foundational principles that govern every design decision in ``coco_pipe.decoding``. Understanding these concepts is essential for interpreting results correctly and avoiding common pitfalls in brain decoding. --- 1. Cross-Validation and Data Leakage ===================================== 1.1 Why Outer-Only Scoring Is Insufficient ------------------------------------------- A decoding score is an estimate of *generalization performance* — how well a trained model predicts labels from **unseen** brain data. The critical word is "unseen". If any information from the test partition is visible during training (even implicitly, through preprocessing), the score is optimistically biased. In practice, leakage occurs when: - A scaler is fit on the **whole dataset** then applied fold-locally. - A feature selector's statistics are computed on the full feature matrix. - A hyperparameter is tuned using a validation set that overlaps with the test set. - A class-label encoder is fit on all samples before splitting. ``coco_pipe.decoding`` prevents all of these by construction: every preprocessing transformer is created inside a scikit-learn ``Pipeline`` that is fit **only on the training partition** of each outer fold. The test partition is never touched during training. 1.2 The Outer Cross-Validation Loop ------------------------------------- The outer CV loop is controlled by ``ExperimentConfig.cv``. It defines the *evaluation splits*. For each fold: 1. ``X_train, y_train`` → fit scaler, feature selector, hyperparameter search, calibration, and the final model. 2. ``X_test, y_test`` → predict and score using the fold-trained pipeline. 3. Fold scores, predictions, importances, and diagnostics are stored. The final score (e.g., ``get_detailed_scores()``) is the **average over outer folds**. It is an unbiased estimate of generalization performance — provided independence is respected (see section 2). 1.3 Inner CV Loops ------------------- When hyperparameter tuning (``TuningConfig``) or Sequential Feature Selection (``FeatureSelectionConfig(method='sfs')``) is enabled, an **inner** CV loop operates on the training partition of each outer fold. This inner loop selects the best model configuration without access to the test set. When the outer CV is group-based (e.g., ``group_kfold``), the inner CV is **automatically made group-based** as well. Overriding this requires setting ``allow_nongroup_inner_cv=True`` and explicitly acknowledges the data-leakage trade-off. .. warning:: Mixing group-based outer CV with non-group inner CV can cause test-set group information to leak into model selection, inflating performance. Always use matching group strategies for inner and outer CV when subjects must remain exclusive to test folds. --- 2. Independence and the Unit of Inference ========================================== 2.1 Pseudoreplication in Neural Data -------------------------------------- EEG and MEG experiments typically produce many epochs per subject. If a model is trained and tested on **epochs** from the same subject, the test scores are not independent. Each subject's neural patterns are correlated across epochs, so the effective sample size for statistical inference is the **number of subjects**, not the number of epochs. Using epochs as the unit of inference inflates degrees of freedom and produces incorrect p-values. This is called *pseudoreplication*. 2.2 Group-Based CV -------------------- The correct solution is to ensure all epochs from a given subject belong **exclusively** to either the training set or the test set — never both. This is achieved with ``CVConfig(strategy="group_kfold", group_key="Subject")``. ``coco_pipe.decoding`` accepts two equivalent ways to specify groups: - ``sample_metadata={"Subject": subject_ids}`` with ``cv.group_key="Subject"`` (recommended — keeps metadata tidy and allows downstream subject-level analysis). - ``groups=subject_ids`` (compatibility alias — binds groups directly to the splitter). 2.3 Subject-Level Aggregation for Statistical Tests ------------------------------------------------------ Even with group-based CV, the **predictions** stored after each fold are epoch-level. Before performing a statistical test, predictions must be aggregated to the independent unit (subjects) to restore correct degrees of freedom. ``coco_pipe.decoding.stats.aggregate_predictions_for_inference()`` handles this: - ``unit_of_inference="group_mean"``: soft-vote by averaging subject class probabilities across epochs, then hard-classify. - ``unit_of_inference="group_majority"``: hard-vote by majority of epoch labels. - ``unit_of_inference="sample"``: no aggregation (correct when each row is already an independent subject). The statistical assessment machinery uses this aggregation automatically: .. code-block:: python from coco_pipe.decoding.configs import StatisticalAssessmentConfig, ChanceAssessmentConfig eval_cfg = StatisticalAssessmentConfig( enabled=True, unit_of_inference="group_mean", chance=ChanceAssessmentConfig( method="permutation", n_permutations=1000, ), ) --- 3. Full-Pipeline Permutation Testing ====================================== 3.1 Why "Post-Hoc" Permutations Are Biased -------------------------------------------- The easiest permutation test shuffles labels and scores the **already-fitted** model's predictions. This is fast but biased: it does not reshuffle labels during hyperparameter search, feature selection, or calibration. If any of these steps use the labels (which they all do), the null distribution is too narrow. 3.2 The Correct Null: Full-Pipeline Permutation ------------------------------------------------- The correct null distribution is obtained by rerunning the **complete** training pipeline — scaler, feature selector, inner CV, hyperparameter search, calibration, and the final model fit — on permuted labels. This is what ``ChanceAssessmentConfig(method="permutation")`` does. Each permutation: 1. Shuffles ``y`` within each group (or globally for ``unit="sample"``). 2. Reruns the complete outer CV with the shuffled labels. 3. Aggregates the permuted predictions to the unit of inference. 4. Scores the aggregated permuted predictions. The observed score is then compared against this null distribution: .. math:: p = \frac{\#\{\text{null scores} \geq \text{observed score}\} + 1}{B + 1} where :math:`B` is the number of permutations. 3.3 Multiple Comparison Correction for Temporal Data ------------------------------------------------------ For sliding/generalizing temporal decoders, one p-value is produced per timepoint. Multiple testing correction is required. ``coco_pipe.decoding`` supports: - ``temporal_correction="max_stat"`` (default): permutation-based Max-Stat correction. The null at each timepoint is the *global maximum* of the permutation distribution, yielding family-wise error rate (FWER) control. Recommended for temporal decoding with moderate-to-high correlations. - ``temporal_correction="fdr_bh"``: Benjamini-Hochberg FDR control. - ``temporal_correction="none"``: no correction (exploratory use only). .. math:: p_t^{\text{max\_stat}} = \frac{\#\{B_b : \max_{t'} s_b(t') \geq s(t)\} + 1}{B + 1} where :math:`s(t)` is the observed score at time :math:`t` and :math:`s_b(t')` is the permuted score at any timepoint. --- 4. Probability Calibration ============================ A classifier is *calibrated* if its predicted probability of class 1 matches the empirical fraction of class-1 samples at that probability level. Poor calibration does not affect accuracy but matters for: - Log-loss and Brier score interpretation. - Clinical decision thresholds. - Ensemble averaging across models. ``coco_pipe.decoding`` supports ``sklearn.calibration.CalibratedClassifierCV`` inside the training path. The calibration estimator uses **disjoint inner folds** within each outer training partition, so the test set is never used for calibration fitting. Enabling calibration also makes probability metrics (``log_loss``, ``brier_score``) available for models that do not natively provide ``predict_proba`` (e.g., ``LinearSVC``). --- 5. Feature Importance and Stability ====================================== Feature importances are extracted per fold (when the fitted model supports them) and aggregated: - ``get_feature_importances(fold_level=False)``: mean importance ± std across folds. - ``get_feature_importances(fold_level=True)``: per-fold importances in long form. - ``get_feature_stability()``: proportion of folds in which each feature was selected (for SFS) or had positive importance. .. warning:: Fold-averaged importance is **not** the same as importance computed on the full dataset. Because each fold trains on a subset of subjects, the importance estimate has higher variance than whole-dataset importance. Always report the fold-level standard deviation alongside the mean. --- 6. Temporal Decoding Concepts ================================ 6.1 Sliding Decoding ---------------------- A ``SlidingEstimator`` (MNE) fits one independent model per timepoint. Each model sees the channel-space snapshot at its timepoint across all epochs in the training fold. The result is a score curve over time. - *Assumption*: The most discriminative time window is narrow relative to the total window length. - *Strength*: Identifies when (not just whether) neural representations are discriminative. 6.2 Generalizing Decoding (Temporal Generalization) ------------------------------------------------------ A ``GeneralizingEstimator`` (MNE) fits one model per training timepoint and tests each model at **every** test timepoint. The result is a ``(n_train_times, n_test_times)`` matrix of scores. Off-diagonal entries answer: "Does the representation learned at time :math:`t_1` generalize to predict the label at time :math:`t_2`?" A diagonal band indicates a rapidly changing representation; an extended off-diagonal band indicates a stable neural code. - *Scientific interpretation*: Off-diagonal generalization is evidence of a sustained, format-stable neural representation. - *Statistical note*: The generalizing matrix has ``n_train × n_test`` cells. Temporal correction (Max-Stat) is strongly recommended to control the family-wise error rate.