coco_pipe.decoding.stats#
Finite-sample statistical assessment for decoding results.
This module separates descriptive performance from inferential claims. The default inferential path reruns the complete decoding experiment under label permutations so learned preprocessing, feature selection, tuning, and calibration remain inside each null pipeline.
Attributes#
Functions#
|
Correct a family of p-values with the Benjamini-Hochberg procedure. |
|
Apply BH-FDR independently within configured decoding sweep families. |
|
Aggregate prediction rows to independent units for inference. |
|
Exact upper-tail binomial test for top-1 classification accuracy. |
|
Orchestrate the statistical assessment of experiment results. |
|
Run a paired permutation test to compare two experimental results. |
|
Perform a post-hoc label permutation assessment on out-of-fold predictions. |
|
Perform a paired permutation test between two models. |
|
Estimate uncertainty of a metric via bootstrapping over units. |
|
Apply multiple comparison correction to a DataFrame of results. |
Module Contents#
- coco_pipe.decoding.stats.logger#
- coco_pipe.decoding.stats.benjamini_hochberg(p_values, *, alpha=0.05)#
Correct a family of p-values with the Benjamini-Hochberg procedure.
- Parameters:
p_values (collections.abc.Sequence[float])
alpha (float)
- Return type:
- coco_pipe.decoding.stats.correct_sweep_pvalues(results, *, p_column='p_value', family_columns=('target', 'input_mode', 'analysis_mode', 'model', 'selection_mode'), alpha=0.05)#
Apply BH-FDR independently within configured decoding sweep families.
- Parameters:
results (pandas.DataFrame)
p_column (str)
family_columns (collections.abc.Sequence[str])
alpha (float)
- Return type:
- coco_pipe.decoding.stats.TEMPORAL_COLUMNS = ['Time', 'TrainTime', 'TestTime']#
- coco_pipe.decoding.stats.aggregate_predictions_for_inference(predictions, metric, task='classification', unit_of_inference='sample', custom_unit_column=None, custom_aggregation='mean', require_single_prediction=False)#
Aggregate prediction rows to independent units for inference.
This ensures that each independent unit (e.g., a subject or a specific trial) contributes exactly one prediction per temporal coordinate to the statistical test, preventing pseudoreplication.
Scientific Rationale#
Inferential statistics assume independence between observations. In EEG/MEG, multiple epochs from the same subject are correlated. By aggregating predictions to the ‘subject’ level before calculating p-values, we ensure the degrees of freedom in the test reflect the number of independent biological units rather than the number of recorded segments.
- param predictions:
Raw predictions from the experiment.
- type predictions:
pd.DataFrame
- param metric:
The metric to optimize aggregation for (e.g., ‘accuracy’).
- type metric:
str
- param task:
Task type (‘classification’ or ‘regression’).
- type task:
str, default=’classification’
- param unit_of_inference:
The level at which independence is assumed (‘sample’, ‘subject’, or ‘custom’).
- type unit_of_inference:
str, default=’sample’
- param custom_unit_column:
Column name in metadata to use as the independence unit if unit_of_inference is ‘custom’.
- type custom_unit_column:
str, optional
- param custom_aggregation:
Aggregation mode (‘mean’ or ‘majority’).
- type custom_aggregation:
str, default=’mean’
- param require_single_prediction:
If True, ensures that each unit has exactly one prediction per coordinate.
- type require_single_prediction:
bool, default=False
- returns:
aggregated_df – Aggregated predictions with an ‘InferentialUnitID’ column.
- rtype:
pd.DataFrame
- raises ValueError:
If the unit column is missing or aggregation is incompatible with the task.
Examples
>>> import pandas as pd >>> from coco_pipe.decoding.stats import aggregate_predictions_for_inference >>> df = pd.DataFrame( ... { ... "Subject": ["S1", "S1"], ... "y_true": [1, 1], ... "y_pred": [1, 0], ... "SampleID": [0, 1], ... } ... ) >>> res = aggregate_predictions_for_inference( ... df, "accuracy", unit_of_inference="Subject" ... )
See also
ExperimentResult.get_predictionsTidy prediction accessor.
- Parameters:
- Return type:
- coco_pipe.decoding.stats.binomial_accuracy_test(y_true, y_pred, p0, alpha=0.05, ci_method='wilson')#
Exact upper-tail binomial test for top-1 classification accuracy.
This test computes the probability of obtaining at least the observed number of correct predictions under the null hypothesis (theoretical chance).
Scientific Rationale#
For classification tasks with a known number of categories, the number of correct predictions follows a Binomial distribution B(n, p0) under the null hypothesis. This exact test is more robust than z-tests for small sample sizes and provides a rigorous bound for ‘chance-level’ performance.
- param y_true:
Actual ground-truth labels.
- type y_true:
Sequence[Any]
- param y_pred:
Predicted labels.
- type y_pred:
Sequence[Any]
- param p0:
The theoretical chance level (e.g., 0.5 for binary classification).
- type p0:
float
- param alpha:
Significance level for p-values and confidence intervals.
- type alpha:
float, default=0.05
- param ci_method:
Method for calculating confidence intervals (‘wilson’ or ‘clopper_pearson’).
- type ci_method:
str, default=’wilson’
- returns:
result – Dictionary containing ‘observed’ accuracy, ‘p_value’, ‘n_eff’, ‘chance_threshold’, and confidence intervals.
- rtype:
dict
- raises ValueError:
If p0 is missing or input arrays are empty.
Examples
>>> from coco_pipe.decoding.stats import binomial_accuracy_test >>> res = binomial_accuracy_test([1, 0, 1], [1, 1, 1], p0=0.5) >>> print(res["p_value"])
See also
run_statistical_assessmentFull-pipeline assessment driver.
- Parameters:
y_true (collections.abc.Sequence[Any])
y_pred (collections.abc.Sequence[Any])
p0 (float | None)
alpha (float)
ci_method (str)
- Return type:
- coco_pipe.decoding.stats.run_statistical_assessment(observed_result, experiment_config, X, y, groups, sample_ids, sample_metadata, feature_names, time_axis, observation_level, inferential_unit)#
Orchestrate the statistical assessment of experiment results.
Resolves the chosen statistical method (binomial or permutation) and dispatches analysis for each model and metric.
Scientific Rationale#
Statistical significance in decoding is often non-trivial due to temporal autocorrelations and multiple comparisons. This orchestrator handles either analytical binomial tests (fast, theoretical chance) or full-pipeline permutation tests (rigorous, empirical null) to provide scientifically grounded inferential claims about model performance.
- param observed_result:
The result of the actual experiment run.
- type observed_result:
ExperimentResult
- param experiment_config:
The full configuration of the experiment.
- type experiment_config:
ExperimentConfig
- param X:
The raw features and targets.
- type X:
np.ndarray
- param y:
The raw features and targets.
- type y:
np.ndarray
- param groups:
CV grouping vector.
- type groups:
np.ndarray, optional
- param sample_ids:
Unique identifiers for samples.
- type sample_ids:
np.ndarray
- param sample_metadata:
Metadata for unit resolution.
- type sample_metadata:
pd.DataFrame, optional
- param feature_names:
Names of input features.
- type feature_names:
list of str, optional
- param time_axis:
Time coordinates for temporal data.
- type time_axis:
np.ndarray, optional
- param observation_level:
Level of input rows (‘sample’ or ‘epoch’).
- type observation_level:
str
- param inferential_unit:
Definition of statistical independence (‘sample’ or ‘subject’).
- type inferential_unit:
str
- returns:
assessment_payload – Summary containing ‘rows’ (standardized results) and ‘nulls’.
- rtype:
dict
Examples
>>> # Internal use within Experiment.run() >>> # res = run_statistical_assessment(observed, config, X, y, ...)
See also
binomial_accuracy_testCore analytical test.
assess_post_hoc_permutationFast post-hoc alternative.
- Parameters:
observed_result (Any)
experiment_config (Any)
X (numpy.ndarray)
y (numpy.ndarray)
groups (numpy.ndarray | None)
sample_ids (numpy.ndarray)
sample_metadata (pandas.DataFrame | None)
feature_names (collections.abc.Sequence[str] | None)
time_axis (numpy.ndarray | None)
observation_level (str)
inferential_unit (str)
- Return type:
- coco_pipe.decoding.stats.run_paired_permutation_assessment(results_a, results_b, model, metric, config)#
Run a paired permutation test to compare two experimental results.
Tests the null hypothesis that the difference between two models is zero by randomly swapping model labels within each independent unit.
Scientific Rationale#
This function performs a rigorous comparison of two experimental pipelines by aligning predictions at the ‘SampleID’ level and performing within-unit label swaps. This ensures that the comparison is not biased by unit-specific performance baselines and correctly estimates the p-value for the observed performance delta.
- param results_a:
The results of the two experiments to compare.
- type results_a:
ExperimentResult
- param results_b:
The results of the two experiments to compare.
- type results_b:
ExperimentResult
- param model:
The name of the model to compare.
- type model:
str
- param metric:
Metric to use for the comparison.
- type metric:
str
- param config:
Configuration for permutations and significance.
- type config:
StatisticalAssessmentConfig
- returns:
paired_df – DataFrame with Difference, PValue, and confidence intervals.
- rtype:
pd.DataFrame
Examples
>>> # diff = run_paired_permutation_assessment(res1, res2, 'LR', 'accuracy', config)
See also
assess_paired_comparisonFast post-hoc alternative.
Examples
>>> # paired_df = run_paired_permutation_assessment( >>> # res_a, res_b, "LR", "accuracy", config >>> # )
- Parameters:
results_a (coco_pipe.decoding.ExperimentResult)
results_b (coco_pipe.decoding.ExperimentResult)
model (str)
metric (str)
config (coco_pipe.decoding.configs.StatisticalAssessmentConfig)
- Return type:
- coco_pipe.decoding.stats.assess_post_hoc_permutation(res, metric='accuracy', unit=None, n_permutations=1000, random_state=None)#
Perform a post-hoc label permutation assessment on out-of-fold predictions.
Shuffles labels relative to fixed predictions to estimate the null distribution under exchangeability.
Scientific Rationale#
Unlike full-pipeline permutations, post-hoc permutations do not rerun feature selection or tuning. This makes them significantly faster but potentially over-optimistic if those steps ‘leaked’ label information. However, if the independence unit (e.g., subject) is respected during the shuffle, it provides a valid test of whether the model’s predictions are significantly associated with the labels beyond chance.
- param res:
Result dictionary from ExperimentResult.raw.
- type res:
dict
- param metric:
The metric to evaluate.
- type metric:
str, default=’accuracy’
- param unit:
Level of independence (e.g., ‘subject’).
- type unit:
str, optional
- param n_permutations:
Number of null permutations.
- type n_permutations:
int, default=1000
- param random_state:
Seed for reproducibility.
- type random_state:
int, optional
- returns:
posthoc_df – DataFrame with Observed score, PValue, and Significant status.
- rtype:
pd.DataFrame
Examples
>>> # posthoc = assess_post_hoc_permutation(res.raw['LR'], metric='accuracy')
See also
run_statistical_assessmentFull-pipeline assessment driver.
- coco_pipe.decoding.stats.assess_paired_comparison(merged, metric='accuracy', unit=None, n_permutations=1000, random_state=None)#
Perform a paired permutation test between two models.
Tests the null hypothesis that the difference between two models is zero by randomly swapping model labels within each independent unit.
Scientific Rationale#
To compare two models (A and B), we test if the observed difference in performance is greater than what would be expected by chance if the labels ‘A’ and ‘B’ were interchangeable. By swapping labels within units (e.g., within subject), we control for subject-specific performance baselines and focus on the model-driven variance.
- param merged:
Merged prediction frame with suffixes ‘_A’ and ‘_B’.
- type merged:
pd.DataFrame
- param metric:
Metric to evaluate.
- type metric:
str, default=’accuracy’
- param unit:
Level of independence (e.g., ‘subject’).
- type unit:
str, optional
- param n_permutations:
Number of permutations.
- type n_permutations:
int, default=1000
- param random_state:
Seed for reproducibility.
- type random_state:
int, optional
- returns:
comparison_df – DataFrame with ScoreA, ScoreB, Difference, and PValue.
- rtype:
pd.DataFrame
Examples
>>> # comp = assess_paired_comparison(merged_df, metric='accuracy')
See also
run_paired_permutation_assessmentFull-pipeline paired comparison.
- Parameters:
merged (pandas.DataFrame)
metric (str)
unit (str | None)
n_permutations (int)
random_state (int | None)
- Return type:
- coco_pipe.decoding.stats.assess_bootstrap_ci(res, metric='accuracy', unit=None, n_bootstraps=1000, ci=0.95, random_state=None)#
Estimate uncertainty of a metric via bootstrapping over units.
This function computes the observed metric on the provided results and then generates a distribution of scores by resampling independent units with replacement.
Scientific Rationale#
Bootstrapping provides a non-parametric estimate of the sampling distribution of the metric. By resampling at the ‘unit’ level (e.g., subjects rather than individual trials), we account for within-unit correlations and avoid pseudoreplication, ensuring that the confidence intervals accurately reflect the uncertainty at the intended level of inference.
- param res:
Result dictionary for a single model from ExperimentResult.raw.
- type res:
dict
- param metric:
Metric to evaluate.
- type metric:
str, default=’accuracy’
- param unit:
The level of independence (e.g., ‘subject’).
- type unit:
str, optional
- param n_bootstraps:
Number of bootstrap iterations.
- type n_bootstraps:
int, default=1000
- param ci:
Confidence level (0.95 for 95% intervals).
- type ci:
float, default=0.95
- param random_state:
Seed for reproducibility.
- type random_state:
int, optional
- returns:
bootstrap_df – DataFrame with estimate, CILower, and CIUpper.
- rtype:
pd.DataFrame
Examples
>>> # ci_df = assess_bootstrap_ci(res.raw['LR'], unit='subject')
See also
binomial_accuracy_testAnalytical CI alternative.
- coco_pipe.decoding.stats.apply_multiple_comparison_correction(df, p_col='PValue', method='fdr_bh', alpha=0.05)#
Apply multiple comparison correction to a DataFrame of results.
Scientific Rationale#
When testing multiple hypotheses (e.g., across many timepoints or models), the probability of a Type I error (false positive) increases. This utility applies standard corrections like Bonferroni (strict) or False Discovery Rate (FDR; Benjamini-Hochberg) to control the family-wise error rate or the expected proportion of false discoveries.
- param df:
Results DataFrame containing p-values.
- type df:
pd.DataFrame
- param p_col:
Name of the column containing raw p-values.
- type p_col:
str, default=’PValue’
- param method:
Correction method (e.g., ‘fdr_bh’, ‘bonferroni’).
- type method:
str, default=’fdr_bh’
- param alpha:
Significance level.
- type alpha:
float, default=0.05
- returns:
corrected_df – The DataFrame with updated ‘PValueCorrected’ and ‘Significant’ columns.
- rtype:
pd.DataFrame
Examples
>>> # corrected = apply_multiple_comparison_correction(results_df, method='fdr_bh')
See also
statsmodels.stats.multitest.multipletestsUnderlying implementation.
- Parameters:
df (pandas.DataFrame)
p_col (str)
method (str)
alpha (float)
- Return type: