coco_pipe.decoding.stats#

Finite-sample statistical assessment for decoding results.

This module separates descriptive performance from inferential claims. The default inferential path reruns the complete decoding experiment under label permutations so learned preprocessing, feature selection, tuning, and calibration remain inside each null pipeline.

Attributes#

`logger`
`TEMPORAL_COLUMNS`

Functions#

`benjamini_hochberg`(p_values, *[, alpha])	Correct a family of p-values with the Benjamini-Hochberg procedure.
`correct_sweep_pvalues`(results, *[, p_column, ...])	Apply BH-FDR independently within configured decoding sweep families.
`aggregate_predictions_for_inference`(predictions, metric)	Aggregate prediction rows to independent units for inference.
`binomial_accuracy_test`(y_true, y_pred, p0[, alpha, ...])	Exact upper-tail binomial test for top-1 classification accuracy.
`run_statistical_assessment`(observed_result, ...)	Orchestrate the statistical assessment of experiment results.
`run_paired_permutation_assessment`(results_a, ...)	Run a paired permutation test to compare two experimental results.
`assess_post_hoc_permutation`(res[, metric, unit, ...])	Perform a post-hoc label permutation assessment on out-of-fold predictions.
`assess_paired_comparison`(merged[, metric, unit, ...])	Perform a paired permutation test between two models.
`assess_bootstrap_ci`(res[, metric, unit, n_bootstraps, ...])	Estimate uncertainty of a metric via bootstrapping over units.
`apply_multiple_comparison_correction`(df[, p_col, ...])	Apply multiple comparison correction to a DataFrame of results.

Module Contents#

coco_pipe.decoding.stats.logger#

coco_pipe.decoding.stats.benjamini_hochberg(p_values, *, alpha=0.05)#

Correct a family of p-values with the Benjamini-Hochberg procedure.

Parameters:

p_values (collections.abc.Sequence[float])
alpha (float)

Return type:

tuple[numpy.ndarray, numpy.ndarray]

coco_pipe.decoding.stats.correct_sweep_pvalues(results, *, p_column='p_value', family_columns=('target', 'input_mode', 'analysis_mode', 'model', 'selection_mode'), alpha=0.05)#

Apply BH-FDR independently within configured decoding sweep families.

Parameters:

results (pandas.DataFrame)
p_column (str)
family_columns (collections.abc.Sequence[str])
alpha (float)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.TEMPORAL_COLUMNS = ['Time', 'TrainTime', 'TestTime']#

coco_pipe.decoding.stats.aggregate_predictions_for_inference(predictions, metric, task='classification', unit_of_inference='sample', custom_unit_column=None, custom_aggregation='mean', require_single_prediction=False)#

Aggregate prediction rows to independent units for inference.

This ensures that each independent unit (e.g., a subject or a specific trial) contributes exactly one prediction per temporal coordinate to the statistical test, preventing pseudoreplication.

Scientific Rationale#

Inferential statistics assume independence between observations. In EEG/MEG, multiple epochs from the same subject are correlated. By aggregating predictions to the ‘subject’ level before calculating p-values, we ensure the degrees of freedom in the test reflect the number of independent biological units rather than the number of recorded segments.

param predictions:: Raw predictions from the experiment.
type predictions:: pd.DataFrame
param metric:: The metric to optimize aggregation for (e.g., ‘accuracy’).
type metric:: str
param task:: Task type (‘classification’ or ‘regression’).
type task:: str, default=’classification’
param unit_of_inference:: The level at which independence is assumed (‘sample’, ‘subject’, or ‘custom’).
type unit_of_inference:: str, default=’sample’
param custom_unit_column:: Column name in metadata to use as the independence unit if unit_of_inference is ‘custom’.
type custom_unit_column:: str, optional
param custom_aggregation:: Aggregation mode (‘mean’ or ‘majority’).
type custom_aggregation:: str, default=’mean’
param require_single_prediction:: If True, ensures that each unit has exactly one prediction per coordinate.
type require_single_prediction:: bool, default=False
returns:: aggregated_df – Aggregated predictions with an ‘InferentialUnitID’ column.
rtype:: pd.DataFrame
raises ValueError:: If the unit column is missing or aggregation is incompatible with the task.

Examples

>>> import pandas as pd
>>> from coco_pipe.decoding.stats import aggregate_predictions_for_inference
>>> df = pd.DataFrame(
...     {
...         "Subject": ["S1", "S1"],
...         "y_true": [1, 1],
...         "y_pred": [1, 0],
...         "SampleID": [0, 1],
...     }
... )
>>> res = aggregate_predictions_for_inference(
...     df, "accuracy", unit_of_inference="Subject"
... )

See also

ExperimentResult.get_predictions: Tidy prediction accessor.

Parameters:

predictions (pandas.DataFrame)
metric (str)
task (str)
unit_of_inference (str)
custom_unit_column (str | None)
custom_aggregation (str)
require_single_prediction (bool)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.binomial_accuracy_test(y_true, y_pred, p0, alpha=0.05, ci_method='wilson')#

Exact upper-tail binomial test for top-1 classification accuracy.

This test computes the probability of obtaining at least the observed number of correct predictions under the null hypothesis (theoretical chance).

Scientific Rationale#

For classification tasks with a known number of categories, the number of correct predictions follows a Binomial distribution B(n, p0) under the null hypothesis. This exact test is more robust than z-tests for small sample sizes and provides a rigorous bound for ‘chance-level’ performance.

param y_true:: Actual ground-truth labels.
type y_true:: Sequence[Any]
param y_pred:: Predicted labels.
type y_pred:: Sequence[Any]
param p0:: The theoretical chance level (e.g., 0.5 for binary classification).
type p0:: float
param alpha:: Significance level for p-values and confidence intervals.
type alpha:: float, default=0.05
param ci_method:: Method for calculating confidence intervals (‘wilson’ or ‘clopper_pearson’).
type ci_method:: str, default=’wilson’
returns:: result – Dictionary containing ‘observed’ accuracy, ‘p_value’, ‘n_eff’, ‘chance_threshold’, and confidence intervals.
rtype:: dict
raises ValueError:: If p0 is missing or input arrays are empty.

Examples

>>> from coco_pipe.decoding.stats import binomial_accuracy_test
>>> res = binomial_accuracy_test([1, 0, 1], [1, 1, 1], p0=0.5)
>>> print(res["p_value"])

See also

run_statistical_assessment: Full-pipeline assessment driver.

Parameters:

y_true (collections.abc.Sequence[Any])
y_pred (collections.abc.Sequence[Any])
p0 (float | None)
alpha (float)
ci_method (str)

Return type:

dict[str, Any]

coco_pipe.decoding.stats.run_statistical_assessment(observed_result, experiment_config, X, y, groups, sample_ids, sample_metadata, feature_names, time_axis, observation_level, inferential_unit)#

Orchestrate the statistical assessment of experiment results.

Resolves the chosen statistical method (binomial or permutation) and dispatches analysis for each model and metric.

Scientific Rationale#

Statistical significance in decoding is often non-trivial due to temporal autocorrelations and multiple comparisons. This orchestrator handles either analytical binomial tests (fast, theoretical chance) or full-pipeline permutation tests (rigorous, empirical null) to provide scientifically grounded inferential claims about model performance.

param observed_result:: The result of the actual experiment run.
type observed_result:: ExperimentResult
param experiment_config:: The full configuration of the experiment.
type experiment_config:: ExperimentConfig
param X:: The raw features and targets.
type X:: np.ndarray
param y:: The raw features and targets.
type y:: np.ndarray
param groups:: CV grouping vector.
type groups:: np.ndarray, optional
param sample_ids:: Unique identifiers for samples.
type sample_ids:: np.ndarray
param sample_metadata:: Metadata for unit resolution.
type sample_metadata:: pd.DataFrame, optional
param feature_names:: Names of input features.
type feature_names:: list of str, optional
param time_axis:: Time coordinates for temporal data.
type time_axis:: np.ndarray, optional
param observation_level:: Level of input rows (‘sample’ or ‘epoch’).
type observation_level:: str
param inferential_unit:: Definition of statistical independence (‘sample’ or ‘subject’).
type inferential_unit:: str
returns:: assessment_payload – Summary containing ‘rows’ (standardized results) and ‘nulls’.
rtype:: dict

Examples

>>> # Internal use within Experiment.run()
>>> # res = run_statistical_assessment(observed, config, X, y, ...)

See also

binomial_accuracy_test: Core analytical test.
assess_post_hoc_permutation: Fast post-hoc alternative.

Parameters:

observed_result (Any)
experiment_config (Any)
X (numpy.ndarray)
y (numpy.ndarray)
groups (numpy.ndarray | None)
sample_ids (numpy.ndarray)
sample_metadata (pandas.DataFrame | None)
feature_names (collections.abc.Sequence[str] | None)
time_axis (numpy.ndarray | None)
observation_level (str)
inferential_unit (str)

Return type:

dict[str, Any]

coco_pipe.decoding.stats.run_paired_permutation_assessment(results_a, results_b, model, metric, config)#

Run a paired permutation test to compare two experimental results.

Tests the null hypothesis that the difference between two models is zero by randomly swapping model labels within each independent unit.

Scientific Rationale#

This function performs a rigorous comparison of two experimental pipelines by aligning predictions at the ‘SampleID’ level and performing within-unit label swaps. This ensures that the comparison is not biased by unit-specific performance baselines and correctly estimates the p-value for the observed performance delta.

param results_a:: The results of the two experiments to compare.
type results_a:: ExperimentResult
param results_b:: The results of the two experiments to compare.
type results_b:: ExperimentResult
param model:: The name of the model to compare.
type model:: str
param metric:: Metric to use for the comparison.
type metric:: str
param config:: Configuration for permutations and significance.
type config:: StatisticalAssessmentConfig
returns:: paired_df – DataFrame with Difference, PValue, and confidence intervals.
rtype:: pd.DataFrame

Examples

>>> # diff = run_paired_permutation_assessment(res1, res2, 'LR', 'accuracy', config)

See also

assess_paired_comparison: Fast post-hoc alternative.

Examples

>>> # paired_df = run_paired_permutation_assessment(
>>> #     res_a, res_b, "LR", "accuracy", config
>>> # )

Parameters:

results_a (coco_pipe.decoding.ExperimentResult)
results_b (coco_pipe.decoding.ExperimentResult)
model (str)
metric (str)
config (coco_pipe.decoding.configs.StatisticalAssessmentConfig)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.assess_post_hoc_permutation(res, metric='accuracy', unit=None, n_permutations=1000, random_state=None)#

Perform a post-hoc label permutation assessment on out-of-fold predictions.

Shuffles labels relative to fixed predictions to estimate the null distribution under exchangeability.

Scientific Rationale#

Unlike full-pipeline permutations, post-hoc permutations do not rerun feature selection or tuning. This makes them significantly faster but potentially over-optimistic if those steps ‘leaked’ label information. However, if the independence unit (e.g., subject) is respected during the shuffle, it provides a valid test of whether the model’s predictions are significantly associated with the labels beyond chance.

param res:: Result dictionary from ExperimentResult.raw.
type res:: dict
param metric:: The metric to evaluate.
type metric:: str, default=’accuracy’
param unit:: Level of independence (e.g., ‘subject’).
type unit:: str, optional
param n_permutations:: Number of null permutations.
type n_permutations:: int, default=1000
param random_state:: Seed for reproducibility.
type random_state:: int, optional
returns:: posthoc_df – DataFrame with Observed score, PValue, and Significant status.
rtype:: pd.DataFrame

Examples

>>> # posthoc = assess_post_hoc_permutation(res.raw['LR'], metric='accuracy')

See also

run_statistical_assessment: Full-pipeline assessment driver.

Parameters:

res (dict[str, Any])
metric (str)
unit (str | None)
n_permutations (int)
random_state (int | None)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.assess_paired_comparison(merged, metric='accuracy', unit=None, n_permutations=1000, random_state=None)#

Perform a paired permutation test between two models.

Tests the null hypothesis that the difference between two models is zero by randomly swapping model labels within each independent unit.

Scientific Rationale#

To compare two models (A and B), we test if the observed difference in performance is greater than what would be expected by chance if the labels ‘A’ and ‘B’ were interchangeable. By swapping labels within units (e.g., within subject), we control for subject-specific performance baselines and focus on the model-driven variance.

param merged:: Merged prediction frame with suffixes ‘_A’ and ‘_B’.
type merged:: pd.DataFrame
param metric:: Metric to evaluate.
type metric:: str, default=’accuracy’
param unit:: Level of independence (e.g., ‘subject’).
type unit:: str, optional
param n_permutations:: Number of permutations.
type n_permutations:: int, default=1000
param random_state:: Seed for reproducibility.
type random_state:: int, optional
returns:: comparison_df – DataFrame with ScoreA, ScoreB, Difference, and PValue.
rtype:: pd.DataFrame

Examples

>>> # comp = assess_paired_comparison(merged_df, metric='accuracy')

See also

run_paired_permutation_assessment: Full-pipeline paired comparison.

Parameters:

merged (pandas.DataFrame)
metric (str)
unit (str | None)
n_permutations (int)
random_state (int | None)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.assess_bootstrap_ci(res, metric='accuracy', unit=None, n_bootstraps=1000, ci=0.95, random_state=None)#

Estimate uncertainty of a metric via bootstrapping over units.

This function computes the observed metric on the provided results and then generates a distribution of scores by resampling independent units with replacement.

Scientific Rationale#

Bootstrapping provides a non-parametric estimate of the sampling distribution of the metric. By resampling at the ‘unit’ level (e.g., subjects rather than individual trials), we account for within-unit correlations and avoid pseudoreplication, ensuring that the confidence intervals accurately reflect the uncertainty at the intended level of inference.

param res:: Result dictionary for a single model from ExperimentResult.raw.
type res:: dict
param metric:: Metric to evaluate.
type metric:: str, default=’accuracy’
param unit:: The level of independence (e.g., ‘subject’).
type unit:: str, optional
param n_bootstraps:: Number of bootstrap iterations.
type n_bootstraps:: int, default=1000
param ci:: Confidence level (0.95 for 95% intervals).
type ci:: float, default=0.95
param random_state:: Seed for reproducibility.
type random_state:: int, optional
returns:: bootstrap_df – DataFrame with estimate, CILower, and CIUpper.
rtype:: pd.DataFrame

Examples

>>> # ci_df = assess_bootstrap_ci(res.raw['LR'], unit='subject')

See also

binomial_accuracy_test: Analytical CI alternative.

Parameters:

res (dict[str, Any])
metric (str)
unit (str | None)
n_bootstraps (int)
ci (float)
random_state (int | None)

Return type:

pandas.DataFrame

coco_pipe.decoding.stats.apply_multiple_comparison_correction(df, p_col='PValue', method='fdr_bh', alpha=0.05)#

Apply multiple comparison correction to a DataFrame of results.

Scientific Rationale#

When testing multiple hypotheses (e.g., across many timepoints or models), the probability of a Type I error (false positive) increases. This utility applies standard corrections like Bonferroni (strict) or False Discovery Rate (FDR; Benjamini-Hochberg) to control the family-wise error rate or the expected proportion of false discoveries.

param df:: Results DataFrame containing p-values.
type df:: pd.DataFrame
param p_col:: Name of the column containing raw p-values.
type p_col:: str, default=’PValue’
param method:: Correction method (e.g., ‘fdr_bh’, ‘bonferroni’).
type method:: str, default=’fdr_bh’
param alpha:: Significance level.
type alpha:: float, default=0.05
returns:: corrected_df – The DataFrame with updated ‘PValueCorrected’ and ‘Significant’ columns.
rtype:: pd.DataFrame

Examples

>>> # corrected = apply_multiple_comparison_correction(results_df, method='fdr_bh')

See also

statsmodels.stats.multitest.multipletests: Underlying implementation.

Parameters:

df (pandas.DataFrame)
p_col (str)
method (str)
alpha (float)

Return type:

pandas.DataFrame