coco_pipe.io.quality#

Data quality measurement and QC gating for loaded data containers.

Four QC levels are defined:

Level 1 — NaN / Inf / extreme-value row drops at load time (handled by load_descriptor_table(); counts surfaced via container meta).
Level 2 — Epoch-level MAD outlier rejection (drop_epoch_outliers()).
Level 3 — Subject-level outlier rejection (drop_subject_outliers()).
Level 4 — Compose levels 2 + 3 and record every decision in a QCResult (run_qc()).

The module also provides lower-level primitives used by the levels above: compute_row_outlier_scores() (MAD z-scores per row), compute_subject_outlier_burden() (per-subject mean epoch burden), and row_quality_score() (simple NaN/Inf/zero count per row, used for quality-weighted sampling).

Classes#

`EpochDropRecord`	Record of one dropped observation.
`SubjectDropRecord`	Record of one dropped subject.
`QCResult`	Structured log of QC decisions produced by `run_qc()`.
`CheckResult`	Result of a data quality check.

Functions#

`make_qc_flag`(level, code, message[, value, threshold, ...])	Create a structured QC flag record.
`resolve_qc_status`(flags)	Return the worst status level from a list of QC flag dicts.
`group_labels`(container[, group_by])	Unique feature-group labels a container spans, at `group_by` granularity.
`compute_row_outlier_scores`(df, feature_cols[, ...])	Compute per-row outlier fractions using MAD-based robust z-scores.
`compute_subject_outlier_burden`(df, feature_cols[, ...])	Aggregate row-level MAD outlier scores to one row per subject.
`check_missingness`(df[, threshold_warn, threshold_fail])	Check for missing values (NaNs).
`check_constant_columns`(df)	Check for columns/features with zero variance.
`check_outliers_zscore`(df[, sigma])	Check for extreme values (> sigma).
`check_flatline`(signal[, threshold])	Check if signal is effectively dead (flatline).
`drop_epoch_outliers`(container[, z_threshold, ...])	Drop observations with a high fraction of MAD-based feature outliers.
`drop_subject_outliers`(container[, z_threshold, ...])	Drop subjects with a high cohort-level feature outlier burden.
`run_qc`(container[, epoch_z_threshold, ...])	Run epoch QC followed by subject QC and return a merged result.

Module Contents#

class coco_pipe.io.quality.EpochDropRecord#

Record of one dropped observation.

obs_index: int#

obs_id: str#

outlier_fraction: float#

mad_z_max: float#

class coco_pipe.io.quality.SubjectDropRecord#

Record of one dropped subject.

subject_id: str#

outlier_fraction: float#

n_outlier_features: float#

class coco_pipe.io.quality.QCResult#

Structured log of QC decisions produced by run_qc().

Fields are populated by run_qc and, for the Level-1 counts (n_rows_entering_qc, n_dropped_nan_inf, n_dropped_extreme), read from the container’s meta dict written by load_descriptor_table().

family_qc is not populated by run_qc — the caller computes it via aggregate_family_qc() and attaches it afterwards (result.family_qc = aggregate_family_qc(...)), keeping the io layer free of descriptors imports.

n_rows_entering_qc: int | None = None#

n_dropped_nan_inf: int = 0#

n_dropped_extreme: int = 0#

n_obs_in: int = 0#

n_obs_out: int = 0#

n_subjects_in: int = 0#

n_subjects_out: int = 0#

epoch_drop_threshold: float | None = None#

epoch_outlier_fraction_threshold: float | None = None#

epochs_dropped: list[EpochDropRecord] = []#

subject_drop_threshold: float | None = None#

subject_outlier_fraction_threshold: float | None = None#

subjects_dropped: list[SubjectDropRecord] = []#

per_family_dropped: dict[str, list[EpochDropRecord | SubjectDropRecord]]#

subject_outlier_burden: pandas.DataFrame | None = None#

feature_missingness: pandas.DataFrame | None = None#

feature_columns_dropped: pandas.DataFrame | None = None#

family_qc: pandas.DataFrame | None = None#

thresholds: dict[str, Any]#

property n_epochs_dropped: int#

Return type:: int

property n_subjects_dropped: int#

Return type:: int

property retention_rate: float#

Return the fraction of input observations retained.

Return type:: float

property total_dropped: int#

Return total rows dropped across all QC levels.

Return type:: int

summary()#

Return a flat summary suitable for logs and report headers.

Return type:: dict[str, Any]

class coco_pipe.io.quality.CheckResult#

Result of a data quality check.

Variables:

check_name (str) – Name of the check (e.g., “Missing Values”).
status (str) – “OK”, “WARN”, or “FAIL”.
message (str) – Human-readable description of the issue.
severity (int) – 0 (Info) to 10 (Critical).
metric_name (str, optional) – Name of the metric evaluated (e.g., “missing_pct”).
metric_value (float, int, or str, optional) – Value of the metric.

Examples

>>> res = CheckResult("Missingness", "FAIL", "Too many NaNs", 9)
>>> res.is_issue
True

check_name: str#

status: coco_pipe.io._constants.QualityStatus#

message: str#

severity: int#

metric_name: str | None = None#

metric_value: float | int | str | None = None#

property is_issue: bool#

Return True if status is WARN or FAIL.

Return type:: bool

classmethod from_flag_dict(flag)#

Construct a CheckResult from a make_qc_flag() record.

Parameters:: flag (dict[str, Any])
Return type:: CheckResult

coco_pipe.io.quality.make_qc_flag(level, code, message, value=None, threshold=None, scope=None)#

Create a structured QC flag record.

Parameters:

level (coco_pipe.io._constants.QCFlagLevel)
code (str)
message (str)
value (float | int | str | None)
threshold (float | int | str | None)
scope (str | None)

Return type:

dict[str, Any]

coco_pipe.io.quality.resolve_qc_status(flags)#

Return the worst status level from a list of QC flag dicts.

Parameters:: flags (list[dict[str, Any]])
Return type:: str

coco_pipe.io.quality.group_labels(container, group_by='family')#

Unique feature-group labels a container spans, at group_by granularity.

Resolves labels from the container’s structured feature_schema() — enriched from descriptor-name parsing when the schema is partial — and returns them de-duplicated in first-seen order. This is the structured replacement for hand-rolled “which families/measures does this analysis unit cover” helpers: pass the sliced unit container from iter_analysis_units() to learn which QC labels it maps to.

Parameters:

container (coco_pipe.io.structures.DataContainer) – Any container with a feature axis (e.g. one analysis unit).
group_by (str) – Grouping granularity: "family", "measure", or "feature".

Returns:

Distinct labels at the requested granularity; [] when the container has no feature axis.

Return type:

list of str

coco_pipe.io.quality.compute_row_outlier_scores(df, feature_cols, z_threshold=5.0, descriptor_names=None, group_by=None, feature_schema=None)#

Compute per-row outlier fractions using MAD-based robust z-scores.

When group_by is set ("family", "measure", or "feature") the result also carries per-group outlier_fraction_<label> columns so a row can be judged within each descriptor group rather than across all features.

Parameters:

df (pandas.DataFrame)
feature_cols (list[str])
z_threshold (float)
descriptor_names (list[str] | None)
group_by (str | None)
feature_schema (pandas.DataFrame | None)

Return type:

pandas.DataFrame

coco_pipe.io.quality.compute_subject_outlier_burden(df, feature_cols, subject_col='subject', z_threshold=5.0)#

Aggregate row-level MAD outlier scores to one row per subject.

Parameters:

df (pandas.DataFrame)
feature_cols (list[str])
subject_col (str)
z_threshold (float)

Return type:

pandas.DataFrame

coco_pipe.io.quality.check_missingness(df, threshold_warn=0.01, threshold_fail=0.2)#

Check for missing values (NaNs).

Parameters:

df (DataFrame or ndarray) – The data to check.
threshold_warn (float) – Ratio of NaNs to trigger a warning. Default 0.01 (1%).
threshold_fail (float) – Ratio of NaNs to trigger a failure. Default 0.20 (20%).

Returns:

Quality check result.

Return type:

CheckResult

Examples

>>> data = np.array([1, 2, np.nan, 4])
>>> check_missingness(data, threshold_warn=0.1)
CheckResult(check_name='Missingness', status='FAIL', ...)

coco_pipe.io.quality.check_constant_columns(df)#

Check for columns/features with zero variance.

Parameters:: df (DataFrame or ndarray) – The data to check.
Returns:: List of findings. Empty if no constant columns found.
Return type:: List[CheckResult]

Examples

>>> df = pd.DataFrame({"a": [1, 1, 1], "b": [1, 2, 3]})
>>> check_constant_columns(df)
[CheckResult(check_name='Constant Features', ...)]

coco_pipe.io.quality.check_outliers_zscore(df, sigma=5.0)#

Check for extreme values (> sigma). Uses a simple global Z-score approach.

Parameters:

df (DataFrame or ndarray) – Data to check.
sigma (float) – Z-score threshold. Default 5.0.

Returns:

CheckResult if outliers found, else None.

Return type:

Optional[CheckResult]

coco_pipe.io.quality.check_flatline(signal, threshold=1e-10)#

Check if signal is effectively dead (flatline).

Parameters:

signal (ndarray) – 1D signal array or flattened data.
threshold (float) – Standard deviation threshold. Default 1e-10.

Returns:

Result indicating if signal is flat.

Return type:

CheckResult

coco_pipe.io.quality.drop_epoch_outliers(container, z_threshold=5.0, outlier_fraction_threshold=0.3, subject_col='subject', feature_cols=None, descriptor_names=None, group_by=None, min_obs=None, feature_schema=None)#

Drop observations with a high fraction of MAD-based feature outliers.

group_by=None makes one global drop decision across all features. When set ("family", "measure", or "feature") the decision is made per descriptor group at that granularity and the returned dict is keyed by group label.

Parameters:

container (coco_pipe.io.structures.DataContainer)
z_threshold (float)
outlier_fraction_threshold (float)
subject_col (str)
feature_cols (list[str] | None)
descriptor_names (list[str] | None)
group_by (str | None)
min_obs (int | None)
feature_schema (pandas.DataFrame | None)

Return type:

tuple[coco_pipe.io.structures.DataContainer | dict[str, numpy.ndarray], QCResult]

coco_pipe.io.quality.drop_subject_outliers(container, z_threshold=5.0, outlier_fraction_threshold=0.2, subject_col='subject', feature_cols=None, descriptor_names=None, group_by=None, feature_schema=None)#

Drop subjects with a high cohort-level feature outlier burden.

group_by=None makes one global decision across all features. When set ("family", "measure", or "feature") the burden is computed per descriptor group at that granularity and the returned dict is keyed by group label.

Parameters:

container (coco_pipe.io.structures.DataContainer)
z_threshold (float)
outlier_fraction_threshold (float)
subject_col (str)
feature_cols (list[str] | None)
descriptor_names (list[str] | None)
group_by (str | None)
feature_schema (pandas.DataFrame | None)

Return type:

tuple[coco_pipe.io.structures.DataContainer | dict[str, numpy.ndarray], QCResult]

coco_pipe.io.quality.run_qc(container, epoch_z_threshold=5.0, epoch_outlier_fraction_threshold=0.3, subject_z_threshold=5.0, subject_outlier_fraction_threshold=0.2, subject_col='subject', feature_cols=None, compute_missingness=True)#

Run epoch QC followed by subject QC and return a merged result.

Parameters:

container (coco_pipe.io.structures.DataContainer)
epoch_z_threshold (float | None)
epoch_outlier_fraction_threshold (float)
subject_z_threshold (float | None)
subject_outlier_fraction_threshold (float)
subject_col (str)
feature_cols (list[str] | None)
compute_missingness (bool)

Return type:

tuple[coco_pipe.io.structures.DataContainer, QCResult]