coco_pipe.io.quality#
Data quality measurement and QC gating for loaded data containers.
Four QC levels are defined:
Level 1 — NaN / Inf / extreme-value row drops at load time (handled by
load_descriptor_table(); counts surfaced via container meta).Level 2 — Epoch-level MAD outlier rejection (
drop_epoch_outliers()).Level 3 — Subject-level outlier rejection (
drop_subject_outliers()).Level 4 — Compose levels 2 + 3 and record every decision in a
QCResult(run_qc()).
The module also provides lower-level primitives used by the levels above:
compute_row_outlier_scores() (MAD z-scores per row),
compute_subject_outlier_burden() (per-subject mean epoch burden), and
row_quality_score() (simple NaN/Inf/zero count per row, used for
quality-weighted sampling).
Classes#
Record of one dropped observation. |
|
Record of one dropped subject. |
|
Structured log of QC decisions produced by |
|
Result of a data quality check. |
Functions#
|
Create a structured QC flag record. |
|
Return the worst status level from a list of QC flag dicts. |
|
Unique feature-group labels a container spans, at |
|
Compute per-row outlier fractions using MAD-based robust z-scores. |
|
Aggregate row-level MAD outlier scores to one row per subject. |
|
Check for missing values (NaNs). |
Check for columns/features with zero variance. |
|
|
Check for extreme values (> sigma). |
|
Check if signal is effectively dead (flatline). |
|
Drop observations with a high fraction of MAD-based feature outliers. |
|
Drop subjects with a high cohort-level feature outlier burden. |
|
Run epoch QC followed by subject QC and return a merged result. |
Module Contents#
- class coco_pipe.io.quality.EpochDropRecord#
Record of one dropped observation.
- class coco_pipe.io.quality.SubjectDropRecord#
Record of one dropped subject.
- class coco_pipe.io.quality.QCResult#
Structured log of QC decisions produced by
run_qc().Fields are populated by
run_qcand, for the Level-1 counts (n_rows_entering_qc,n_dropped_nan_inf,n_dropped_extreme), read from the container’smetadict written byload_descriptor_table().family_qcis not populated byrun_qc— the caller computes it viaaggregate_family_qc()and attaches it afterwards (result.family_qc = aggregate_family_qc(...)), keeping theiolayer free ofdescriptorsimports.- epochs_dropped: list[EpochDropRecord] = []#
- subjects_dropped: list[SubjectDropRecord] = []#
- per_family_dropped: dict[str, list[EpochDropRecord | SubjectDropRecord]]#
- subject_outlier_burden: pandas.DataFrame | None = None#
- feature_missingness: pandas.DataFrame | None = None#
- feature_columns_dropped: pandas.DataFrame | None = None#
- family_qc: pandas.DataFrame | None = None#
- class coco_pipe.io.quality.CheckResult#
Result of a data quality check.
- Variables:
check_name (str) – Name of the check (e.g., “Missing Values”).
status (str) – “OK”, “WARN”, or “FAIL”.
message (str) – Human-readable description of the issue.
severity (int) – 0 (Info) to 10 (Critical).
metric_name (str, optional) – Name of the metric evaluated (e.g., “missing_pct”).
metric_value (float, int, or str, optional) – Value of the metric.
Examples
>>> res = CheckResult("Missingness", "FAIL", "Too many NaNs", 9) >>> res.is_issue True
- status: coco_pipe.io._constants.QualityStatus#
- classmethod from_flag_dict(flag)#
Construct a CheckResult from a
make_qc_flag()record.- Parameters:
- Return type:
- coco_pipe.io.quality.make_qc_flag(level, code, message, value=None, threshold=None, scope=None)#
Create a structured QC flag record.
- coco_pipe.io.quality.resolve_qc_status(flags)#
Return the worst status level from a list of QC flag dicts.
- coco_pipe.io.quality.group_labels(container, group_by='family')#
Unique feature-group labels a container spans, at
group_bygranularity.Resolves labels from the container’s structured
feature_schema()— enriched from descriptor-name parsing when the schema is partial — and returns them de-duplicated in first-seen order. This is the structured replacement for hand-rolled “which families/measures does this analysis unit cover” helpers: pass the sliced unit container fromiter_analysis_units()to learn which QC labels it maps to.- Parameters:
container (coco_pipe.io.structures.DataContainer) – Any container with a
featureaxis (e.g. one analysis unit).group_by (str) – Grouping granularity:
"family","measure", or"feature".
- Returns:
Distinct labels at the requested granularity;
[]when the container has nofeatureaxis.- Return type:
- coco_pipe.io.quality.compute_row_outlier_scores(df, feature_cols, z_threshold=5.0, descriptor_names=None, group_by=None, feature_schema=None)#
Compute per-row outlier fractions using MAD-based robust z-scores.
When
group_byis set ("family","measure", or"feature") the result also carries per-groupoutlier_fraction_<label>columns so a row can be judged within each descriptor group rather than across all features.- Parameters:
df (pandas.DataFrame)
z_threshold (float)
group_by (str | None)
feature_schema (pandas.DataFrame | None)
- Return type:
- coco_pipe.io.quality.compute_subject_outlier_burden(df, feature_cols, subject_col='subject', z_threshold=5.0)#
Aggregate row-level MAD outlier scores to one row per subject.
- Parameters:
df (pandas.DataFrame)
subject_col (str)
z_threshold (float)
- Return type:
- coco_pipe.io.quality.check_missingness(df, threshold_warn=0.01, threshold_fail=0.2)#
Check for missing values (NaNs).
- Parameters:
- Returns:
Quality check result.
- Return type:
Examples
>>> data = np.array([1, 2, np.nan, 4]) >>> check_missingness(data, threshold_warn=0.1) CheckResult(check_name='Missingness', status='FAIL', ...)
- coco_pipe.io.quality.check_constant_columns(df)#
Check for columns/features with zero variance.
- Parameters:
df (DataFrame or ndarray) – The data to check.
- Returns:
List of findings. Empty if no constant columns found.
- Return type:
List[CheckResult]
Examples
>>> df = pd.DataFrame({"a": [1, 1, 1], "b": [1, 2, 3]}) >>> check_constant_columns(df) [CheckResult(check_name='Constant Features', ...)]
- coco_pipe.io.quality.check_outliers_zscore(df, sigma=5.0)#
Check for extreme values (> sigma). Uses a simple global Z-score approach.
- Parameters:
df (DataFrame or ndarray) – Data to check.
sigma (float) – Z-score threshold. Default 5.0.
- Returns:
CheckResult if outliers found, else None.
- Return type:
Optional[CheckResult]
- coco_pipe.io.quality.check_flatline(signal, threshold=1e-10)#
Check if signal is effectively dead (flatline).
- Parameters:
signal (ndarray) – 1D signal array or flattened data.
threshold (float) – Standard deviation threshold. Default 1e-10.
- Returns:
Result indicating if signal is flat.
- Return type:
- coco_pipe.io.quality.drop_epoch_outliers(container, z_threshold=5.0, outlier_fraction_threshold=0.3, subject_col='subject', feature_cols=None, descriptor_names=None, group_by=None, min_obs=None, feature_schema=None)#
Drop observations with a high fraction of MAD-based feature outliers.
group_by=Nonemakes one global drop decision across all features. When set ("family","measure", or"feature") the decision is made per descriptor group at that granularity and the returned dict is keyed by group label.- Parameters:
- Return type:
tuple[coco_pipe.io.structures.DataContainer | dict[str, numpy.ndarray], QCResult]
- coco_pipe.io.quality.drop_subject_outliers(container, z_threshold=5.0, outlier_fraction_threshold=0.2, subject_col='subject', feature_cols=None, descriptor_names=None, group_by=None, feature_schema=None)#
Drop subjects with a high cohort-level feature outlier burden.
group_by=Nonemakes one global decision across all features. When set ("family","measure", or"feature") the burden is computed per descriptor group at that granularity and the returned dict is keyed by group label.- Parameters:
container (coco_pipe.io.structures.DataContainer)
z_threshold (float)
outlier_fraction_threshold (float)
subject_col (str)
group_by (str | None)
feature_schema (pandas.DataFrame | None)
- Return type:
tuple[coco_pipe.io.structures.DataContainer | dict[str, numpy.ndarray], QCResult]
- coco_pipe.io.quality.run_qc(container, epoch_z_threshold=5.0, epoch_outlier_fraction_threshold=0.3, subject_z_threshold=5.0, subject_outlier_fraction_threshold=0.2, subject_col='subject', feature_cols=None, compute_missingness=True)#
Run epoch QC followed by subject QC and return a merged result.
- Parameters:
- Return type: