coco_pipe.descriptors.io#
Descriptor-table file IO: save, load, merge, and feature-column consistency.
This is the descriptor-specific table IO layer. It builds on the generic
coco_pipe.io._serialization.read_table() primitive but owns the
descriptor concerns: the _feature_columns.json sidecar contract, loading a
descriptor table into a DataContainer (flat or
sensor x feature), and the cross-shard merge stage.
Author: Hamza Abdelhedi <hamza.abdelhedi@umontreal.ca>
Functions#
|
Write a descriptor table, with an optional feature-column sidecar. |
|
Load a feature-column sidecar from shard_root and assert consistency. |
|
Merge per-shard tables of one table kind into a single table. |
|
Load a descriptor feature table into a |
Module Contents#
- coco_pipe.descriptors.io.save_descriptor_table(df, base_path, feature_columns=None, formats=('parquet',))#
Write a descriptor table, with an optional feature-column sidecar.
load_descriptor_table()reads whichever single file it is pointed at, so by default only the canonical{base_path}.parquetis written. Passformats=("parquet", "csv")to additionally emit a human-readable{base_path}.csv(doubles the on-disk footprint). A{base_path.name}_feature_columns.jsonsidecar is written when feature_columns is given.- Parameters:
df (pandas.DataFrame) – Table to write.
base_path (pathlib.Path | str) – Output path without suffix, e.g.
combined/sensor_subject_features.feature_columns (collections.abc.Sequence[str] | None) – Optional ordered list of descriptor feature-column names written to a
_feature_columns.jsonsidecar alongside the table.formats (collections.abc.Sequence[str]) – Table formats to write. Any subset of
{"parquet", "csv"}; defaults to parquet only.
- Raises:
ValueError – If formats is empty or contains an unsupported format.
- Return type:
None
- coco_pipe.descriptors.io.check_feature_column_consistency(shard_root, json_name, accumulated, col_key)#
Load a feature-column sidecar from shard_root and assert consistency.
Intended for merging per-shard descriptor outputs: on the first call for a given col_key the loaded column list is stored in accumulated. On every subsequent call the loaded list is compared against the stored one and a
ValueErroris raised on any mismatch, preventing a silent merge of shards produced with incompatible feature sets.- Parameters:
shard_root (pathlib.Path | str) – Directory containing the
json_namefeature-column sidecar.json_name (str) – Filename of the feature-column JSON sidecar within shard_root.
accumulated (dict[str, list[str] | None]) – Mapping of
col_key -> feature column list, mutated in place.col_key (str) – Key identifying which feature-column set this sidecar belongs to (e.g.
"sensor_epoch").
- Return type:
None
- coco_pipe.descriptors.io.merge_descriptor_tables(table_paths, feature_columns_paths=None, *, out_base_path=None, formats=('parquet',))#
Merge per-shard tables of one table kind into a single table.
The cross-shard merge stage. A “table kind” is one descriptor output table — e.g.
sensor_epoch/sensor_subject/pooled_subject— written once per shard; this row-concatenates that kind across shards. (It is not about the band/param/complexity descriptor family.) Each shard is read, its feature-column sidecar is optionally checked against the first, the rows are concatenated, and the combined table (plus sidecar) is optionally written viasave_descriptor_table(). Discovery, manifests, and dataset-level QC are deliberately left to the caller, which calls this once per table kind.- Parameters:
table_paths (collections.abc.Sequence[pathlib.Path | str]) – Per-shard table files (
.csv/.parquet) for one table kind, in the desired row order.feature_columns_paths (collections.abc.Sequence[pathlib.Path | str] | None) – Optional per-shard feature-column JSON sidecars, aligned with table_paths. When given, cross-shard consistency is enforced via
check_feature_column_consistency()and the agreed column list is used as the combined sidecar.out_base_path (pathlib.Path | str | None) – Optional output path without suffix. When set, the combined table is written there via
save_descriptor_table().formats (collections.abc.Sequence[str]) – Output formats forwarded to
save_descriptor_table()(default parquet only).
- Returns:
(combined_df, feature_columns)wherefeature_columnsis the validated column list when feature_columns_paths was provided, elseNone.- Return type:
- Raises:
ValueError – If table_paths is empty, the sidecar list is misaligned, or a shard’s feature columns differ from the first shard.
- coco_pipe.descriptors.io.load_descriptor_table(table_path, feature_columns_path, known_families=('band', 'param', 'complexity'), condition=None, target_col=None, subjects=None, subject_col='subject', analysis_mode='flat', descriptor_families=None, descriptor_max_abs_value=None, drop_degenerate_columns=False, max_missing_rate=0.2, drop_constant_columns=True, constant_tol=1e-12, max_row_drop_rate=None, location_statistic=None, exclude_subfamilies=None)#
Load a descriptor feature table into a
DataContainer.- Parameters:
table_path (pathlib.Path | str)
feature_columns_path (pathlib.Path | str)
condition (str | None)
target_col (str | None)
subjects (collections.abc.Sequence[str] | None)
subject_col (str)
analysis_mode (str)
descriptor_families (collections.abc.Sequence[str] | None)
descriptor_max_abs_value (float | None)
drop_degenerate_columns (bool)
max_missing_rate (float)
drop_constant_columns (bool)
constant_tol (float)
max_row_drop_rate (float | None)
location_statistic (str | None)
exclude_subfamilies (collections.abc.Sequence[str] | None)
- Return type: