coco_pipe.descriptors.io#

Descriptor-table file IO: save, load, merge, and feature-column consistency.

This is the descriptor-specific table IO layer. It builds on the generic coco_pipe.io._serialization.read_table() primitive but owns the descriptor concerns: the _feature_columns.json sidecar contract, loading a descriptor table into a DataContainer (flat or sensor x feature), and the cross-shard merge stage.

Author: Hamza Abdelhedi <hamza.abdelhedi@umontreal.ca>

Functions#

save_descriptor_table(df, base_path[, ...])

Write a descriptor table, with an optional feature-column sidecar.

check_feature_column_consistency(shard_root, ...)

Load a feature-column sidecar from shard_root and assert consistency.

merge_descriptor_tables(table_paths[, ...])

Merge per-shard tables of one table kind into a single table.

load_descriptor_table(table_path, feature_columns_path)

Load a descriptor feature table into a DataContainer.

Module Contents#

coco_pipe.descriptors.io.save_descriptor_table(df, base_path, feature_columns=None, formats=('parquet',))#

Write a descriptor table, with an optional feature-column sidecar.

load_descriptor_table() reads whichever single file it is pointed at, so by default only the canonical {base_path}.parquet is written. Pass formats=("parquet", "csv") to additionally emit a human-readable {base_path}.csv (doubles the on-disk footprint). A {base_path.name}_feature_columns.json sidecar is written when feature_columns is given.

Parameters:
  • df (pandas.DataFrame) – Table to write.

  • base_path (pathlib.Path | str) – Output path without suffix, e.g. combined/sensor_subject_features.

  • feature_columns (collections.abc.Sequence[str] | None) – Optional ordered list of descriptor feature-column names written to a _feature_columns.json sidecar alongside the table.

  • formats (collections.abc.Sequence[str]) – Table formats to write. Any subset of {"parquet", "csv"}; defaults to parquet only.

Raises:

ValueError – If formats is empty or contains an unsupported format.

Return type:

None

coco_pipe.descriptors.io.check_feature_column_consistency(shard_root, json_name, accumulated, col_key)#

Load a feature-column sidecar from shard_root and assert consistency.

Intended for merging per-shard descriptor outputs: on the first call for a given col_key the loaded column list is stored in accumulated. On every subsequent call the loaded list is compared against the stored one and a ValueError is raised on any mismatch, preventing a silent merge of shards produced with incompatible feature sets.

Parameters:
  • shard_root (pathlib.Path | str) – Directory containing the json_name feature-column sidecar.

  • json_name (str) – Filename of the feature-column JSON sidecar within shard_root.

  • accumulated (dict[str, list[str] | None]) – Mapping of col_key -> feature column list, mutated in place.

  • col_key (str) – Key identifying which feature-column set this sidecar belongs to (e.g. "sensor_epoch").

Return type:

None

coco_pipe.descriptors.io.merge_descriptor_tables(table_paths, feature_columns_paths=None, *, out_base_path=None, formats=('parquet',))#

Merge per-shard tables of one table kind into a single table.

The cross-shard merge stage. A “table kind” is one descriptor output table — e.g. sensor_epoch / sensor_subject / pooled_subject — written once per shard; this row-concatenates that kind across shards. (It is not about the band/param/complexity descriptor family.) Each shard is read, its feature-column sidecar is optionally checked against the first, the rows are concatenated, and the combined table (plus sidecar) is optionally written via save_descriptor_table(). Discovery, manifests, and dataset-level QC are deliberately left to the caller, which calls this once per table kind.

Parameters:
Returns:

(combined_df, feature_columns) where feature_columns is the validated column list when feature_columns_paths was provided, else None.

Return type:

tuple

Raises:

ValueError – If table_paths is empty, the sidecar list is misaligned, or a shard’s feature columns differ from the first shard.

coco_pipe.descriptors.io.load_descriptor_table(table_path, feature_columns_path, known_families=('band', 'param', 'complexity'), condition=None, target_col=None, subjects=None, subject_col='subject', analysis_mode='flat', descriptor_families=None, descriptor_max_abs_value=None, drop_degenerate_columns=False, max_missing_rate=0.2, drop_constant_columns=True, constant_tol=1e-12, max_row_drop_rate=None, location_statistic=None, exclude_subfamilies=None)#

Load a descriptor feature table into a DataContainer.

Parameters:
Return type:

coco_pipe.io.structures.DataContainer