coco_pipe.descriptors.io#

Descriptor-table file IO: save, load, merge, and feature-column consistency.

This is the descriptor-specific table IO layer. It builds on the generic coco_pipe.io._serialization.read_table() primitive but owns the descriptor concerns: the _feature_columns.json sidecar contract, loading a descriptor table into a DataContainer (flat or sensor x feature), and the cross-shard merge stage.

Author: Hamza Abdelhedi <hamza.abdelhedi@umontreal.ca>

Functions#

`save_descriptor_table`(df, base_path[, ...])	Write a descriptor table, with an optional feature-column sidecar.
`check_feature_column_consistency`(shard_root, ...)	Load a feature-column sidecar from shard_root and assert consistency.
`merge_descriptor_tables`(table_paths[, ...])	Merge per-shard tables of one table kind into a single table.
`load_descriptor_table`(table_path, feature_columns_path)	Load a descriptor feature table into a `DataContainer`.

Module Contents#

coco_pipe.descriptors.io.save_descriptor_table(df, base_path, feature_columns=None, formats=('parquet',))#

Write a descriptor table, with an optional feature-column sidecar.

load_descriptor_table() reads whichever single file it is pointed at, so by default only the canonical {base_path}.parquet is written. Pass formats=("parquet", "csv") to additionally emit a human-readable {base_path}.csv (doubles the on-disk footprint). A {base_path.name}_feature_columns.json sidecar is written when feature_columns is given.

Parameters:

df (pandas.DataFrame) – Table to write.
base_path (pathlib.Path | str) – Output path without suffix, e.g. combined/sensor_subject_features.
feature_columns (collections.abc.Sequence[str] | None) – Optional ordered list of descriptor feature-column names written to a _feature_columns.json sidecar alongside the table.
formats (collections.abc.Sequence[str]) – Table formats to write. Any subset of {"parquet", "csv"}; defaults to parquet only.

Raises:

ValueError – If formats is empty or contains an unsupported format.

Return type:

None

coco_pipe.descriptors.io.check_feature_column_consistency(shard_root, json_name, accumulated, col_key)#

Load a feature-column sidecar from shard_root and assert consistency.

Intended for merging per-shard descriptor outputs: on the first call for a given col_key the loaded column list is stored in accumulated. On every subsequent call the loaded list is compared against the stored one and a ValueError is raised on any mismatch, preventing a silent merge of shards produced with incompatible feature sets.

Parameters:

shard_root (pathlib.Path | str) – Directory containing the json_name feature-column sidecar.
json_name (str) – Filename of the feature-column JSON sidecar within shard_root.
accumulated (dict[str, list[str] | None]) – Mapping of col_key -> feature column list, mutated in place.
col_key (str) – Key identifying which feature-column set this sidecar belongs to (e.g. "sensor_epoch").

Return type:

None

coco_pipe.descriptors.io.merge_descriptor_tables(table_paths, feature_columns_paths=None, *, out_base_path=None, formats=('parquet',))#

Merge per-shard tables of one table kind into a single table.

The cross-shard merge stage. A “table kind” is one descriptor output table — e.g. sensor_epoch / sensor_subject / pooled_subject — written once per shard; this row-concatenates that kind across shards. (It is not about the band/param/complexity descriptor family.) Each shard is read, its feature-column sidecar is optionally checked against the first, the rows are concatenated, and the combined table (plus sidecar) is optionally written via save_descriptor_table(). Discovery, manifests, and dataset-level QC are deliberately left to the caller, which calls this once per table kind.

Parameters:

table_paths (collections.abc.Sequence[pathlib.Path | str]) – Per-shard table files (.csv / .parquet) for one table kind, in the desired row order.
feature_columns_paths (collections.abc.Sequence[pathlib.Path | str] | None) – Optional per-shard feature-column JSON sidecars, aligned with table_paths. When given, cross-shard consistency is enforced via check_feature_column_consistency() and the agreed column list is used as the combined sidecar.
out_base_path (pathlib.Path | str | None) – Optional output path without suffix. When set, the combined table is written there via save_descriptor_table().
formats (collections.abc.Sequence[str]) – Output formats forwarded to save_descriptor_table() (default parquet only).

Returns:

(combined_df, feature_columns) where feature_columns is the validated column list when feature_columns_paths was provided, else None.

Return type:

tuple

Raises:

ValueError – If table_paths is empty, the sidecar list is misaligned, or a shard’s feature columns differ from the first shard.

coco_pipe.descriptors.io.load_descriptor_table(table_path, feature_columns_path, known_families=('band', 'param', 'complexity'), condition=None, target_col=None, subjects=None, subject_col='subject', analysis_mode='flat', descriptor_families=None, descriptor_max_abs_value=None, drop_degenerate_columns=False, max_missing_rate=0.2, drop_constant_columns=True, constant_tol=1e-12, max_row_drop_rate=None, location_statistic=None, exclude_subfamilies=None)#

Load a descriptor feature table into a DataContainer.

Parameters:

table_path (pathlib.Path | str)
feature_columns_path (pathlib.Path | str)
known_families (tuple[str, Ellipsis])
condition (str | None)
target_col (str | None)
subjects (collections.abc.Sequence[str] | None)
subject_col (str)
analysis_mode (str)
descriptor_families (collections.abc.Sequence[str] | None)
descriptor_max_abs_value (float | None)
drop_degenerate_columns (bool)
max_missing_rate (float)
drop_constant_columns (bool)
constant_tol (float)
max_row_drop_rate (float | None)
location_statistic (str | None)
exclude_subfamilies (collections.abc.Sequence[str] | None)

Return type:

coco_pipe.io.structures.DataContainer