coco_pipe.descriptors.io ======================== .. py:module:: coco_pipe.descriptors.io .. autoapi-nested-parse:: Descriptor-table file IO: save, load, merge, and feature-column consistency. This is the descriptor-specific table IO layer. It builds on the generic :func:`coco_pipe.io._serialization.read_table` primitive but owns the descriptor concerns: the ``_feature_columns.json`` sidecar contract, loading a descriptor table into a :class:`~coco_pipe.io.structures.DataContainer` (flat or sensor x feature), and the cross-shard **merge** stage. Author: Hamza Abdelhedi Functions --------- .. autoapisummary:: coco_pipe.descriptors.io.save_descriptor_table coco_pipe.descriptors.io.check_feature_column_consistency coco_pipe.descriptors.io.merge_descriptor_tables coco_pipe.descriptors.io.load_descriptor_table Module Contents --------------- .. py:function:: save_descriptor_table(df, base_path, feature_columns = None, formats = ('parquet', )) Write a descriptor table, with an optional feature-column sidecar. :func:`load_descriptor_table` reads whichever single file it is pointed at, so by default only the canonical ``{base_path}.parquet`` is written. Pass ``formats=("parquet", "csv")`` to additionally emit a human-readable ``{base_path}.csv`` (doubles the on-disk footprint). A ``{base_path.name}_feature_columns.json`` sidecar is written when *feature_columns* is given. :param df: Table to write. :param base_path: Output path without suffix, e.g. ``combined/sensor_subject_features``. :param feature_columns: Optional ordered list of descriptor feature-column names written to a ``_feature_columns.json`` sidecar alongside the table. :param formats: Table formats to write. Any subset of ``{"parquet", "csv"}``; defaults to parquet only. :raises ValueError: If *formats* is empty or contains an unsupported format. .. py:function:: check_feature_column_consistency(shard_root, json_name, accumulated, col_key) Load a feature-column sidecar from *shard_root* and assert consistency. Intended for merging per-shard descriptor outputs: on the first call for a given *col_key* the loaded column list is stored in *accumulated*. On every subsequent call the loaded list is compared against the stored one and a :class:`ValueError` is raised on any mismatch, preventing a silent merge of shards produced with incompatible feature sets. :param shard_root: Directory containing the ``json_name`` feature-column sidecar. :param json_name: Filename of the feature-column JSON sidecar within *shard_root*. :param accumulated: Mapping of ``col_key -> feature column list``, mutated in place. :param col_key: Key identifying which feature-column set this sidecar belongs to (e.g. ``"sensor_epoch"``). .. py:function:: merge_descriptor_tables(table_paths, feature_columns_paths = None, *, out_base_path = None, formats = ('parquet', )) Merge per-shard tables of one *table kind* into a single table. The cross-shard **merge** stage. A "table kind" is one descriptor output table — e.g. ``sensor_epoch`` / ``sensor_subject`` / ``pooled_subject`` — written once per shard; this row-concatenates that kind across shards. (It is not about the band/param/complexity descriptor *family*.) Each shard is read, its feature-column sidecar is optionally checked against the first, the rows are concatenated, and the combined table (plus sidecar) is optionally written via :func:`save_descriptor_table`. Discovery, manifests, and dataset-level QC are deliberately left to the caller, which calls this once per table kind. :param table_paths: Per-shard table files (``.csv`` / ``.parquet``) for one table kind, in the desired row order. :param feature_columns_paths: Optional per-shard feature-column JSON sidecars, aligned with *table_paths*. When given, cross-shard consistency is enforced via :func:`check_feature_column_consistency` and the agreed column list is used as the combined sidecar. :param out_base_path: Optional output path without suffix. When set, the combined table is written there via :func:`save_descriptor_table`. :param formats: Output formats forwarded to :func:`save_descriptor_table` (default parquet only). :returns: ``(combined_df, feature_columns)`` where ``feature_columns`` is the validated column list when *feature_columns_paths* was provided, else ``None``. :rtype: tuple :raises ValueError: If *table_paths* is empty, the sidecar list is misaligned, or a shard's feature columns differ from the first shard. .. py:function:: load_descriptor_table(table_path, feature_columns_path, known_families = ('band', 'param', 'complexity'), condition = None, target_col = None, subjects = None, subject_col = 'subject', analysis_mode = 'flat', descriptor_families = None, descriptor_max_abs_value = None, drop_degenerate_columns = False, max_missing_rate = 0.2, drop_constant_columns = True, constant_tol = 1e-12, max_row_drop_rate = None, location_statistic = None, exclude_subfamilies = None) Load a descriptor feature table into a :class:`~coco_pipe.io.DataContainer`.