coco_pipe.io.load#

High-level data loading factory.

Author: Hamza Abdelhedi <hamza.abdelhedi@umontreal.ca>

Functions#

load_data([path, mode, target_col, index_col, sep, ...])

Universal data loader factory.

Module Contents#

coco_pipe.io.load.load_data(path=None, mode='auto', target_col=None, index_col=None, sep='\t', header=0, sheet_name=0, columns_to_dims=None, col_sep='_', meta_columns=None, clean=False, clean_kwargs=None, task=None, session=None, runs=None, datatype='eeg', suffix=None, loading_mode='epochs', window_length=None, stride=None, event_id=None, tmin=-0.2, tmax=0.5, baseline=None, drop_short_epochs=True, units=None, subject_metadata_df=None, subject_key=None, pattern='*.pkl', dims=('obs', 'feature'), coords=None, run=None, processing=None, reader=None, id_fn=None, subjects=None, config=None, **kwargs)#

Universal data loader factory. Dispatches to BIDSDataset, TabularDataset, or EmbeddingDataset based on mode.

Parameters:
  • path (str or Path, optional) – Path to data source (file or directory). Required unless config is given (in which case config.path is used).

  • mode ({"auto", "tabular", "bids", "embedding"}, default="auto") –

    Type of data to load.

    • ”auto”: Infers type from file extension or directory structure. A directory with dataset_description.json or sub-* entries is treated as "bids"; .csv/.tsv/.xls/.xlsx/.txt files as "tabular"; everything else as "embedding".

    • ”tabular”: uses TabularDataset (CSV, TSV, Excel, TXT).

    • ”bids”: uses BIDSDataset (BIDS-compliant directories).

    • ”embedding”: uses EmbeddingDataset (NPY, PKL, H5, JSON).

  • config (DatasetConfig or {Tabular,BIDS,Embedding}Config, optional) – A pre-validated configuration object (see coco_pipe.io.config). When provided, its fields drive the load and mode is taken from the config; the matching keyword arguments below are ignored. When omitted, the relevant keyword arguments are assembled into a config and validated before dispatch. The non-serializable parameters reader, id_fn, subject_metadata_df, and subject_key are always passed through directly and are never part of the config schema.

  • (mode="tabular") (Tabular Arguments)

  • ----------------------------------

  • target_col (str, optional) – Name of the column to extract as target y. Removed from features X.

  • index_col (str or int, optional) – Column to use as index (observation IDs).

  • sep (str, default='t') – Separator for text files (e.g. ‘,’ for CSV).

  • header (int or list of int, default=0) – Row number(s) to use as column names.

  • sheet_name (str or int, default=0) – Sheet name or index for Excel files.

  • columns_to_dims (list of str, optional) – If provided, attempts to reshape 2D feature columns into N-D dimensions. Columns must follow: dim1_dim2_…_feature.

  • col_sep (str, default='_') – Separator used in column names for reshaping.

  • meta_columns (list of str, optional) – Columns to extract as metadata coordinates instead of features.

  • clean (bool, default=False) – Whether to perform automated cleaning (drop NaNs/Infs).

  • clean_kwargs (dict, optional) – Arguments passed to TabularDataset.clean.

  • (mode="bids") (BIDS Arguments)

  • ----------------------------

  • task (str, optional) – BIDS task name (e.g., ‘rest’, ‘audiovisual’).

  • session (str or List[str], optional) – Session ID(s) to load. Defaults to all available.

  • datatype (str, default='eeg') – Data type folder (e.g., ‘eeg’, ‘meg’, ‘ieeg’).

  • suffix (str, optional) – File suffix to load (e.g., ‘eeg’, ‘epo’, ‘ave’).

  • loading_mode (str, default='epochs') –

    How to process the data. Renamed to loading_mode here (and in BIDSConfig) to avoid colliding with this function’s mode argument; it is passed through as mode to BIDSDataset.

    • ’epochs’: Splices continuous data into fixed-length windows.

    • ’continuous’: Loads as single continuous segments.

    • ’load_existing’: Loads pre-computed epochs.

  • window_length (float, optional) – Window length in seconds (for ‘epochs’ mode).

  • stride (float, optional) – Stride in seconds (for ‘epochs’ mode).

  • subject_metadata_df (DataFrame, optional) – External subject-level metadata to merge by subject during BIDS loading.

  • subject_key (str, optional) – Column in subject_metadata_df containing the BIDS subject identifier.

  • subjects (int or list, optional) – Specific subject IDs to load (without ‘sub-‘).

  • (mode="embedding") (Embedding Arguments)

  • --------------------------------------

  • pattern (str, default=r'*.pkl') – Glob pattern to match files.

  • dims (tuple of str, default=('obs', 'feature')) – Dimension labels for the data arrays.

  • coords (dict, optional) – Dictionary of coordinates for dimensions.

  • reader (callable, optional) – Custom file reader function.

  • id_fn (callable, optional) – Custom subject ID extraction function.

  • subjects – If int, loads first N subjects. If list, filters by ID.

  • runs (str | list[str] | None)

  • event_id (dict[str, int] | str | list[str] | None)

  • tmin (float)

  • tmax (float)

  • baseline (tuple[float | None, float | None] | None)

  • drop_short_epochs (bool)

  • units (str | None)

  • run (str | None)

  • processing (str | None)

Returns:

Standardized data container with attributes: - X: (N_obs, …) data array - y: Targets (if available) - ids: Observation identifiers - coords: Coordinate metadata

Return type:

DataContainer

Examples

Two equivalent ways to load. The keyword form is convenient for quick, interactive use:

>>> container = load_data("features.csv", mode="tabular", target_col="y")

The config-first form is recommended for pipelines and reproducible runs: a TabularConfig / BIDSConfig / EmbeddingConfig is validated once and can be serialized, version-controlled, and reused. It also keeps each mode’s options self-contained instead of mixing all three modes’ keywords:

>>> from coco_pipe.io.config import TabularConfig
>>> cfg = TabularConfig(path="features.csv", target_col="y")
>>> container = load_data(config=cfg)

BIDS loading uses loading_mode (not mode) to choose the windowing strategy:

>>> container = load_data(
...     "/data/bids",
...     mode="bids",
...     task="rest",
...     loading_mode="epochs",
...     window_length=2.0,
... )