coco_pipe.io.load_data#

coco_pipe.io.load_data(path=None, mode='auto', target_col=None, index_col=None, sep='\t', header=0, sheet_name=0, columns_to_dims=None, col_sep='_', meta_columns=None, clean=False, clean_kwargs=None, task=None, session=None, runs=None, datatype='eeg', suffix=None, loading_mode='epochs', window_length=None, stride=None, event_id=None, tmin=-0.2, tmax=0.5, baseline=None, drop_short_epochs=True, units=None, subject_metadata_df=None, subject_key=None, pattern='*.pkl', dims=('obs', 'feature'), coords=None, run=None, processing=None, reader=None, id_fn=None, subjects=None, config=None, **kwargs)#

Universal data loader factory. Dispatches to BIDSDataset, TabularDataset, or EmbeddingDataset based on mode.

Parameters:

path (str or Path, optional) – Path to data source (file or directory). Required unless config is given (in which case config.path is used).
mode ({"auto", "tabular", "bids", "embedding"}, default="auto") –
Type of data to load.
- ”auto”: Infers type from file extension or directory structure. A directory with dataset_description.json or sub-* entries is treated as "bids"; .csv/.tsv/.xls/.xlsx/.txt files as "tabular"; everything else as "embedding".
- ”tabular”: uses TabularDataset (CSV, TSV, Excel, TXT).
- ”bids”: uses BIDSDataset (BIDS-compliant directories).
- ”embedding”: uses EmbeddingDataset (NPY, PKL, H5, JSON).
config (DatasetConfig or {Tabular,BIDS,Embedding}Config, optional) – A pre-validated configuration object (see coco_pipe.io.config). When provided, its fields drive the load and mode is taken from the config; the matching keyword arguments below are ignored. When omitted, the relevant keyword arguments are assembled into a config and validated before dispatch. The non-serializable parameters reader, id_fn, subject_metadata_df, and subject_key are always passed through directly and are never part of the config schema.
(mode="tabular") (Tabular Arguments)
----------------------------------
target_col (str, optional) – Name of the column to extract as target y. Removed from features X.
index_col (str or int, optional) – Column to use as index (observation IDs).
sep (str, default='t') – Separator for text files (e.g. ‘,’ for CSV).
header (int or list of int, default=0) – Row number(s) to use as column names.
sheet_name (str or int, default=0) – Sheet name or index for Excel files.
columns_to_dims (list of str, optional) – If provided, attempts to reshape 2D feature columns into N-D dimensions. Columns must follow: dim1_dim2_…_feature.
col_sep (str, default='_') – Separator used in column names for reshaping.
meta_columns (list of str, optional) – Columns to extract as metadata coordinates instead of features.
clean (bool, default=False) – Whether to perform automated cleaning (drop NaNs/Infs).
clean_kwargs (dict, optional) – Arguments passed to TabularDataset.clean.
(mode="bids") (BIDS Arguments)
----------------------------
task (str, optional) – BIDS task name (e.g., ‘rest’, ‘audiovisual’).
session (str or List[str], optional) – Session ID(s) to load. Defaults to all available.
datatype (str, default='eeg') – Data type folder (e.g., ‘eeg’, ‘meg’, ‘ieeg’).
suffix (str, optional) – File suffix to load (e.g., ‘eeg’, ‘epo’, ‘ave’).
loading_mode (str, default='epochs') –
How to process the data. Renamed to loading_mode here (and in BIDSConfig) to avoid colliding with this function’s mode argument; it is passed through as mode to BIDSDataset.
- ’epochs’: Splices continuous data into fixed-length windows.
- ’continuous’: Loads as single continuous segments.
- ’load_existing’: Loads pre-computed epochs.
window_length (float, optional) – Window length in seconds (for ‘epochs’ mode).
stride (float, optional) – Stride in seconds (for ‘epochs’ mode).
subject_metadata_df (DataFrame, optional) – External subject-level metadata to merge by subject during BIDS loading.
subject_key (str, optional) – Column in subject_metadata_df containing the BIDS subject identifier.
subjects (int or list, optional) – Specific subject IDs to load (without ‘sub-‘).
(mode="embedding") (Embedding Arguments)
--------------------------------------
pattern (str, default=r'*.pkl') – Glob pattern to match files.
dims (tuple of str, default=('obs', 'feature')) – Dimension labels for the data arrays.
coords (dict, optional) – Dictionary of coordinates for dimensions.
reader (callable, optional) – Custom file reader function.
id_fn (callable, optional) – Custom subject ID extraction function.
subjects – If int, loads first N subjects. If list, filters by ID.
runs (str | list[str] | None)
event_id (dict[str, int] | str | list[str] | None)
tmin (float)
tmax (float)
baseline (tuple[float | None, float | None] | None)
drop_short_epochs (bool)
units (str | None)
run (str | None)
processing (str | None)

Returns:

Standardized data container with attributes: - X: (N_obs, …) data array - y: Targets (if available) - ids: Observation identifiers - coords: Coordinate metadata

Return type:

DataContainer

Examples

Two equivalent ways to load. The keyword form is convenient for quick, interactive use:

>>> container = load_data("features.csv", mode="tabular", target_col="y")

The config-first form is recommended for pipelines and reproducible runs: a TabularConfig / BIDSConfig / EmbeddingConfig is validated once and can be serialized, version-controlled, and reused. It also keeps each mode’s options self-contained instead of mixing all three modes’ keywords:

>>> from coco_pipe.io.config import TabularConfig
>>> cfg = TabularConfig(path="features.csv", target_col="y")
>>> container = load_data(config=cfg)

BIDS loading uses loading_mode (not mode) to choose the windowing strategy:

>>> container = load_data(
...     "/data/bids",
...     mode="bids",
...     task="rest",
...     loading_mode="epochs",
...     window_length=2.0,
... )