coco_pipe.io.DataContainer#
- class coco_pipe.io.DataContainer(X, dims, coords=<factory>, y=None, ids=None, meta=<factory>)#
Bases:
objectGeneric container for N-dimensional neurophysiological data.
Acts as a lightweight labelled array (like xarray but simpler), managing dimensions, coordinates, and associated target labels (y) and IDs.
- Variables:
X (np.ndarray) – The primary data tensor. Shape must match dims.
dims (Tuple[str, ...]) – Labels for each dimension of X. Examples: (‘obs’, ‘feature’), (‘obs’, ‘channel’, ‘time’). Note: The ‘obs’ dimension is special and typically represents independent samples.
coords (Dict[str, Union[List, np.ndarray]]) – Coordinates/Labels for dimensions. Keys must be in dims. Values must match the length of the corresponding dimension in X.
y (Optional[np.ndarray], optional) – Target labels corresponding to the ‘obs’ dimension. Used for supervised learning or coloring plots.
ids (Optional[np.ndarray], optional) – Identifiers for observations (e.g., subject IDs, trial names). Should correspond to ‘obs’ dim in coords if provided. Kept separate from coords for convenient tracking.
meta (Dict[str, Any]) – Arbitrary metadata (sfreq, units, source path, etc).
- Parameters:
Examples
Accessing data: >>> container.X.shape (10, 64, 500)
Accessing coordinates: >>> container.coords[“channel”][:3] [‘Fz’, ‘Cz’, ‘Pz’]
- save(path)#
Save the DataContainer to disk using joblib.
- Parameters:
path (str or Path) – Destination file path.
- Return type:
None
- observation_frame()#
Return observation-aligned coordinates and stable sample IDs.
- Return type:
DataFrame
- classmethod load(path)#
Load a DataContainer from disk.
- Parameters:
path (str or Path) – Source file path.
- Return type:
- classmethod concat(containers, fill_condition_from_meta=True)#
Concatenate containers along their observation axis.
All containers must use matching dimensions and non-observation shapes. Observation-aligned coordinates are concatenated, filling missing entries with
Nonewhen a coordinate is absent from a container. Non-observation coordinates are copied from the first container.- Parameters:
containers (sequence of DataContainer) – One or more containers to concatenate.
fill_condition_from_meta (bool, default=True) – If no observation-level
conditioncoordinate is available, create one from each container’smeta["condition"]value.
- Returns:
The concatenated container.
- Return type:
- obs_table(include_ids=False, id_col='obs_id', include_y=False, y_col='y', include_obs_coord=False)#
Return one-dimensional coordinates aligned to the observation axis.
This helper is useful when exporting a row-wise table from a container. It only materializes metadata that can map cleanly to one row per observation, skipping coordinates that belong to other axes such as
channel,time,feature, orstat.- Parameters:
include_ids (bool, default=False) – If True, include
self.idsas the first column.id_col (str, default="obs_id") – Column name used when exporting
self.ids.include_y (bool, default=False) – If True, include
self.yas a column when present.y_col (str, default="y") – Column name used when exporting
self.y.include_obs_coord (bool, default=False) – If True, include
coords["obs"]when present.
- Returns:
DataFrame containing only one-dimensional observation-aligned metadata columns.
- Return type:
- Raises:
ValueError – If the container has no
obsdimension, or ifinclude_idsis requested whenself.idsis missing.
- isel(**indexers)#
Select data by integer indices on specified dimensions.
This method is the integer-index equivalent of select. It operates directly on the dimensions of the data tensor X. It is robust and handles metadata splitting/alignment automatically.
- Parameters:
**indexers (dict) –
Dimension names mapped to the desired index. The index can be:
List or numpy array of integers: [0, 1, 5]
Slice object: slice(0, 10)
Single integer: 0
Note: If you provide a list of indices with repeats (e.g., [0, 0, 1]), the output will be oversampled accordingly.
- Returns:
A new DataContainer instance with the sliced data and coordinates.
- Return type:
Examples
>>> # Select first 10 observations >>> subset = container.isel(obs=slice(0, 10))
>>> # Select specific channels by index >>> subset = container.isel(channel=[0, 5, 12])
>>> # Select time range by index >>> subset = container.isel(time=slice(100, 200))
>>> # Bootstrap/Resample (Select index 0 five times) >>> bootstrap = container.isel(obs=[0, 0, 0, 0, 0])
- balance(target='y', strategy='undersample', covariates=None, random_state=42, **kwargs)#
Balance the dataset classes using undersampling or oversampling.
This method adjusts the number of observations (rows) in the container so that class counts in target are equalized. It supports simple random sampling and stratified sampling based on covariates.
- Parameters:
target (str or array-like) –
The target vector to balance against:
’y’: Uses self.y.
Any other string: Looks for the variable in self.coords.
Array-like: Direct labels to use.
method (str, default='auto') –
Balancing strategy:
’undersample’: Downsample majority classes to match the minority class frequency. Uses self.ids to ensure repeatability.
’oversample’: Upsample minority classes (with replacement) to match the majority frequency.
’auto’: Heuristic choice. Uses undersampling if total size remains > 20% of original, else oversampling.
covariates (list of str, optional) – List of covariate names in self.coords to preserve distribution of. If provided, the balancing is performed within strata defined by these covariates.
random_state (int, default=42) – Seed for the random number generator. Change this value to produce different random subsets (e.g., for bagging).
**kwargs (dict) –
Additional arguments passed to internal logic:
n_bins (int): Number of bins for continuous covariates (default 5).
binning (str): ‘quantile’ (default) or ‘uniform’ binning.
prefer_clean_rows (bool): If True, weighs sampling to prefer rows with fewer NaNs/artifacts.
strategy (str)
- Returns:
A new DataContainer instance with balanced classes.
- Return type:
Examples
>>> # 1. Simple Undersampling of 'y' >>> balanced = container.balance(strategy="undersample")
>>> # 2. Balance based on a metadata column 'condition' >>> balanced = container.balance(target="condition")
>>> # 3. Stratified Balancing (Balance 'y' while preserving 'sex' and 'age' >>> # ratios) >>> balanced = container.balance(target="y", covariates=["sex", "age"])
>>> # 4. Iterative Bootstrapping (Different seeds) >>> for seed in [1, 2, 3]: ... subset = container.balance(strategy="undersample", random_state=seed) ... # process subset...
- select(ignore_case=False, fuzzy=False, **selections)#
Select data subsets based on coordinates, ids, or y.
This method supports exact matching, wildcard matching, operator-based filtering, and custom callable filters.
- Parameters:
ignore_case (bool, default=False) – If True, string matching is case-insensitive (e.g., ‘fz’ matches ‘Fz’).
fuzzy (bool, default=False) – If True, uses difflib to find closest matches for string queries (e.g., ‘Alpha’ matches ‘alpha’). Useful for handling typos.
**selections (dict) –
Key is the dimension name (or special keys ‘y’, ‘ids’). Value is the query. Supported query types:
List/Array (Exact or Wildcard): Matches values present in the list. Strings can use shell-style wildcards (‘*’, ‘?’).
Dictionary (Operator Queries): Filters numerical or string values using operators. Keys: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’, ‘in’.
Callable: A function taking the coordinate array and returning a boolean mask.
- Returns:
A new DataContainer instance containing the selected subset.
- Return type:
Examples
>>> # 1. Exact Selection (Sensors) >>> sub = container.select(channel=["Fz", "Cz"])
>>> # 2. Wildcard Selection (All Alpha features) >>> sub = container.select(feature="*alpha*")
>>> # 3. Range Selection (Time) >>> sub = container.select(time={">=": 0.1, "<": 0.5})
>>> # 4. Case-Insensitive Fuzzy Matching >>> sub = container.select(channel=["fz"], ignore_case=True)
>>> # 5. Filter by Target (y) >>> sub = container.select(y=["Patient"])
>>> # 6. Complex Logic (Subjects 1-5 via Operator) >>> sub = container.select(subject_id={">=": 1, "<=": 5})
>>> # 7. Stratified Selection (First 2 epochs per subject via Callable) >>> def first_n(ids, n=2): ... # ... logic ... ... return mask >>> sub = container.select(ids=first_n)
- flatten(preserve='obs', sep='_')#
Flatten dimensions NOT in preserve into a single ‘feature’ dimension.
This is useful for preparing N-dimensional data for standard 2D machine learning algorithms (scikit-learn). It automatically generates composite feature names (e.g., ‘Fz_0.1s’) for tracking.
- Parameters:
preserve (str or list of str, default="obs") –
Dimensions to preserve. All other dimensions are flattened into a new dimension ‘feature’.
’obs’: Result shape (N_obs, N_features). Standard specification.
[‘obs’, ‘time’]: Result shape (N_obs, N_time, N_features). Useful for time-resolved decoding distributions.
sep (str, default="_") – Separator used when generating composite feature names.
- Returns:
A new DataContainer with reshaped X and generated ‘feature’ coordinates.
- Return type:
Examples
>>> # Flatten (10, 64, 500) -> (10, 32000) >>> flat = container.flatten(preserve="obs") >>> flat.shape (10, 32000) >>> flat.coords["feature"][0] 'Fz_0.0'
>>> # Flatten spatial only, keep time (10, 64, 500) -> (10, 500, 64) >>> time_resolved = container.flatten(preserve=["obs", "time"])
- feature_schema()#
Return feature-axis metadata, or None when no feature coord exists.
Only coordinates aligned to the feature axis are included.
feature_*metadata is mapped to canonical schema names such asfamilyandmeasure; primary dimension coords folded byflatten()are used in the feature labels and are not recovered as structured metadata.- Return type:
DataFrame | None
- stack(dims, new_dim='obs')#
Stack multiple dimensions into a single new dimension.
This reshapes N-dimensional data into (N-K) dimensions by combining specified dimensions. It is useful for transforming spatiotemporal data (Trials, Channels, Time) -> (Trials*Time, Channels) for trajectory analysis.
- Parameters:
- Returns:
New container with stacked dimension. Metadata (coords/ids) are expanded/tiled to match the new shape.
- Return type:
Examples
>>> # Stack time into observations: >>> # (10 obs, 64 ch, 500 time) -> (5000 obs, 64 ch) >>> stacked = container.stack(dims=("obs", "time"), new_dim="obs") >>> stacked.shape (5000, 64)
- unstack(dim)#
Unstack a dimension into multiple dimensions.
Inverse operation of stack. Reshapes the data tensor by splitting one dimension into multiple using metadata stored during the stack operation.
- Parameters:
dim (str) – Dimension to unstack (e.g. ‘obs’).
- Returns:
New container with unstacked dimensions.
- Return type:
- Raises:
ValueError – If the container was not previously stacked (missing metadata).
Examples
>>> # Stack 'trials' and 'time' -> 'obs' >>> stacked = container.stack(("trials", "time"), new_dim="obs") >>> # Unstack 'obs' -> ('trials', 'time') (automatically inferred) >>> unstacked = stacked.unstack("obs")
- with_features(X, names=None, feature_dim=None, new_dim_name='component')#
Return a new container with the feature axis replaced.
Typical use: re-attach reduced-dimensionality scores (e.g. PCA components) to a container, so downstream operations (
unstack,aggregate, plotting) keep working with proper coordinates.- Parameters:
X (np.ndarray) – New data array. The leading axes must match the container’s non-feature axes; the trailing axis is the new feature axis.
names (sequence of str, optional) – Coordinate labels for the new feature axis. When
None, integer indices are used. Must have lengthX.shape[-1].feature_dim (str, optional) – Name of the dimension being replaced. Defaults to the last dimension of the container.
new_dim_name (str, default='component') – Dimension name to assign to the replaced axis when
feature_dimis not present inself.dims(e.g., when replacingchannelwithcomponentafter PCA).
- Returns:
New container with
Xreplaced and the feature-axis coord updated. All other dims, coords,y,ids, and meta are preserved.- Return type:
- Raises:
ValueError – If
X’s leading shape doesn’t match the container, or ifnameshas the wrong length.
Examples
>>> # After fitting PCA on stacked data: >>> scores = reducer.fit_transform(c_stacked.X) # (n_obs, 3) >>> c_pc = c_stacked.with_features( ... scores, ... names=["PC1", "PC2", "PC3"], ... new_dim_name="component", ... ) >>> c_pc.dims ('obs', 'component')
- center(dim='time', inplace=False)#
Remove mean along a specified dimension (Centering/Baseline Correction).
This operation computes the mean along dim (ignoring NaNs) and subtracts it. Commonly used in EEG for baseline correction (subtracting mean of pre-stimulus interval) or centering features before covariance calculation.
- Parameters:
- Returns:
Container with centered data.
- Return type:
Examples
>>> # Baseline correction over time >>> container.center(dim="time")
- zscore(dim='time', eps=1e-08, inplace=False)#
Standardize (Z-score) along a specified dimension.
Computes (X - mean) / std along the given dimension. Robust to NaNs. Useful for normalizing features or standardizing temporal dynamics.
- Parameters:
- Return type:
Examples
>>> # Standardize each channel's timecourse >>> container.zscore(dim="time")
- rms_scale(dim='time', eps=1e-08, inplace=False)#
Scale by Root Mean Square (RMS) amplitude along a dimension.
Divides data by sqrt(mean(X**2)) along the dimension. Preserves relative shape but normalizes energy.
- Parameters:
- Return type:
- baseline_correction(dim='time', inplace=False)#
Alias for center(). Common in EEG.
- Parameters:
- Return type:
- combine_coords(keys, name, *, sep='_', pair_sep='-', overwrite=False)#
Return a copy with a new coordinate combining several existing coords.
Each element of the new coordinate joins the corresponding elements of
keysin the order given. This materializes a single composite key (e.g. arecording_idfromsubject/session/run) that can then be used anywhere a single coordinate is expected — most notably as thebyargument toaggregate(), but also forselect,observation_frame, or provenance labels.- Parameters:
keys (sequence of str) – Names of existing coordinates to combine. All must be present in
coordsand share the same length.name (str) – Name of the new coordinate to create.
sep (str, default="_") – Separator placed between components.
pair_sep (str or None, default="-") – Separator between a component’s source name and its value, yielding
"<key><pair_sep><value>"(e.g."subject-0001"). WhenNone, names are omitted and only the values are joined.overwrite (bool, default=False) – Whether to replace
nameif a coordinate by that name exists.
- Returns:
A copy with the new composite coordinate added.
- Return type:
- Raises:
ValueError – If
keysis empty, any key is missing, the keys differ in length, ornamealready exists andoverwriteis False.
Examples
>>> grouped = container.combine_coords( ... ["subject", "session", "run"], "recording_id" ... ).aggregate(by="recording_id", stats="mean")
- aggregate(by, stats='mean', min_count=1, on_insufficient='raise')#
Aggregate observations into grouped summaries along the
obsaxis.- Parameters:
by (str or array-like) –
Grouping definition.
If str: resolve the key from
self.coordsor fromself.y(if “y” is passed).If array-like: explicit group labels aligned with
obs.
stats (str or sequence of str, default="mean") – Aggregation statistic or ordered list of statistics. Supported tokens are
"mean","median","std","var","sem","mad","iqr","min","max","count", and"first". Legacy"obs-*"aliases are accepted and normalized.min_count (int, default=1) – Minimum number of valid observations required per group. A valid observation is one with at least one finite value across the non-observation axes.
on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than
min_countvalid observations.
- Returns:
Aggregated container with grouped observations on the
obsaxis. When multiple stats are requested, astatdimension is inserted immediately afterobs.- Return type:
- Raises:
ValueError – If the container has no
obsdimension, grouping is invalid, requested stats are unsupported, ormin_count/on_insufficientare invalid.
- aggregate_groups(by, groups, min_count=1, on_insufficient='raise', skip_empty=True)#
Aggregate selected feature groups with different statistics.
This is a thin wrapper around
aggregate()for tabular feature containers. Each group spec selects a subset of feature columns and applies one or more stats to that subset. The outputs are concatenated along thefeaturedimension, and each resulting feature name is prefixed with its stat (for example"mean_band_log_abs_alpha").- Parameters:
by (str or array-like) – Group definition for the observation axis. Passed through to
aggregate().groups (sequence of dict) –
Ordered group specifications. Each group must provide
"stats"and may optionally provide include/exclude selectors:names/exclude_namesprefixes/exclude_prefixessuffixes/exclude_suffixescontains/exclude_containsregex/exclude_regex
If a group provides no include selectors, it starts from all features and then applies exclusions.
min_count (int, default=1) – Minimum number of valid observations required per group. Passed through to
aggregate().on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than
min_countvalid observations. Passed through toaggregate().skip_empty (bool, default=True) – If True, silently skip group specs that match no features. If False, raise a
ValueErrorwhen a group matches nothing.
- Returns:
Aggregated container with dims
("obs", "feature")and stat-prefixed feature names.- Return type:
- Raises:
ValueError – If the container lacks a
featuredimension or coord, no groups are provided, a group spec is invalid, multiple groups would emit the same output feature name, or no non-empty grouped outputs are produced.
- __hash__ = None#