coco_pipe.io.DataContainer#

class coco_pipe.io.DataContainer(X, dims, coords=<factory>, y=None, ids=None, meta=<factory>)#

Bases: object

Generic container for N-dimensional neurophysiological data.

Acts as a lightweight labelled array (like xarray but simpler), managing dimensions, coordinates, and associated target labels (y) and IDs.

Variables:
  • X (np.ndarray) – The primary data tensor. Shape must match dims.

  • dims (Tuple[str, ...]) – Labels for each dimension of X. Examples: (‘obs’, ‘feature’), (‘obs’, ‘channel’, ‘time’). Note: The ‘obs’ dimension is special and typically represents independent samples.

  • coords (Dict[str, Union[List, np.ndarray]]) – Coordinates/Labels for dimensions. Keys must be in dims. Values must match the length of the corresponding dimension in X.

  • y (Optional[np.ndarray], optional) – Target labels corresponding to the ‘obs’ dimension. Used for supervised learning or coloring plots.

  • ids (Optional[np.ndarray], optional) – Identifiers for observations (e.g., subject IDs, trial names). Should correspond to ‘obs’ dim in coords if provided. Kept separate from coords for convenient tracking.

  • meta (Dict[str, Any]) – Arbitrary metadata (sfreq, units, source path, etc).

Parameters:

Examples

Accessing data: >>> container.X.shape (10, 64, 500)

Accessing coordinates: >>> container.coords[“channel”][:3] [‘Fz’, ‘Cz’, ‘Pz’]

X: ndarray#
dims: tuple[str, ...]#
coords: dict[str, list | ndarray | Sequence]#
y: ndarray | None = None#
ids: ndarray | None = None#
meta: dict[str, Any]#
property shape: tuple[int, ...]#
Return type:

tuple[int, Ellipsis]

save(path)#

Save the DataContainer to disk using joblib.

Parameters:

path (str or Path) – Destination file path.

Return type:

None

observation_frame()#

Return observation-aligned coordinates and stable sample IDs.

Return type:

DataFrame

classmethod load(path)#

Load a DataContainer from disk.

Parameters:

path (str or Path) – Source file path.

Return type:

DataContainer

classmethod concat(containers, fill_condition_from_meta=True)#

Concatenate containers along their observation axis.

All containers must use matching dimensions and non-observation shapes. Observation-aligned coordinates are concatenated, filling missing entries with None when a coordinate is absent from a container. Non-observation coordinates are copied from the first container.

Parameters:
  • containers (sequence of DataContainer) – One or more containers to concatenate.

  • fill_condition_from_meta (bool, default=True) – If no observation-level condition coordinate is available, create one from each container’s meta["condition"] value.

Returns:

The concatenated container.

Return type:

DataContainer

obs_table(include_ids=False, id_col='obs_id', include_y=False, y_col='y', include_obs_coord=False)#

Return one-dimensional coordinates aligned to the observation axis.

This helper is useful when exporting a row-wise table from a container. It only materializes metadata that can map cleanly to one row per observation, skipping coordinates that belong to other axes such as channel, time, feature, or stat.

Parameters:
  • include_ids (bool, default=False) – If True, include self.ids as the first column.

  • id_col (str, default="obs_id") – Column name used when exporting self.ids.

  • include_y (bool, default=False) – If True, include self.y as a column when present.

  • y_col (str, default="y") – Column name used when exporting self.y.

  • include_obs_coord (bool, default=False) – If True, include coords["obs"] when present.

Returns:

DataFrame containing only one-dimensional observation-aligned metadata columns.

Return type:

pandas.DataFrame

Raises:

ValueError – If the container has no obs dimension, or if include_ids is requested when self.ids is missing.

isel(**indexers)#

Select data by integer indices on specified dimensions.

This method is the integer-index equivalent of select. It operates directly on the dimensions of the data tensor X. It is robust and handles metadata splitting/alignment automatically.

Parameters:

**indexers (dict) –

Dimension names mapped to the desired index. The index can be:

  • List or numpy array of integers: [0, 1, 5]

  • Slice object: slice(0, 10)

  • Single integer: 0

Note: If you provide a list of indices with repeats (e.g., [0, 0, 1]), the output will be oversampled accordingly.

Returns:

A new DataContainer instance with the sliced data and coordinates.

Return type:

DataContainer

Examples

>>> # Select first 10 observations
>>> subset = container.isel(obs=slice(0, 10))
>>> # Select specific channels by index
>>> subset = container.isel(channel=[0, 5, 12])
>>> # Select time range by index
>>> subset = container.isel(time=slice(100, 200))
>>> # Bootstrap/Resample (Select index 0 five times)
>>> bootstrap = container.isel(obs=[0, 0, 0, 0, 0])
balance(target='y', strategy='undersample', covariates=None, random_state=42, **kwargs)#

Balance the dataset classes using undersampling or oversampling.

This method adjusts the number of observations (rows) in the container so that class counts in target are equalized. It supports simple random sampling and stratified sampling based on covariates.

Parameters:
  • target (str or array-like) –

    The target vector to balance against:

    • ’y’: Uses self.y.

    • Any other string: Looks for the variable in self.coords.

    • Array-like: Direct labels to use.

  • method (str, default='auto') –

    Balancing strategy:

    • ’undersample’: Downsample majority classes to match the minority class frequency. Uses self.ids to ensure repeatability.

    • ’oversample’: Upsample minority classes (with replacement) to match the majority frequency.

    • ’auto’: Heuristic choice. Uses undersampling if total size remains > 20% of original, else oversampling.

  • covariates (list of str, optional) – List of covariate names in self.coords to preserve distribution of. If provided, the balancing is performed within strata defined by these covariates.

  • random_state (int, default=42) – Seed for the random number generator. Change this value to produce different random subsets (e.g., for bagging).

  • **kwargs (dict) –

    Additional arguments passed to internal logic:

    • n_bins (int): Number of bins for continuous covariates (default 5).

    • binning (str): ‘quantile’ (default) or ‘uniform’ binning.

    • prefer_clean_rows (bool): If True, weighs sampling to prefer rows with fewer NaNs/artifacts.

  • strategy (str)

Returns:

A new DataContainer instance with balanced classes.

Return type:

DataContainer

Examples

>>> # 1. Simple Undersampling of 'y'
>>> balanced = container.balance(strategy="undersample")
>>> # 2. Balance based on a metadata column 'condition'
>>> balanced = container.balance(target="condition")
>>> # 3. Stratified Balancing (Balance 'y' while preserving 'sex' and 'age'
>>> #    ratios)
>>> balanced = container.balance(target="y", covariates=["sex", "age"])
>>> # 4. Iterative Bootstrapping (Different seeds)
>>> for seed in [1, 2, 3]:
...     subset = container.balance(strategy="undersample", random_state=seed)
...     # process subset...
select(ignore_case=False, fuzzy=False, **selections)#

Select data subsets based on coordinates, ids, or y.

This method supports exact matching, wildcard matching, operator-based filtering, and custom callable filters.

Parameters:
  • ignore_case (bool, default=False) – If True, string matching is case-insensitive (e.g., ‘fz’ matches ‘Fz’).

  • fuzzy (bool, default=False) – If True, uses difflib to find closest matches for string queries (e.g., ‘Alpha’ matches ‘alpha’). Useful for handling typos.

  • **selections (dict) –

    Key is the dimension name (or special keys ‘y’, ‘ids’). Value is the query. Supported query types:

    1. List/Array (Exact or Wildcard): Matches values present in the list. Strings can use shell-style wildcards (‘*’, ‘?’).

    2. Dictionary (Operator Queries): Filters numerical or string values using operators. Keys: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’, ‘in’.

    3. Callable: A function taking the coordinate array and returning a boolean mask.

Returns:

A new DataContainer instance containing the selected subset.

Return type:

DataContainer

Examples

>>> # 1. Exact Selection (Sensors)
>>> sub = container.select(channel=["Fz", "Cz"])
>>> # 2. Wildcard Selection (All Alpha features)
>>> sub = container.select(feature="*alpha*")
>>> # 3. Range Selection (Time)
>>> sub = container.select(time={">=": 0.1, "<": 0.5})
>>> # 4. Case-Insensitive Fuzzy Matching
>>> sub = container.select(channel=["fz"], ignore_case=True)
>>> # 5. Filter by Target (y)
>>> sub = container.select(y=["Patient"])
>>> # 6. Complex Logic (Subjects 1-5 via Operator)
>>> sub = container.select(subject_id={">=": 1, "<=": 5})
>>> # 7. Stratified Selection (First 2 epochs per subject via Callable)
>>> def first_n(ids, n=2):
...     # ... logic ...
...     return mask
>>> sub = container.select(ids=first_n)
flatten(preserve='obs', sep='_')#

Flatten dimensions NOT in preserve into a single ‘feature’ dimension.

This is useful for preparing N-dimensional data for standard 2D machine learning algorithms (scikit-learn). It automatically generates composite feature names (e.g., ‘Fz_0.1s’) for tracking.

Parameters:
  • preserve (str or list of str, default="obs") –

    Dimensions to preserve. All other dimensions are flattened into a new dimension ‘feature’.

    • ’obs’: Result shape (N_obs, N_features). Standard specification.

    • [‘obs’, ‘time’]: Result shape (N_obs, N_time, N_features). Useful for time-resolved decoding distributions.

  • sep (str, default="_") – Separator used when generating composite feature names.

Returns:

A new DataContainer with reshaped X and generated ‘feature’ coordinates.

Return type:

DataContainer

Examples

>>> # Flatten (10, 64, 500) -> (10, 32000)
>>> flat = container.flatten(preserve="obs")
>>> flat.shape
(10, 32000)
>>> flat.coords["feature"][0]
'Fz_0.0'
>>> # Flatten spatial only, keep time (10, 64, 500) -> (10, 500, 64)
>>> time_resolved = container.flatten(preserve=["obs", "time"])
feature_schema()#

Return feature-axis metadata, or None when no feature coord exists.

Only coordinates aligned to the feature axis are included. feature_* metadata is mapped to canonical schema names such as family and measure; primary dimension coords folded by flatten() are used in the feature labels and are not recovered as structured metadata.

Return type:

DataFrame | None

stack(dims, new_dim='obs')#

Stack multiple dimensions into a single new dimension.

This reshapes N-dimensional data into (N-K) dimensions by combining specified dimensions. It is useful for transforming spatiotemporal data (Trials, Channels, Time) -> (Trials*Time, Channels) for trajectory analysis.

Parameters:
  • dims (sequence of str) – Dimensions to stack. The order determines the nesting (slowest to fastest). e.g., (‘obs’, ‘time’) means ‘obs’ changes slowly, ‘time’ cycles fast.

  • new_dim (str, default='obs') – Name of the resulting stacked dimension.

Returns:

New container with stacked dimension. Metadata (coords/ids) are expanded/tiled to match the new shape.

Return type:

DataContainer

Examples

>>> # Stack time into observations:
>>> # (10 obs, 64 ch, 500 time) -> (5000 obs, 64 ch)
>>> stacked = container.stack(dims=("obs", "time"), new_dim="obs")
>>> stacked.shape
(5000, 64)
unstack(dim)#

Unstack a dimension into multiple dimensions.

Inverse operation of stack. Reshapes the data tensor by splitting one dimension into multiple using metadata stored during the stack operation.

Parameters:

dim (str) – Dimension to unstack (e.g. ‘obs’).

Returns:

New container with unstacked dimensions.

Return type:

DataContainer

Raises:

ValueError – If the container was not previously stacked (missing metadata).

Examples

>>> # Stack 'trials' and 'time' -> 'obs'
>>> stacked = container.stack(("trials", "time"), new_dim="obs")
>>> # Unstack 'obs' -> ('trials', 'time') (automatically inferred)
>>> unstacked = stacked.unstack("obs")
with_features(X, names=None, feature_dim=None, new_dim_name='component')#

Return a new container with the feature axis replaced.

Typical use: re-attach reduced-dimensionality scores (e.g. PCA components) to a container, so downstream operations (unstack, aggregate, plotting) keep working with proper coordinates.

Parameters:
  • X (np.ndarray) – New data array. The leading axes must match the container’s non-feature axes; the trailing axis is the new feature axis.

  • names (sequence of str, optional) – Coordinate labels for the new feature axis. When None, integer indices are used. Must have length X.shape[-1].

  • feature_dim (str, optional) – Name of the dimension being replaced. Defaults to the last dimension of the container.

  • new_dim_name (str, default='component') – Dimension name to assign to the replaced axis when feature_dim is not present in self.dims (e.g., when replacing channel with component after PCA).

Returns:

New container with X replaced and the feature-axis coord updated. All other dims, coords, y, ids, and meta are preserved.

Return type:

DataContainer

Raises:

ValueError – If X’s leading shape doesn’t match the container, or if names has the wrong length.

Examples

>>> # After fitting PCA on stacked data:
>>> scores = reducer.fit_transform(c_stacked.X)  # (n_obs, 3)
>>> c_pc = c_stacked.with_features(
...     scores,
...     names=["PC1", "PC2", "PC3"],
...     new_dim_name="component",
... )
>>> c_pc.dims
('obs', 'component')
center(dim='time', inplace=False)#

Remove mean along a specified dimension (Centering/Baseline Correction).

This operation computes the mean along dim (ignoring NaNs) and subtracts it. Commonly used in EEG for baseline correction (subtracting mean of pre-stimulus interval) or centering features before covariance calculation.

Parameters:
  • dim (str or sequence of str, default='time') – Dimension name(s) to center over (e.g., ‘time’, ‘channel’, ‘obs’, or (‘obs’, ‘time’)).

  • inplace (bool, default=False) – If True, modifies X in-place to save memory. Returns self.

Returns:

Container with centered data.

Return type:

DataContainer

Examples

>>> # Baseline correction over time
>>> container.center(dim="time")
zscore(dim='time', eps=1e-08, inplace=False)#

Standardize (Z-score) along a specified dimension.

Computes (X - mean) / std along the given dimension. Robust to NaNs. Useful for normalizing features or standardizing temporal dynamics.

Parameters:
  • dim (str or sequence of str) – Dimension(s) to standardize.

  • eps (float) – Stability epsilon to avoid division by zero.

  • inplace (bool)

Return type:

DataContainer

Examples

>>> # Standardize each channel's timecourse
>>> container.zscore(dim="time")
rms_scale(dim='time', eps=1e-08, inplace=False)#

Scale by Root Mean Square (RMS) amplitude along a dimension.

Divides data by sqrt(mean(X**2)) along the dimension. Preserves relative shape but normalizes energy.

Parameters:
  • dim (str or sequence of str) – Dimension(s) to scale.

  • eps (float) – Stability epsilon.

  • inplace (bool)

Return type:

DataContainer

baseline_correction(dim='time', inplace=False)#

Alias for center(). Common in EEG.

Parameters:
Return type:

DataContainer

combine_coords(keys, name, *, sep='_', pair_sep='-', overwrite=False)#

Return a copy with a new coordinate combining several existing coords.

Each element of the new coordinate joins the corresponding elements of keys in the order given. This materializes a single composite key (e.g. a recording_id from subject/session/run) that can then be used anywhere a single coordinate is expected — most notably as the by argument to aggregate(), but also for select, observation_frame, or provenance labels.

Parameters:
  • keys (sequence of str) – Names of existing coordinates to combine. All must be present in coords and share the same length.

  • name (str) – Name of the new coordinate to create.

  • sep (str, default="_") – Separator placed between components.

  • pair_sep (str or None, default="-") – Separator between a component’s source name and its value, yielding "<key><pair_sep><value>" (e.g. "subject-0001"). When None, names are omitted and only the values are joined.

  • overwrite (bool, default=False) – Whether to replace name if a coordinate by that name exists.

Returns:

A copy with the new composite coordinate added.

Return type:

DataContainer

Raises:

ValueError – If keys is empty, any key is missing, the keys differ in length, or name already exists and overwrite is False.

Examples

>>> grouped = container.combine_coords(
...     ["subject", "session", "run"], "recording_id"
... ).aggregate(by="recording_id", stats="mean")
aggregate(by, stats='mean', min_count=1, on_insufficient='raise')#

Aggregate observations into grouped summaries along the obs axis.

Parameters:
  • by (str or array-like) –

    Grouping definition.

    • If str: resolve the key from self.coords or from self.y (if “y” is passed).

    • If array-like: explicit group labels aligned with obs.

  • stats (str or sequence of str, default="mean") – Aggregation statistic or ordered list of statistics. Supported tokens are "mean", "median", "std", "var", "sem", "mad", "iqr", "min", "max", "count", and "first". Legacy "obs-*" aliases are accepted and normalized.

  • min_count (int, default=1) – Minimum number of valid observations required per group. A valid observation is one with at least one finite value across the non-observation axes.

  • on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than min_count valid observations.

Returns:

Aggregated container with grouped observations on the obs axis. When multiple stats are requested, a stat dimension is inserted immediately after obs.

Return type:

DataContainer

Raises:

ValueError – If the container has no obs dimension, grouping is invalid, requested stats are unsupported, or min_count / on_insufficient are invalid.

aggregate_groups(by, groups, min_count=1, on_insufficient='raise', skip_empty=True)#

Aggregate selected feature groups with different statistics.

This is a thin wrapper around aggregate() for tabular feature containers. Each group spec selects a subset of feature columns and applies one or more stats to that subset. The outputs are concatenated along the feature dimension, and each resulting feature name is prefixed with its stat (for example "mean_band_log_abs_alpha").

Parameters:
  • by (str or array-like) – Group definition for the observation axis. Passed through to aggregate().

  • groups (sequence of dict) –

    Ordered group specifications. Each group must provide "stats" and may optionally provide include/exclude selectors:

    • names / exclude_names

    • prefixes / exclude_prefixes

    • suffixes / exclude_suffixes

    • contains / exclude_contains

    • regex / exclude_regex

    If a group provides no include selectors, it starts from all features and then applies exclusions.

  • min_count (int, default=1) – Minimum number of valid observations required per group. Passed through to aggregate().

  • on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than min_count valid observations. Passed through to aggregate().

  • skip_empty (bool, default=True) – If True, silently skip group specs that match no features. If False, raise a ValueError when a group matches nothing.

Returns:

Aggregated container with dims ("obs", "feature") and stat-prefixed feature names.

Return type:

DataContainer

Raises:

ValueError – If the container lacks a feature dimension or coord, no groups are provided, a group spec is invalid, multiple groups would emit the same output feature name, or no non-empty grouped outputs are produced.

__hash__ = None#

Examples using coco_pipe.io.DataContainer#

Basic Feature Descriptors Extraction

Basic Feature Descriptors Extraction

Data Structures Demo

Data Structures Demo