Data Structures Demo#

This example demonstrates the DataContainer and other core IO structures used in the coco-pipe package. The DataContainer is a powerful wrapper around N-dimensional numpy arrays that keeps track of dimensions, coordinates, and labels.

Imports#

First, let’s import the necessary libraries.

import numpy as np

from coco_pipe.io.structures import DataContainer

1. Tabular Data (2D)#

We can store standard 2D tabular data (Observations x Features). The DataContainer will automatically track the coordinates for each dimension.

X_tab = np.random.randn(5, 3)
container_tab = DataContainer(
    X=X_tab,
    dims=("obs", "feature"),
    coords={
        "obs": [f"sub-{i}" for i in range(5)],
        "feature": ["Alpha_Cz", "Alpha_Fz", "Beta_Pz"],
    },
)

print(f"Original container:\n{container_tab}")
Original container:
<DataContainer [obs=5 x feature=3], coords=['obs', 'feature']>

We can easily select data using wildcards on the coordinates. Let’s select all features starting with “Alpha”:

subset = container_tab.select(feature=["Alpha*"])
print(f"Selected (Alpha*):\n{subset}")
Selected (Alpha*):
<DataContainer [obs=5 x feature=2], coords=['obs', 'feature']>

2. EEG Data (3D)#

The DataContainer excels at handling multi-dimensional data like EEG, which typically has dimensions (Observations x Channels x Time).

Let’s simulate data for 2 subjects, 2 conditions, and 4 epochs each.

n_subs = 2
n_conds = 2
n_epochs = 4
n_obs = n_subs * n_conds * n_epochs
n_chans = 3
n_times = 10

X_eeg = np.random.randn(n_obs, n_chans, n_times)

# Create tracking labels
ids = []
conditions = []
for sub in range(n_subs):
    for cond in ["A", "B"]:
        for ep in range(n_epochs):
            ids.append(f"sub-{sub}_cond-{cond}_ep-{ep}")
            conditions.append(cond)

container_eeg = DataContainer(
    X=X_eeg,
    y=np.array(conditions),
    ids=np.array(ids),
    dims=("obs", "channel", "time"),
    coords={"obs": ids, "channel": ["Fz", "Cz", "Pz"], "time": np.arange(n_times)},
)

print(f"EEG Container:\n{container_eeg}")
print(f"First 5 IDs:\n{container_eeg.ids[:5]}")
EEG Container:
<DataContainer [obs=16 x channel=3 x time=10], coords=['obs', 'channel', 'time']>
First 5 IDs:
['sub-0_cond-A_ep-0' 'sub-0_cond-A_ep-1' 'sub-0_cond-A_ep-2'
 'sub-0_cond-A_ep-3' 'sub-0_cond-B_ep-0']

3. Flattening Data#

We often need to flatten high-dimensional data into 2D matrices for standard machine learning algorithms (like PCA or classifiers), while preserving specific dimensions.

Flatten for TRCA (Spatial): Keep Observations and Channels, flatten Time. Result: (16, 3, 10) -> (Obs, Chan, Feature=Time)

flat_spatial = container_eeg.flatten(preserve=["obs"])
print(
    f"Flattened (Spatial): {flat_spatial.shape} dims={flat_spatial.dims} | "
    f"Coords: {list(flat_spatial.coords.keys())}"
)
Flattened (Spatial): (16, 30) dims=('obs', 'feature') | Coords: ['obs', 'feature']

Flatten for Standard ML (2D): Keep Observations only. Result: (16, 3*10) -> (16, 30) -> (Obs, Feature=Chan*Time)

flat_ml = container_eeg.flatten(preserve=["obs"])
print(f"Flattened (Standard 2D): {flat_ml.shape} dims={flat_ml.dims}")
print(f"Sample Composite Features:\n{flat_ml.coords['feature'][:5]}")
Flattened (Standard 2D): (16, 30) dims=('obs', 'feature')
Sample Composite Features:
['Fz_0', 'Fz_1', 'Fz_2', 'Fz_3', 'Fz_4']

4. Aggregation#

You can aggregate data across coordinates or labels. Let’s average the data across our “Condition” labels (A and B).

agg_cond = container_eeg.aggregate(by=container_eeg.y, stats="mean")
print(f"Aggregated by Condition (A, B): {agg_cond.shape}\nIDs={agg_cond.ids}")
Aggregated by Condition (A, B): (2, 3, 10)
IDs=[np.str_('A') np.str_('B')]

5. Advanced Selection#

The select() method is very powerful. It supports wildcards, fuzzy matching, mathematical operators, and even custom callables.

Wildcard Epoch Selection

subset_epochs = container_eeg.select(obs=["*ep-0", "*ep-1"])
print(f"Selected (*ep-0, *ep-1): {subset_epochs.shape} from {container_eeg.shape}")
print(f"Selected IDs:\n{subset_epochs.ids}")
Selected (*ep-0, *ep-1): (8, 3, 10) from (16, 3, 10)
Selected IDs:
['sub-0_cond-A_ep-0' 'sub-0_cond-A_ep-1' 'sub-0_cond-B_ep-0'
 'sub-0_cond-B_ep-1' 'sub-1_cond-A_ep-0' 'sub-1_cond-A_ep-1'
 'sub-1_cond-B_ep-0' 'sub-1_cond-B_ep-1']

Case-Insensitive Selection

subset_fuzzy = container_eeg.select(channel=["fz"], ignore_case=True, fuzzy=False)
print(f"Case-Insensitive 'fz' -> {subset_fuzzy.coords['channel']}")
Case-Insensitive 'fz' -> ['Fz']

Operator Selection (e.g., Time >= 5)

subset_time = container_eeg.select(time={">=": 5})
print(f"Time >= 5 -> {subset_time.coords['time']}")
Time >= 5 -> [5 6 7 8 9]

Filter by Target Label (Y)

subset_cond = container_eeg.select(y=["B"])
print(f"Select Y='B' -> IDs:\n{subset_cond.ids[:3]}... (Total {subset_cond.shape[0]})")
Select Y='B' -> IDs:
['sub-0_cond-B_ep-0' 'sub-0_cond-B_ep-1' 'sub-0_cond-B_ep-2']... (Total 8)

Stratified Selection via Callable Keep only the first 2 epochs for each unique subject.

def first_n_per_subject(ids_array, n=2):
    """Custom selector: keeps first n occurrences of each unique subject prefix."""
    subjects = [i.split("_")[0] for i in ids_array]

    mask = np.zeros(len(ids_array), dtype=bool)
    counts = {}
    for idx, sub in enumerate(subjects):
        if counts.get(sub, 0) < n:
            mask[idx] = True
            counts[sub] = counts.get(sub, 0) + 1
    return mask


subset_strat = container_eeg.select(ids=lambda x: first_n_per_subject(x, n=2))
print(f"First 2 epochs per subject:\n{subset_strat.ids}")
First 2 epochs per subject:
['sub-0_cond-A_ep-0' 'sub-0_cond-A_ep-1' 'sub-1_cond-A_ep-0'
 'sub-1_cond-A_ep-1']

6. Data Scaling and Normalization#

The container provides built-in methods for data normalization. These operations return a new container with the normalized data.

# Z-score normalization (mean=0, std=1) across the time dimension
zscored_eeg = container_eeg.zscore(dim="time")
print(
    f"Z-scored EEG Data:\nMean: {np.mean(zscored_eeg.X):.3f},"
    f"\nStd: {np.std(zscored_eeg.X):.3f}"
)
Z-scored EEG Data:
Mean: -0.000,
Std: 1.000

7. Restructuring Dimensions#

You can stack and unstack dimensions to change the shape of your data dynamically. Let’s stack Observations and Channels into a single “obs_chan” dimension.

stacked = container_eeg.stack(dims=["obs", "channel"], new_dim="obs_chan")
print(f"Stacked (Obs+Chan): {stacked.shape} dims={stacked.dims}")
Stacked (Obs+Chan): (48, 10) dims=('obs_chan', 'time')

And unstack it back out:

unstacked = stacked.unstack(dim="obs_chan")
print(f"Unstacked back to: {unstacked.shape} dims={unstacked.dims}")
Unstacked back to: (16, 3, 10) dims=('obs', 'channel', 'time')

8. Working with Pandas#

For standard machine learning pipelines or EDA, you might want to export your observation metadata to a Pandas DataFrame.

df_obs = container_eeg.observation_frame()
print("Observation DataFrame (First 5 rows):")
print(df_obs.head())
Observation DataFrame (First 5 rows):
                 obs          sample_id
0  sub-0_cond-A_ep-0  sub-0_cond-A_ep-0
1  sub-0_cond-A_ep-1  sub-0_cond-A_ep-1
2  sub-0_cond-A_ep-2  sub-0_cond-A_ep-2
3  sub-0_cond-A_ep-3  sub-0_cond-A_ep-3
4  sub-0_cond-B_ep-0  sub-0_cond-B_ep-0

Total running time of the script: (0 minutes 0.013 seconds)

Gallery generated by Sphinx-Gallery