OnDiskPlibData¶

class OnDiskPlibData(data=None, data_sources=None, path_to_data_sources=None, columns=None)[source]¶

Bases: PlibData[OutT], Generic[OutT]

Class for handling on-disk data. Implemented using pytables backend. . +====================+ . | OnDiskPlibData | . | | — __getitem__ —–> . | | — __iter__ ——-> . | | . +====================+ . | . v . _iterate_shards (internal) . | . v . +——————-+ . | ShuffleBuffer | . +——————-+ . | . __iter__ (produce batches) . | . v . +——————-+ . | DataLoader | . +——————-+

apply_transform(transform)[source]¶

Apply a transformation to the data.

Return type:: OnDiskPlibData[TypeVar(NewOutT)]

property columns: list[str]¶: The list of column names.

property dtypes: dict¶: Dictionary of data types.

get_data_loader(batch_size, num_workers=0, pin_memory=False, shuffle=False)[source]¶

Fetch a torch-style data loader for batch sampling.

Parameters:

batch_size (Optional[int]) – The size of a batch to fetch in each iteration. If None, return shards directly
num_workers (int) – Number of pytorch workers.
pin_memory (bool) – If true, Copy Tensors into device pinned memory before returning them.
shuffle (bool) – If false, samples will be sampled sequentially to form batches. If true, samples will be shuffled.

Return type:

DataLoader[TypeVar(OutT)]

Returns:

an instance of torch.utils.data.DataLoader

init_from_files(path_to_data_sources, data_sources)[source]¶

Initializes PlibData from multiple files.

Return type:: DataFrame

subset_columnwise(columns)[source]¶

Select a subset of existing columns.

Parameters:: columns (list[str]) – The names of columns to keep
Return type:: Self