OnDiskPlibData

class OnDiskPlibData(data=None, data_sources=None, path_to_data_sources=None, columns=None)[source]

Bases: PlibData[OutT], Generic[OutT]

Class for handling on-disk data. Implemented using pytables backend. . +====================+ . | OnDiskPlibData | . | | — __getitem__ —–> . | | — __iter__ ——-> . | | . +====================+ . | . v . _iterate_shards (internal) . | . v . +——————-+ . | ShuffleBuffer | . +——————-+ . | . __iter__ (produce batches) . | . v . +——————-+ . | DataLoader | . +——————-+

apply_transform(transform)[source]

Apply a transformation to the data.

Return type:

OnDiskPlibData[TypeVar(NewOutT)]

property columns: list[str]

The list of column names.

property dtypes: dict

Dictionary of data types.

get_data_loader(batch_size, num_workers=0, pin_memory=False, shuffle=False)[source]

Fetch a torch-style data loader for batch sampling.

Parameters:
  • batch_size (Optional[int]) – The size of a batch to fetch in each iteration. If None, return shards directly

  • num_workers (int) – Number of pytorch workers.

  • pin_memory (bool) – If true, Copy Tensors into device pinned memory before returning them.

  • shuffle (bool) – If false, samples will be sampled sequentially to form batches. If true, samples will be shuffled.

Return type:

DataLoader[TypeVar(OutT)]

Returns:

an instance of torch.utils.data.DataLoader

init_from_files(path_to_data_sources, data_sources)[source]

Initializes PlibData from multiple files.

Return type:

DataFrame

subset_columnwise(columns)[source]

Select a subset of existing columns.

Parameters:

columns (list[str]) – The names of columns to keep

Return type:

Self