Models Tutorial¶
– On model handling –
In this tutorial, we will cover the following aspects of Perturb-lib machine-learning modeling:
perturb-lib
modelshow to perform model fitting (training time)
how to perform predictions (inference time)
how to load and save trained models
registration of new models
[1]:
import perturb_lib as plib
plib.list_models()
[1]:
['Catboost', 'GlobalMean', 'LPM', 'NoPerturb', 'ReadoutMean']
Let us pick two models, one based on Catboost regressor, and the other one based on an LPM:
[2]:
print("Catboost:\n", plib.describe_model("Catboost"))
print("LPM:\n", plib.describe_model("LPM"))
Catboost:
CatBoostRegressor used on top of predefined embeddings.
LPM:
Large perturbation model.
Args:
embedding_dim: Dimensionality of all embedding layers.
optimizer_name: Name of pytorch optimizer to use.
learning_rate: Learning rate.
learning_rate_decay: Exponential learning rate decay.
num_layers: Depth of the MLP.
hidden_dim: Number of units in the hidden nodes.
batch_size: Size of batches during training.
embedding_aggregation_mode: Defines how to aggregate embeddings.
num_workers: Number of workers to use during data loading.
pin_memory: Whether to pin the memory.
early_stopping_patience: Patience for early stopping in case validation set is given.
lightning_trainer_pars: Parameters for pytorch-lightning.
We can now load a model instance. All we need to decide is the model parameters which are fully described in API reference. Let us instantiate two example models:
[8]:
catboost_model = plib.load_model(
"Catboost", model_args={"learning_rate": 0.75}
) # the same model signature as for sklearn
print(catboost_model)
<perturb_lib.models.collection.baselines.Catboost object at 0x335b26300>
[9]:
lpm = plib.load_model(
"LPM",
model_args={
"lightning_trainer_pars": {
"accelerator": "cpu",
"max_epochs": 1,
"enable_checkpointing": False,
},
"optimizer_name": "AdamW",
"learning_rate": 0.002,
"learning_rate_decay": 0.98,
"num_layers": 2,
"hidden_dim": 512,
"dropout": 0.0,
"batch_size": 5000,
"embedding_dim": 64,
},
)
print(lpm)
LPM(
(loss): MSELoss()
(predictor): Sequential(
(0): Linear(in_features=192, out_features=512, bias=True)
(1): ReLU()
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=512, out_features=512, bias=True)
(4): ReLU()
(5): Dropout(p=0.0, inplace=False)
(6): Linear(in_features=512, out_features=1, bias=True)
)
)
Note that we set max_epochs=1
in pytorch lightning trainer to speed up the training. Let us now prepare some training data:
[10]:
pdata = plib.load_plibdata("HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22")
traindata, _, _ = plib.split_plibdata_3fold(pdata, context_ids="HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22")
print(len(traindata))
12344962
We can now fit LPM:
[11]:
lpm.fit(traindata)
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
16:03:26 | INFO | Fitting LPM..
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
| Name | Type | Params | Mode
-----------------------------------------------------------------
0 | loss | MSELoss | 0 | train
1 | predictor | Sequential | 361 K | train
2 | context_embedding_layer | Embedding | 64 | train
3 | perturb_embedding_layer | EmbeddingBag | 92.3 K | train
4 | readout_embedding_layer | Embedding | 547 K | train
-----------------------------------------------------------------
1.0 M Trainable params
0 Non-trainable params
1.0 M Total params
4.009 Total estimated model params size (MB)
12 Modules in train mode
0 Modules in eval mode
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.
`Trainer.fit` stopped: `max_epochs=1` reached.
16:05:32 | INFO | Cleaning up...
16:05:32 | INFO | Model fitting completed
For the sake of simplicity, let’s make the predictions on the same set of data:
[12]:
print(lpm.predict(traindata))
[ 0.03736594 0.02409402 0.03611039 ... 0.00019517 -0.03289356
0.02492389]
Introducing new models is also trivial:
[13]:
import numpy as np
@plib.register_model
class CoolModel(plib.ModelMixin):
def fit(self, traindata: plib.PlibData, valdata: plib.PlibData = None):
pass
def predict(self, data_x: plib.PlibData):
return np.zeros(len(data_x))
[14]:
"CoolModel" in plib.list_models()
[14]:
True