Models Tutorial¶
– On model handling –
In this tutorial, we will cover the following aspects of Perturb-lib machine-learning modeling:
- perturb-libmodels
- how to perform model fitting (training time) 
- how to perform predictions (inference time) 
- how to load and save trained models 
- registration of new models 
[1]:
import perturb_lib as plib
plib.list_models()
[1]:
['Catboost', 'GlobalMean', 'LPM', 'NoPerturb', 'ReadoutMean']
Let us pick two models, one based on Catboost regressor, and the other one based on an LPM:
[2]:
print("Catboost:\n", plib.describe_model("Catboost"))
print("LPM:\n", plib.describe_model("LPM"))
Catboost:
 CatBoostRegressor used on top of predefined embeddings.
LPM:
 Large perturbation model.
    Args:
        embedding_dim: Dimensionality of all embedding layers.
        optimizer_name: Name of pytorch optimizer to use.
        learning_rate: Learning rate.
        learning_rate_decay: Exponential learning rate decay.
        num_layers: Depth of the MLP.
        hidden_dim: Number of units in the hidden nodes.
        batch_size: Size of batches during training.
        embedding_aggregation_mode: Defines how to aggregate embeddings.
        num_workers: Number of workers to use during data loading.
        pin_memory: Whether to pin the memory.
        early_stopping_patience: Patience for early stopping in case validation set is given.
        lightning_trainer_pars: Parameters for pytorch-lightning.
We can now load a model instance. All we need to decide is the model parameters which are fully described in API reference. Let us instantiate two example models:
[8]:
catboost_model = plib.load_model(
    "Catboost", model_args={"learning_rate": 0.75}
)  # the same model signature as for sklearn
print(catboost_model)
<perturb_lib.models.collection.baselines.Catboost object at 0x335b26300>
[9]:
lpm = plib.load_model(
    "LPM",
    model_args={
        "lightning_trainer_pars": {
            "accelerator": "cpu",
            "max_epochs": 1,
            "enable_checkpointing": False,
        },
        "optimizer_name": "AdamW",
        "learning_rate": 0.002,
        "learning_rate_decay": 0.98,
        "num_layers": 2,
        "hidden_dim": 512,
        "dropout": 0.0,
        "batch_size": 5000,
        "embedding_dim": 64,
    },
)
print(lpm)
LPM(
  (loss): MSELoss()
  (predictor): Sequential(
    (0): Linear(in_features=192, out_features=512, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.0, inplace=False)
    (3): Linear(in_features=512, out_features=512, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.0, inplace=False)
    (6): Linear(in_features=512, out_features=1, bias=True)
  )
)
Note that we set max_epochs=1 in pytorch lightning trainer to speed up the training. Let us now prepare some training data:
[10]:
pdata = plib.load_plibdata("HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22")
traindata, _, _ = plib.split_plibdata_3fold(pdata, context_ids="HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22")
print(len(traindata))
12344962
We can now fit LPM:
[11]:
lpm.fit(traindata)
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
16:03:26 | INFO | Fitting LPM..
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
  | Name                    | Type         | Params | Mode
-----------------------------------------------------------------
0 | loss                    | MSELoss      | 0      | train
1 | predictor               | Sequential   | 361 K  | train
2 | context_embedding_layer | Embedding    | 64     | train
3 | perturb_embedding_layer | EmbeddingBag | 92.3 K | train
4 | readout_embedding_layer | Embedding    | 547 K  | train
-----------------------------------------------------------------
1.0 M     Trainable params
0         Non-trainable params
1.0 M     Total params
4.009     Total estimated model params size (MB)
12        Modules in train mode
0         Modules in eval mode
/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.
`Trainer.fit` stopped: `max_epochs=1` reached.
16:05:32 | INFO | Cleaning up...
16:05:32 | INFO | Model fitting completed
For the sake of simplicity, let’s make the predictions on the same set of data:
[12]:
print(lpm.predict(traindata))
[ 0.03736594  0.02409402  0.03611039 ...  0.00019517 -0.03289356
  0.02492389]
Introducing new models is also trivial:
[13]:
import numpy as np
@plib.register_model
class CoolModel(plib.ModelMixin):
    def fit(self, traindata: plib.PlibData, valdata: plib.PlibData = None):
        pass
    def predict(self, data_x: plib.PlibData):
        return np.zeros(len(data_x))
[14]:
"CoolModel" in plib.list_models()
[14]:
True