{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Models Tutorial\n", "\n", "**-- On model handling --**\n", "\n", "In this tutorial, we will cover the following aspects of Perturb-lib machine-learning modeling:\n", "\n", "- `perturb-lib` models\n", "- how to perform model fitting (training time)\n", "- how to perform predictions (inference time)\n", "- how to load and save trained models\n", "- registration of new models" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Catboost', 'GlobalMean', 'LPM', 'NoPerturb', 'ReadoutMean']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import perturb_lib as plib\n", "\n", "plib.list_models()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us pick two models, one based on Catboost regressor, and the other one based on an LPM:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Catboost:\n", " CatBoostRegressor used on top of predefined embeddings.\n", "LPM:\n", " Large perturbation model.\n", "\n", " Args:\n", " embedding_dim: Dimensionality of all embedding layers.\n", " optimizer_name: Name of pytorch optimizer to use.\n", " learning_rate: Learning rate.\n", " learning_rate_decay: Exponential learning rate decay.\n", " num_layers: Depth of the MLP.\n", " hidden_dim: Number of units in the hidden nodes.\n", " batch_size: Size of batches during training.\n", " embedding_aggregation_mode: Defines how to aggregate embeddings.\n", " num_workers: Number of workers to use during data loading.\n", " pin_memory: Whether to pin the memory.\n", " early_stopping_patience: Patience for early stopping in case validation set is given.\n", " lightning_trainer_pars: Parameters for pytorch-lightning.\n", " \n" ] } ], "source": [ "print(\"Catboost:\\n\", plib.describe_model(\"Catboost\"))\n", "print(\"LPM:\\n\", plib.describe_model(\"LPM\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now load a model instance. All we need to decide is the model parameters which are fully described in API reference. Let us instantiate two example models:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "catboost_model = plib.load_model(\n", " \"Catboost\", model_args={\"learning_rate\": 0.75}\n", ") # the same model signature as for sklearn\n", "print(catboost_model)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LPM(\n", " (loss): MSELoss()\n", " (predictor): Sequential(\n", " (0): Linear(in_features=192, out_features=512, bias=True)\n", " (1): ReLU()\n", " (2): Dropout(p=0.0, inplace=False)\n", " (3): Linear(in_features=512, out_features=512, bias=True)\n", " (4): ReLU()\n", " (5): Dropout(p=0.0, inplace=False)\n", " (6): Linear(in_features=512, out_features=1, bias=True)\n", " )\n", ")\n" ] } ], "source": [ "lpm = plib.load_model(\n", " \"LPM\",\n", " model_args={\n", " \"lightning_trainer_pars\": {\n", " \"accelerator\": \"cpu\",\n", " \"max_epochs\": 1,\n", " \"enable_checkpointing\": False,\n", " },\n", " \"optimizer_name\": \"AdamW\",\n", " \"learning_rate\": 0.002,\n", " \"learning_rate_decay\": 0.98,\n", " \"num_layers\": 2,\n", " \"hidden_dim\": 512,\n", " \"dropout\": 0.0,\n", " \"batch_size\": 5000,\n", " \"embedding_dim\": 64,\n", " },\n", ")\n", "print(lpm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we set `max_epochs=1` in pytorch lightning trainer to speed up the training. Let us now prepare some training data:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12344962\n" ] } ], "source": [ "pdata = plib.load_plibdata(\"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\")\n", "traindata, _, _ = plib.split_plibdata_3fold(pdata, context_ids=\"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\")\n", "print(len(traindata))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "We can now fit LPM:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "GPU available: True (mps), used: False\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.\n", "16:03:26 | INFO | Fitting LPM..\n", "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.\n", "\n", " | Name | Type | Params | Mode \n", "-----------------------------------------------------------------\n", "0 | loss | MSELoss | 0 | train\n", "1 | predictor | Sequential | 361 K | train\n", "2 | context_embedding_layer | Embedding | 64 | train\n", "3 | perturb_embedding_layer | EmbeddingBag | 92.3 K | train\n", "4 | readout_embedding_layer | Embedding | 547 K | train\n", "-----------------------------------------------------------------\n", "1.0 M Trainable params\n", "0 Non-trainable params\n", "1.0 M Total params\n", "4.009 Total estimated model params size (MB)\n", "12 Modules in train mode\n", "0 Modules in eval mode\n", "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e52f2a3970724cbd82406dd10b3e9aad", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Training: | …" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=1` reached.\n", "16:05:32 | INFO | Cleaning up...\n", "16:05:32 | INFO | Model fitting completed\n" ] } ], "source": [ "lpm.fit(traindata)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "For the sake of simplicity, let's make the predictions on the same set of data:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.03736594 0.02409402 0.03611039 ... 0.00019517 -0.03289356\n", " 0.02492389]\n" ] } ], "source": [ "print(lpm.predict(traindata))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Introducing new models is also trivial:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "\n", "@plib.register_model\n", "class CoolModel(plib.ModelMixin):\n", " def fit(self, traindata: plib.PlibData, valdata: plib.PlibData = None):\n", " pass\n", "\n", " def predict(self, data_x: plib.PlibData):\n", " return np.zeros(len(data_x))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"CoolModel\" in plib.list_models()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 4 }