{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Models Tutorial\n",
    "\n",
    "**-- On model handling --**\n",
    "\n",
    "In this tutorial, we will cover the following aspects of Perturb-lib machine-learning modeling:\n",
    "\n",
    "- `perturb-lib` models\n",
    "- how to perform model fitting (training time)\n",
    "- how to perform predictions (inference time)\n",
    "- how to load and save trained models\n",
    "- registration of new models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Catboost', 'GlobalMean', 'LPM', 'NoPerturb', 'ReadoutMean']"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import perturb_lib as plib\n",
    "\n",
    "plib.list_models()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us pick two models, one based on Catboost regressor, and the other one based on an LPM:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Catboost:\n",
      " CatBoostRegressor used on top of predefined embeddings.\n",
      "LPM:\n",
      " Large perturbation model.\n",
      "\n",
      "    Args:\n",
      "        embedding_dim: Dimensionality of all embedding layers.\n",
      "        optimizer_name: Name of pytorch optimizer to use.\n",
      "        learning_rate: Learning rate.\n",
      "        learning_rate_decay: Exponential learning rate decay.\n",
      "        num_layers: Depth of the MLP.\n",
      "        hidden_dim: Number of units in the hidden nodes.\n",
      "        batch_size: Size of batches during training.\n",
      "        embedding_aggregation_mode: Defines how to aggregate embeddings.\n",
      "        num_workers: Number of workers to use during data loading.\n",
      "        pin_memory: Whether to pin the memory.\n",
      "        early_stopping_patience: Patience for early stopping in case validation set is given.\n",
      "        lightning_trainer_pars: Parameters for pytorch-lightning.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(\"Catboost:\\n\", plib.describe_model(\"Catboost\"))\n",
    "print(\"LPM:\\n\", plib.describe_model(\"LPM\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now load a model instance. All we need to decide is the model parameters which are fully described in API reference. Let us instantiate two example models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<perturb_lib.models.collection.baselines.Catboost object at 0x335b26300>\n"
     ]
    }
   ],
   "source": [
    "catboost_model = plib.load_model(\n",
    "    \"Catboost\", model_args={\"learning_rate\": 0.75}\n",
    ")  # the same model signature as for sklearn\n",
    "print(catboost_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LPM(\n",
      "  (loss): MSELoss()\n",
      "  (predictor): Sequential(\n",
      "    (0): Linear(in_features=192, out_features=512, bias=True)\n",
      "    (1): ReLU()\n",
      "    (2): Dropout(p=0.0, inplace=False)\n",
      "    (3): Linear(in_features=512, out_features=512, bias=True)\n",
      "    (4): ReLU()\n",
      "    (5): Dropout(p=0.0, inplace=False)\n",
      "    (6): Linear(in_features=512, out_features=1, bias=True)\n",
      "  )\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "lpm = plib.load_model(\n",
    "    \"LPM\",\n",
    "    model_args={\n",
    "        \"lightning_trainer_pars\": {\n",
    "            \"accelerator\": \"cpu\",\n",
    "            \"max_epochs\": 1,\n",
    "            \"enable_checkpointing\": False,\n",
    "        },\n",
    "        \"optimizer_name\": \"AdamW\",\n",
    "        \"learning_rate\": 0.002,\n",
    "        \"learning_rate_decay\": 0.98,\n",
    "        \"num_layers\": 2,\n",
    "        \"hidden_dim\": 512,\n",
    "        \"dropout\": 0.0,\n",
    "        \"batch_size\": 5000,\n",
    "        \"embedding_dim\": 64,\n",
    "    },\n",
    ")\n",
    "print(lpm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we set `max_epochs=1` in pytorch lightning trainer to speed up the training. Let us now prepare some training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12344962\n"
     ]
    }
   ],
   "source": [
    "pdata = plib.load_plibdata(\"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\")\n",
    "traindata, _, _ = plib.split_plibdata_3fold(pdata, context_ids=\"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\")\n",
    "print(len(traindata))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "We can now fit LPM:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "GPU available: True (mps), used: False\n",
      "TPU available: False, using: 0 TPU cores\n",
      "HPU available: False, using: 0 HPUs\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.\n",
      "16:03:26 | INFO | Fitting LPM..\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.\n",
      "\n",
      "  | Name                    | Type         | Params | Mode \n",
      "-----------------------------------------------------------------\n",
      "0 | loss                    | MSELoss      | 0      | train\n",
      "1 | predictor               | Sequential   | 361 K  | train\n",
      "2 | context_embedding_layer | Embedding    | 64     | train\n",
      "3 | perturb_embedding_layer | EmbeddingBag | 92.3 K | train\n",
      "4 | readout_embedding_layer | Embedding    | 547 K  | train\n",
      "-----------------------------------------------------------------\n",
      "1.0 M     Trainable params\n",
      "0         Non-trainable params\n",
      "1.0 M     Total params\n",
      "4.009     Total estimated model params size (MB)\n",
      "12        Modules in train mode\n",
      "0         Modules in eval mode\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e52f2a3970724cbd82406dd10b3e9aad",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Training: |                                                                                                   …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`Trainer.fit` stopped: `max_epochs=1` reached.\n",
      "16:05:32 | INFO | Cleaning up...\n",
      "16:05:32 | INFO | Model fitting completed\n"
     ]
    }
   ],
   "source": [
    "lpm.fit(traindata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "For the sake of simplicity, let's make the predictions on the same set of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.03736594  0.02409402  0.03611039 ...  0.00019517 -0.03289356\n",
      "  0.02492389]\n"
     ]
    }
   ],
   "source": [
    "print(lpm.predict(traindata))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Introducing new models is also trivial:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "@plib.register_model\n",
    "class CoolModel(plib.ModelMixin):\n",
    "    def fit(self, traindata: plib.PlibData, valdata: plib.PlibData = None):\n",
    "        pass\n",
    "\n",
    "    def predict(self, data_x: plib.PlibData):\n",
    "        return np.zeros(len(data_x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"CoolModel\" in plib.list_models()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}