{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# Data Tutorial\n",
    "\n",
    "**-- On data handling --**\n",
    "\n",
    "One of the contributions of `perturb-lib` is to provide users with *one-liners to manage available perturbation data*. This tutorial covers:\n",
    "\n",
    "- Browsing through available data.\n",
    "- Loading and handling data in ``AnnData`` format suitable for exploration, filtering, and preprocessing.\n",
    "- Loading and handling data in built-in ``PlibData`` format suitable for machine-learning operations.\n",
    "\n",
    "We start by listing currently available data. The basic unit of data in `perturb-lib` is `context`. We assume that each `context` collects the data produced within a specific experimental context and without confounders within a single context such as batch effects. Larger datasets that are given in batches across which differences are considerable (think statistically significant), are split into contexts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['DummyData',\n",
       " 'DummyDataLongStrings',\n",
       " 'HumanCellLine_1HAE_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_A375_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_A375_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_A549_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_A549_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_AGS_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_ASC_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_BICR6_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_BJAB_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_BT20_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_CD34_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_DANG_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_ES2_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_HA1E_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HAP1_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HBL1_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HCC1806_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_HCC515_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HCT116_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HEK293_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HELA_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HEPG2_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HFL1_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HME1_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HPTEC_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HS578T_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HS944T_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_HT29_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HT29_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_HUES3_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HUH7_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_HUVEC_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_IMR90_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_IPC298_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_JURKAT_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_Jurkat_GrowthScreen_Horlbeck18',\n",
       " 'HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22',\n",
       " 'HumanCellLine_K562_GrowthScreen_Horlbeck18',\n",
       " 'HumanCellLine_K562_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_KELLY_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_KMS34_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_KYSE30_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_LCLC103H_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_LNCAP_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_MCF10A_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_MCF7_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_MCF7_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_MDAMB231_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_MINO_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_MNEU_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_NALM6_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_NKDBA_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_NL20_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_NPC_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_OCILY10_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_OCILY19_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_OCILY3_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_P1A82_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_PC3_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_PC3_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_PHH_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_RPE1_10xChromium3-scRNA-seq_Replogle22',\n",
       " 'HumanCellLine_SHSY5Y_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SKBR3_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SKB_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SKL_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SKMEL5_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SKNSH_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_SNGM_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_SNU761_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_THP1_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_TMD8_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_U251MG_L1000-RNA-seq_LINCS-CMap2020_XPR',\n",
       " 'HumanCellLine_U2OS_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_VCAP_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_WA09_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_WI38_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.L10_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P026_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P031_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P033_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P091_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P092_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P901_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P904_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P905_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P906_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P907_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P908_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P909_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P910_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P911_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P912_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P914_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P915_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P922_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P930_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P931_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P932_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P933_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P934_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P935_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.P936_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_XC.R10_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_YAPC_L1000-RNA-seq_LINCS-CMap2020_CMP',\n",
       " 'HumanCellLine_YAPC_L1000-RNA-seq_LINCS-CMap2020_XPR']"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import perturb_lib as plib\n",
    "\n",
    "plib.list_contexts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the naming logic: The name of the context contains details that describe the experimental context as closely as possible. The model system is mentioned, the technology, the type of readouts, as well as the name that is specific to the dataset used. To get better understanding of individual contexts, e.g. to identify the type of outcome measured in the particular experiment (gene expression, cell viability, etc.), we can obtain the high-level description as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Human cell line K562 prepared for perturbation analysis of essential genes as described in`2022 Replogle et al. <https://pubmed.ncbi.nlm.nih.gov/35688146>`_.\n"
     ]
    }
   ],
   "source": [
    "context = \"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\"\n",
    "print(plib.describe_context(context))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get your hands on the particular dataset, the best is to load the dataset in AnnData format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "15:14:54 | INFO | Loading dataset 'replogle_K562' which is originally given in AnnData format..\n",
      "15:14:54 | INFO | Downloading replogle_K562.h5ad...\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'plus.figshare.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
      "  warnings.warn(\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 's3-eu-west-1.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
      "  warnings.warn(\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7G/10.7G [09:23<00:00, 18.9MiB/s]\n",
      "15:24:30 | INFO | Adding train/val/test splits..\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "perturb_lib.data.collection.replogle22.HumanCellLine_K562_10xChromium3scRNAseq_Replogle22"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata = plib.load_anndata(context)\n",
    "type(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AnnData object contains a data matrix that describes the value of a readout after a certain perturbation has occurred. We can explore its object to investigate the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 310385 × 8563\n",
       "    obs: 'cell_barcode', 'perturbation_type', 'perturbation_target', 'batch', 'perturbation', 'context', 'split'\n",
       "    var: 'readout', 'readout_type', 'readout_target'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cell_barcode</th>\n",
       "      <th>perturbation_type</th>\n",
       "      <th>perturbation_target</th>\n",
       "      <th>batch</th>\n",
       "      <th>perturbation</th>\n",
       "      <th>context</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAACCCAAGAAATCCA-27</td>\n",
       "      <td>CRISPRi</td>\n",
       "      <td>NAF1</td>\n",
       "      <td>27</td>\n",
       "      <td>CRISPRi_NAF1</td>\n",
       "      <td>HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAACCCAAGAACTTCC-31</td>\n",
       "      <td>CRISPRi</td>\n",
       "      <td>BUB1</td>\n",
       "      <td>31</td>\n",
       "      <td>CRISPRi_BUB1</td>\n",
       "      <td>HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AAACCCAAGAAGCCAC-34</td>\n",
       "      <td>CRISPRi</td>\n",
       "      <td>UBL5</td>\n",
       "      <td>34</td>\n",
       "      <td>CRISPRi_UBL5</td>\n",
       "      <td>HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AAACCCAAGAATAGTC-43</td>\n",
       "      <td>CRISPRi</td>\n",
       "      <td>C9orf16</td>\n",
       "      <td>43</td>\n",
       "      <td>CRISPRi_C9orf16</td>\n",
       "      <td>HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AAACCCAAGACAGCGT-28</td>\n",
       "      <td>CRISPRi</td>\n",
       "      <td>TIMM9</td>\n",
       "      <td>28</td>\n",
       "      <td>CRISPRi_TIMM9</td>\n",
       "      <td>HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          cell_barcode perturbation_type perturbation_target  batch  \\\n",
       "0  AAACCCAAGAAATCCA-27           CRISPRi                NAF1     27   \n",
       "1  AAACCCAAGAACTTCC-31           CRISPRi                BUB1     31   \n",
       "2  AAACCCAAGAAGCCAC-34           CRISPRi                UBL5     34   \n",
       "3  AAACCCAAGAATAGTC-43           CRISPRi             C9orf16     43   \n",
       "4  AAACCCAAGACAGCGT-28           CRISPRi               TIMM9     28   \n",
       "\n",
       "      perturbation                                            context  split  \n",
       "0     CRISPRi_NAF1  HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...  train  \n",
       "1     CRISPRi_BUB1  HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...  train  \n",
       "2     CRISPRi_UBL5  HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...  train  \n",
       "3  CRISPRi_C9orf16  HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...  train  \n",
       "4    CRISPRi_TIMM9  HumanCellLine_K562_10xChromium3-scRNA-seq_Repl...  train  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.obs[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>readout</th>\n",
       "      <th>readout_type</th>\n",
       "      <th>readout_target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Transcriptome_LINC01409</td>\n",
       "      <td>Transcriptome</td>\n",
       "      <td>LINC01409</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Transcriptome_LINC01128</td>\n",
       "      <td>Transcriptome</td>\n",
       "      <td>LINC01128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Transcriptome_NOC2L</td>\n",
       "      <td>Transcriptome</td>\n",
       "      <td>NOC2L</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Transcriptome_KLHL17</td>\n",
       "      <td>Transcriptome</td>\n",
       "      <td>KLHL17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Transcriptome_HES4</td>\n",
       "      <td>Transcriptome</td>\n",
       "      <td>HES4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   readout   readout_type readout_target\n",
       "0  Transcriptome_LINC01409  Transcriptome      LINC01409\n",
       "1  Transcriptome_LINC01128  Transcriptome      LINC01128\n",
       "2      Transcriptome_NOC2L  Transcriptome          NOC2L\n",
       "3     Transcriptome_KLHL17  Transcriptome         KLHL17\n",
       "4       Transcriptome_HES4  Transcriptome           HES4"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.var[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((310385, 8563), numpy.ndarray)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.X.shape, type(adata.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.5457988 , -0.21936147, -0.6191385 ,  1.8765817 , -0.18855709,\n",
       "        -0.70445   ],\n",
       "       [-0.7178899 , -1.0262779 , -0.5378534 , -0.3611275 , -1.1285864 ,\n",
       "         0.3140653 ],\n",
       "       [ 0.20289904,  0.8282157 ,  0.5052937 , -0.36487058, -0.9389877 ,\n",
       "        -0.7573743 ],\n",
       "       [-0.67229235, -0.2952343 , -0.5420161 , -0.26392677, -0.8980309 ,\n",
       "         1.6804963 ],\n",
       "       [-0.7307865 , -0.2523772 , -0.548757  , -0.3520529 , -0.75549585,\n",
       "         0.5462019 ]], dtype=float32)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.X[45:50, 45:51]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, AnnData is suitable for exploration and also for data manipulation routines. However, performing machine-learning operations such as batching is not trivial. To enable data handling optimized for training machine-learning models, we introduce ``PlibData``.\n",
    "\n",
    "``PlibData`` data structure comes in two main flavours: One based on in-memory (``InMemoryPlibData``) and one based on-disk (``OnDiskPlibData``) data handling routines (built on top of `polars` and `pyarrow` for optimized performance). Using ``PlibData``, we can load multiple contexts stacked together in the context-perturbation-readout (CPR) format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "15:35:32 | INFO | Loading dataset 'replogle_K562' which is originally given in AnnData format..\n",
      "15:35:32 | INFO | replogle_K562.h5ad found in cache.\n",
      "15:35:44 | INFO | Adding train/val/test splits..\n",
      "15:35:44 | INFO | mean-aggregating data..\n",
      "15:36:19 | INFO | Casting to standardized DataFrame format..\n",
      "15:36:22 | INFO | Loading dataset 'replogle_RPE1' which is originally given in AnnData format..\n",
      "15:36:22 | INFO | Downloading replogle_RPE1.h5ad...\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'plus.figshare.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
      "  warnings.warn(\n",
      "/Users/dm922386/Library/Caches/pypoetry/virtualenvs/perturb-lib-h68r_ta--py3.12/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 's3-eu-west-1.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
      "  warnings.warn(\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.70G/8.70G [07:32<00:00, 19.2MiB/s]\n",
      "15:44:04 | INFO | Adding train/val/test splits..\n",
      "15:44:04 | INFO | mean-aggregating data..\n",
      "15:44:24 | INFO | Casting to standardized DataFrame format..\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "perturb_lib.data.plibdata.InMemoryPlibData"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdata = plib.load_plibdata(\n",
    "    contexts_ids=[\n",
    "        \"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\",\n",
    "        \"HumanCellLine_RPE1_10xChromium3-scRNA-seq_Replogle22\",\n",
    "    ]\n",
    ")\n",
    "type(pdata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Note that after ``PlibData`` construction the instance is cached by default for fast retrieval in later attempts. In fact, each context is cached separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['context', 'perturbation', 'readout', 'value']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdata.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, there are four columns in the table defined by our data structure:\n",
    "\n",
    "- ``context``: contains symbols that this value came from.\n",
    "- ``perturbation``: contains perturbation symbols.\n",
    "- ``readout``: contains readout symbols.\n",
    "- ``value``: the value of the readout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "polars.dataframe.frame.DataFrame"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch_of_four = pdata[1000:1004]\n",
    "type(batch_of_four)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the batches of the dataset are given in the form of `pd.DataFrame` to enable simple exploration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (4, 4)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>context</th><th>perturbation</th><th>readout</th><th>value</th></tr><tr><td>str</td><td>str</td><td>str</td><td>f32</td></tr></thead><tbody><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_RPA2&quot;</td><td>&quot;Transcriptome_A1BG&quot;</td><td>0.026806</td></tr><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_RPA3&quot;</td><td>&quot;Transcriptome_A1BG&quot;</td><td>0.138879</td></tr><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_RPAP2&quot;</td><td>&quot;Transcriptome_A1BG&quot;</td><td>-0.084605</td></tr><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_RPAP3&quot;</td><td>&quot;Transcriptome_A1BG&quot;</td><td>0.060994</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (4, 4)\n",
       "┌─────────────────────────────────┬───────────────┬────────────────────┬───────────┐\n",
       "│ context                         ┆ perturbation  ┆ readout            ┆ value     │\n",
       "│ ---                             ┆ ---           ┆ ---                ┆ ---       │\n",
       "│ str                             ┆ str           ┆ str                ┆ f32       │\n",
       "╞═════════════════════════════════╪═══════════════╪════════════════════╪═══════════╡\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_RPA2  ┆ Transcriptome_A1BG ┆ 0.026806  │\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_RPA3  ┆ Transcriptome_A1BG ┆ 0.138879  │\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_RPAP2 ┆ Transcriptome_A1BG ┆ -0.084605 │\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_RPAP3 ┆ Transcriptome_A1BG ┆ 0.060994  │\n",
       "└─────────────────────────────────┴───────────────┴────────────────────┴───────────┘"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch_of_four"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where the columns are of the following types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[String, String, String, Float32]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch_of_four.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "To prepare the data for predictive modelling, we can split the dataset into train, validation, and test parts as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "traindata, valdata, testdata = plib.split_plibdata_3fold(\n",
    "    pdata, context_ids=\"HumanCellLine_K562_10xChromium3-scRNA-seq_Replogle22\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Splitting operation provides traindata valdata, and testdata as specified in the ``AnnData`` format for the specified context. The valdata and testdata contain only data from the target context. The traindata optionally contains data from other datasets.\n",
    "\n",
    "Finally, we can fetch a ``pytorch`` `DataLoader` for the corresponding dataset as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (2, 4)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>context</th><th>perturbation</th><th>readout</th><th>value</th></tr><tr><td>str</td><td>str</td><td>str</td><td>f32</td></tr></thead><tbody><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_RRP15&quot;</td><td>&quot;Transcriptome_USF3&quot;</td><td>-0.066255</td></tr><tr><td>&quot;HumanCellLine_K562_10xChromium…</td><td>&quot;CRISPRi_CWC25&quot;</td><td>&quot;Transcriptome_ZNF276&quot;</td><td>0.01206</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (2, 4)\n",
       "┌─────────────────────────────────┬───────────────┬──────────────────────┬───────────┐\n",
       "│ context                         ┆ perturbation  ┆ readout              ┆ value     │\n",
       "│ ---                             ┆ ---           ┆ ---                  ┆ ---       │\n",
       "│ str                             ┆ str           ┆ str                  ┆ f32       │\n",
       "╞═════════════════════════════════╪═══════════════╪══════════════════════╪═══════════╡\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_RRP15 ┆ Transcriptome_USF3   ┆ -0.066255 │\n",
       "│ HumanCellLine_K562_10xChromium… ┆ CRISPRi_CWC25 ┆ Transcriptome_ZNF276 ┆ 0.01206   │\n",
       "└─────────────────────────────────┴───────────────┴──────────────────────┴───────────┘"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_loader = traindata.get_data_loader(batch_size=2, num_workers=0, pin_memory=False, shuffle=True)\n",
    "next(iter(train_loader))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}