From 0be405b9c534d2cc9ecafe671af78bb844312a8a Mon Sep 17 00:00:00 2001
From: Simon Axelrod <saxelrod@.mit.edu>
Date: Wed, 2 Dec 2020 16:08:06 -0500
Subject: [PATCH] Modification of first tutorial to include references for the
 new CoV-2 datasets

---
 tutorials/01_loading_data.ipynb       | 51 +++++++++++++++++----------
 tutorials/02_loading_rdkit_mols.ipynb |  2 +-
 2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/tutorials/01_loading_data.ipynb b/tutorials/01_loading_data.ipynb
index df22a03..67f01fb 100644
--- a/tutorials/01_loading_data.ipynb
+++ b/tutorials/01_loading_data.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This tutorial shows how to load and understand the GEOM data.\n"
+    "This tutorial shows how to load and understand the GEOM data. **Caution: We will only be updating the RDKit files as new data gets added to GEOM, and we will not be updating the messagepack files. Make sure you are using the RDKit files if you want the most up-to-date data.**\n"
    ]
   },
   {
@@ -38,7 +38,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -54,11 +54,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
-    "drugs_file = \"drugs_crude.msgpack\"\n",
+    "import os\n",
+    "\n",
+    "# change to where your data is located\n",
+    "direc = \"/home/saxelrod/rgb_nfs/GEOM_NON_TAR\"\n",
+    "\n",
+    "drugs_file = os.path.join(direc, \"drugs_crude.msgpack\")\n",
     "unpacker = msgpack.Unpacker(open(drugs_file, \"rb\"))\n"
    ]
   },
@@ -73,7 +78,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,7 +94,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,7 +112,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -144,7 +149,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -171,7 +176,7 @@
        " 'datasets': ['plpro', 'aid1706']}"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -204,7 +209,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -213,7 +218,7 @@
        "84"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -231,7 +236,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -289,7 +294,7 @@
        " 'conformerweights': [0.22575, 0.22548]}"
       ]
      },
-     "execution_count": 8,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -340,7 +345,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "qm9_features_file = \"qm9_featurized.msgpack\"\n",
+    "qm9_features_file = os.path.join(direc, \"qm9_featurized.msgpack\")\n",
     "qm9_unpacker = msgpack.Unpacker(open(qm9_features_file, \"rb\"))\n",
     "qm9_feat_1k = next(iter(qm9_unpacker))"
    ]
@@ -364,7 +369,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -787,7 +792,7 @@
        " 'canon_smiles': 'CN1C[C@H]2NC[C@]21C#N'}"
       ]
      },
-     "execution_count": 19,
+     "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -842,6 +847,16 @@
     "\n",
     "Here we provide the references for the different possible values of the `dataset` key in the dictionaries:\n",
     "\n",
+    "- `ellinger`:\n",
+    "    - Bernhard Ellinger, Denisa Bojkova, Andrea Zaliani, Jindrich Cinatl, Carsten Claussen, Sandra Westhaus, Jeanette Reinshagen, Maria Kuzikov, Markus Wolf, Gerd Geisslinger, et al. Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection. 2020.\n",
+    "\n",
+    "- `amu_sars_cov_2`: \n",
+    "    - Franck Touret,Magali Gilles, Karine Barral, Antoine Nougairède, Etienne Decroly, Xavierde Lamballerie, and Bruno Coutard. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. BioRxiv, 2020.\n",
+    "    \n",
+    "- `mpro_xchem`: \n",
+    "    - Main protease structure and XChem fragment screen. https://www.diamond.ac.uk/covid-19/for-scientists/Main-protease-structure-and-XChem.html.\n",
+    "   \n",
+    "    \n",
     "- `aid1706`: \n",
     "    - Valerie Tokars and Andrew Mesecar. QFRET-based primary biochemical high throughput screening assayto identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro). https://pubchem.ncbi.nlm.nih.gov/bioassay/1706\n",
     "    - https://github.com/yangkevin2/coronavirus_data/blob/master/data/AID1706_binarized_sars.csv. Accessed: 2020-03-28\n",
@@ -871,9 +886,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python [conda env:geom]",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "geom"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/tutorials/02_loading_rdkit_mols.ipynb b/tutorials/02_loading_rdkit_mols.ipynb
index 3ac42a7..2b77d9e 100644
--- a/tutorials/02_loading_rdkit_mols.ipynb
+++ b/tutorials/02_loading_rdkit_mols.ipynb
@@ -2366,7 +2366,7 @@
    "source": [
     "In this case we only had to remove the `Br-` anion and the take the rest of the SMILES string.\n",
     "\n",
-    "If you want to compare the SMILES in our data to the SMILES in another dataset, you can use `uncleaned_smiles` for our data and `Chem.MolToSmiles(Chem.MolFromSmiles(<their_smiles>))` for theirs. Our SMILES are already in canonical form, and application of `Chem.MolToSmiles(Chem.MolFromSmiles(` converts `<their_smiles>` to canonical form, too."
+    "If you want to compare the SMILES in our data to the SMILES in another dataset, you can use `uncleaned_smiles` for our data and `Chem.MolToSmiles(Chem.MolFromSmiles(<their_smiles>))` for theirs. Our SMILES are already in canonical form, and application of `Chem.MolToSmiles(Chem.MolFromSmiles())` converts `<their_smiles>` to canonical form, too."
    ]
   },
   {