Modification of first tutorial to include references for the new CoV-…

…2 datasets
aleksy-k · Dec 2, 2020 · 0be405b · 0be405b
1 parent df4d6ca
commit 0be405b
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 19 deletions.
diff --git a/tutorials/01_loading_data.ipynb b/tutorials/01_loading_data.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This tutorial shows how to load and understand the GEOM data.\n"
+    "This tutorial shows how to load and understand the GEOM data. **Caution: We will only be updating the RDKit files as new data gets added to GEOM, and we will not be updating the messagepack files. Make sure you are using the RDKit files if you want the most up-to-date data.**\n"
    ]
   },
   {
@@ -38,7 +38,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -54,11 +54,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
-    "drugs_file = \"drugs_crude.msgpack\"\n",
+    "import os\n",
+    "\n",
+    "# change to where your data is located\n",
+    "direc = \"/home/saxelrod/rgb_nfs/GEOM_NON_TAR\"\n",
+    "\n",
+    "drugs_file = os.path.join(direc, \"drugs_crude.msgpack\")\n",
     "unpacker = msgpack.Unpacker(open(drugs_file, \"rb\"))\n"
    ]
   },
@@ -73,7 +78,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,7 +94,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,7 +112,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -144,7 +149,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -171,7 +176,7 @@
        " 'datasets': ['plpro', 'aid1706']}"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -204,7 +209,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -213,7 +218,7 @@
        "84"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -231,7 +236,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -289,7 +294,7 @@
        " 'conformerweights': [0.22575, 0.22548]}"
       ]
      },
-     "execution_count": 8,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -340,7 +345,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "qm9_features_file = \"qm9_featurized.msgpack\"\n",
+    "qm9_features_file = os.path.join(direc, \"qm9_featurized.msgpack\")\n",
     "qm9_unpacker = msgpack.Unpacker(open(qm9_features_file, \"rb\"))\n",
     "qm9_feat_1k = next(iter(qm9_unpacker))"
    ]
@@ -364,7 +369,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -787,7 +792,7 @@
        " 'canon_smiles': 'CN1C[C@H]2NC[C@]21C#N'}"
       ]
      },
-     "execution_count": 19,
+     "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -842,6 +847,16 @@
     "\n",
     "Here we provide the references for the different possible values of the `dataset` key in the dictionaries:\n",
     "\n",
+    "- `ellinger`:\n",
+    "    - Bernhard Ellinger, Denisa Bojkova, Andrea Zaliani, Jindrich Cinatl, Carsten Claussen, Sandra Westhaus, Jeanette Reinshagen, Maria Kuzikov, Markus Wolf, Gerd Geisslinger, et al. Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection. 2020.\n",
+    "\n",
+    "- `amu_sars_cov_2`: \n",
+    "    - Franck Touret,Magali Gilles, Karine Barral, Antoine Nougairède, Etienne Decroly, Xavierde Lamballerie, and Bruno Coutard. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. BioRxiv, 2020.\n",
+    "    \n",
+    "- `mpro_xchem`: \n",
+    "    - Main protease structure and XChem fragment screen. https://www.diamond.ac.uk/covid-19/for-scientists/Main-protease-structure-and-XChem.html.\n",
+    "   \n",
+    "    \n",
     "- `aid1706`: \n",
     "    - Valerie Tokars and Andrew Mesecar. QFRET-based primary biochemical high throughput screening assayto identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro). https://pubchem.ncbi.nlm.nih.gov/bioassay/1706\n",
     "    - https://github.com/yangkevin2/coronavirus_data/blob/master/data/AID1706_binarized_sars.csv. Accessed: 2020-03-28\n",
@@ -871,9 +886,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python [conda env:geom]",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "geom"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {

diff --git a/tutorials/02_loading_rdkit_mols.ipynb b/tutorials/02_loading_rdkit_mols.ipynb
@@ -2366,7 +2366,7 @@
    "source": [
     "In this case we only had to remove the `Br-` anion and the take the rest of the SMILES string.\n",
     "\n",
-    "If you want to compare the SMILES in our data to the SMILES in another dataset, you can use `uncleaned_smiles` for our data and `Chem.MolToSmiles(Chem.MolFromSmiles(<their_smiles>))` for theirs. Our SMILES are already in canonical form, and application of `Chem.MolToSmiles(Chem.MolFromSmiles(` converts `<their_smiles>` to canonical form, too."
+    "If you want to compare the SMILES in our data to the SMILES in another dataset, you can use `uncleaned_smiles` for our data and `Chem.MolToSmiles(Chem.MolFromSmiles(<their_smiles>))` for theirs. Our SMILES are already in canonical form, and application of `Chem.MolToSmiles(Chem.MolFromSmiles())` converts `<their_smiles>` to canonical form, too."
    ]
   },
   {