Skip to content

iomega/molDiscovery

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

molDiscovery: Learning Mass Spectrometry Fragmentation of Small Molecules

About MolDiscovery

MolDiscovery is a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by (i) utilizing an efficient algorithm to generate mass spectrometry fragmentations, and (ii) learning a probabilistic model to match small molecules with their mass spectra. A search of over six million spectra from global natural product social molecular networking infrastructure shows that our probabilistic model can identify nearly twice more small molecules than previous method.

MolDiscovery is developed in collaboration of Carnegie Mellon University (PA, USA) and Saint Petersburg State University (Russia). The current version (beta) can be downloaded from https://github.com/mohimanilab/molDiscovery/releases. The stable release and all further versions will be available in the Natural Product Discovery toolkit (NPDtools) at https://github.com/ablab/npdtools.

Web version

You can try molDiscovery workflow online at the GNPS platform (registration is needed but it is quick and simple). Note that you need to login first to be able to open the workflow link. Alternatively, follow the instructions below to install and run our command line tool (available for Linux and macOS).

Installation & Running instructions

Please refer to the NPDtools manual for all details. Specific details regarding molDiscovery are in this section.

Basic example (this is for Linux, please substitute Linux to Darwin for replicating on macOS):

    wget https://github.com/mohimanilab/molDiscovery/releases/download/npdtools-2.6.0-beta/NPDtools-2.6.0-beta-Linux.tar.gz
    tar -xzf NPDtools-2.6.0-beta-Linux.tar.gz
    cd NPDtools-2.6.0-beta-Linux
    python2.7 bin/moldiscovery.py share/npdtools/test_data/moldiscovery/ --db-path share/npdtools/test_data/sample_database/ -o moldiscovery_outdir

If the run is finished correctly, you will see identifications of a nonribosomal peptide (Surugamide) and a polyketide (Chalcomycin) listed in moldiscovery_outdir/significant_matches.tsv. The column names are self-explanatory in principle but you can always find more details in the corresponding section of the manual.

Results

Massive GNPS analysis

We analyzed 7.6 million spectra from global natural product social molecular networking infrastructure (GNPS, https://gnps.ucsd.edu/) using molDiscovery and regular Dereplicator+. The figure below demonstrates performance of the both tools at different false discovery rate (FDR) levels.
The curves show the number of (A) small molecule-spectrum matches and (B) unique compounds identified by Dereplicator+ and molDiscovery in the search of 45 GNPS spectral datasets against AllDB (719,958 compounds from AntiMarin, DNP, UNPD, and other databases).

alt text

Novel BGCs in streptomyces dataset

MolDiscovery successfully discovered novel BGCs for three small molecule families from streptomyces dataset MSV000083738 (see the previous analysis of this dataset in Navarro-Muñoz et al., 2020 and Doroghazi et al., 2014). MolDiscovery search results for this dataset on GNPS are available here.

Pseudomonas dataset manual curation

We benchmarked molDiscovery against Dereplicator+ on top 100 identifications from extensively studied GNPS dataset MSV000079450 (400,000 spectra from Pseudomonas isolates). For each identified compound we checked its origin using a literature search. Out of the top 100 small molecule-spectra matches reported by molDiscovery, 78 correspond to compounds having Pseudomonas origin based on taxonomies reported for molecules in AntiMarin database. The second largest genus among the identifications (20 out of 100) is Bacillus. However, these molecule-spectra matches are still likely to be true positives since the dataset is known to be contaminated with Bacillus species (see Gurevich et al, 2018). While the top 100 identifications from Dereplicator+ also contains 20 Bacillus matches, the number of hits related to Pseudomonas species is just 62, that is 25% lower than molDiscovery. 18 identifications of Dereplicator+ are annotated as having fungi origin which makes them likely false positives. A subset of the AntiMarin database used in this search and containing all compounds from both top-100 lists can be downloaded here.

Feedback and Bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve NPDtools in general and molDiscovery in particular. You can leave them at our GitHub repository tracker or sent them via support e-mail: [email protected].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published