Skip to content

Commit

Permalink
Merge pull request #152 from iomega/workflow_k_fold_validation
Browse files Browse the repository at this point in the history
Added k fold cross validation script
  • Loading branch information
niekdejonge authored Nov 17, 2022
2 parents 03de1fc + 3053612 commit 1f869ad
Show file tree
Hide file tree
Showing 43 changed files with 2,148 additions and 1,358 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ docs/apidocs
config.py
output/
/data/
/ms2query/use_case_python_files
/experiment_python_files
models_trained/
computed_results/
notebooks/.ipynb_checkpoints/
Expand Down
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.5.0]
### Added
- Training models is now fully automatic (no need for notebooks)
- Functions for creating benchmarking results
- Functions for doing k_fold_cross_validation
- Functions for visualizing benchmarking results
### Changed
- Method for creating new library files
- Cleaning spectra functions for running are now combined with cleaning spectra functions for training

## [0.4.3]
- Do not store MS2Deepscores in results table, to prevent memory issues

Expand Down Expand Up @@ -114,7 +124,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- First test workflow and basic batches.
- Licence.

[Unreleased]: https://github.com/iomega/ms2query/compare/0.4.2...HEAD
[Unreleased]: https://github.com/iomega/ms2query/compare/0.5.0...HEAD
[0.4.3]: https://github.com/iomega/ms2query/compare/0.4.3...0.5.0
[0.4.3]: https://github.com/iomega/ms2query/compare/0.4.1...0.4.3
[0.4.1]: https://github.com/iomega/ms2query/compare/0.4.0...0.4.1
[0.4.0]: https://github.com/iomega/ms2query/compare/0.3.3...0.4.0
Expand Down
64 changes: 53 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,10 @@ run_complete_folder(ms2library, ms2_spectra_directory)

```

## Create your own library
## Create your own library (without training new models)
The code below creates all required library files for your own in house library.
No new models for MS2deepscore, Spec2Vec and MS2Query will be trained, to do this see the next section.

To create your own library you also need to install RDKit, by running the following in your command line (while in the ms2query conda environment):
```
conda install -c conda-forge rdkit
Expand All @@ -87,23 +90,22 @@ It is important that the library spectra are annotated with smiles, inchi's or i
are not included in the library.

Fill in the blank spots with the file locations.
The models for spec2vec, ms2deepscore and ms2query can be downloaded from the zenodo links (see above).
The models for spec2vec, ms2deepscore and ms2query can be downloaded from the zenodo links (see above).

```python
from ms2query.library_files_creator import LibraryFilesCreator
from ms2query.create_new_library.library_files_creator import LibraryFilesCreator
from ms2query.clean_and_filter_spectra import clean_normalize_and_split_annotated_spectra
from ms2query.utils import load_matchms_spectrum_objects_from_file

spectrum_file_location = # The file location of the library spectra
library_spectra = load_matchms_spectrum_objects_from_file(spectrum_file_location)
# Fill in the missing values:
library_creator = LibraryFilesCreator(library_spectra,
output_directory=, # For instance "data/library_data/all_GNPS_positive_mode_"
ion_mode="positive",
ms2ds_model_file_name=, # The file location of the ms2ds model
s2v_model_file_name=, ) # The file location of the s2v model
library_creator.clean_up_smiles_inchi_and_inchikeys(do_pubchem_lookup=True)
library_creator.clean_peaks_and_normalise_intensities_spectra()
library_creator.remove_not_fully_annotated_spectra()
cleaned_library_spectra = clean_normalize_and_split_annotated_spectra(library_spectra, ion_mode_to_keep="")[
0] # fill in "positive" or "negative"
library_creator = LibraryFilesCreator(cleaned_library_spectra,
output_directory="", # For instance "data/library_data/all_GNPS_positive_mode_"
ms2ds_model_file_name="", # The file location of the ms2ds model
s2v_model_file_name="", ) # The file location of the s2v model
library_creator.create_all_library_files()
```

Expand Down Expand Up @@ -143,6 +145,46 @@ ms2library = MS2Library(sqlite_file_name= ,
)
```

# Create your own library and train new models
The code trains new MS2Deepscore, Spec2Vec and MS2Query models for your in house library,
and creates all needed files for running MS2Query.

It is important that the library spectra are annotated with smiles, inchi's or inchikeys in the metadata otherwise they
are not included in the library and training.

Fill in the blank spots below and run the code (can take several days).
The models will be stored in the specified output_folder. MS2Query can be run

```python
from ms2query.create_new_library.train_models import clean_and_train_models
clean_and_train_models(spectrum_file=, #Fill in the location of the file containing the library spectra
# Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object.
ion_mode=, # Fill in the ion mode, choose from "positive" or "negative"
output_folder= # The output folder in which all the models are stored.
)
```

To run MS2Query on your own created library run the code below (again fill in the blanks).

```python
from ms2query.run_ms2query import run_complete_folder
from ms2query.ms2library import create_library_object_from_one_dir

# Define the folder in which your query spectra are stored.
# Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a pickled matchms object.
ms2_spectra_directory = # Specify the folder containing the query spectra you want to run against the library
ms2_library_directory = # Specify the directory containing all the library and model files

# Create a MS2Library object from one directory
# If this does not work (because files have unexpected names or are not in one dir) see below.
ms2library = create_library_object_from_one_dir(ms2_library_directory)

# Run library search and analog search on your files.
run_complete_folder(ms2library, ms2_spectra_directory)
```

After running the model can be loaded

## Documentation for developers
### Prepare environmnent
We recommend to create an Anaconda environment with
Expand Down
2 changes: 1 addition & 1 deletion ms2query/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import logging
from .__version__ import __version__
from .library_files_creator import LibraryFilesCreator
from ms2query.create_new_library.library_files_creator import LibraryFilesCreator
from .ms2library import MS2Library
from .results_table import ResultsTable

Expand Down
2 changes: 1 addition & 1 deletion ms2query/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.4.3'
__version__ = '0.5.0'
Empty file.
Loading

0 comments on commit 1f869ad

Please sign in to comment.