Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add methods to export results in tabular format #280

Merged
merged 45 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
6b8d3d3
add print links method to LinkGraph, improve LinkGraph string represe…
liannette Oct 15, 2024
cdd26c3
feat: add a method to print tabular results files
liannette Oct 16, 2024
ec8b8ae
improve method names and docstrings, remove unused method to export g…
liannette Oct 16, 2024
2207df1
improve doctring and typing
liannette Oct 16, 2024
c6e166a
fix a failing test
liannette Oct 16, 2024
32ca3dd
refactor a little bit the spectrum method to covert to dict
liannette Oct 16, 2024
8e7945d
change the output format for gnps_annotations in metabolomics results…
liannette Oct 16, 2024
2592810
fix: convert int to str before using join
liannette Oct 17, 2024
7f53de8
change representation of empty values in output files for improved in…
liannette Oct 17, 2024
ad049c8
refactoring the export methods
liannette Oct 17, 2024
b220fb0
small refactor: specify staticmethod
liannette Oct 18, 2024
f98fa98
add more tests
liannette Oct 18, 2024
a8a8329
correct typing in doctrings
liannette Oct 18, 2024
c6c33e6
typing: changed typings to pass mypy static typing checks
liannette Oct 22, 2024
a260338
refactor: change the order of methods/functions
liannette Oct 22, 2024
3289683
restore the order of already existing functions and methods
liannette Nov 4, 2024
d2272e2
make dicts json compatible
liannette Nov 4, 2024
cb49209
rename functions and variables
liannette Nov 4, 2024
6a4da5f
refactor: changed the place when the index is added to the link dict
liannette Nov 4, 2024
edcc7db
use csv package to write the tabular output files
liannette Nov 4, 2024
05f9f76
make sure all elements of the input list have the same type of data.
liannette Nov 4, 2024
bff7731
shorten to long doc string lines, correct some doc strings
liannette Nov 4, 2024
d4bf9fb
tests: adapted the test to the changes
liannette Nov 4, 2024
2c05efb
remove a file that was committed by accident
liannette Nov 4, 2024
229a11d
Merge branch 'NPLinker:dev' into output_files
liannette Nov 5, 2024
32d78c3
Improve docstrings
liannette Nov 19, 2024
b04226b
Improve docstrings
liannette Nov 19, 2024
5fd4108
refactor: add method to convert a value to string for tabular output
liannette Nov 19, 2024
8137f7d
Merge branch 'output_files' of https://github.com/liannette/nplinker …
liannette Nov 19, 2024
940eb19
improve doctring, add a comment about key order of bgc dict represent…
liannette Nov 20, 2024
e551dcc
move to_string method to the BGC/Spectrum class, add a to_tabular method
liannette Nov 20, 2024
f9ae9f2
add tests for the to_string method
liannette Nov 20, 2024
1b00262
change to_tabular to it returns a list and not a string
liannette Nov 20, 2024
0d6bec3
refactor: to_tabular returns dict, to_string turned into private func…
liannette Nov 25, 2024
41757c7
fix typing in to_tabular methods
liannette Dec 2, 2024
b94eddf
update docstrings and comments
liannette Dec 2, 2024
94bcb67
ensure 0 and 0.0 are correctly converted to strings, and not to empty…
liannette Dec 2, 2024
16a56c7
change the order of methods
liannette Dec 2, 2024
183bd5f
remove whitespace in blank lines
liannette Dec 2, 2024
e2227df
update and add tests
liannette Dec 2, 2024
642c67c
change variable name to fix mypy error
liannette Dec 2, 2024
7cd675f
test: trying to fix unit test issue where the spectrum rt is a dict i…
liannette Dec 2, 2024
cacd504
Merge branch 'NPLinker:dev' into output_files
liannette Dec 2, 2024
19b6f1e
tests: add precursor charge to the test spectra
liannette Dec 2, 2024
40391fe
Update src/nplinker/metabolomics/spectrum.py
CunliangGeng Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/format-typing-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
- name: Install ruff and mypy
run: |
pip install ruff mypy typing_extensions \
types-Deprecated types-beautifulsoup4 types-jsonschema types-networkx pandas-stubs
types-Deprecated types-beautifulsoup4 types-jsonschema types-networkx types-tabulate pandas-stubs
- name: Get all changed python files
id: changed-python-files
uses: tj-actions/changed-files@v44
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ dev = [
"types-beautifulsoup4",
"types-jsonschema",
"types-networkx",
"types-tabulate",
"pandas-stubs",
# docs
"black",
Expand Down
68 changes: 68 additions & 0 deletions src/nplinker/genomics/bgc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations
import logging
from typing import TYPE_CHECKING
from typing import Any
from deprecated import deprecated
from nplinker.strain import Strain
from .aa_pred import predict_aa
Expand Down Expand Up @@ -173,6 +174,73 @@ def is_mibig(self) -> bool:
"""
return self.id.startswith("BGC")

def to_dict(self) -> dict[str, Any]:
liannette marked this conversation as resolved.
Show resolved Hide resolved
"""Convert the BGC object to a dictionary for exporting purpose.

Returns:
A dictionary containing the following key-value pairs:
liannette marked this conversation as resolved.
Show resolved Hide resolved

- GCF_id (list[str]): A list of GCF IDs.
- GCF_bigscape_class (list[str]): A list of BiG-SCAPE classes.
- strain_id (str | None): The ID of the strain.
- description (str | None): A description of the BGC.
- BGC_name (str): The name of the BGC.
- product_prediction (list[str]): (predicted) products or product classes of the BGC.
- mibig_bgc_class (list[str] | None): MIBiG biosynthetic classes.
- antismash_id (str | None): The antiSMASH ID.
- antismash_region (int | None): The antiSMASH region number.
"""
# Keys are ordered to make the output easier to analyze
return {
"GCF_id": [gcf.id for gcf in self.parents if gcf.id is not None],
"GCF_bigscape_class": [bsc for bsc in self.bigscape_classes if bsc is not None],
"strain_id": self.strain.id if self.strain is not None else None,
"description": self.description,
"BGC_name": self.id,
"product_prediction": list(self.product_prediction),
"mibig_bgc_class": self.mibig_bgc_class,
"antismash_id": self.antismash_id,
"antismash_region": self.antismash_region,
}
liannette marked this conversation as resolved.
Show resolved Hide resolved

def to_tabular(self) -> dict[str, str]:
"""Convert the BGC object to a tabular format.

Returns:
dict: A dictionary representing the BGC object in tabular format.
The keys can be treated as headers and values are strings in which tabs are removed.
This dict can be exported as a TSV file.
"""
return {
key: self._to_string(value).replace("\t", " ")
for key, value in self.to_dict().items()
}

@staticmethod
def _to_string(value: Any) -> str:
"""Convert various types of values to a string.

Args:
value: The value to be converted to a string.
Can be a list, dict, or any other JSON-compatible type.

Returns:
A string representation of the input value.
"""
# Convert list to comma-separated string
if isinstance(value, list):
formatted_value = ", ".join(map(str, value))
# Convert dict to comma-separated string
elif isinstance(value, dict):
formatted_value = ", ".join([f"{k}:{v}" for k, v in value.items()])
# Convert None to empty string
elif value is None:
formatted_value = ""
# Convert anything else to string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to covert None to ""? Does it make sense to the BGC attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value is None, the corresponding field in the tabular output file should be left empty. This ensures that when the file is opened in Excel, numeric fields are correctly recognized as numbers rather than text, allowing the columns to be sorted properly. For text fields, leaving them empty is also preferable to displaying None, as it is cleaner and more intuitive.

else:
formatted_value = str(value)
return formatted_value

# CG: why not providing whole product but only amino acid as product monomer?
# this property is not used in NPLinker core business.
@property
Expand Down
63 changes: 63 additions & 0 deletions src/nplinker/metabolomics/spectrum.py
liannette marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations
from functools import cached_property
from typing import TYPE_CHECKING
from typing import Any
import numpy as np
from nplinker.strain import Strain
from nplinker.strain import StrainCollection
Expand Down Expand Up @@ -108,3 +109,65 @@ def has_strain(self, strain: Strain) -> bool:
True when the given strain exist in the spectrum.
"""
return strain in self.strains

def to_dict(self) -> dict[str, Any]:
"""Convert the Spectrum object to a dictionary for exporting purpose.

Returns:
A dictionary containing containing the following key-value pairs:
liannette marked this conversation as resolved.
Show resolved Hide resolved

- "spectrum_id" (str): The unique identifier of the spectrum.
- "num_strains_with_spectrum" (int): The number of strains associated with the spectrum.
- "precursor_mz" (float): The precursor m/z value, rounded to four decimal places.
- "rt" (float): The retention time, rounded to three decimal places.
- "molecular_family" (str | None ): The identifier of the molecular family.
- "gnps_id" (str | None ): The GNPS identifier.
- "gnps_annotations" (dict[str, str]): A dictionary of GNPS annotations.
"""
return {
"spectrum_id": self.id,
"num_strains_with_spectrum": len(self.strains),
"precursor_mz": round(self.precursor_mz, 4),
"rt": round(self.rt, 3),
"molecular_family": self.family.id if self.family else None,
"gnps_id": self.gnps_id,
"gnps_annotations": self.gnps_annotations,
}

def to_tabular(self) -> dict[str, str]:
"""Convert the Spectrum object to a tabular format.

Returns:
dict: A dictionary representing the Spectrum object in tabular format.
The keys can be treated as headers and values are strings in which tabs are removed.
This dict can be exported as a TSV file.
"""
return {
key: self._to_string(value).replace("\t", " ")
for key, value in self.to_dict().items()
}

@staticmethod
def _to_string(value: Any) -> str:
"""Convert various types of values to a string.

Args:
value: The value to be converted to a string.
Can be a list, dict, or any other JSON-compatible type.

Returns:
A string representation of the input value.
"""
# Convert list to comma-separated string
if isinstance(value, list):
formatted_value = ", ".join(map(str, value))
# Convert dict to comma-separated string
elif isinstance(value, dict):
formatted_value = ", ".join([f"{k}:{v}" for k, v in value.items()])
# Convert None to empty string
elif value is None:
formatted_value = ""
# Convert anything else to string
else:
formatted_value = str(value)
return formatted_value
41 changes: 41 additions & 0 deletions src/nplinker/nplinker.py
liannette marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from __future__ import annotations
import csv
import logging
import pickle
from collections.abc import Sequence
Expand Down Expand Up @@ -355,3 +356,43 @@ def save_data(
data = (self.bgcs, self.gcfs, self.spectra, self.mfs, self.strains, links)
with open(file, "wb") as f:
pickle.dump(data, f)

def objects_to_tsv(self, objects: Sequence[BGC] | Sequence[Spectrum], filename: str) -> None:
"""Exports a list of BGC or Spectrum objects to a tsv file.

Args:
objects (list): A list of BGC or a list of Spectrum objects to be exported.
filename (str): The name of the output file.
"""
if not objects:
raise ValueError("No objects provided to export")

# Ensure all elements in the list are of the same type
obj_type = type(objects[0])
if not all(isinstance(obj, obj_type) for obj in objects):
raise TypeError("All objects in the list must be of the same type")

with open(self._output_dir / filename, "w", newline="") as outfile:
CunliangGeng marked this conversation as resolved.
Show resolved Hide resolved
headers = objects[0].to_tabular().keys()
writer = csv.DictWriter(outfile, fieldnames=headers, delimiter="\t")
writer.writeheader()
for obj in objects:
writer.writerow(obj.to_tabular())

def to_tsv(self, lg: LinkGraph | None = None) -> None:
"""Export data to tsv files.

This method exports following data to seperated TSV files:

- BGC objects: `genomics_data.tsv`
- Spectrum objects: `metabolomics_data.tsv`
- LinkGraph object (if given): `links.tsv`

Args:
lg (LinkGraph | None): An optional LinkGraph object. If provided,
the links data will be exported to 'links.tsv'.
"""
self.objects_to_tsv(self.bgcs, "genomics_data.tsv")
self.objects_to_tsv(self.spectra, "metabolomics_data.tsv")
if lg is not None:
lg.to_tsv(self._output_dir / "links.tsv")
Loading
Loading