add class DatasetArranger #215

CunliangGeng · 2024-03-01T13:53:59Z

This is a big PR to implement the pipelines of data arranging, which enables the local and podp modes.

Arranging data means

creating data folders in the root_dir
downloading dataset if needed (e.g. for podp mode)
validating dataset downloaded or provided by users

Basically, it means all steps needed to make data ready for loading.

The pipelines of arranging data for different types of data are displayed in the diagram of #117.

To keep the data arranging workflow simple, we use fixed project directory structure (see #163) with fixed dir and file names (see globals.py).

To use nplinker, users are required to

create a root_dir manually and use it as the root directory of the nplinker project
provide a config file nplinker.toml and put it in the root_dir

Major changes

Added file arranger.py including the class DatasetArranger and some validation functions, which implement the pipelines of arranging data
Clean/remove/update some files to make the arrangers work (some may need further refactoring in future PRs)
- cleaned runbigscape.py
- Deleted downloader.py and its tests, which is replaced by DatasetArranger
- Updated loader.py and nplinker.py to use the DatasetArranger
Added integration tests for the arranger (tests passed)
- Created nplinker_local_mode.toml
- Updated tests/conftest.py
- Updated test_nplinker_local.py to test the local mode

Tests on podp mode also passed on my local machine. Due to the cost of running bigscape, the tests will be added to the codebase in next PRs.

CunliangGeng · 2024-03-01T13:54:13Z

setup docs with Mkdocs #218
fix docstrings #217
add class DatasetArranger #215 👈
use git large file for large zip files #214
Update strain mappings generator #212
Update global variables #211
Update mibig downloader #210
Update config template and validations #209
Update utils #208
dev

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @CunliangGeng and the rest of your teammates on Graphite

- Add class `DatasetArranger` - Add dataset validation functions `validate_gnps`, `validate_antismash` and `validate_bigscape`

- remove function `podp_run_bigscape` - updated function `run_bigscape`

remove invalid steps

gcroci2

Great refactor :D

Just for me to know, did you use a tool for creating the md diagram displayed in PODP mode and local data mode #117?
I haven't run local tests myself, should I?

src/nplinker/arranger.py

gcroci2 · 2024-03-05T11:05:13Z

src/nplinker/loader.py

-            "DatasetLoader({}, {}, {})".format(self._root, self.dataset_id, self._remote_loading)
-        )
-
-    def __repr__(self):


Are you sure we don't want to implement this anymore?

I do not get which part of code you're talking about. The loader.py still needs further refactoring. In this PR I just cleaned it to make the arranger work.

I meant the __repr__ magic method

It's still there, I did not remove it. Github does not render the changes correctly...

this is a workaround to solve the issues in `tests/conftest.py`: it copies example data for each process if multi-process test is enabled

CunliangGeng · 2024-03-05T14:16:39Z

Great refactor :D

Just for me to know, did you use a tool for creating the md diagram displayed in PODP mode and local data mode #117?

I talked about that on the sprint meeting last week ;-) Two ways:

github support it natively, if you check the raw data of that issue (by editting the issue), you will see how it works.
a very good live editor https://mermaid.live/ (see the diagram)

I first used the live editor and then copied the raw code to github that renders it as diagram automatically.

I haven't run local tests myself, should I?

Not necessary for this PR.
I just pushed a new commit to stop tests running in parallel. This is just a workaround to stop the tests/conftest.py copying data for every process when parallel testing is enabled. I'm planning to separate the unit tests and integration tests, so the parallel copying issue should be fixed then.

CunliangGeng · 2024-03-05T15:38:58Z

Merge activity

Mar 5, 10:38 AM EST: @CunliangGeng started a stack merge that includes this pull request via Graphite.
Mar 5, 10:39 AM EST: @CunliangGeng merged this pull request with Graphite.

gcroci2 · 2024-03-07T09:20:51Z

Great refactor :D

Just for me to know, did you use a tool for creating the md diagram displayed in PODP mode and local data mode #117?

I talked about that on the sprint meeting last week ;-) Two ways:

github support it natively, if you check the raw data of that issue (by editting the issue), you will see how it works.

a very good live editor https://mermaid.live/ (see the diagram)

Nice thanks I was looking for something like 2.

This was referenced Mar 1, 2024

Update utils #208

Merged

Update config template and validations #209

Merged

Update mibig downloader #210

Merged

Update global variables #211

Merged

Update strain mappings generator #212

Merged

CunliangGeng mentioned this pull request Mar 1, 2024

use git large file for large zip files #214

Merged

CunliangGeng requested a review from gcroci2 March 1, 2024 13:56

CunliangGeng self-assigned this Mar 1, 2024

CunliangGeng mentioned this pull request Mar 1, 2024

Refactor initialisation of project root and data folders [Track issue] #163

Closed

12 tasks

Base automatically changed from use_git_large_file_storage to dev March 1, 2024 14:59

CunliangGeng added 10 commits March 1, 2024 16:08

Create arranger.py

a33701f

- Add class `DatasetArranger` - Add dataset validation functions `validate_gnps`, `validate_antismash` and `validate_bigscape`

Update runbigscape.py

1923a99

- remove function `podp_run_bigscape` - updated function `run_bigscape`

Delete downloader.py and its tests

6760ebd

Update loader.py

faad572

Update nplinker.py

74a397d

Update conftest.py

cf3370d

remove invalid steps

Update test_nplinker_local.py

8ea1e79

Update conftest.py

47d8c12

Create nplinker_local_mode.toml

62a0d22

update config tests

5400394

CunliangGeng force-pushed the add_dataset_arranger branch from f7ab6ce to 5400394 Compare March 1, 2024 15:35

This was referenced Mar 5, 2024

fix docstrings #217

Merged

setup docs with Mkdocs #218

Merged

gcroci2 approved these changes Mar 5, 2024

View reviewed changes

CunliangGeng added 2 commits March 5, 2024 14:50

fix arranging logics for gnps, antismash and bigscape

1b3c080

set tests run on one core

63605d6

this is a workaround to solve the issues in `tests/conftest.py`: it copies example data for each process if multi-process test is enabled

CunliangGeng merged commit f6ac5a3 into dev Mar 5, 2024
3 of 4 checks passed

CunliangGeng deleted the add_dataset_arranger branch March 5, 2024 15:39

CunliangGeng linked an issue Mar 5, 2024 that may be closed by this pull request

PODP mode and local data mode #117

Closed

CunliangGeng mentioned this pull request Mar 5, 2024

Refactor Downloader class to have a specific PoDP downloader #121

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add class DatasetArranger #215

add class DatasetArranger #215

CunliangGeng commented Mar 1, 2024 •

edited

Loading

CunliangGeng commented Mar 1, 2024 •

edited

Loading

gcroci2 left a comment

gcroci2 Mar 5, 2024

CunliangGeng Mar 5, 2024

gcroci2 Mar 5, 2024

CunliangGeng Mar 5, 2024

CunliangGeng commented Mar 5, 2024 •

edited

Loading

CunliangGeng commented Mar 5, 2024 •

edited

Loading

gcroci2 commented Mar 7, 2024

add class DatasetArranger #215

add class DatasetArranger #215

Conversation

CunliangGeng commented Mar 1, 2024 • edited Loading

CunliangGeng commented Mar 1, 2024 • edited Loading

gcroci2 left a comment

Choose a reason for hiding this comment

gcroci2 Mar 5, 2024

Choose a reason for hiding this comment

CunliangGeng Mar 5, 2024

Choose a reason for hiding this comment

gcroci2 Mar 5, 2024

Choose a reason for hiding this comment

CunliangGeng Mar 5, 2024

Choose a reason for hiding this comment

CunliangGeng commented Mar 5, 2024 • edited Loading

CunliangGeng commented Mar 5, 2024 • edited Loading

Merge activity

gcroci2 commented Mar 7, 2024

CunliangGeng commented Mar 1, 2024 •

edited

Loading

CunliangGeng commented Mar 1, 2024 •

edited

Loading

CunliangGeng commented Mar 5, 2024 •

edited

Loading

CunliangGeng commented Mar 5, 2024 •

edited

Loading