Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add class DatasetArranger #215

Merged
merged 12 commits into from
Mar 5, 2024
Merged

add class DatasetArranger #215

merged 12 commits into from
Mar 5, 2024

Conversation

CunliangGeng
Copy link
Member

@CunliangGeng CunliangGeng commented Mar 1, 2024

This is a big PR to implement the pipelines of data arranging, which enables the local and podp modes.

Arranging data means

  • creating data folders in the root_dir
  • downloading dataset if needed (e.g. for podp mode)
  • validating dataset downloaded or provided by users

Basically, it means all steps needed to make data ready for loading.

The pipelines of arranging data for different types of data are displayed in the diagram of #117.

To keep the data arranging workflow simple, we use fixed project directory structure (see #163) with fixed dir and file names (see globals.py).

To use nplinker, users are required to

  • create a root_dir manually and use it as the root directory of the nplinker project
  • provide a config file nplinker.toml and put it in the root_dir

Major changes

  • Added file arranger.py including the class DatasetArranger and some validation functions, which implement the pipelines of arranging data

  • Clean/remove/update some files to make the arrangers work (some may need further refactoring in future PRs)

    • cleaned runbigscape.py
    • Deleted downloader.py and its tests, which is replaced by DatasetArranger
    • Updated loader.py and nplinker.py to use the DatasetArranger
  • Added integration tests for the arranger (tests passed)

    • Created nplinker_local_mode.toml
    • Updated tests/conftest.py
    • Updated test_nplinker_local.py to test the local mode

Tests on podp mode also passed on my local machine. Due to the cost of running bigscape, the tests will be added to the codebase in next PRs.

@CunliangGeng CunliangGeng requested a review from gcroci2 March 1, 2024 13:56
@CunliangGeng CunliangGeng self-assigned this Mar 1, 2024
Base automatically changed from use_git_large_file_storage to dev March 1, 2024 14:59
- Add class `DatasetArranger`
- Add dataset validation functions `validate_gnps`, `validate_antismash` and `validate_bigscape`
-  remove function `podp_run_bigscape`
- updated function `run_bigscape`
remove invalid steps
@CunliangGeng CunliangGeng force-pushed the add_dataset_arranger branch from f7ab6ce to 5400394 Compare March 1, 2024 15:35
This was referenced Mar 5, 2024
Copy link
Contributor

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great refactor :D

src/nplinker/arranger.py Outdated Show resolved Hide resolved
"DatasetLoader({}, {}, {})".format(self._root, self.dataset_id, self._remote_loading)
)

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we don't want to implement this anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get which part of code you're talking about. The loader.py still needs further refactoring. In this PR I just cleaned it to make the arranger work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the __repr__ magic method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there, I did not remove it. Github does not render the changes correctly...

this is a workaround to solve the issues in `tests/conftest.py`: it copies example data for each process if multi-process test is enabled
@CunliangGeng
Copy link
Member Author

CunliangGeng commented Mar 5, 2024

Great refactor :D

I talked about that on the sprint meeting last week ;-) Two ways:

  1. github support it natively, if you check the raw data of that issue (by editting the issue), you will see how it works.
  2. a very good live editor https://mermaid.live/ (see the diagram)

I first used the live editor and then copied the raw code to github that renders it as diagram automatically.

  • I haven't run local tests myself, should I?

Not necessary for this PR.
I just pushed a new commit to stop tests running in parallel. This is just a workaround to stop the tests/conftest.py copying data for every process when parallel testing is enabled. I'm planning to separate the unit tests and integration tests, so the parallel copying issue should be fixed then.

Copy link
Member Author

CunliangGeng commented Mar 5, 2024

Merge activity

@CunliangGeng CunliangGeng merged commit f6ac5a3 into dev Mar 5, 2024
3 of 4 checks passed
@CunliangGeng CunliangGeng deleted the add_dataset_arranger branch March 5, 2024 15:39
@CunliangGeng CunliangGeng linked an issue Mar 5, 2024 that may be closed by this pull request
@gcroci2
Copy link
Contributor

gcroci2 commented Mar 7, 2024

Great refactor :D

I talked about that on the sprint meeting last week ;-) Two ways:

  1. github support it natively, if you check the raw data of that issue (by editting the issue), you will see how it works.
  2. a very good live editor https://mermaid.live/ (see the diagram)

Nice thanks I was looking for something like 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

PODP mode and local data mode
2 participants