Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove filtering of common strains during loading #200

Merged
merged 1 commit into from
Jan 24, 2024

Conversation

CunliangGeng
Copy link
Member

@CunliangGeng CunliangGeng commented Dec 12, 2023

The loading part should return all relevant strains to scoring part. And then users should do the filtering of strains in the scoring part. So removing the method _filter_only_common_strains from loading process.

The upcoming PRs will add the filtering of common strains in the scoring part.

@CunliangGeng CunliangGeng requested a review from gcroci2 December 12, 2023 15:49
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from db67ce5 to 9b08108 Compare December 14, 2023 08:33
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 666f018 to 827ae43 Compare December 14, 2023 08:34
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 9b08108 to 256692a Compare December 14, 2023 10:00
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 827ae43 to 0d42e94 Compare December 14, 2023 10:01
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 256692a to f7bd16d Compare December 19, 2023 13:21
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 0d42e94 to a4be9e2 Compare December 19, 2023 13:21
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from f7bd16d to ee755e0 Compare December 19, 2023 13:38
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from a4be9e2 to 7794618 Compare December 19, 2023 13:38
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from ee755e0 to 8517df0 Compare December 20, 2023 11:53
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 7794618 to a810870 Compare December 20, 2023 11:53
@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 8517df0 to 0f4bc7c Compare December 20, 2023 12:24
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from a810870 to 88d39a5 Compare December 20, 2023 12:24
Copy link
Contributor

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something here, but if a strain is not relevant (i.e., it's not present in both genomic and metabolomic data), what is the use case in which it could be used for the scoring part anyway? Which means, how can a link be present (and scored) at all in such cases?

@justinjjvanderhooft
Copy link

Good point @gcroci2, I think it depends on how we organize the loading of the strains - and the matching. This is one of the most critical steps as so far, most of the issues arise from this. @CunliangGeng, what is your current idea? First load everything and then ensure there are matches in a second step, before moving on to the scoring?

@CunliangGeng
Copy link
Member Author

@gcroci2 @justinjjvanderhooft If a strain is not a common strain, it could be used by only genomics or only metabolomics or neither. The current workflow will load everything and pass them to scoring part, users then set what strains to use in the scoring part, e.g. there would be a parameter in scoring like use_only_common_strains and/or filter_specified_strains...

@justinjjvanderhooft
Copy link

@CunliangGeng I think I can follow your logic, as in, we load all that is there, and then prior to scoring we let the user choose/assess what will be considered....is that how you envisioned it? I think what @gcroci2 and myself are referring to is that, for any "linking score" to work, we need at least 1 (and preferably more!) "common" strains between the genome and metabolome data that is uploaded. For the PoDP data, this is not an issue, as they are matched through the PoDP metadata. For the local data (where many users will start with), what do we currently have in mind? Upon preparing the data, do a validation that at least one strain label is matching (common) between the genome and metabolome data? And/or give an overview of common and "not-matching" strains (labels) prior to calculating the score(s). It is good that you raise this point. I am happy to follow your advice/logic @CunliangGeng, as long as it is clear to the user in an early stage how many matching (common) strains there are available for any scoring. Whilst it is still unclear to me now what we can do with the not matching strain labels, I can think of enough use cases (i.e., streamlined loading of genome data for networking, etc.) - is that what you had in mind @CunliangGeng?

@CunliangGeng
Copy link
Member Author

@justinjjvanderhooft @gcroci2
Here are my answers to Justin's questions:

If users set it to use common strains for scoring and actually there is no common strains, then it should raise an warning/error.

We can provide methods in class NPLinker to check if common strains exist or not, or to check the number of common/non-common strains.

For the local data (where many users will start with), what do we currently have in mind? Upon preparing the data, do a validation that at least one strain label is matching (common) between the genome and metabolome data?

For the local data mode, users have to manually prepare the strain mappings and ensure the strain id has alias for both genomics and metabolomics data. After loading the strain mappings (and also other data), users can use the new method in NPLinker class to check the number of different types of strains.

Whilst it is still unclear to me now what we can do with the not matching strain labels, I can think of enough use cases (i.e., streamlined loading of genome data for networking, etc.) - is that what you had in mind

For the data like BGC that do not have a matched strain, they won't be passed to scoring stage. In the loading stage, the number of matched/unmatched data will be reported in the log.

@justinjjvanderhooft
Copy link

Thanks for your answers @gcroci2! I think it may also be a "semantic" confusion, but I think I can follow the logic now. Let's get back to this in an online / in person meeting to clarify everything :-)

@CunliangGeng
Copy link
Member Author

@gcroci2 Please explicitly approve the PR if it looks good. Without approval from at least one reviewer, the PR cannot be merged :-)

@gcroci2 gcroci2 self-requested a review January 24, 2024 15:07
Copy link
Member Author

CunliangGeng commented Jan 24, 2024

Merge activity

@CunliangGeng CunliangGeng force-pushed the 12-12-remove_loading_optional_data branch from 0f4bc7c to f0235cb Compare January 24, 2024 15:25
Base automatically changed from 12-12-remove_loading_optional_data to dev January 24, 2024 15:27
The loading part should return all relevant strains to scoring part. It should allow users to specify if they want to filter common strains or not in the scoring part (to be implemented in the class `NPLinker`).
@CunliangGeng CunliangGeng force-pushed the 12-12-remove-filtering-common-strains branch from 88d39a5 to d1adc66 Compare January 24, 2024 15:28
@CunliangGeng CunliangGeng merged commit 4de30e3 into dev Jan 24, 2024
1 of 2 checks passed
@CunliangGeng CunliangGeng deleted the 12-12-remove-filtering-common-strains branch January 24, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants