Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable automatic dataset splitting #126

Open
SebieF opened this issue Dec 2, 2024 · 0 comments
Open

Enable automatic dataset splitting #126

SebieF opened this issue Dec 2, 2024 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SebieF
Copy link
Collaborator

SebieF commented Dec 2, 2024

In addition to providing the dataset splits via the input FASTA files, it would be a nice feature to allow automatic dataset splitting within the biotrainer pipeline. The dataset splitting should not be done based on pure random splits, but should also feature checks for sequence similarity within the dataset splits, using mmseqs.

Steps include:

  • Create new config file option(s) for automatic dataset splitting
  • Add a step in the pipeline where the splitting is done
  • Run mmseqs and split afterwards
  • Provide feedback in the out.yml file and the logging about the splits
@SebieF SebieF added enhancement New feature or request good first issue Good for newcomers labels Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant