Enable automatic dataset splitting #126

SebieF · 2024-12-02T14:11:24Z

In addition to providing the dataset splits via the input FASTA files, it would be a nice feature to allow automatic dataset splitting within the biotrainer pipeline. The dataset splitting should not be done based on pure random splits, but should also feature checks for sequence similarity within the dataset splits, using mmseqs.

Steps include:

Create new config file option(s) for automatic dataset splitting
Add a step in the pipeline where the splitting is done
Run mmseqs and split afterwards
Provide feedback in the out.yml file and the logging about the splits

heispv · 2024-12-06T10:23:42Z

I was thinking of working on this since I've been working with the project's configuration recently. What do you think? :)

SebieF added enhancement New feature or request good first issue Good for newcomers labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable automatic dataset splitting #126

Enable automatic dataset splitting #126

SebieF commented Dec 2, 2024

heispv commented Dec 6, 2024

Enable automatic dataset splitting #126

Enable automatic dataset splitting #126

Comments

SebieF commented Dec 2, 2024

heispv commented Dec 6, 2024