Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to allow finetuning at scale #12

Closed
wants to merge 12 commits into from

Conversation

peterdays
Copy link
Contributor

@peterdays peterdays commented Oct 24, 2024

  • Updated to python3.11
  • Updated to pyproject.toml
  • Created conda environment.yml for reproducibility
  • Data generation:
    • changed strategy of splitting (train, val and test vs train and val)
    • added option to pass a custom function to process the reads
    • added verbose control to finetune_generator
    • added progress bars to data generation
  • Finetuning:
    • added progress bars
    • separated gradient accumulation from evaluation

peterdays and others added 12 commits September 11, 2024 16:52
added environment.yml for conda users;
added minor changes in the bam function;
minor changes and linting
changed tokenizatoin of all data in one go in the dataloader;
reverted finetunedata
updated to py3.11, changes in training loop
major improvements in finetuning loop;
@hanyangii hanyangii self-assigned this Oct 26, 2024
Copy link
Collaborator

@hanyangii hanyangii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments on your changes! Please let me know if you have different opinions.

dmrs["abs_areaStat"] = dmrs["areaStat"].abs()
dmrs = dmrs.sort_values(by="abs_areaStat", ascending=False)
dmrs = dmrs.sort_values(by="areaStat", ascending=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the values to be sorted by the absolute value of areaStat.
'areaStat' (from the DSS package) has a direction, meaning areaStat < 0 is hypomethylated in the targeted phenotype and vice versa. Therefore, if you choose top-N DMRs based on the value, you only get hypermethylated regions in the targeted phenotype.

dmrs["abs_diff.Methy"] = dmrs["diff.Methy"].abs()
dmrs = dmrs.sort_values(by="abs_diff.Methy", ascending=False)
dmrs = dmrs.sort_values(by="diff.Methy", ascending=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as line 229.

n_mers: int = 3,
split_ratio: float = 1.0,
split_ratios: List[float] = [0.8, 0.1, 0.1],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure about this part... Although this is the standard design for machine learning, I prefer to keep the test data set from other samples (not seen during the training time) for bioinformatic applications. Even if sequencing reads are different, they are still related/not independent if the source of the biological sample is the same. This is why the original version only splits data into training and valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see! We should make it flexible for both cases, what do you think?

@@ -16,18 +16,19 @@ bioRxiv 2023.10.29.564590; doi: https://doi.org/10.1101/2023.10.29.564590
## Installation
_MethylBERT_ runs most stably with __Python=3.9__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you updated the package to python 3.11, this might need to be changed! Or did you test it with 3.9 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you are right! I've worked with python 3.11 only

> methylbert
MethylBERT v2.0.0
> methylbert
MethylBERT v2.0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already updated methylbert to v2.0.1 after fixing some bugs. Please change it to '2.0.2'

@@ -0,0 +1,40 @@
[project]
name = "methylbert"
version = "1.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version needs to be fixed!

methyl_caller: str = "bismark"
methyl_caller: str = "bismark",
verbose: int = 2,
read_extract_sequences_func: Optional[callable] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an option for your own function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! I find it more user friendly this way, one can have a custom read extraction different from the original one without the need to fork the package


self.step += 1
duration=0
global_step_loss = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused here. Wouldn't the global_step_loss need to be reset when the gradient accumulation step is achieved? I think this should be under "if (local_step+1) % self._config.gradient_accumulation_steps == 0". Maybe I should add one more line to divide global_step_loss with gradient_accumulation_step before recording the performance in f_perform file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I've only worked with gradient accumulation of 1, so I didn't notice that! It's up to you, don't find it necessary to add that division 🤔

@hanyangii
Copy link
Collaborator

Merged to the main branch after fixing some bugs!

@hanyangii hanyangii closed this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants