Changes to allow finetuning at scale #12

peterdays · 2024-10-24T15:53:31Z

Updated to python3.11
Updated to pyproject.toml
Created conda environment.yml for reproducibility
Data generation:
- changed strategy of splitting (train, val and test vs train and val)
- added option to pass a custom function to process the reads
- added verbose control to finetune_generator
- added progress bars to data generation
Finetuning:
- added progress bars
- separated gradient accumulation from evaluation

added environment.yml for conda users; added minor changes in the bam function; minor changes and linting

changed tokenizatoin of all data in one go in the dataloader;

reverted finetunedata

updated to py3.11, changes in training loop

dataset bug fix

major improvements in finetuning loop;

hanyangii

I made some comments on your changes! Please let me know if you have different opinions.

hanyangii · 2024-10-25T15:10:59Z

src/methylbert/data/finetune_data_generate.py

        dmrs["abs_areaStat"]  = dmrs["areaStat"].abs()
-        dmrs = dmrs.sort_values(by="abs_areaStat", ascending=False)
+        dmrs = dmrs.sort_values(by="areaStat", ascending=False)


Please keep the values to be sorted by the absolute value of areaStat.
'areaStat' (from the DSS package) has a direction, meaning areaStat < 0 is hypomethylated in the targeted phenotype and vice versa. Therefore, if you choose top-N DMRs based on the value, you only get hypermethylated regions in the targeted phenotype.

hanyangii · 2024-10-25T15:12:28Z

src/methylbert/data/finetune_data_generate.py

        dmrs["abs_diff.Methy"]  = dmrs["diff.Methy"].abs()
-        dmrs = dmrs.sort_values(by="abs_diff.Methy", ascending=False)
+        dmrs = dmrs.sort_values(by="diff.Methy", ascending=False)


Same comment as line 229.

hanyangii · 2024-10-25T15:15:43Z

src/methylbert/data/finetune_data_generate.py

        n_mers: int = 3,
-        split_ratio: float = 1.0,
+        split_ratios: List[float] = [0.8, 0.1, 0.1],


I am not 100% sure about this part... Although this is the standard design for machine learning, I prefer to keep the test data set from other samples (not seen during the training time) for bioinformatic applications. Even if sequencing reads are different, they are still related/not independent if the source of the biological sample is the same. This is why the original version only splits data into training and valid.

Hmm, I see! We should make it flexible for both cases, what do you think?

hanyangii · 2024-10-28T16:46:30Z

README.md

@@ -16,18 +16,19 @@ bioRxiv 2023.10.29.564590; doi: https://doi.org/10.1101/2023.10.29.564590
 ## Installation
 _MethylBERT_ runs most stably with __Python=3.9__


If you updated the package to python 3.11, this might need to be changed! Or did you test it with 3.9 as well?

no, you are right! I've worked with python 3.11 only

hanyangii · 2024-10-28T16:51:04Z

README.md

-> methylbert 
-MethylBERT v2.0.0
+> methylbert
+MethylBERT v2.0.1


I already updated methylbert to v2.0.1 after fixing some bugs. Please change it to '2.0.2'

hanyangii · 2024-10-28T16:51:49Z

pyproject.toml

@@ -0,0 +1,40 @@
+[project]
+name = "methylbert"
+version = "1.0.0"


This version needs to be fixed!

hanyangii · 2024-10-28T17:08:46Z

src/methylbert/data/finetune_data_generate.py

-        methyl_caller: str = "bismark"
+        methyl_caller: str = "bismark",
+        verbose: int = 2,
+        read_extract_sequences_func: Optional[callable] = None


Is this an option for your own function?

Exactly! I find it more user friendly this way, one can have a custom read extraction different from the original one without the need to fork the package

hanyangii · 2024-10-28T17:24:58Z

src/methylbert/trainer.py

+
+                self.step += 1
+                duration=0
+                global_step_loss = 0


I am a bit confused here. Wouldn't the global_step_loss need to be reset when the gradient accumulation step is achieved? I think this should be under "if (local_step+1) % self._config.gradient_accumulation_steps == 0". Maybe I should add one more line to divide global_step_loss with gradient_accumulation_step before recording the performance in f_perform file.

You are right, I've only worked with gradient accumulation of 1, so I didn't notice that! It's up to you, don't find it necessary to add that division 🤔

hanyangii · 2024-12-10T09:31:16Z

Merged to the main branch after fixing some bugs!

peterdays and others added 12 commits September 11, 2024 16:52

moved setup.py to a pyproject.toml;

676a3b7

added environment.yml for conda users; added minor changes in the bam function; minor changes and linting

added changes in train test split;

6734df1

changed tokenizatoin of all data in one go in the dataloader;

updated readme

0e3e19c

changed training loop

2e57568

small changes

5adb8e0

added mlflow logging;

0671abc

reverted finetunedata

Merge pull request #1 from peterdays/update/python_packages

0fcb78d

updated to py3.11, changes in training loop

major improvements in finetuning loop;

cfeb225

dataset bug fix

Merge pull request #2 from peterdays/feature/optimizing_to_scale

0efcd6e

major improvements in finetuning loop;

removed mlflow code

80cf85a

Merge branch 'main' of github.com:CompEpigen/methylbert

b351a55

removed debugging code

37aaabd

hanyangii self-assigned this Oct 26, 2024

hanyangii requested changes Oct 28, 2024

View reviewed changes

hanyangii closed this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to allow finetuning at scale #12

Changes to allow finetuning at scale #12

peterdays commented Oct 24, 2024 •

edited

Loading

hanyangii left a comment

hanyangii Oct 25, 2024

hanyangii Oct 25, 2024

hanyangii Oct 25, 2024

peterdays Nov 5, 2024

hanyangii Oct 28, 2024

peterdays Nov 5, 2024

hanyangii Oct 28, 2024

hanyangii Oct 28, 2024

hanyangii Oct 28, 2024

peterdays Nov 5, 2024

hanyangii Oct 28, 2024

peterdays Nov 6, 2024

hanyangii commented Dec 10, 2024

		@@ -16,18 +16,19 @@ bioRxiv 2023.10.29.564590; doi: https://doi.org/10.1101/2023.10.29.564590
		## Installation
		_MethylBERT_ runs most stably with __Python=3.9__

Changes to allow finetuning at scale #12

Changes to allow finetuning at scale #12

Conversation

peterdays commented Oct 24, 2024 • edited Loading

hanyangii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanyangii commented Dec 10, 2024

peterdays commented Oct 24, 2024 •

edited

Loading