You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good afternoon!
Thank you for open-sourcing such fantastic work!
I have been trying to fine-tune bioGPT on a subset of textual data to provide more knowledge on some specific domain.
However, when I am training the model for one epoch on a very small subset of PubMed abstracts, the model loses the ability to generate comprehensive English and seems to output just random words. Would you be able to provide insights into why the model degrades so fast?
To fine-tune, I am freezing all layers besides the last one and feeding new data into self-supervised training to predict the next word based on the previous:
## Freeze embed_tokens and embed_positions
for param in abs_model.biogpt.embed_tokens.parameters():
param.requires_grad = False
for param in abs_model.biogpt.embed_positions.parameters():
param.requires_grad = False
# Freeze parameters of the first 23 layers
for i, layer in enumerate(abs_model.biogpt.layers):
if i < 23: # Freeze all layers except the last one
for param in layer.parameters():
param.requires_grad = False
for param in abs_model.biogpt.layer_norm.parameters():
param.requires_grad = True
# Check which parameters are trainable
for name, param in abs_model.named_parameters():
print(name, param.requires_grad)
In this repository, I couldn't find a dataset definition to train the foundation model, so I came up with my own:
import torch
from transformers import BioGptTokenizer, BioGptForCausalLM
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
import os
# Prepare the dataset
class AbstractsDataset(Dataset):
def __init__(self, texts, tokenizer, max_length):
self.tokenizer = tokenizer
self.texts = texts
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts.iloc[idx]
# Tokenize and pad the sequence to the max_length
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_length,
return_tensors='pt', # Return PyTorch tensors
padding='max_length', # Add padding
truncation=True
)
# Shift the labels to the right to predict the next token
input_ids = encoding['input_ids'].squeeze(0) # Remove the batch dimension added by `return_tensors`
labels = input_ids.clone()
labels[:-1] = input_ids[1:]
labels[-1] = -100 # Typically we set the label for the last token to -100 to ignore it in loss calculation
return {
'input_ids': input_ids,
'labels': labels
}
# Create the dataset
train_dataset = AbstractsDataset(train_abstracts, abs_tokenizer, max_length=512) # Adjust max_length as needed
val_dataset = AbstractsDataset(val_abstracts, abs_tokenizer, max_length=512) # Adjust max_length as needed
Below is the code for training:
# Training arguments
training_args = TrainingArguments(
output_dir=f'./results_{formatted_date}', # output directory
num_train_epochs=1, # number of training epochs, adjust as needed
per_device_train_batch_size=4, # batch size per device during training, adjust based on your GPU(s)
warmup_steps=500, # number of warmup steps for learning rate scheduler
eval_steps=500, # evaluation will be performed every 500 steps
save_steps=500, # save the model every 1000 steps
weight_decay=0.01, # strength of weight decay
load_best_model_at_end=True, # load the best model when finished training (based on `metric_for_best_model`)
logging_dir='./logs', # directory for storing logs
metric_for_best_model="loss", # use loss to evaluate the best model
evaluation_strategy="steps", # evaluate at regular intervals
greater_is_better=False, # lower loss indicates a better model
logging_steps=1,
)
# Initialize Trainer
trainer = Trainer(
model=abs_model,
args=training_args,
train_dataset=train_dataset,
# If you have a validation dataset, you can include it here
eval_dataset=val_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)] # Stop training after 10 evaluations without improvement
)
trainer.args._n_gpu = 1
# Train the model
trainer.train()
# Save the fine-tuned model
abs_model.save_pretrained(f'./saved_model/updated_{formatted_date}')
After training one epoch with 10,000 abstracts, the validation loss decreased from 11.34 to 3.40.
However, the quality of the generated text became way worse. For example, the query "COVID-19 is" for the fine-tunned version gives the following:
COVID-19 is on world of, to the and this has the to the of world., its of. patients and to the world, have a on illness caused a., the of.,, and of people.
While the base pre-trained BioGPT provides a more reasonable answer:
COVID-19 is still an ongoing pandemic.
This leads me to wonder: How is it possible for the loss to decrease yet the text quality to worsen? Were there any specific training techniques or considerations used in the initial training of BioGPT that I might be missing?
Any insights or suggestions would be greatly appreciated. Thank you for your time and assistance.
Good afternoon!
Thank you for open-sourcing such fantastic work!
I have been trying to fine-tune bioGPT on a subset of textual data to provide more knowledge on some specific domain.
However, when I am training the model for one epoch on a very small subset of PubMed abstracts, the model loses the ability to generate comprehensive English and seems to output just random words. Would you be able to provide insights into why the model degrades so fast?
I am fine-tuning the base pre-trained model from https://huggingface.co/microsoft/biogpt:
To fine-tune, I am freezing all layers besides the last one and feeding new data into self-supervised training to predict the next word based on the previous:
In this repository, I couldn't find a dataset definition to train the foundation model, so I came up with my own:
Below is the code for training:
After training one epoch with 10,000 abstracts, the validation loss decreased from 11.34 to 3.40.
However, the quality of the generated text became way worse. For example, the query "COVID-19 is" for the fine-tunned version gives the following:
While the base pre-trained BioGPT provides a more reasonable answer:
This leads me to wonder: How is it possible for the loss to decrease yet the text quality to worsen? Were there any specific training techniques or considerations used in the initial training of BioGPT that I might be missing?
Any insights or suggestions would be greatly appreciated. Thank you for your time and assistance.
For your reference, I am attaching a jupyter notebook for the abovementioned run.
Uploading 2.2-FineTune… - JupyterLab.pdf…
The text was updated successfully, but these errors were encountered: