git clone [email protected]:braingpt-lovelab/matching_experts.git --recursive
Training related scripts and files are in model_training/
; post-training analyses scripts and results are in analyses/
.
cd model_training
- To build the neuro-tokenizer:
python tokenizer.py
- To train a GPT-2 using a specific configuration:
bash launch_training.sh
. Configurations are inconfigs/
- Make sure to supply wandb info in the config json.
- Accelerate config:
accel_config.yaml
Domain-specific Neuroscience training data can be found here: https://huggingface.co/datasets/BrainGPT/train_valid_split_pmc_neuroscience_2002-2022_filtered_subset
cd analyses
- Run inference with GPT-2 variants on BrainBench test cases:
python run_choice.py
- Produce token analysis intermediate results:
python common_and_unique.py
- Call GPT-4 to identify neuroscience terms in GPT-2 pretrained tokenizer and neuro-tokenizer vocab:
python neuro_term_tagging.py
cd analyses
- Fig 1:
python model_vs_human.py
- Fig 2:
python token_analyses.py
- Fig 3:
python tokenization_viz.py
cd analyses/model_results
Variant | Training | Data | Tokenizer | Raw Results Directory |
---|---|---|---|---|
Untrained | - | - | pretrained | gpt2_init/ |
Pretrained | from scratch | WebText | pretrained | gpt2/ |
Scratch | from scratch | neuroscience | pretrained | gpt2_scratch/ |
Finetuned (from pretrained) | finetune | neuroscience | pretrained | finetune_gpt2/ |
Scratch (Neuro tokenizer) | from scratch | neuroscience | custom | gpt2_scratch_neuro_tokenizer/ |
@misc{luo2024matching,
title={Matching domain experts by training from scratch on domain knowledge},
author={Xiaoliang Luo and Guangzhi Sun and Bradley C. Love},
year={2024},
eprint={2405.09395},
archivePrefix={arXiv},
primaryClass={q-bio.NC}
}