reference thesis: https://nsaphra.net/uploads/thesis.pdf
older work (e.g. that thesis) uses lstm, is unclear if any of that transfers to a modern architecture, pythia checkpoints give a good reference point
As per JDC has mentioned:
- Establish things that pythia-12b embeddings have learned when fully trained.
- Look through the checkpoints to see at what point the model learns those things/how quickly it learns those things/what the learning curve looks like for all those things the fully trained model learns
- See how this extends to other pythia models/sizes.
- Upload all(or a meaningful subset, e.g. every power of 2) pythia-12b checkpoints to HF
- Analysis of token meanings/categories in the fully trained pythia-12b model
- Analysis of what meanings show up when in training pythia-12b
- Potentially expand this to other pythia model sizes to see if this is true across scales?
GSON has uploaded weights here: https://huggingface.co/amphora/pythia-12b-weights
And data on their similarities here: https://huggingface.co/amphora/pythia-12b-weights/tree/main/cos_sim
- Representation Degeneration Problem in Training Natural Language Generation Models & Is anisotropy really the cause of BERT embeddings not being semantic? - anisotropic and hypercone behaviour of token embeddings and the latter paper links them to known biases such as frequency, subword, punctuation, and case
- Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse - probably only relevant part fig1 evolution of cosine angle between tokens, maybe as another way to quantify embedding quality, or some behaviour that might be worth keeping in mind
- Interpreting Word Embeddings with Eigenvector Analysis - embedding svd
- (https://github.com/saprmarks/dictionary_learning) - candidate implementation of sparse autoencoders