Model scores for the paper "Event knowledge in large language models: the gap between the impossible and the unlikely".
The main analysis repo: https://github.com/carina-kauf/lm-event-knowledge
We tested four attention-based Transformer (Vaswani et al., 2017) language models:
- RoBERTa (Liu et al., 2019)
- BERT (Devlin et al., 2018)
- GPT-J (B. Wang & Komatsuzaki, 2021)
- GPT-2 (Radford et al., 2019)
In the script names and result files, we use the name ANN
instead of LLM
.
Main metric: Adapted Pseudo-log-likelihood (PLL)
We use a modified version of the sentence’s pseudo-log-likelihood under the model (PLL; Salazar et al., 2020; A. Wang & Cho, 2019), which defines the sentence score as the sum of the log-probabilities of each token given all other tokens. To avoid biasing the scores in favor of multi-token lexical items, we modify the original procedure to additionally mask tokens within multi-token words if they are located to the right of the target.
- associated script at: ANN_MLM_adapted.py
Secondary metrics
- PLL (Salazar et al., 2020)
- Verb probability, i.e., the average log-likelihood of the verb’s tokens v = wt ...wt' conditioned on their bidirectional sentence context
- Last-word probability, i.e., the average log-likelihood of the subtokens that compose the last word in the sequence according to the model’s tokenizer
- Left-to-right (l2r), causal sentence-generation probability, i.e., average log-likelihood for each token wi in the sequence, conditioned on only the preceding tokens w<i according to the model.
- associated script at: ANN_MLM_scores.py
We define the sentence score as the sum of the log-probabilities of each token wi in the sequence, conditioned on the preceding sentence tokens w<i}.
- associated script at: ANN_GPT2_scores.py
- tinyLSTM (Gauthier et al., 2020): computes the surprisal of a sentence as the sum of the surprisals of each token in the sentence
- associated script at: baseline_lmzoo_tinylstm.py
-
thematic Fit: computes the fit of a patient with the prototype representation of that role, considering the patient-role fillers most associated with the agent AND the predicate of the sentence (following Lenci, 2011).
Procedure:
- we retrieve the N most strongly associated objects for the subject and the verb respectively, and we take the intersection of the two lists;
- we update their association scores using either the product (prod) function;
- we select the FastText embeddings corresponding to the first M objects in this list and we average them together (centroid) to create the prototype vector of the object given the subject and the verb;
- the thematic fit of the object x with respect to the other items in the sentence is computed as the similarity score of its corresponding lexical vector v(x) with the prototype vector.
To avoid zero scores, we apply the following methodology in case the intersection of fillers is empty:
- in the two lists are not empty, we use verb's fillers to create the prototype;
- if one list is empty, we take the other one.
- associated script at: baseline_TF-update.ipynb
- Structured Distributional Model (SDM; Chersoni et al., 2019): computes a thematic fit that computes both a context-independent and a context-dependent representation of the prototype role filler based on the current linguistic context.
- associated script available upon request
-
PPMI-syntax (structured input, input annotated with grammatical roles)
After extracting triples < verbal head, nominal dependent, relation > from the corpora (with a frequency >= 2), we compute the PPMI as follows (N= total frequency of all triples).
- associated script at: baseline_PPMI_structured_and_unstructured.ipynb
- NOTE: frequency files can be found here: drive_folder
Dataset 1 - EventsAdapt (based on Fedorenko et al, 2020) : newsentences_EventsAdapt
Dataset 2 - DTFit (based on Vassallo et al, 2018) : DTFit
Dataset 3 - EventsRev (based on Ivanova et al, 2021) : ev1