This is the supplementary material for the paper "EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles" presented at CIKM 2024.
Table A1 displays the N-grams referring to related debunk articles cited in EUvsDisinfo's response section. To obtain the N-grams, we lowercase, remove punctuations, and then sentence tokenise the text within the response section. Next, we use the Gensim Python library to compute 3-grams and 4-grams. We check the most frequent 3 and 4-grams within the sentences, and sort them decreasingly. Finally, we select the N-grams that refer to other debunks, such as "recurring pro-kremilin disinformation", or "see similar cases". Using this list of N-grams, we remove URLs that appear in the same sentence with an N-gram in the list.
N-gram | #Sentences |
---|---|
recurring pro-kremlin disinformation narrative | 3,123 |
pro-kremlin disinformation narrative about | 2,398 |
disinformation narrative about | 1,236 |
see other examples of | 927 |
a recurring pro-kremlin disinformation | 848 |
this is a recurring | 802 |
disinformation cases alleging that | 753 |
similar cases claiming that | 736 |
pro-kremlin disinformation narratives about | 697 |
recurring pro-kremlin disinformation narratives | 681 |
read more about the | 560 |
read similar cases claiming | 525 |
is a recurring pro-kremlin | 464 |
other examples of similar | 453 |
recurring pro-kremlin narrative about | 447 |
a recurring pro-kremlin narrative | 441 |
a recurring disinformation narrative | 439 |
earlier disinformation cases alleging | 430 |
see earlier disinformation cases | 422 |
disinformation narratives about | 375 |
---------------------------------------------- | ------------ |
recurring pro-kremlin disinformation | 4,541 |
pro-kremlin disinformation narrative | 4,015 |
disinformation narrative about | 2,767 |
a recurring pro-kremlin | 1,363 |
see other examples | 1,145 |
pro-kremlin disinformation narratives | 1,114 |
recurring pro-kremlin narrative | 1,036 |
other examples of | 1,008 |
disinformation narratives about | 952 |
is a recurring | 898 |
see similar cases | 731 |
[Table A1: N-grams associated to related debunk articles cited in EUvsDisinfo’s response section, along with the number of sentences in which the n-gram occurs. 4-grams are above the dashed line, and 3-grams are below the dashed line.]
Table B1 shows the detailed breakdown of disinformation and trustworthy articles per language for EUvsDisinfo. We indicate the languages used throughout the experiments using a dashed line.
Language | Total | Disinformation | Trustworthy |
---|---|---|---|
English | 6,546 | 425 | 6,121 |
Russian | 5,825 | 5,356 | 469 |
German | 313 | 216 | 97 |
French | 292 | 165 | 127 |
Spanish | 287 | 243 | 44 |
Georgian | 156 | 146 | 10 |
Czech | 152 | 111 | 41 |
Polish | 147 | 44 | 103 |
Italian | 103 | 85 | 18 |
Lithuanian | 78 | 28 | 50 |
Romanian | 68 | 17 | 51 |
Slovak | 35 | 32 | 3 |
Serbian | 31 | 27 | 4 |
Finnish | 30 | 8 | 22 |
----------------- | ------- | ---------------- | ------------- |
Arabic | 3,451 | 3,449 | 2 |
Ukrainian | 323 | 8 | 315 |
Hungarian | 147 | 144 | 3 |
Armenian | 87 | 83 | 4 |
Azerbaijani | 54 | 54 | 0 |
Swedish | 22 | 4 | 18 |
Bulgarian | 18 | 4 | 14 |
Dutch | 11 | 3 | 8 |
Norwegian | 9 | 0 | 9 |
Estonian | 8 | 0 | 8 |
Indonesian | 8 | 6 | 2 |
Bosnian | 7 | 6 | 1 |
Latvian | 6 | 3 | 3 |
Croatian | 6 | 4 | 2 |
Greek | 5 | 2 | 3 |
Belarusian | 5 | 0 | 5 |
Afrikaans | 3 | 3 | 0 |
Macedonian | 3 | 1 | 2 |
Chinese | 2 | 2 | 0 |
Persian | 2 | 1 | 1 |
Filipino | 2 | 0 | 2 |
Turkish | 1 | 0 | 1 |
Norwegian (Nynorsk) | 1 | 0 | 1 |
Japanese | 1 | 0 | 1 |
Danish | 1 | 0 | 1 |
Catalan | 1 | 0 | 1 |
Korean | 1 | 1 | 0 |
Portuguese | 1 | 1 | 0 |
[Table B1: Class distribution per language for EUvsDisinfo. per language. Languages above the dashed line are used in the classification experiments.]
For the Multinomial Naive Bayes MNB baseline, we try four different values for alpha (the Laplace smoothing constant):
For the mBERT and XLM-RoBERTa baselines, we finetune the pre-trained bert-base-multilingual-cased (
The best mBERT configuration uses a learning rate of
All experiments involving mBERT and XLM-RoBERTa are performed on a single Nvidia A100 40GB GPU
Lastly, we study the cross-dataset generalisation capabilities of the best performing mBERT model. In addition to the EUvsDisinfo dataset, the experiments make use of the two other inherently multilingual datasets presented in Table 1, namely MM-Covid and FakeCovid. Following the methodology described in Section 4, for each dataset
Train | Test | |
---|---|---|
EUvsDisinfo | FakeCovid | 0.37 (↓41.3%) |
EUvsDisinfo | MM-Covid | 0.41 (↓57.3%) |
FakeCovid | EUvsDisinfo | 0.44 (↓47.0%) |
FakeCovid | MM-Covid | 0.18 (↓81.3%) |
MM-Covid | EUvsDisinfo | 0.55 (↓33.7%) |
MM-Covid | FakeCovid | 0.46 (↓27.0%) |
[Table D1: Cross-Dataset results. The percentage decrease with respect to the baseline score for the test set (EUvsDisinfo=0.83, MM-Covid=0.96, FakeCovid=0.66) is shown within parenthesis.]
** The best configurations for FakeCovid and MM-Covid are, respectively, a learning rate of