You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that the sentences in IMST originate in the METU Turkish Corpus (MTC) and while the original corpus contains whole documents, in the IMST treebank the sentences have been shuffled. Unfortunately, sentences from one document are scattered across train / dev / test partitions of the treebank.
I am wondering whether this could be fixed, i.e., sentences reordered so that they reflect the original order from MTC, documents are not split between train, dev and test, and document boundaries are annotated using # newdoc. (Even if some sentences from the original documents in MTC are missing in IMST, I think it would still be better to restore the order of the sentences that were selected for IMST.)
Without it, the dataset cannot be used for any NLP beyond sentence (such as coreference or discourse). Having sentences from one document in both train and test makes evaluation less realistic. And also, the treebank is not compatible with other annotations over the same sentences, such as the ITCC dataset in CorefUD.
For more context, here is a copy of some observations originally posted by @martinpopel as a CorefUD issue:
The train-dev-test split of ITCC is not compatible with the UD_Turkish-IMST (UD) split. For example, 316 sentences from the ITCC test set appear in the UD train set. This is a big problem, as explained in #42.
UD train
UD dev
UD test
UD any
lines
ITCC train
1402
363
372
2132
3531
ITCC dev
269
75
73
398
556
ITCC test
316
79
88
477
645
lines
3685
975
975
As explained in #41, ITCC is missing SpaceAfter=No, so I've ignored spaces when generating the above table:
grep '# text =' tr_imst-ud-train.conllu | sed 's/ //g'> ud-train.txt
grep '# text =' tr-corefud-train.conllu | sed 's/ //g'> itcc-train.txt
...
cat itcc-train.txt | grep -Ff ud-train.txt | wc -l
...
Unfortunately, it seems the sentences in the UD treebank were shuffled randomly without respecting document boundaries. So e.g. the first document in ITCC (00016112) with 92 sentences is included in all three files of UD (60 sentences in train, 9 in dev and 15 in test). Of course, we need to keep each document in one file. I have not checked the other documents, but it seems the only way how to make the train/dev/test split compatible is to fix UD_Turkish-IMST for the next UD release.
The text was updated successfully, but these errors were encountered:
Hey @martinpopel, because the issue was mentioned, I wanted to read it. I will not be making any edits. I know of the first 2 people and had a conversation with @Tugbapmy recently (although not about this). @dan-zeman already provided the relevant parts of the issue, seems not necessary for me to see the issue. He already told me about the problem and the possible solutions. Hopefully, it's going to be resolved.
I understand that the sentences in IMST originate in the METU Turkish Corpus (MTC) and while the original corpus contains whole documents, in the IMST treebank the sentences have been shuffled. Unfortunately, sentences from one document are scattered across train / dev / test partitions of the treebank.
I am wondering whether this could be fixed, i.e., sentences reordered so that they reflect the original order from MTC, documents are not split between train, dev and test, and document boundaries are annotated using
# newdoc
. (Even if some sentences from the original documents in MTC are missing in IMST, I think it would still be better to restore the order of the sentences that were selected for IMST.)Without it, the dataset cannot be used for any NLP beyond sentence (such as coreference or discourse). Having sentences from one document in both train and test makes evaluation less realistic. And also, the treebank is not compatible with other annotations over the same sentences, such as the ITCC dataset in CorefUD.
For more context, here is a copy of some observations originally posted by @martinpopel as a CorefUD issue:
The train-dev-test split of ITCC is not compatible with the UD_Turkish-IMST (UD) split. For example, 316 sentences from the ITCC test set appear in the UD train set. This is a big problem, as explained in #42.
As explained in #41, ITCC is missing
SpaceAfter=No
, so I've ignored spaces when generating the above table:Unfortunately, it seems the sentences in the UD treebank were shuffled randomly without respecting document boundaries. So e.g. the first document in ITCC (00016112) with 92 sentences is included in all three files of UD (60 sentences in train, 9 in dev and 15 in test). Of course, we need to keep each document in one file. I have not checked the other documents, but it seems the only way how to make the train/dev/test split compatible is to fix UD_Turkish-IMST for the next UD release.
The text was updated successfully, but these errors were encountered: