Documents vs. shuffled sentences #12

dan-zeman · 2023-05-01T10:40:54Z

I understand that the sentences in IMST originate in the METU Turkish Corpus (MTC) and while the original corpus contains whole documents, in the IMST treebank the sentences have been shuffled. Unfortunately, sentences from one document are scattered across train / dev / test partitions of the treebank.

I am wondering whether this could be fixed, i.e., sentences reordered so that they reflect the original order from MTC, documents are not split between train, dev and test, and document boundaries are annotated using # newdoc. (Even if some sentences from the original documents in MTC are missing in IMST, I think it would still be better to restore the order of the sentences that were selected for IMST.)

Without it, the dataset cannot be used for any NLP beyond sentence (such as coreference or discourse). Having sentences from one document in both train and test makes evaluation less realistic. And also, the treebank is not compatible with other annotations over the same sentences, such as the ITCC dataset in CorefUD.

For more context, here is a copy of some observations originally posted by @martinpopel as a CorefUD issue:

The train-dev-test split of ITCC is not compatible with the UD_Turkish-IMST (UD) split. For example, 316 sentences from the ITCC test set appear in the UD train set. This is a big problem, as explained in #42.

	UD train	UD dev	UD test	UD any	lines
ITCC train	1402	363	372	2132	3531
ITCC dev	269	75	73	398	556
ITCC test	316	79	88	477	645
lines	3685	975	975

As explained in #41, ITCC is missing SpaceAfter=No, so I've ignored spaces when generating the above table:

grep '# text =' tr_imst-ud-train.conllu | sed 's/ //g' > ud-train.txt
grep '# text =' tr-corefud-train.conllu | sed 's/ //g' > itcc-train.txt
...
cat itcc-train.txt | grep -Ff ud-train.txt | wc -l
...

Unfortunately, it seems the sentences in the UD treebank were shuffled randomly without respecting document boundaries. So e.g. the first document in ITCC (00016112) with 92 sentences is included in all three files of UD (60 sentences in train, 9 in dev and 15 in test). Of course, we need to keep each document in one file. I have not checked the other documents, but it seems the only way how to make the train/dev/test split compatible is to fix UD_Turkish-IMST for the next UD release.

The text was updated successfully, but these errors were encountered:

furkanakkurt1335 · 2023-05-10T12:37:05Z

Dear @dan-zeman, I have gotten the METU corpus and am working on the issue. I don't have permission for the issue on ufal/corefUD.

martinpopel · 2023-05-10T12:52:01Z

Hi @furkanakkurt1335, I can grant you access to https://github.com/ufal/corefUD. But any edits should be coordinated first with the maintainers of ITCC: Gülşen Eryiğit, @TugbaP and @kutaygallo. Are you in contact with them? The data of CorefUD 1.1 (except for the test set) are published also at http://hdl.handle.net/11234/1-5053

furkanakkurt1335 · 2023-05-10T12:56:11Z

Hey @martinpopel, because the issue was mentioned, I wanted to read it. I will not be making any edits. I know of the first 2 people and had a conversation with @Tugbapmy recently (although not about this). @dan-zeman already provided the relevant parts of the issue, seems not necessary for me to see the issue. He already told me about the problem and the possible solutions. Hopefully, it's going to be resolved.

dan-zeman added the enhancement label May 1, 2023

furkanakkurt1335 self-assigned this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documents vs. shuffled sentences #12

Documents vs. shuffled sentences #12

dan-zeman commented May 1, 2023

furkanakkurt1335 commented May 10, 2023

martinpopel commented May 10, 2023 •

edited

Loading

furkanakkurt1335 commented May 10, 2023

Documents vs. shuffled sentences #12

Documents vs. shuffled sentences #12

Comments

dan-zeman commented May 1, 2023

furkanakkurt1335 commented May 10, 2023

martinpopel commented May 10, 2023 • edited Loading

furkanakkurt1335 commented May 10, 2023

martinpopel commented May 10, 2023 •

edited

Loading