Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documents vs. shuffled sentences #12

Open
dan-zeman opened this issue May 1, 2023 · 3 comments
Open

Documents vs. shuffled sentences #12

dan-zeman opened this issue May 1, 2023 · 3 comments
Assignees

Comments

@dan-zeman
Copy link
Member

I understand that the sentences in IMST originate in the METU Turkish Corpus (MTC) and while the original corpus contains whole documents, in the IMST treebank the sentences have been shuffled. Unfortunately, sentences from one document are scattered across train / dev / test partitions of the treebank.

I am wondering whether this could be fixed, i.e., sentences reordered so that they reflect the original order from MTC, documents are not split between train, dev and test, and document boundaries are annotated using # newdoc. (Even if some sentences from the original documents in MTC are missing in IMST, I think it would still be better to restore the order of the sentences that were selected for IMST.)

Without it, the dataset cannot be used for any NLP beyond sentence (such as coreference or discourse). Having sentences from one document in both train and test makes evaluation less realistic. And also, the treebank is not compatible with other annotations over the same sentences, such as the ITCC dataset in CorefUD.

For more context, here is a copy of some observations originally posted by @martinpopel as a CorefUD issue:


The train-dev-test split of ITCC is not compatible with the UD_Turkish-IMST (UD) split. For example, 316 sentences from the ITCC test set appear in the UD train set. This is a big problem, as explained in #42.

UD train UD dev UD test UD any lines
ITCC train 1402 363 372 2132 3531
ITCC dev 269 75 73 398 556
ITCC test 316 79 88 477 645
lines 3685 975 975

As explained in #41, ITCC is missing SpaceAfter=No, so I've ignored spaces when generating the above table:

grep '# text =' tr_imst-ud-train.conllu | sed 's/ //g' > ud-train.txt
grep '# text =' tr-corefud-train.conllu | sed 's/ //g' > itcc-train.txt
...
cat itcc-train.txt | grep -Ff ud-train.txt | wc -l
...

Unfortunately, it seems the sentences in the UD treebank were shuffled randomly without respecting document boundaries. So e.g. the first document in ITCC (00016112) with 92 sentences is included in all three files of UD (60 sentences in train, 9 in dev and 15 in test). Of course, we need to keep each document in one file. I have not checked the other documents, but it seems the only way how to make the train/dev/test split compatible is to fix UD_Turkish-IMST for the next UD release.

@furkanakkurt1335
Copy link
Contributor

Dear @dan-zeman, I have gotten the METU corpus and am working on the issue. I don't have permission for the issue on ufal/corefUD.

@martinpopel
Copy link
Member

martinpopel commented May 10, 2023

Hi @furkanakkurt1335, I can grant you access to https://github.com/ufal/corefUD. But any edits should be coordinated first with the maintainers of ITCC: Gülşen Eryiğit, @TugbaP and @kutaygallo. Are you in contact with them? The data of CorefUD 1.1 (except for the test set) are published also at http://hdl.handle.net/11234/1-5053

@furkanakkurt1335
Copy link
Contributor

Hey @martinpopel, because the issue was mentioned, I wanted to read it. I will not be making any edits. I know of the first 2 people and had a conversation with @Tugbapmy recently (although not about this). @dan-zeman already provided the relevant parts of the issue, seems not necessary for me to see the issue. He already told me about the problem and the possible solutions. Hopefully, it's going to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants