This directory contains the document-level data of DEplain-APA (DEplain-APA-doc).
The data of APA (Austrian Press Agency) is restricted for non-commercial research purposes. To get access to DEplain-APA please request the access via zenodo (https://zenodo.org/record/7674560).
Download the data and add the content with cp -r src/* target
.
In the following, we provide a dataset for DEplain-APA (following Huggingface's data cards).
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Repository: DEplain-APA zenodo repository
- Paper: Regina Stodden, Momen Omar, and Laura Kallmeyer. 2023. "DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification.". In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.
- Point of Contact: Regina Stodden
DEplain-APA (Stodden et al., 2023) is a dataset for the training and evaluation of sentence and document simplification in German. All texts of this dataset are provided by the Austrian Press Agency. The simple-complex sentence pairs are manually aligned.
The dataset supports the training and evaluation of text-simplification
systems. Success in this task is typically measured using the SARI and FKBLEU metrics described in the paper Optimizing Statistical Machine Translation for Text Simplification.
The text in this dataset is in Austrian German (de-at
).
All texts in this dataset are news data.
- The dataset is licensed with restricted access for only academic purposes. To download the dataset, please request access on zenodo.
document-simplification
configuration: an instance consists of an original document and one reference simplification (in plain-text format).sentence-simplification
configuration: an instance consists of original sentence(s) and one manually aligned reference simplification (inclusing one or more sentences).
data field | data field description |
---|---|
original |
an original text from the source dataset |
simplification |
a simplified text from the source dataset |
pair_id |
document pair id |
complex_document_id (on doc-level) |
id of complex document (-1) |
simple_document_id (on doc-level) |
id of simple document (-0) |
original_id (on sent-level) |
id of sentence(s) of the original text |
simplification_id (on sent-level) |
id of sentence(s) of the simplified text |
domain |
text domain of the document pair |
corpus |
subcorpus name |
simple_url |
origin URL of the simplified document |
complex_url |
origin URL of the simplified document |
simple_level or language_level_simple |
required CEFR language level to understand the simplified document |
complex_level or language_level_original |
required CEFR language level to understand the original document |
simple_location_html |
location on hard disk where the HTML file of the simple document is stored |
complex_location_html |
location on hard disk where the HTML file of the original document is stored |
simple_location_txt |
location on hard disk where the content extracted from the HTML file of the simple document is stored |
complex_location_txt |
location on hard disk where the content extracted from the HTML file of the simple document is stored |
alignment_location |
location on hard disk where the alignment is stored |
simple_author |
author (or copyright owner) of the simplified document |
complex_author |
author (or copyright owner) of the original document |
simple_title |
title of the simplified document |
complex_title |
title of the original document |
license |
license of the data |
last_access or access_date |
data origin data or data when the HTML files were downloaded |
rater |
id of the rater who annotated the sentence pair |
alignment |
type of alignment, e.g., 1:1, 1:n, n:1 or n:m |
DEplain-APA is randomly split into a training, development and test set. The training set of the sentence-simplification configuration contains only texts of documents which are part of the training set of document-simplification configuration and the same for dev and test sets. The statistics are given below.
Train | Dev | Test | Total | |
---|---|---|---|---|
Document Pairs | 387 | 48 | 48 | 483 |
Sentence Pairs | 10660 | 1231 | 1231 | 13122 |
Inter-Annotator-Agreement: 0.7497 (moderate)
Here, more information on simplification operations will follow soon.
DEplain-APA was created to improve the training and evaluation of German document and sentence simplification. The data is provided by the same data provided as for the APA-LHA data. In comparison to APA-LHA (automatic-aligned), the sentence pairs of DEplain-APA are all manually aligned. Further, DEplain-APA aligns the texts in language level B1 with the texts in A2, which result in mostly mild simplifications.
Further DEplain-APA, contains parallel documents as well as parallel sentence pairs.
The original news texts (in CEFR level B2) were manually simplified by professional translators, i.e. capito – CFS GmbH, and provided to us by the Austrian Press Agency. All documents date back to 2019 to 2021. Two German native speakers have manually aligned the sentence pairs by using the text simplification annotation tool TS-ANNO. The data was split into sentences using a German model of SpaCy.
The original news texts (in CEFR level B2) were manually simplified by professional translators, i.e. capito – CFS GmbH. No other demographic or compensation information is known.
The instructions given to the annotators are available here.
The annotators are two German native speakers, who are trained in linguistics. Both were at least compensated with the minimum wage of their country of residence. They are not part of any target group of text simplification.
No sensitive data.
Many people do not understand texts due to their complexity. With automatic text simplification methods, the texts can be simplified for them. Our new training data can benefit in training a TS model.
No bias is known.
The dataset is provided for research purposes only. Please check the dataset license for additional information.
Researchers at the Heinrich-Heine-University Düsseldorf, Germany, developed DEplain-APA. This research is part of the PhD-program Online Participation
supported by the North Rhine-Westphalian (German) funding scheme Forschungskolleg
.
The dataset (DEplain-APA) is provided for research purposes only. Please request access using the following form: https://zenodo.org/record/7674560.
If you use part of this work, please cite our paper:
@inproceedings{stodden-etal-2023-deplain,
title = "{DE}-plain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
author = "Stodden, Regina and
Momen, Omar and
Kallmeyer, Laura",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
notes = "preprint: https://arxiv.org/abs/2305.18939",
}
This dataset card uses material written by Juan Diego Rodriguez and Yacine Jernite.