Skip to content

Latest commit

 

History

History

DEplain-APA-doc

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

DEPlain: DEplain-APA for Document Simplification

This directory contains the document-level data of DEplain-APA (DEplain-APA-doc).

The data of APA (Austrian Press Agency) is restricted for non-commercial research purposes. To get access to DEplain-APA please request the access via zenodo (https://zenodo.org/record/7674560). Download the data and add the content with cp -r src/* target.

Dataset Statement for DEplain-APA

In the following, we provide a dataset for DEplain-APA (following Huggingface's data cards).

Table of Contents

Dataset Description

Dataset Summary

DEplain-APA (Stodden et al., 2023) is a dataset for the training and evaluation of sentence and document simplification in German. All texts of this dataset are provided by the Austrian Press Agency. The simple-complex sentence pairs are manually aligned.

Supported Tasks and Leaderboards

The dataset supports the training and evaluation of text-simplification systems. Success in this task is typically measured using the SARI and FKBLEU metrics described in the paper Optimizing Statistical Machine Translation for Text Simplification.

Languages

The text in this dataset is in Austrian German (de-at).

Domains

All texts in this dataset are news data.

Dataset Structure

Data Access

  • The dataset is licensed with restricted access for only academic purposes. To download the dataset, please request access on zenodo.

Data Instances

  • document-simplification configuration: an instance consists of an original document and one reference simplification (in plain-text format).
  • sentence-simplification configuration: an instance consists of original sentence(s) and one manually aligned reference simplification (inclusing one or more sentences).

Data Fields

data field data field description
original an original text from the source dataset
simplification a simplified text from the source dataset
pair_id document pair id
complex_document_id (on doc-level) id of complex document (-1)
simple_document_id (on doc-level) id of simple document (-0)
original_id (on sent-level) id of sentence(s) of the original text
simplification_id (on sent-level) id of sentence(s) of the simplified text
domain text domain of the document pair
corpus subcorpus name
simple_url origin URL of the simplified document
complex_url origin URL of the simplified document
simple_level or language_level_simple required CEFR language level to understand the simplified document
complex_level or language_level_original required CEFR language level to understand the original document
simple_location_html location on hard disk where the HTML file of the simple document is stored
complex_location_html location on hard disk where the HTML file of the original document is stored
simple_location_txt location on hard disk where the content extracted from the HTML file of the simple document is stored
complex_location_txt location on hard disk where the content extracted from the HTML file of the simple document is stored
alignment_location location on hard disk where the alignment is stored
simple_author author (or copyright owner) of the simplified document
complex_author author (or copyright owner) of the original document
simple_title title of the simplified document
complex_title title of the original document
license license of the data
last_access or access_date data origin data or data when the HTML files were downloaded
rater id of the rater who annotated the sentence pair
alignment type of alignment, e.g., 1:1, 1:n, n:1 or n:m

Data Splits

DEplain-APA is randomly split into a training, development and test set. The training set of the sentence-simplification configuration contains only texts of documents which are part of the training set of document-simplification configuration and the same for dev and test sets. The statistics are given below.

Train Dev Test Total
Document Pairs 387 48 48 483
Sentence Pairs 10660 1231 1231 13122

Inter-Annotator-Agreement: 0.7497 (moderate)

Here, more information on simplification operations will follow soon.

Dataset Creation

Curation Rationale

DEplain-APA was created to improve the training and evaluation of German document and sentence simplification. The data is provided by the same data provided as for the APA-LHA data. In comparison to APA-LHA (automatic-aligned), the sentence pairs of DEplain-APA are all manually aligned. Further, DEplain-APA aligns the texts in language level B1 with the texts in A2, which result in mostly mild simplifications.

Further DEplain-APA, contains parallel documents as well as parallel sentence pairs.

Source Data

Initial Data Collection and Normalization

The original news texts (in CEFR level B2) were manually simplified by professional translators, i.e. capito – CFS GmbH, and provided to us by the Austrian Press Agency. All documents date back to 2019 to 2021. Two German native speakers have manually aligned the sentence pairs by using the text simplification annotation tool TS-ANNO. The data was split into sentences using a German model of SpaCy.

Who are the source language producers?

The original news texts (in CEFR level B2) were manually simplified by professional translators, i.e. capito – CFS GmbH. No other demographic or compensation information is known.

Annotations

Annotation process

The instructions given to the annotators are available here.

Who are the annotators?

The annotators are two German native speakers, who are trained in linguistics. Both were at least compensated with the minimum wage of their country of residence. They are not part of any target group of text simplification.

Personal and Sensitive Information

No sensitive data.

Considerations for Using the Data

Social Impact of Dataset

Many people do not understand texts due to their complexity. With automatic text simplification methods, the texts can be simplified for them. Our new training data can benefit in training a TS model.

Discussion of Biases

No bias is known.

Other Known Limitations

The dataset is provided for research purposes only. Please check the dataset license for additional information.

Additional Information

Dataset Curators

Researchers at the Heinrich-Heine-University Düsseldorf, Germany, developed DEplain-APA. This research is part of the PhD-program Online Participation supported by the North Rhine-Westphalian (German) funding scheme Forschungskolleg.

Licensing Information

The dataset (DEplain-APA) is provided for research purposes only. Please request access using the following form: https://zenodo.org/record/7674560.

Citation Information

If you use part of this work, please cite our paper:

@inproceedings{stodden-etal-2023-deplain,
    title = "{DE}-plain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
    author = "Stodden, Regina  and
      Momen, Omar  and
      Kallmeyer, Laura",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    notes = "preprint: https://arxiv.org/abs/2305.18939",
}

This dataset card uses material written by Juan Diego Rodriguez and Yacine Jernite.