Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
rooa committed Nov 17, 2020
0 parents commit 7bc6c2f
Show file tree
Hide file tree
Showing 3 changed files with 225 additions and 0 deletions.
186 changes: 186 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

This repository contains the dataset from the paper "[WikiAsp: A Dataset for Multi-domain Aspect-based Summarization](http://arxiv.org/abs/2011.07832)".

WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain.
In this task, models are asked to summarize *cited reference documents* of a Wikipedia article into aspect-based summaries.
Each of the 20 domains include 10 domain-specific pre-defined aspects.

<div align="center"><img alt="wikiasp" width="50%" src="wikiasp_task.jpg"></div>

## Dataset

### Download

WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain.
**More than 28GB of storage space** is necessary to download and store all the domains (unzipped).
The following command will download all of them and extract archives:

```sh
./scripts/download_and_extract_all.sh /path/to/save_directory
```
Alternatively, one can individually download an archive for each domain from the table below.

<table>
<thead>
<tr>
<th>Domain</th>
<th>Link</th>
<th>Size (unzipped)</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Album">Album</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Album.tar.bz2">Download</a></td>
<td>2.3GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Animal">Animal</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Animal.tar.bz2">Download</a></td>
<td>589MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Artist">Artist</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Artist.tar.bz2">Download</a></td>
<td>2.2GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Building">Building</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Building.tar.bz2">Download</a></td>
<td>1.3GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Company">Company</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Company.tar.bz2">Download</a></td>
<td>1.9GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:EducationalInstitution">EducationalInstitution</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/EducationalInstitution.tar.bz2">Download</a></td>
<td>1.9GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Event">Event</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Event.tar.bz2">Download</a></td>
<td>900MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Film">Film</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Film.tar.bz2">Download</a></td>
<td>2.8GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Group">Group</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Group.tar.bz2">Download</a></td>
<td>1.2GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:HistoricPlace">HistoricPlace</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/HistoricPlace.tar.bz2">Download</a></td>
<td>303MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Infrastructure">Infrastructure</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Infrastructure.tar.bz2">Download</a></td>
<td>1.3GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:MeanOfTransportation">MeanOfTransportation</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/MeanOfTransportation.tar.bz2">Download</a></td>
<td>792MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:OfficeHolder">OfficeHolder</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/OfficeHolder.tar.bz2">Download</a></td>
<td>2.0GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Plant">Plant</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Plant.tar.bz2">Download</a></td>
<td>286MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Single">Single</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Single.tar.bz2">Download</a></td>
<td>1.5GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:SoccerPlayer">SoccerPlayer</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/SoccerPlayer.tar.bz2">Download</a></td>
<td>721MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Software">Software</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Software.tar.bz2">Download</a></td>
<td>1.3GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:TelevisionShow">TelevisionShow</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/TelevisionShow.tar.bz2">Download</a></td>
<td>1.1GB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:Town">Town</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/Town.tar.bz2">Download</a></td>
<td>932MB</td>
</tr>
<tr>
<td><a href="http://mappings.dbpedia.org/index.php/OntologyClass:WrittenWork">WrittenWork</a></td>
<td><a href="https://github.com/neulab/wikiasp/releases/download/v1.0/WrittenWork.tar.bz2">Download</a></td>
<td>1.8GB</td>
</tr>
</tbody>
</table>

### Format

Each domain includes three files `{train,valid,test}.jsonl`, and each line represents one instance in JSON format.
Each instance forms the following structure:

```json
{
"exid": "train-1-1",
"input": [
"tokenized and uncased sentence_1 from document_1",
"tokenized and uncased sentence_2 from document_1",
"...",
"tokenized and uncased sentence_i from document_j",
"..."
],
"targets": [
["a_1", "tokenized and uncased aspect-based summary for a_1"],
["a_2", "tokenized and uncased aspect-based summary for a_2"],
"..."
]
}
```
where,
* exid: `str`
* input: `List[str]`
* targets: `List[Tuple[str,str]]`

Here, `input` is the cited references and consists of tokenized sentences (with NLTK).
The `targets` key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.

Inheriting from the base [corpus](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikisum), this dataset exhibits the following characteristics:

* Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
* Sentences in the cited references (`input`) are tokenized using NLTK.
* The number of target summaries for each instance varies.


## Citation
If you use the dataset, please consider citing with
```
@article{hayashi2020wikiasp,
author = {Hayashi, Hiroaki and Budania, Prashant and Wang, Peng and Ackerson, Chris and Neervannan, Raj and Neubig, Graham},
title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
journal = {arXiv preprint arXiv:2011.07832},
year = {2020},
}
```

## LICENSE

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
39 changes: 39 additions & 0 deletions scripts/download_and_extract_all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/bin/bash

info() {
printf "\r [\033[00;34mINFO\033[0m] %s\n" "$1"
}

fail() {
printf "\r\033[2K [\033[0;31mFAIL\033[0m] %s\n" "$1"
echo ''
exit
}

main() {
DEST=${1:-wikiasp}
PREFIX="https://github.com/rooa/summarization/releases/v1.0"

mkdir -p "$DEST"
info "Saving to $DEST"

DOMAINS=(Album Animal Artist Building Company EducationalInstitution Event Film Group
HistoricPlace Infrastructure MeanOfTransportation OfficeHolder Plant Single
SoccerPlayer Software TelevisionShow Town WrittenWork)

for DOM in "${DOMAINS[@]}"; do
TEMP_TARGET="wikiasp_temp_downloaded_${DOM}.tar.bz2"
wget -O "${TEMP_TARGET}" "$PREFIX/${DOM}.tar.bz2"
if [ ! -e "${TEMP_TARGET}" ]; then
fail "Could not download."
fi
info "Extracting $DOM data..."
tar xjvf "${TEMP_TARGET}"
mv "${DOM}" "$DEST"
rm -f "${TEMP_TARGET}"
done

info "All downloads and extraction are done at $DEST."
}

main "$@"
Binary file added wikiasp_task.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7bc6c2f

Please sign in to comment.