diff --git a/README.md b/README.md new file mode 100644 index 0000000..1ee8d0e --- /dev/null +++ b/README.md @@ -0,0 +1,186 @@ +# WikiAsp: A Dataset for Multi-domain Aspect-based Summarization + +This repository contains the dataset from the paper "[WikiAsp: A Dataset for Multi-domain Aspect-based Summarization](http://arxiv.org/abs/2011.07832)". + +WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. +In this task, models are asked to summarize *cited reference documents* of a Wikipedia article into aspect-based summaries. +Each of the 20 domains include 10 domain-specific pre-defined aspects. + +
wikiasp
+ +## Dataset + +### Download + +WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. +**More than 28GB of storage space** is necessary to download and store all the domains (unzipped). +The following command will download all of them and extract archives: + +```sh +./scripts/download_and_extract_all.sh /path/to/save_directory +``` +Alternatively, one can individually download an archive for each domain from the table below. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DomainLinkSize (unzipped)
AlbumDownload2.3GB
AnimalDownload589MB
ArtistDownload2.2GB
BuildingDownload1.3GB
CompanyDownload1.9GB
EducationalInstitutionDownload1.9GB
EventDownload900MB
FilmDownload2.8GB
GroupDownload1.2GB
HistoricPlaceDownload303MB
InfrastructureDownload1.3GB
MeanOfTransportationDownload792MB
OfficeHolderDownload2.0GB
PlantDownload286MB
SingleDownload1.5GB
SoccerPlayerDownload721MB
SoftwareDownload1.3GB
TelevisionShowDownload1.1GB
TownDownload932MB
WrittenWorkDownload1.8GB
+ +### Format + +Each domain includes three files `{train,valid,test}.jsonl`, and each line represents one instance in JSON format. +Each instance forms the following structure: + +```json +{ + "exid": "train-1-1", + "input": [ + "tokenized and uncased sentence_1 from document_1", + "tokenized and uncased sentence_2 from document_1", + "...", + "tokenized and uncased sentence_i from document_j", + "..." + ], + "targets": [ + ["a_1", "tokenized and uncased aspect-based summary for a_1"], + ["a_2", "tokenized and uncased aspect-based summary for a_2"], + "..." + ] +} +``` +where, +* exid: `str` +* input: `List[str]` +* targets: `List[Tuple[str,str]]` + +Here, `input` is the cited references and consists of tokenized sentences (with NLTK). +The `targets` key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary. + +Inheriting from the base [corpus](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikisum), this dataset exhibits the following characteristics: + +* Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences. +* Sentences in the cited references (`input`) are tokenized using NLTK. +* The number of target summaries for each instance varies. + + +## Citation +If you use the dataset, please consider citing with +``` +@article{hayashi2020wikiasp, + author = {Hayashi, Hiroaki and Budania, Prashant and Wang, Peng and Ackerson, Chris and Neervannan, Raj and Neubig, Graham}, + title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization}, + journal = {arXiv preprint arXiv:2011.07832}, + year = {2020}, +} +``` + +## LICENSE + +Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. diff --git a/scripts/download_and_extract_all.sh b/scripts/download_and_extract_all.sh new file mode 100755 index 0000000..b1fbb51 --- /dev/null +++ b/scripts/download_and_extract_all.sh @@ -0,0 +1,39 @@ +#!/bin/bash + +info() { + printf "\r [\033[00;34mINFO\033[0m] %s\n" "$1" +} + +fail() { + printf "\r\033[2K [\033[0;31mFAIL\033[0m] %s\n" "$1" + echo '' + exit +} + +main() { + DEST=${1:-wikiasp} + PREFIX="https://github.com/rooa/summarization/releases/v1.0" + + mkdir -p "$DEST" + info "Saving to $DEST" + + DOMAINS=(Album Animal Artist Building Company EducationalInstitution Event Film Group + HistoricPlace Infrastructure MeanOfTransportation OfficeHolder Plant Single + SoccerPlayer Software TelevisionShow Town WrittenWork) + + for DOM in "${DOMAINS[@]}"; do + TEMP_TARGET="wikiasp_temp_downloaded_${DOM}.tar.bz2" + wget -O "${TEMP_TARGET}" "$PREFIX/${DOM}.tar.bz2" + if [ ! -e "${TEMP_TARGET}" ]; then + fail "Could not download." + fi + info "Extracting $DOM data..." + tar xjvf "${TEMP_TARGET}" + mv "${DOM}" "$DEST" + rm -f "${TEMP_TARGET}" + done + + info "All downloads and extraction are done at $DEST." +} + +main "$@" diff --git a/wikiasp_task.jpg b/wikiasp_task.jpg new file mode 100644 index 0000000..5e76248 Binary files /dev/null and b/wikiasp_task.jpg differ