Skip to content

Latest commit

 

History

History
97 lines (69 loc) · 3.37 KB

README.md

File metadata and controls

97 lines (69 loc) · 3.37 KB

MMN

mmn model

This project hosts the code and dataset for our paper.

We address the problem of abstractive summarization in two directions: proposing a novel dataset and a new model. First, we collected Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit. Second, we propose a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi-level memory to store the information of text from different levels of abstraction.

Reference

If you use this code or dataset as part of any published research, please refer following paper.

@inproceedings{Kim:2019:NAACL-HLT,
    author = {Kim, Byeongchang and Kim, Hyunwoo and Kim, Gunhee},
    title = "{Abstractive Summarization of Reddit Posts with Multi-level Memory Networks}",
    booktitle = {NAACL-HLT},
    year = 2019
}

Running Code

TBU

Reddit TIFU Dataset

Reddit TIFU dataset is our newly collected Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu.

Key statistics of Reddit TIFU dataset are outlined below. We also show average and median (in parentheses) values. The total text-summary pairs are 122,933.

Dataset #posts #words/post #words/summ
TIFU-short 79,949 342.4 (269) 9.33 (8)
TIFU-long 42,984 432.6 (351) 23.0 (21)

You can download data from the links below. This file includes raw text and tokenized text.

[Download json]

You can read and explore our dataset as follows

import json

# Read entire file
posts = []
with open('tifu_tokenized_and_filtered.json', 'r') as fp:
    for line in fp:
        posts.append(json.loads(line))

# Json entries
print(posts[50000].keys())
# [u'title_tokenized',
#  u'permalink',
#  u'title',
#  u'url',
#  u'num_comments',
#  u'tldr',  # (optional)
#  u'created_utc',
#  u'trimmed_title_tokenized',
#  u'ups',
#  u'selftext_html',
#  u'score',
#  u'upvote_ratio',
#  u'tldr_tokenized',  # (optional)
#  u'selftext',
#  u'trimmed_title',
#  u'selftext_without_tldr_tokenized',
#  u'id',
#  u'selftext_without_tldr']

Acknowledgement

We thank PRAW developers for their API and Reddit users for their valuable posts.

We also appreciate Chris Dongjoo Kim and Yunseok Jang for helpful comments and discussions.

This work was supported by Kakao and Kakao Brain corporations, and Creative-Pioneering Researchers Program through Seoul National University.

Authors

Byeongchang Kim, Hyunwoo Kim and Gunhee Kim

Vision and Learning Lab @ Computer Science and Engineering, Seoul National University, Seoul, Korea

License

MIT license