aparnadutta / ner-restaurant-reviews Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

An annotated corpus of entities in NYT Restaurant Reviews.

1 star 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

NER for Restaurant Reviews

Set Up Virtual Environment

$ python3 -m venv ner-reviews
$ source ner-reviews/bin/activate
(ner-reviews) $ pip3 install -r requirements.txt

If you prefer conda:

$ conda create -n ner-reviews python=3.9
$ conda activate ner-reviews
(ner-reviews) $ pip3 install -r requirements.txt

Data

`html_reviews`

This directory contains the HTML files for New York Times food reviews. It includes url_list.txt listing the corresponding URLs.

`raw_data`

This directory contains txt files for a processed dataset of New York Times food reviews. It includes cleaned_reviews.json and edit_cleaned_reviews.json, which contain all of the processed food review data. It includes unprocessed_URLs.txt, listing any unprocessed URLs.

`tags.json`

This file contains entity types to import into an annotation tool.

`small_batch_annotation`

This directory contains the JSON files for the annotated data from the initial small batch annotation.

`annotated_data`

This directory contains the JSON files for the annotated data, along with corresponding files that were corrected to remove encoding issues.

`all_annotations.txt`

This file contains all annotations from all annotators in CONLL format.

`adjudicated_annotations.txt`

This file contains adjudicated annotations in CONLL format.

`counts.txt`

This file contains counts of entities in adjudicated_annotations.txt.

`final_data_splits`

This directory contains the train, dev, and test sets for the adjudicated annotations.

Scripts

`review_fetcher.py`

This script downloads food reviews from New York Times and stores them in html_reviews.

`clean_data.py`

This script processes and cleans html_reviews, and outputs raw_data.

`prepare_data.py`

This script prepares raw_data for annotation.

`fix_encoding_issue.py`

This script fixes encoding issues in annotated_data.

`data_process_pipeline.py`

This script accepts annotated_data and processes the data to output all_annotations.txt.

`data_process_utils.py`

This script includes additional functions to support data_process_pipeline.py.

`inter_annotator_agreement.py`

This script calculates inter-annotator agreement with the processed data in all_annotations.txt.

`adjudication.py`

This script runs adjudication to produce gold standard annotations and outputs adjudicated_annotations.txt.

`create_train_dev_test.py`

This script accepts adjudicated_annotations.txt and splits the data into train, dev, and test sets.

`NER_restaurant_reviews.ipynb`

This script runs the model and experiments using final_data_splits.

About

An annotated corpus of entities in NYT Restaurant Reviews.

Report repository

Releases

No releases published

Packages

No packages published

Contributors 3

Languages

HTML 99.8%
Other 0.2%