$ python3 -m venv ner-reviews
$ source ner-reviews/bin/activate
(ner-reviews) $ pip3 install -r requirements.txt
If you prefer conda
:
$ conda create -n ner-reviews python=3.9
$ conda activate ner-reviews
(ner-reviews) $ pip3 install -r requirements.txt
This directory contains the HTML files for New York Times food reviews.
It includes url_list.txt
listing the corresponding URLs.
This directory contains txt files for a processed dataset of New York Times food reviews.
It includes cleaned_reviews.json
and edit_cleaned_reviews.json
, which contain all of the processed food review data.
It includes unprocessed_URLs.txt
, listing any unprocessed URLs.
This file contains entity types to import into an annotation tool.
This directory contains the JSON files for the annotated data from the initial small batch annotation.
This directory contains the JSON files for the annotated data, along with corresponding files that were corrected to remove encoding issues.
This file contains all annotations from all annotators in CONLL format.
This file contains adjudicated annotations in CONLL format.
This file contains counts of entities in adjudicated_annotations.txt
.
This directory contains the train, dev, and test sets for the adjudicated annotations.
This script downloads food reviews from New York Times and stores them in html_reviews
.
This script processes and cleans html_reviews
, and outputs raw_data
.
This script prepares raw_data
for annotation.
This script fixes encoding issues in annotated_data
.
This script accepts annotated_data
and processes the data to output all_annotations.txt
.
This script includes additional functions to support data_process_pipeline.py
.
This script calculates inter-annotator agreement with the processed data in all_annotations.txt
.
This script runs adjudication to produce gold standard annotations and outputs adjudicated_annotations.txt
.
This script accepts adjudicated_annotations.txt
and splits the data into train, dev, and test sets.
This script runs the model and experiments using final_data_splits
.