Please cite the following paper if you are using TALLOR. Thanks!
- Jiacheng Li, Haibo Ding, Jingbo Shang, Julian McAuley, Zhe Feng. Weakly Supervised Named Entity Tagging with Learnable Logical Rules. (ACL 2021)
pip install -r requirements.txt
In this section, we introduce how to reimplement the experiments in our paper. We already include all needed datasets and rule files in this repo.
3 datasets are preprocessed and included in this repository.
Dataset | Task code | Dir | Source |
---|---|---|---|
BC5CDR | bc5cdr | data/bc5cdr | link |
CHEMDNER | chemdner | data/chemdner | link |
CoNLL 2003 | conll2003 | data/conll2003 | link |
python train_demo.py --dataset bc5cdr --encoder scibert
We have examples of output files of experiments on BC5CDR dataset and we also describe these output files here.
Filename or Path | Descriptions |
---|---|
checkpoint/ | Best checkpoint of neural model. |
logging/JointIE.log | Evaluation on dev and test dataset via iterations. |
logging/RuleSelector.log | Selected rules of each iteration. |
logging/InstanceSelector.log | Thresholds and scores of dynamic instance selection. |
In this Section, we introduce how to run Tallor on your own plain text dataset to recognize the entities starting only from a few rules. We include a dataset bc5cdr_serving
as an example here.
- Please prepare your text data into a
.json
file as following and put it indata/[your dataset name]/[your file name].json
: (example file isdata/bc5cdr_serving/serving.json
)
{"sentence": ["A", "lesser", "degree", "of", "orthostatic", "hypotension", "occurred", "with", "standing", "."]}
- Put the multi-gram phrases file generated from AutoPhrase into
data/[your dataset name]/AutoPhrase_multi-words.txt
. (example file isdata/bc5cdr_serving/AutoPhrase_multi-words.txt
)
- You can write your own rules for your dataset in file
tallor/label_functions/serving_template.py
as a Pythondict
as following (we already have rules for BC5CDR dataset in this file, please comment it and write your own rules):
dictionary = {'proteinuria': 'Disease', 'esrd': 'Disease', 'thrombosis': 'Disease', 'tremor': 'Disease', 'hepatotoxicity': 'Disease','nicotine': 'Chemical', 'morphine': 'Chemical', 'haloperidol': 'Chemical', 'warfarin': 'Chemical', 'clonidine': 'Chemical'}
- Run our model TALLOR and please check the hyperparameters in the next section:
python serving.py --dataset [your dataset name]
Example: python serving.py --dataset bc5cdr_serving --filename serving.json --encoder scibert
- Extracted rules and recognized entities are saved into path
serving/[your dataset name]
and include two filesextracted_rules.json
andner_results.json
.
We believe that our default parameters can help you get a good start to tune the hyperparameters. Please refer to the table to select your parameters.
Parameters | Description |
---|---|
--filename | Your dataset file. |
--dataset | Your dataset directory. |
--encoder | Pre-trained langauge used in our model bert or scibert . |
--epoch | Number of iterations. Empirically, model with Larger epoch has results with better Recall, Smaller means better Precision. |
--train_step | Number of training steps of neural model in each iteration. This number will increase 50 after one epoch. |
--update_threshold | The ratio of the most confident data used for evaluateing and updating new rules. |
--rule_threshold | The minimum frequency of rules. |
--rule_topk | Number of rules in each entity category are selected as new rules in each epoch. |
--global_sample_times | Sample times for global scores. |
--threshold_sample_times | Sample times for computing dynamic threshold. |
--temperature | Temperature to control threshold. Larger will have more strict instance selection strategy. |
Jiacheng Li
E-mail: [email protected]