-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
- Loading branch information
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020 yusanshi | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# News Recommendation | ||
|
||
The repository currently includes the following models. | ||
|
||
**Models in published papers** | ||
|
||
| Model | Full name | Paper | | ||
| --------- | ------------------------------------------------------------------------- | -------------------------------------------------- | | ||
| NRMS | Neural News Recommendation with Multi-Head Self-Attention | https://www.aclweb.org/anthology/D19-1671/ | | ||
|
||
|
||
## Get started to train | ||
|
||
Basic setup. | ||
|
||
```bash | ||
git clone https://github.com/yusanshi/NewsRecommendation | ||
cd NewsRecommendation | ||
pip3 install -r requirements.txt | ||
``` | ||
|
||
Download and preprocess the data. | ||
|
||
```bash | ||
mkdir data && cd data | ||
# Download GloVe pre-trained word embedding | ||
wget https://nlp.stanford.edu/data/glove.840B.300d.zip | ||
sudo apt install unzip | ||
unzip glove.840B.300d.zip -d glove | ||
rm glove.840B.300d.zip | ||
|
||
# Download MIND dataset | ||
# By downloading the dataset, you agree to the [Microsoft Research License Terms](https://go.microsoft.com/fwlink/?LinkID=206977). For more detail about the dataset, see https://msnews.github.io/. | ||
|
||
# Uncomment the following lines to use the MIND Large dataset (Note MIND Large test set doesn't have labels, see #11) | ||
# wget https://mind201910small.blob.core.windows.net/release/MINDlarge_train.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_dev.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_test.zip | ||
# unzip MINDlarge_train.zip -d train | ||
# unzip MINDlarge_dev.zip -d val | ||
# unzip MINDlarge_test.zip -d test | ||
# rm MINDlarge_*.zip | ||
|
||
# Uncomment the following lines to use the MIND Small dataset (Note MIND Small doesn't have a test set, so we just copy the validation set as test set :) | ||
wget https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip | ||
unzip MINDsmall_train.zip -d train | ||
unzip MINDsmall_dev.zip -d val | ||
cp -r val test # MIND Small has no test set :) | ||
rm MINDsmall_*.zip | ||
|
||
# Preprocess data into appropriate format | ||
cd .. | ||
python3 src/data_preprocess.py | ||
# Remember you shoud modify `num_*` in `src/config.py` by the output of `src/data_preprocess.py` | ||
``` | ||
|
||
Modify `src/config.py` to select target model. The configuration file is organized into general part (which is applied to all models) and model-specific part (that some models not have). | ||
|
||
```bash | ||
vim src/config.py | ||
``` | ||
|
||
Run. | ||
|
||
```bash | ||
# Train and save checkpoint into `checkpoint/{model_name}/` directory | ||
python3 src/train.py | ||
# Load latest checkpoint and evaluate on the test set | ||
python3 src/evaluate.py | ||
``` | ||
|
||
You can visualize metrics with TensorBoard. | ||
|
||
```bash | ||
tensorboard --logdir=runs | ||
|
||
# or | ||
tensorboard --logdir=runs/{model_name} | ||
# for a specific model | ||
``` | ||
|
||
> Tip: by adding `REMARK` environment variable, you can make the runs name in TensorBoard more meaningful. For example, `REMARK=num-filters-300-window-size-5 python3 src/train.py`. | ||
|
||
### Optim study in MIND-mini | ||
|
||
| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark | | ||
| --------- | --- | --- | ------ | ------- | ------ | | ||
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | | | ||
| +SGD | 0.5188 | 0.2148 | 0.2250 | 0.2905 | | | ||
| +AdamW | 0.6298 | 0.2841 | 0.3091 | 0.3765 | | | ||
|
||
|
||
### Norm study in MIND-mini | ||
|
||
| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark | | ||
| --------- | --- | --- | ------ | ------- | ------ | | ||
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | | | ||
| +BN | 0.5252 | 0.2476 | 0.2565 | 0.3181 | | | ||
| +GN | 0.6323 | 0.2884 | 0.3122 | 0.3795 | | | ||
| +IN | 0.6321 | 0.2847 | 0.3101 | 0.3785 | | | ||
| +LN | 0.6404 | 0.2905 | 0.3172 | 0.3835 | | | ||
|
||
|
||
### Results in MIND-mini | ||
| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark | | ||
| --------- | --- | --- | ------ | ------- | ------ | | ||
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | | | ||
| +LN +AdamW + Cosine decay | 0.6421 | 0.2960 | 0.3239 | 0.3890 | | | ||
|
||
|
||
|
||
## Get started to open website | ||
```bash | ||
cd .. | ||
python3 src/web.py | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
torch | ||
numpy | ||
pandas | ||
tensorboard | ||
tqdm | ||
nltk | ||
scikit-learn | ||
swifter | ||
ray[tune] | ||
elasticsearch | ||
pyquery | ||
flask |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
import os | ||
|
||
model_name = os.environ['MODEL_NAME'] if 'MODEL_NAME' in os.environ else 'NRMS' | ||
# Currently included model | ||
assert model_name in [ | ||
'NRMS', 'NAML', 'LSTUR', 'DKN', 'HiFiArk', 'TANR', 'Exp1' | ||
] | ||
|
||
|
||
class BaseConfig(): | ||
""" | ||
General configurations appiled to all models | ||
""" | ||
num_epochs = 2 | ||
num_batches_show_loss = 100 # Number of batchs to show loss | ||
# Number of batchs to check metrics on validation dataset | ||
num_batches_validate = 1000 | ||
batch_size = 128 | ||
learning_rate = 0.0001 | ||
num_workers = 4 # Number of workers for data loading | ||
num_clicked_news_a_user = 50 # Number of sampled click history for each user | ||
num_words_title = 20 | ||
num_words_abstract = 50 | ||
word_freq_threshold = 1 | ||
entity_freq_threshold = 2 | ||
entity_confidence_threshold = 0.5 | ||
negative_sampling_ratio = 2 # K | ||
dropout_probability = 0.2 | ||
# Modify the following by the output of `src/dataprocess.py` | ||
num_words = 1 + 70975 | ||
num_categories = 1 + 274 | ||
num_entities = 1 + 12957 | ||
num_users = 1 + 50000 | ||
word_embedding_dim = 300 | ||
category_embedding_dim = 100 | ||
# Modify the following only if you use another dataset | ||
entity_embedding_dim = 100 | ||
# For additive attention | ||
query_vector_dim = 200 | ||
|
||
|
||
class NRMSConfig(BaseConfig): | ||
dataset_attributes = {"news": ['title'], "record": []} | ||
# For multi-head self-attention | ||
num_attention_heads = 15 | ||
|
||
|
||
class NAMLConfig(BaseConfig): | ||
dataset_attributes = { | ||
"news": ['category', 'subcategory', 'title', 'abstract'], | ||
"record": [] | ||
} | ||
# For CNN | ||
num_filters = 300 | ||
window_size = 3 | ||
|
||
|
||
class LSTURConfig(BaseConfig): | ||
dataset_attributes = { | ||
"news": ['category', 'subcategory', 'title'], | ||
"record": ['user', 'clicked_news_length'] | ||
} | ||
# For CNN | ||
num_filters = 300 | ||
window_size = 3 | ||
long_short_term_method = 'ini' | ||
# See paper for more detail | ||
assert long_short_term_method in ['ini', 'con'] | ||
masking_probability = 0.5 | ||
|
||
|
||
class DKNConfig(BaseConfig): | ||
dataset_attributes = {"news": ['title', 'title_entities'], "record": []} | ||
# For CNN | ||
num_filters = 50 | ||
window_sizes = [2, 3, 4] | ||
# TODO: currently context is not available | ||
use_context = False | ||
|
||
|
||
class HiFiArkConfig(BaseConfig): | ||
dataset_attributes = {"news": ['title'], "record": []} | ||
# For CNN | ||
num_filters = 300 | ||
window_size = 3 | ||
num_pooling_heads = 5 | ||
regularizer_loss_weight = 0.1 | ||
|
||
|
||
class TANRConfig(BaseConfig): | ||
dataset_attributes = {"news": ['category', 'title'], "record": []} | ||
# For CNN | ||
num_filters = 300 | ||
window_size = 3 | ||
topic_classification_loss_weight = 0.1 | ||
|
||
|
||
class Exp1Config(BaseConfig): | ||
dataset_attributes = { | ||
# TODO ['category', 'subcategory', 'title', 'abstract'], | ||
"news": ['category', 'subcategory', 'title'], | ||
"record": [] | ||
} | ||
# For multi-head self-attention | ||
num_attention_heads = 15 | ||
ensemble_factor = 1 # Not use ensemble since it's too expensive |