Skip to content

Commit

Permalink
update for presentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Maguire1999 committed Aug 4, 2022
0 parents commit 5ae80d4
Show file tree
Hide file tree
Showing 169 changed files with 38,400 additions and 0 deletions.
20 changes: 20 additions & 0 deletions .idea/NewsRecommendation-master.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions .idea/deployment.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions .idea/sonarlint/issuestore/index.pb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 yusanshi

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# News Recommendation

The repository currently includes the following models.

**Models in published papers**

| Model | Full name | Paper |
| --------- | ------------------------------------------------------------------------- | -------------------------------------------------- |
| NRMS | Neural News Recommendation with Multi-Head Self-Attention | https://www.aclweb.org/anthology/D19-1671/ |


## Get started to train

Basic setup.

```bash
git clone https://github.com/yusanshi/NewsRecommendation
cd NewsRecommendation
pip3 install -r requirements.txt
```

Download and preprocess the data.

```bash
mkdir data && cd data
# Download GloVe pre-trained word embedding
wget https://nlp.stanford.edu/data/glove.840B.300d.zip
sudo apt install unzip
unzip glove.840B.300d.zip -d glove
rm glove.840B.300d.zip

# Download MIND dataset
# By downloading the dataset, you agree to the [Microsoft Research License Terms](https://go.microsoft.com/fwlink/?LinkID=206977). For more detail about the dataset, see https://msnews.github.io/.

# Uncomment the following lines to use the MIND Large dataset (Note MIND Large test set doesn't have labels, see #11)
# wget https://mind201910small.blob.core.windows.net/release/MINDlarge_train.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_dev.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_test.zip
# unzip MINDlarge_train.zip -d train
# unzip MINDlarge_dev.zip -d val
# unzip MINDlarge_test.zip -d test
# rm MINDlarge_*.zip

# Uncomment the following lines to use the MIND Small dataset (Note MIND Small doesn't have a test set, so we just copy the validation set as test set :)
wget https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip
unzip MINDsmall_train.zip -d train
unzip MINDsmall_dev.zip -d val
cp -r val test # MIND Small has no test set :)
rm MINDsmall_*.zip

# Preprocess data into appropriate format
cd ..
python3 src/data_preprocess.py
# Remember you shoud modify `num_*` in `src/config.py` by the output of `src/data_preprocess.py`
```

Modify `src/config.py` to select target model. The configuration file is organized into general part (which is applied to all models) and model-specific part (that some models not have).

```bash
vim src/config.py
```

Run.

```bash
# Train and save checkpoint into `checkpoint/{model_name}/` directory
python3 src/train.py
# Load latest checkpoint and evaluate on the test set
python3 src/evaluate.py
```

You can visualize metrics with TensorBoard.

```bash
tensorboard --logdir=runs

# or
tensorboard --logdir=runs/{model_name}
# for a specific model
```

> Tip: by adding `REMARK` environment variable, you can make the runs name in TensorBoard more meaningful. For example, `REMARK=num-filters-300-window-size-5 python3 src/train.py`.

### Optim study in MIND-mini

| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
| --------- | --- | --- | ------ | ------- | ------ |
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | |
| +SGD | 0.5188 | 0.2148 | 0.2250 | 0.2905 | |
| +AdamW | 0.6298 | 0.2841 | 0.3091 | 0.3765 | |


### Norm study in MIND-mini

| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
| --------- | --- | --- | ------ | ------- | ------ |
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | |
| +BN | 0.5252 | 0.2476 | 0.2565 | 0.3181 | |
| +GN | 0.6323 | 0.2884 | 0.3122 | 0.3795 | |
| +IN | 0.6321 | 0.2847 | 0.3101 | 0.3785 | |
| +LN | 0.6404 | 0.2905 | 0.3172 | 0.3835 | |


### Results in MIND-mini
| Model | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
| --------- | --- | --- | ------ | ------- | ------ |
| baseline | 0.6253 | 0.2823 | 0.3051 | 0.3731 | |
| +LN +AdamW + Cosine decay | 0.6421 | 0.2960 | 0.3239 | 0.3890 | |



## Get started to open website
```bash
cd ..
python3 src/web.py
```
12 changes: 12 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
torch
numpy
pandas
tensorboard
tqdm
nltk
scikit-learn
swifter
ray[tune]
elasticsearch
pyquery
flask
106 changes: 106 additions & 0 deletions src/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import os

model_name = os.environ['MODEL_NAME'] if 'MODEL_NAME' in os.environ else 'NRMS'
# Currently included model
assert model_name in [
'NRMS', 'NAML', 'LSTUR', 'DKN', 'HiFiArk', 'TANR', 'Exp1'
]


class BaseConfig():
"""
General configurations appiled to all models
"""
num_epochs = 2
num_batches_show_loss = 100 # Number of batchs to show loss
# Number of batchs to check metrics on validation dataset
num_batches_validate = 1000
batch_size = 128
learning_rate = 0.0001
num_workers = 4 # Number of workers for data loading
num_clicked_news_a_user = 50 # Number of sampled click history for each user
num_words_title = 20
num_words_abstract = 50
word_freq_threshold = 1
entity_freq_threshold = 2
entity_confidence_threshold = 0.5
negative_sampling_ratio = 2 # K
dropout_probability = 0.2
# Modify the following by the output of `src/dataprocess.py`
num_words = 1 + 70975
num_categories = 1 + 274
num_entities = 1 + 12957
num_users = 1 + 50000
word_embedding_dim = 300
category_embedding_dim = 100
# Modify the following only if you use another dataset
entity_embedding_dim = 100
# For additive attention
query_vector_dim = 200


class NRMSConfig(BaseConfig):
dataset_attributes = {"news": ['title'], "record": []}
# For multi-head self-attention
num_attention_heads = 15


class NAMLConfig(BaseConfig):
dataset_attributes = {
"news": ['category', 'subcategory', 'title', 'abstract'],
"record": []
}
# For CNN
num_filters = 300
window_size = 3


class LSTURConfig(BaseConfig):
dataset_attributes = {
"news": ['category', 'subcategory', 'title'],
"record": ['user', 'clicked_news_length']
}
# For CNN
num_filters = 300
window_size = 3
long_short_term_method = 'ini'
# See paper for more detail
assert long_short_term_method in ['ini', 'con']
masking_probability = 0.5


class DKNConfig(BaseConfig):
dataset_attributes = {"news": ['title', 'title_entities'], "record": []}
# For CNN
num_filters = 50
window_sizes = [2, 3, 4]
# TODO: currently context is not available
use_context = False


class HiFiArkConfig(BaseConfig):
dataset_attributes = {"news": ['title'], "record": []}
# For CNN
num_filters = 300
window_size = 3
num_pooling_heads = 5
regularizer_loss_weight = 0.1


class TANRConfig(BaseConfig):
dataset_attributes = {"news": ['category', 'title'], "record": []}
# For CNN
num_filters = 300
window_size = 3
topic_classification_loss_weight = 0.1


class Exp1Config(BaseConfig):
dataset_attributes = {
# TODO ['category', 'subcategory', 'title', 'abstract'],
"news": ['category', 'subcategory', 'title'],
"record": []
}
# For multi-head self-attention
num_attention_heads = 15
ensemble_factor = 1 # Not use ensemble since it's too expensive
Loading

0 comments on commit 5ae80d4

Please sign in to comment.