update for presentation

Maguire1999 · Aug 4, 2022 · 5ae80d4 · 5ae80d4
commit 5ae80d4
Show file tree

Hide file tree

Showing 169 changed files with 38,400 additions and 0 deletions.
diff --git a/.idea/NewsRecommendation-master.iml b/.idea/NewsRecommendation-master.iml
diff --git a/.idea/deployment.xml b/.idea/deployment.xml
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/sonarlint/issuestore/index.pb b/.idea/sonarlint/issuestore/index.pb
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 yusanshi
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,115 @@
+# News Recommendation
+
+The repository currently includes the following models.
+
+**Models in published papers**
+
+| Model     | Full name                                                                 | Paper                                              |
+| --------- | ------------------------------------------------------------------------- | -------------------------------------------------- |
+| NRMS      | Neural News Recommendation with Multi-Head Self-Attention                 | https://www.aclweb.org/anthology/D19-1671/         |
+
+
+## Get started to train
+
+Basic setup.
+
+```bash
+git clone https://github.com/yusanshi/NewsRecommendation
+cd NewsRecommendation
+pip3 install -r requirements.txt
+```
+
+Download and preprocess the data.
+
+```bash
+mkdir data && cd data
+# Download GloVe pre-trained word embedding
+wget https://nlp.stanford.edu/data/glove.840B.300d.zip
+sudo apt install unzip
+unzip glove.840B.300d.zip -d glove
+rm glove.840B.300d.zip
+
+# Download MIND dataset
+# By downloading the dataset, you agree to the [Microsoft Research License Terms](https://go.microsoft.com/fwlink/?LinkID=206977). For more detail about the dataset, see https://msnews.github.io/.
+
+# Uncomment the following lines to use the MIND Large dataset (Note MIND Large test set doesn't have labels, see #11)
+# wget https://mind201910small.blob.core.windows.net/release/MINDlarge_train.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_dev.zip https://mind201910small.blob.core.windows.net/release/MINDlarge_test.zip
+# unzip MINDlarge_train.zip -d train
+# unzip MINDlarge_dev.zip -d val
+# unzip MINDlarge_test.zip -d test
+# rm MINDlarge_*.zip
+
+# Uncomment the following lines to use the MIND Small dataset (Note MIND Small doesn't have a test set, so we just copy the validation set as test set :)
+wget https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip
+unzip MINDsmall_train.zip -d train
+unzip MINDsmall_dev.zip -d val
+cp -r val test # MIND Small has no test set :)
+rm MINDsmall_*.zip
+
+# Preprocess data into appropriate format
+cd ..
+python3 src/data_preprocess.py
+# Remember you shoud modify `num_*` in `src/config.py` by the output of `src/data_preprocess.py`
+```
+
+Modify `src/config.py` to select target model. The configuration file is organized into general part (which is applied to all models) and model-specific part (that some models not have).
+
+```bash
+vim src/config.py
+```
+
+Run.
+
+```bash
+# Train and save checkpoint into `checkpoint/{model_name}/` directory
+python3 src/train.py
+# Load latest checkpoint and evaluate on the test set
+python3 src/evaluate.py
+```
+
+You can visualize metrics with TensorBoard.
+
+```bash
+tensorboard --logdir=runs
+
+# or
+tensorboard --logdir=runs/{model_name}
+# for a specific model
+```
+
+> Tip: by adding `REMARK` environment variable, you can make the runs name in TensorBoard more meaningful. For example, `REMARK=num-filters-300-window-size-5 python3 src/train.py`.
+
+
+### Optim study in MIND-mini
+
+| Model     | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
+| --------- | --- | --- | ------ | ------- | ------ |
+| baseline  | 0.6253    |   0.2823   |     0.3051   |   0.3731      |        |
+| +SGD      |   0.5188   |    0.2148  |    0.2250    |     0.2905     |        |
+| +AdamW      |   0.6298   |    0.2841  |    0.3091    |     0.3765     |        |
+
+
+### Norm study in MIND-mini
+
+| Model     | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
+| --------- | --- | --- | ------ | ------- | ------ |
+| baseline  | 0.6253    |   0.2823   |     0.3051   |   0.3731      |        |
+| +BN      |   0.5252   |    0.2476  |    0.2565    |     0.3181     |        |
+| +GN     |    0.6323 |  0.2884   |   0.3122     |    0.3795     |        |
+| +IN       | 0.6321    |  0.2847   |    0.3101     |    0.3785     |        |
+| +LN       | 0.6404    |  0.2905   |    0.3172     |    0.3835     |        |
+
+
+### Results in MIND-mini
+| Model     | AUC | MRR | nDCG@5 | nDCG@10 | Remark |
+| --------- | --- | --- | ------ | ------- | ------ |
+| baseline  | 0.6253    |   0.2823   |     0.3051   |   0.3731      |        |
+| +LN  +AdamW  + Cosine decay   | 0.6421    |  0.2960   |    0.3239     |    0.3890     |        |
+
+
+
+## Get started to open website
+```bash
+cd ..
+python3 src/web.py
+```
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,12 @@
+torch
+numpy
+pandas
+tensorboard
+tqdm
+nltk
+scikit-learn
+swifter
+ray[tune]
+elasticsearch
+pyquery
+flask
diff --git a/src/config.py b/src/config.py
@@ -0,0 +1,106 @@
+import os
+
+model_name = os.environ['MODEL_NAME'] if 'MODEL_NAME' in os.environ else 'NRMS'
+# Currently included model
+assert model_name in [
+    'NRMS', 'NAML', 'LSTUR', 'DKN', 'HiFiArk', 'TANR', 'Exp1'
+]
+
+
+class BaseConfig():
+    """
+    General configurations appiled to all models
+    """
+    num_epochs = 2
+    num_batches_show_loss = 100  # Number of batchs to show loss
+    # Number of batchs to check metrics on validation dataset
+    num_batches_validate = 1000
+    batch_size = 128
+    learning_rate = 0.0001
+    num_workers = 4  # Number of workers for data loading
+    num_clicked_news_a_user = 50  # Number of sampled click history for each user
+    num_words_title = 20
+    num_words_abstract = 50
+    word_freq_threshold = 1
+    entity_freq_threshold = 2
+    entity_confidence_threshold = 0.5
+    negative_sampling_ratio = 2  # K
+    dropout_probability = 0.2
+    # Modify the following by the output of `src/dataprocess.py`
+    num_words = 1 + 70975
+    num_categories = 1 + 274
+    num_entities = 1 + 12957
+    num_users = 1 + 50000
+    word_embedding_dim = 300
+    category_embedding_dim = 100
+    # Modify the following only if you use another dataset
+    entity_embedding_dim = 100
+    # For additive attention
+    query_vector_dim = 200
+
+
+class NRMSConfig(BaseConfig):
+    dataset_attributes = {"news": ['title'], "record": []}
+    # For multi-head self-attention
+    num_attention_heads = 15
+
+
+class NAMLConfig(BaseConfig):
+    dataset_attributes = {
+        "news": ['category', 'subcategory', 'title', 'abstract'],
+        "record": []
+    }
+    # For CNN
+    num_filters = 300
+    window_size = 3
+
+
+class LSTURConfig(BaseConfig):
+    dataset_attributes = {
+        "news": ['category', 'subcategory', 'title'],
+        "record": ['user', 'clicked_news_length']
+    }
+    # For CNN
+    num_filters = 300
+    window_size = 3
+    long_short_term_method = 'ini'
+    # See paper for more detail
+    assert long_short_term_method in ['ini', 'con']
+    masking_probability = 0.5
+
+
+class DKNConfig(BaseConfig):
+    dataset_attributes = {"news": ['title', 'title_entities'], "record": []}
+    # For CNN
+    num_filters = 50
+    window_sizes = [2, 3, 4]
+    # TODO: currently context is not available
+    use_context = False
+
+
+class HiFiArkConfig(BaseConfig):
+    dataset_attributes = {"news": ['title'], "record": []}
+    # For CNN
+    num_filters = 300
+    window_size = 3
+    num_pooling_heads = 5
+    regularizer_loss_weight = 0.1
+
+
+class TANRConfig(BaseConfig):
+    dataset_attributes = {"news": ['category', 'title'], "record": []}
+    # For CNN
+    num_filters = 300
+    window_size = 3
+    topic_classification_loss_weight = 0.1
+
+
+class Exp1Config(BaseConfig):
+    dataset_attributes = {
+        # TODO ['category', 'subcategory', 'title', 'abstract'],
+        "news": ['category', 'subcategory', 'title'],
+        "record": []
+    }
+    # For multi-head self-attention
+    num_attention_heads = 15
+    ensemble_factor = 1  # Not use ensemble since it's too expensive