Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] leave-k-out split mode #2121

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions asset/model_list.json
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,20 @@
"repository": "RecBole",
"repo_link": "https://github.com/RUCAIBox/RecBole"
},
{
"category": "General Recommendation",
"cate_link": "/docs/user_guide/model_intro.html#general-recommendation",
"year": "2013",
"pub": "RecSys'13",
"model": "AsymKNN",
"model_link": "/docs/user_guide/model/general/asymknn.html",
"paper": "Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets",
"paper_link": "https://doi.org/10.1145/2507157.2507189",
"authors": "Fabio Aiolli",
"ref_code": "",
"repository": "RecBole",
"repo_link": "https://github.com/RUCAIBox/RecBole"
},
{
"category": "General Recommendation",
"cate_link": "/docs/user_guide/model_intro.html#general-recommendation",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.. automodule:: recbole.model.general_recommender.asymknn
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/source/recbole/recbole.model.general_recommender.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ recbole.model.general\_recommender
.. toctree::
:maxdepth: 4

recbole.model.general_recommender.asymknn
recbole.model.general_recommender.admmslim
recbole.model.general_recommender.bpr
recbole.model.general_recommender.cdae
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/config/evaluation_settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Evaluation settings are designed to set parameters about model evaluation.

- ``order (str)``: decides how we sort the data in `.inter`. Now we support two kinds of ordering strategies: ``['RO', 'TO']``, which denotes the random ordering and temporal ordering. For ``RO``, we will shuffle the data and then split them in this order. For ``TO``, we will sort the data by the column of `TIME_FIELD` in ascending order and the split them in this order. The default value is ``RO``.

- ``split (dict)``: decides how we split the data in `.inter`. Now we support two kinds of splitting strategies: ``['RS','LS']``, which denotes the ratio-based data splitting and leave-one-out data splitting. If the key of ``split`` is ``RS``, you need to set the splitting ratio like ``[0.8,0.1,0.1]``, ``[7,2,1]`` or ``[8,0,2]``, which denotes the ratio of training set, validation set and testing set respectively. If the key of ``split`` is ``LS``, now we support three kinds of ``LS`` mode: ``['valid_and_test', 'valid_only', 'test_only']`` and you should choose one mode as the value of ``LS``. The default value of ``split`` is ``{'RS': [0.8,0.1,0.1]}``.
- ``split (dict)``: decides how we split the data in `.inter`. Now we support two kinds of splitting strategies: ``['RS','LS','LK']``, which denotes the ratio-based data splitting, leave-one-out data splitting, and leave-k-out data splitting. If the key of ``split`` is ``RS``, you need to set the splitting ratio like ``[0.8,0.1,0.1]``, ``[7,2,1]`` or ``[8,0,2]``, which denotes the ratio of training set, validation set and testing set respectively. If the key of ``split`` is ``LS`` (or ``LK``), now we support three kinds of ``LS`` (``LK``) mode: ``['valid_and_test', 'valid_only', 'test_only']`` and you should choose one mode as the value of ``LS`` (``LK``). For ``LK``, you also need to set the mode and the number ``k`` by providing a list in the following format: ``['valid_and_test', k]``. The number ``k`` represents the number of elements that will be left out according to the specified mode. The default value of ``split`` is ``{'RS': [0.8,0.1,0.1]}``.

- ``mode (str|dict)``: decides the data range when we evaluate the model during ``valid`` and ``test`` phase. Now we support four kinds of evaluation mode: ``['full','unixxx','popxxx','labeled']``. ``full`` , ``unixxx`` and ``popxxx`` are designed for the evaluation on implicit feedback (data without label). For implicit feedback, we regard the items with observed interactions as positive items and those without observed interactions as negative items. ``full`` means evaluating the model on the set of all items. ``unixxx``, for example ``uni100``, means uniformly sample 100 negative items for each positive item in testing set, and evaluate the model on these positive items with their sampled negative items. ``popxxx``, for example ``pop100``, means sample 100 negative items for each positive item in testing set based on item popularity (:obj:`Counter(item)` in `.inter` file), and evaluate the model on these positive items with their sampled negative items. Here the `xxx` must be an integer. For explicit feedback (data with label), you should set the mode as ``labeled`` and we will evaluate the model based on your label. You can use ``valid`` and ``test`` as the dict key to set specific ``mode`` in different phases. The default value is ``full``, which is equivalent to ``{'valid': 'full', 'test': 'full'}``.

Expand Down
88 changes: 88 additions & 0 deletions docs/source/user_guide/model/general/asymknn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
AsymKNN
===========

Introduction
---------------------

`[paper] <https://dl.acm.org/doi/pdf/10.1145/2507157.25071896>`_

**Title:** Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets

**Authors:** Fabio Aiolli

**Abstract:** We present a simple and scalable algorithm for top-N recommendation able to deal with very large datasets and (binary rated) implicit feedback. We focus on memory-based collaborative filtering
algorithms similar to the well known neighboor based technique for explicit feedback. The major difference, that makes the algorithm particularly scalable, is that it uses positive feedback only
and no explicit computation of the complete (user-by-user or itemby-item) similarity matrix needs to be performed.
The study of the proposed algorithm has been conducted on data from the Million Songs Dataset (MSD) challenge whose task was to suggest a set of songs (out of more than 380k available songs) to more than 100k users given half of the user listening history and
complete listening history of other 1 million people.
In particular, we investigate on the entire recommendation pipeline, starting from the definition of suitable similarity and scoring functions and suggestions on how to aggregate multiple ranking strategies to define the overall recommendation. The technique we are
proposing extends and improves the one that already won the MSD challenge last year.

In this article, we introduce a versatile class of recommendation algorithms that calculate either user-to-user or item-to-item similarities as the foundation for generating recommendations. This approach enables the flexibility to switch between UserKNN and ItemKNN models depending on the desired application.

A distinguishing feature of this class of algorithms, exemplified by AsymKNN, is its use of asymmetric cosine similarity, which generalizes the traditional cosine similarity. Specifically, when the asymmetry parameter
``alpha = 0.5``, the method reduces to the standard cosine similarity, while other values of ``alpha`` allow for tailored emphasis on specific aspects of the interaction data. Furthermore, setting the parameter
``beta = 1.0`` ensures a traditional UserKNN or ItemKNN, as the final scores are only divided by a fixed positive constant, preserving the same order of recommendations.

Running with RecBole
-------------------------

**Model Hyper-Parameters:**

- ``k (int)`` : The neighborhood size. Defaults to ``100``.

- ``alpha (float)`` : Weight parameter for asymmetric cosine similarity. Defaults to ``0.5``.

- ``beta (float)`` : Parameter for controlling the balance between factors in the final score normalization. Defaults to ``1.0``.

- ``q (int)`` : The 'locality of scoring function' parameter. Defaults to ``1``.

**Additional Parameters:**

- ``knn_method (str)`` : Calculate the similarity of users if method is 'user', otherwise, calculate the similarity of items.. Defaults to ``item``.


**A Running Example:**

Write the following code to a python file, such as `run.py`

.. code:: python

from recbole.quick_start import run_recbole

run_recbole(model='AsymKNN', dataset='ml-100k')

And then:

.. code:: bash

python run.py

Tuning Hyper Parameters
-------------------------

If you want to use ``HyperTuning`` to tune hyper parameters of this model, you can copy the following settings and name it as ``hyper.test``.

.. code:: bash

k choice [10,50,100,200,250,300,400,500,1000,1500,2000,2500]
alpha choice [0.0,0.2,0.5,0.8,1.0]
beta choice [0.0,0.2,0.5,0.8,1.0]
q choice [1,2,3,4,5,6]

Note that we just provide these hyper parameter ranges for reference only, and we can not guarantee that they are the optimal range of this model.

Then, with the source code of RecBole (you can download it from GitHub), you can run the ``run_hyper.py`` to tuning:

.. code:: bash

python run_hyper.py --model=[model_name] --dataset=[dataset_name] --config_files=[config_files_path] --params_file=hyper.test

For more details about Parameter Tuning, refer to :doc:`../../../user_guide/usage/parameter_tuning`.

If you want to change parameters, dataset or evaluation settings, take a look at

- :doc:`../../../user_guide/config_settings`
- :doc:`../../../user_guide/data_intro`
- :doc:`../../../user_guide/train_eval_intro`
- :doc:`../../../user_guide/usage`
1 change: 1 addition & 0 deletions docs/source/user_guide/model_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ task of top-n recommendation. All the collaborative filter(CF) based models are
.. toctree::
:maxdepth: 1

model/general/asymknn
model/general/pop
model/general/itemknn
model/general/bpr
Expand Down
3 changes: 2 additions & 1 deletion docs/source/user_guide/train_eval_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ items or a sampled-based ranking.
RO Random Ordering
TO Temporal Ordering
LS Leave-one-out Splitting
LK Leave-k-out Splitting
RS Ratio-based Splitting
full full ranking with all item candidates
uniN sample-based ranking: each positive item is paired with N sampled negative items in uniform distribution
Expand All @@ -54,7 +55,7 @@ The parameters used to control the evaluation method are as follows:
including ``split``, ``group_by``, ``order`` and ``mode``.

- ``split (dict)``: Control the splitting of dataset and the split ratio. The key is splitting method
and value is the list of split ratio. The range of key is ``[RS,LS]``. Defaults to ``{'RS':[0.8, 0.1, 0.1]}``
and value is the list of split ratio. The range of key is ``[RS,LS,LK]``. Defaults to ``{'RS':[0.8, 0.1, 0.1]}``
- ``group_by (str)``: Whether to split dataset with the group of user.
Range in ``[None, user]`` and defaults to ``user``.
- ``order (str)``: Control the ordering of data and affect the splitting of data.
Expand Down
72 changes: 72 additions & 0 deletions recbole/data/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1729,6 +1729,74 @@ def leave_one_out(self, group_by, leave_one_mode):
next_ds = [self.copy(_) for _ in next_df]
return next_ds

def _split_index_by_leave_k_out(self, grouped_index, leave_k_num, k):
"""Split indexes by strategy leave one out.

Args:
grouped_index (list of list of int): Index to be split.
leave_k_num (int): Number of parts whose length is expected to be ``1``.

Returns:
list: List of index that has been split.
"""
#print(list(grouped_index)[0])
next_index = [[] for _ in range(leave_k_num + 1)]
for index in grouped_index:
index = list(index)
tot_cnt = len(index)
legal_leave_k_num = min(leave_k_num, tot_cnt - 1)
pr = tot_cnt - k
next_index[0].extend(index[:pr])
for i in range(legal_leave_k_num):
next_index[-legal_leave_k_num + i].extend(index[pr:])
pr += 1
#print(next_index[0][:len(list(grouped_index)[0])])
return next_index

def leave_k_out(self, group_by, leave_k_mode, k):
"""Split interaction records by leave k out strategy.

Args:
group_by (str): Field name that interaction records should grouped by before splitting.
leave_k_mode (str): The way to leave one out. It can only take three values:
'valid_and_test', 'valid_only' and 'test_only'.

Returns:
list: List of :class:`~Dataset`, whose interaction features has been split.
"""
self.logger.debug(
f"leave k out, group_by=[{group_by}], leave_k_mode=[{leave_k_mode}]"
)
if group_by is None:
raise ValueError("leave one out strategy require a group field")

grouped_inter_feat_index = self._grouped_index(
self.inter_feat[group_by].numpy()
)
if leave_k_mode == "valid_and_test":
next_index = self._split_index_by_leave_k_out(
grouped_inter_feat_index, leave_k_num=2, k=k
)
elif leave_k_mode == "valid_only":
next_index = self._split_index_by_leave_k_out(
grouped_inter_feat_index, leave_k_num=1, k=k
)
next_index.append([])
elif leave_k_mode == "test_only":
next_index = self._split_index_by_leave_k_out(
grouped_inter_feat_index, leave_k_num=1, k=k
)
next_index = [next_index[0], [], next_index[1]]
else:
raise NotImplementedError(
f"The leave_one_mode [{leave_k_mode}] has not been implemented."
)

self._drop_unused_col()
next_df = [self.inter_feat[index] for index in next_index]
next_ds = [self.copy(_) for _ in next_df]
return next_ds

def shuffle(self):
"""Shuffle the interaction records inplace."""
self.inter_feat.shuffle()
Expand Down Expand Up @@ -1799,6 +1867,10 @@ def build(self):
datasets = self.leave_one_out(
group_by=self.uid_field, leave_one_mode=split_args["LS"]
)
elif split_mode == "LK":
datasets = self.leave_k_out(
group_by=self.uid_field, leave_k_mode=split_args["LK"][0], k=split_args["LK"][1]
)
else:
raise NotImplementedError(
f"The splitting_method [{split_mode}] has not been implemented."
Expand Down
1 change: 1 addition & 0 deletions recbole/model/general_recommender/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from recbole.model.general_recommender.asymknn import AsymKNN
from recbole.model.general_recommender.bpr import BPR
from recbole.model.general_recommender.cdae import CDAE
from recbole.model.general_recommender.convncf import ConvNCF
Expand Down
Loading