This repository is the solution to the WSDM 2023 Unbiased Learning for Web Search. On the basis of the Dual Learning Algorithm (DLA), our solution conducts extensive and in-depth research on unbias learning to rank and proposes a strategy of using multiple behavioral features for unbiased learning, which greatly improves the performance of ranking models.
The overall framework of the model is shown in Fig.1.
Taking the data of one search session as an example, as shown in Fig.1,
the text features of the document at position n
will be fed into the relevance model to output the relevance score r
,
while other features of the document that can be used to calculate the propensity score are fed into the propensity model to get the propensity score s
.
Subsequently, p
and r
are multiplied to obtain the score s
of the position n
being clicked.
Note, instead of inputting the entire document list for the session, we pick a group of documents whose group size is 6 from the document list, including 1 document (positive sample) that is clicked and 5 documents (negative samples) that are not clicked.
In addition, only the propensity score of the positive sample is provided by the model, while the propensity socre of the negative sample is forced to be set to a fixed value 0.1, which means that p1
、p3
and pn
in Fig.1 is 0.1.
The environment of unbias learning to rank task is same as the Pre-training for Web Search Task.
Suppose your have downloaded the Web Search Session Data (training data) and annotation_data_0522.txt (test data) on Google drive. Moreover, for those who cannot access google drive:
Note: unzip the train data may spend a lot of time.
A pre-trained language model is important for the model in Fig.1. You can download the pre-trained language model we trained from the table below:
PTM Version | URL |
---|---|
Bert_Layer12_Head12 | Bert_Layer12_Head12 |
Bert_Layer12_Head12 wwm | Bert_Layer12_Head12 wwm |
Bert_Layer24_Head12 | Bert_Layer24_Head12 |
In Table, wwm means that we use whole word masking. |
After the corpus and pre-trained language model is ready, you should organize them with the following directory structure:
Your Data Root Path
|——baidu_ultr
| |——data
| | |——part-00000
| | |——part-00001
| | |——...
| |——annotate_data
| | |——annotation_data_0522.txt
| | |——wsdm_test_1.txt
| | |——wsdm_test_2_all.txt
| |——ckpt
| | |——submit
| | | |——model_name
| | | |——config.json
| | | |——pytorch.bin
| | |——pretrain
| | | |——model_name
| | | |——config.json
| | | |——pytorch.bin
- Modify
data_root
in./pretrain/start.sh
asYour Data Root Path
- Then,
cd pretrain
sh start.sh
- You can apply tensorboard in
output_dir
to observe the trend of model indicators
In order to quickly test the model performance, you can directly download model trained by us whose dcg@10 is 10.25 on annotation_data_0522.txt (val dataset)
Then, modify data_root
as Your Data Root Path
、 model_name_or_path
as the path of model you want to test
and model_w
as 1
in ./submit/start.sh
.
Finally
cd submit
sh start.sh
In order to further improve the performance of the model, we used the weighted sum of the output scores of multiple models trained under different settings that we produced during the experiment as the final relevance score.
You can download these model with different setting from the table below:
Model Name | URL | DCG@10 on val dataset |
---|---|---|
group6_pos_slipoff_mtype_serph_emb8_mlp5l_maxmeancls_bs48 | Download | 10.03 |
group6_pos_slipoff_mtype_serph_emb8_mlp5l_maxmeancls | Download | 10.14 |
group6_pos_slipoff_mtype_serph_emb8_mlp5l_wwm | Download | 10.16 |
group6_pos_slipoff_serph_emb8_mlp5l_24l | Download | 10.10 |
group6_pos_slipoff_serph_emb8_mlp5l | Download | 10.25 |
group6_pos_slipoff_mtype_serph_emb8_bnnoelu_mlp5l_relu | Download | 10.20 |
group6_pos_slipoff_mtype_serph_emb8_bnnoelu_dropout_mlp5l_relu | Download | 10.14 |
group6_pos_slipoff_mtype_serph_emb8_bnnoelu_mlp5l_relu_24l | Download | 10.23 |
group6_pos_slipoff_mtype_serh_emb8_bnnoelu | Download | 10.15 |
group6_pos_slipoff_mtype_emb8_bnnoelu | Download | 10.15 |
group6_pos_slipoff_serh_emb8 | Download | 10.05 |
group6_pos_slipoff_pad_with_pretrain_emb8 | Download | 10.05 |
Then, modify data_root
as Your Data Root Path
、 model_name_or_path
as the path of model you want to test
and model_w
as 0.10,0.35,0.50,0.25,0.40,0.10,0.10,0.55,0.35,0.05,0.1,0.50
in ./submit/start.sh
, in which model_w
is set manually.
Finally
cd submit
sh start.sh
The dcg@10 of model Ensemble on val dataset is 10.54 (10.14 on final test dataset)
- Xiaoshu Chen: [email protected]