MVA Altegrad challenge 2021

Link prediction

Team members :

Gabriel Baker : [email protected]
Yujin Cho : [email protected]
Fabien Merceron : [email protected]

This work intends to study different approaches of leveraging text and graph structure information to predict missing links in a scientific collaboration network. The problem was proposed in the ALTEGRAD course and is the objective of a Kaggle challenge.

Dataset

It contains different sources of data (abstract, authors and edge connections)

Runing the script

To facilitate the run of our model we created a colab notebook available here.

But you can also locally train the model as explained below.

Prepare the environment

We recommend the download of the following folders with the precomputed features and the dataset.

./data Google drive link
./saved_feats Google drive link

Please install the following not so common libraries

!pip install -qU sentence-transformers
!pip install -q node2vec
!pip install -q --upgrade gensim
!pip install -q dgl

Training

Use this command to reproduce our best model

python main.py --use_manual_features

Evaluate

To create the submission file run the following

python eval.py --use_manual_features

Experiments / Results

In order to leverage the most of our three different sources of data (abstracts, authors and edges connections), we tested different strategies of feature extraction for each source and different combinations of the generated features.

The table summarizes the results of our best input combinations, but not of the best models, as there were small variations that produced better models than the first 3 lines (the last line was indeed our best model). An example of this are the models of the second and third rows (SBERT & GAT / SBERT & Doc2Vec), which performed better when training with the whole dataset, but we decided to present the results of all models with the division of training and validation set to have a fairer comparison.

We reached top 5 in kaggle challenge private leaderboard with 0.09725

Abstract Embedding	Node Embedding	Hueristics/Manual features	Train	Val	Test
Doc2Vec	Node2Vec	No	0.078	0.052	0.225
SBERT	GAT	No	0.087	0.082	0.143
SBERT	Node2Vec	No	0.061	0.052	0.146
SBERT	Node2Vec	Yes	0.059	0.048	0.104

Result table : Negative - log likelihood obtained after training the network and choosing the model with the smallest validation loss. Test score is calculated with approximately 50% of the test dataset

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Doc2Vec.py		Doc2Vec.py
GAT.py		GAT.py
Node2Vec.py		Node2Vec.py
README.md		README.md
SBERT.py		SBERT.py
data.py		data.py
eval.py		eval.py
main.py		main.py
manual_features.py		manual_features.py
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVA Altegrad challenge 2021

Link prediction

Dataset

Runing the script

Prepare the environment

Training

Evaluate

Experiments / Results

About

Releases

Packages

Contributors 2

Languages

gabrielBarbutti/altegrad-challenge

Folders and files

Latest commit

History

Repository files navigation

MVA Altegrad challenge 2021

Link prediction

Dataset

Runing the script

Prepare the environment

Training

Evaluate

Experiments / Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages